pg_0250

<<<

Index

>>>

9-18

PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS

The non-temporal store instructions (MOVNTPS, MOVNTQ, and MASKMOVQ) minimize

cache pollution while writing data. The main difference between a non-temporal store and a

regular cacheable store is in the write-allocation policy. The memory type of the region being

written to can override the non-temporal hint, leading to the following considerations. If the

programmer specifies a non-temporal store to:

Uncacheable memory, the store behaves like an uncacheable store; the non-temporal hint is

ignored, and the memory type for the region is retained. Uncacheable as referred to here

means that the region being written to has been mapped with either a UC or WP memory

type. If the memory region has been mapped as WB, WT, or WC, the non-temporal store

will implement weakly-ordered (WC) semantic behavior.

Cacheable memory, two cases may result. If the data is:

Present in the cache hierarchy, the hint is ignored and the cache line is updated

normally. A given processor may choose different ways to implement this; some

examples include: updating data in-place in the cache hierarchy while preserving

the memory type semantics assigned to that region, or evicting the data from the

caches and writing the new non-temporal data to memory (with WC semantics).

Not present in the cache hierarchy, and the destination region is mapped as WB,

WT, or WC, the transaction will be weakly-ordered, and is subject to all WC

memory semantics; consequently, the programmer is responsible for maintaining

coherency. The non-temporal store will not write allocate (i.e., the processor will

not fetch the corresponding cache line into the cache hierarchy, prior to

performing the store). Different implementations may choose to collapse and

combine these stores prior to issuing them to memory.

In general, WC semantics require software to ensure coherency, with respect to other processors

and other system agents (such as graphics cards). Appropriate use of synchronization and a

fencing operation (refer to SFENCE, below) must be performed for producer-consumer usage

models. Fencing ensures that all system agents have global visibility of the stored data. For

instance, failure to fence may result in a written cache line staying within a processor, and the

line would not be visible to other agents. For processors that implement non-temporal stores by

updating data in-place that already resides in the cache hierarchy, the destination region should

also be mapped as WC. Otherwise, if mapped as WB or WT, there is the potential for speculative

processor reads to bring the data into the caches. In this case, non-temporal stores would then

update in place, and data would not be flushed from the processor by a subsequent fencing oper-

ation.

The memory type visible on the bus in the presence of memory type aliasing is implementation-

specific. As one possible example, the memory type written to the bus may reflect the memory

type for the first store to this line, as seen in program order; other alternatives are possible. This

behavior should be considered reserved, and dependence on the behavior of any particular

implementation risks future incompatibility.

The PREFETCH (Load 32 or greater number of bytes) instructions load either non-temporal

data or temporal data in the specified cache level. This access and the cache level are specified

as a hint. The prefetch instructions do not affect functional behavior of the program and will be

implementation-specific.

<<<

Index

>>>