CHAPTER 10 MEMORY CACHE CONTROL
Monday, 28. August 2006, 05:49:40
10 Memory Cache Control
Vol. 3 10-1
CHAPTER 10 MEMORY CACHE CONTROL
This chapter describes the IA-32 architecture’s memory cache and cache control mechanisms, the
TLBs, and the store buffer. It also describes the memory type range registers (MTRRs) found in
the P6 family processors and how they are used to control caching of physical memory locations.
10.1 INTERNAL CACHES, TLBS, AND BUFFERS
The IA-32 architecture supports caches, translation look aside buffers (TLBs), and a store buffer
for temporary on-chip (and external) storage of instructions and data. (Figure 10-1 shows the
arrangement of caches, TLBs, and the store buffer for the Pentium 4 and Intel Xeon processors.)
Table 10-1 shows the characteristics of these caches and buffers for the Pentium 4, Intel Xeon,
P6 family, and Pentium processors. The sizes and characteristics of these units are machine
specific and may change in future versions of the processor. The CPUID instruction returns
the sizes and characteristics of the caches and buffers for the processor on which the instruction
is executed (see “CPUID—CPU Identification” in Chapter 3 of the IA-32 Intel Architecture Software
Developer’s Manual, Volume 2).
Figure 10-1. Cache Structure of the Pentium 4 and Intel Xeon Processors
Instruction Decoder Trace Cache
Bus Interface Unit
System Bus
Data Cache
Unit (L1)
(External)
Physical
Memory
Store Buffer
Data TLBs
L2 Cache
Instruction
TLBs
L3 Cache†
† Intel Xeon processors only
10-2 Vol. 3
MEMORY CACHE CONTROL
Table 10-1. Characteristics of the Caches, TLBs, Store Buffer, and
Write Combining Buffer in IA-32 processors
Cache or Buffer Characteristics
Trace Cache† - Pentium 4 and Intel Xeon processors: 12 Kμops, 8-way set associative.
- Pentium M processor: not implemented.
- P6 family and Pentium processors: not implemented.
L1 Instruction Cache - Pentium 4 and Intel Xeon processors: not implemented.
- Pentium M processor: 32-KByte, 8-way set associative.
- P6 family and Pentium processors: 8- or 16-KByte, 4-way set associative,
32-byte cache line size; 2-way set associative for earlier Pentium processors.
L1 Data Cache - Pentium 4 and Intel Xeon processors: 8-KByte, 4-way set associative, 64-byte
cache line size.
- Pentium 4 and Intel Xeon processors: 16-KByte, 8-way set associative, 64-byte
cache line size.
- Pentium M processor: 32-KByte, 8-way set associative, 64-byte cache line size.
- P6 family processors: 16-KByte, 4-way set associative, 32-byte cache line size;
8-KBytes, 2-way set associative for earlier P6 family processors.
- Pentium processors: 16-KByte, 4-way set associative, 32-byte cache line size;
8-KByte, 2-way set associative for earlier Pentium processors.
L2 Unified Cache - Pentium 4 and Intel Xeon processors: 256, 512, 1024, or 2048-KByte, 8-way set
associative, 64-byte cache line size, 128-byte sector size.
- Pentium M processor: 1 or 2-MByte, 8-way set associative, 64-byte cache line
size.
- P6 family processors: 128-KByte, 256-KByte, 512-KByte, 1-MByte, or 2-MByte,
4-way set associative, 32-byte cache line size.
- Pentium processor (external optional): System specific, typically 256- or
512-KByte, 4-way set associative, 32-byte cache line size.
L3 Unified Cache - Intel Xeon processors: 512-KByte, 1-MByte, 2-MByte, or 4-MByte, 8-way set
associative, 64-byte cache line size, 128-byte sector size.
Instruction TLB
(4-KByte Pages)
- Pentium 4 and Intel Xeon processors: 128 entries, 4-way set associative.
- Pentium M processor: 128 entries, 4-way set associative.
- P6 family processors: 32 entries, 4-way set associative.
- Pentium processor: 32 entries, 4-way set associative; fully set associative for
Pentium processors with MMX technology.
Data TLB (4-KByte
Pages)
- Pentium 4 and Intel Xeon processors: 64 entries, fully set associative; shared
with large page data TLBs.
- Pentium M processor: 128 entries, 4-way set associative.
- Pentium and P6 family processors: 64 entries, 4-way set associative; fully set.
associative for Pentium processors with MMX technology.
Instruction TLB
(Large Pages)
- Pentium 4 and Intel Xeon processors: large pages are fragmented.
- Pentium M processor: 2 entries, fully associative.
- P6 family processors: 2 entries, fully associative.
- Pentium processor: Uses same TLB as used for 4-KByte pages.
Data TLB (Large
Pages)
- Pentium 4 and Intel Xeon processors: 64 entries, fully set associative; shared
with small page data TLBs.
- Pentium M processor: 8 entries, fully associative.
- P6 family processors: 8 entries, 4-way set associative.
- Pentium processor: 8 entries, 4-way set associative; uses same TLB as used for
4-KByte pages in Pentium processors with MMX technology.
Vol. 3 10-3
MEMORY CACHE CONTROL
The IA-32 processors implement four types of caches: the trace cache, the level 1 (L1) cache,
the level 2 (L2) cache, and the level 3 (L3) cache (see Figure 10-1). The uses of these caches
differs from the Pentium 4, Intel Xeon, and P6 family processors, as follows:
• Pentium 4 and Intel Xeon processors — The trace cache caches decoded instructions
(μops) from the instruction decoder, and the L1 cache contains only data. The L2 and L3
caches are unified data and instruction caches that are located on the processor chip. (The
L3 cache is only implemented on Intel Xeon processors.)
• P6 family processors — The L1 cache is divided into two sections: one dedicated to
caching IA-32 architecture instructions (pre-decoded instructions) and one to caching data.
The L2 cache is a unified data and instruction cache that is located on the processor chip.
The P6 family processors do not implement a trace cache.
• Pentium processors — The L1 cache has the same structure as on the P6 family
processors (and a trace cache is not implemented). The L2 cache is a unified data and
instruction cache that is external to the processor chip on earlier Pentium processors and
implemented on the processor chip in later Pentium processors. For Pentium processors
where the L2 cache is external to the processor, access to the cache is through the system
bus.
The cache lines for the L1 and L2 caches in the Pentium 4 and the L1, L2, and L3 caches in the
Intel Xeon processors are 64 bytes wide. The processor always reads a cache line from system
memory beginning on a 64-byte boundary. (A 64-byte aligned cache line begins at an address
with its 6 least-significant bits clear.) A cache line can be filled from memory with a 8-transfer
burst transaction. The caches do not support partially-filled cache lines, so caching even a single
doubleword requires caching an entire line.
The L1 and L2 cache lines in the P6 family and Pentium processors are 32 bytes wide, with
cache line reads from system memory beginning on a 32-byte boundary (5 least-significant bits
of a memory address clear.) A cache line can be filled from memory with a 4-transfer burst transaction.
Partially-filled cache lines are not supported.
Store Buffer - Pentium 4 and Intel Xeon processors: 24 entries.
- Pentium M processor: 16 entries.
- P6 family processors: 12 entries.
- Pentium processor: 2 buffers, 1 entry each (Pentium processors with MMX
technology have 4 buffers for 4 entries).
Write Combining
(WC) Buffer
- Pentium 4 and Intel Xeon processors: 6 or 8 entries.
- Pentium M processor: 6 entries.
- P6 family processors: 4 entries.
NOTES:
† Introduced to the IA-32 architecture in the Pentium 4 and Intel Xeon processors.
Table 10-1. Characteristics of the Caches, TLBs, Store Buffer, and
Write Combining Buffer in IA-32 processors (Contd.)
Cache or Buffer Characteristics
10-4 Vol. 3
MEMORY CACHE CONTROL
The trace cache in the Pentium 4 and Intel Xeon processors is an integral part of the Intel
NetBurst microarchitecture and is available in all execution modes: protected mode, system
management mode (SMM), and real-address mode. The L1,L2, and L3 caches are also available
in all execution modes; however, use of them must be handled carefully in SMM (see Section
13.4.2, “SMRAM Caching”).
The TLBs store the most recently used page-directory and page-table entries. They speed up
memory accesses when paging is enabled by reducing the number of memory accesses that are
required to read the page tables stored in system memory. The TLBs are divided into four
groups: instruction TLBs for 4-KByte pages, data TLBs for 4-KByte pages; instruction TLBs
for large pages (2-MByte or 4-MByte pages), and data TLBs for large pages. The TLBs are
normally active only in protected mode with paging enabled. When paging is disabled or the
processor is in real-address mode, the TLBs maintain their contents until explicitly or implicitly
flushed (see Section 10.9, “Invalidating the Translation Lookaside Buffers (TLBs)”).
The store buffer is associated with the processors instruction execution units. It allows writes to
system memory and/or the internal caches to be saved and in some cases combined to optimize
the processor’s bus accesses. The store buffer is always enabled in all execution modes.
The processor’s caches are for the most part transparent to software. When enabled, instructions
and data flow through these caches without the need for explicit software control. However,
knowledge of the behavior of these caches may be useful in optimizing software performance.
For example, knowledge of cache dimensions and replacement algorithms gives an indication
of how large of a data structure can be operated on at once without causing cache thrashing.
In multiprocessor systems, maintenance of cache consistency may, in rare circumstances,
require intervention by system software. For these rare cases, the processor provides privileged
cache control instructions for use in flushing caches and forcing memory ordering.
The Pentium III, Pentium 4, and Intel Xeon processors introduced several instructions that software
can use to improve the performance of the L1, L2, and L3 caches, including the
PREFETCHh and CLFLUSH instructions and the non-temporal move instructions (MOVNTI,
MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD). The use of these instructions are
discussed in Section 10.5.5, “Cache Management Instructions”.
10.2 CACHING TERMINOLOGY
The IA-32 architecture (beginning with the Pentium processor) uses the MESI (modified, exclusive,
shared, invalid) cache protocol to maintain consistency with internal caches and caches in
other processors (see Section 10.4, “Cache Control Protocol”).
When the processor recognizes that an operand being read from memory is cacheable, the
processor reads an entire cache line into the appropriate cache (L1, L2, L3, or all). This operation
is called a cache line fill. If the memory location containing that operand is still cached the next
time the processor attempts to access the operand, the processor can read the operand from the
cache instead of going back to memory. This operation is called a cache hit.
Vol. 3 10-5
MEMORY CACHE CONTROL
When the processor attempts to write an operand to a cacheable area of memory, it first checks
if a cache line for that memory location exists in the cache. If a valid cache line does exist, the
processor (depending on the write policy currently in force) can write the operand into the cache
instead of writing it out to system memory. This operation is called a write hit. If a write misses
the cache (that is, a valid cache line is not present for area of memory being written to), the
processor performs a cache line fill, write allocation. Then it writes the operand into the cache
line and (depending on the write policy currently in force) can also write it out to memory. If the
operand is to be written out to memory, it is written first into the store buffer, and then written
from the store buffer to memory when the system bus is available. (Note that for the Pentium
processor, write misses do not result in a cache line fill; they always result in a write to memory.
For this processor, only read misses result in cache line fills.)
When operating in an MP system, IA-32 processors (beginning with the Intel486 processor)
have the ability to snoop other processor’s accesses to system memory and to their internal
caches. They use this snooping ability to keep their internal caches consistent both with system
memory and with the caches in other processors on the bus. For example, in the Pentium and P6
family processors, if through snooping one processor detects that another processor intends to
write to a memory location that it currently has cached in shared state, the snooping processor
will invalidate its cache line forcing it to perform a cache line fill the next time it accesses the
same memory location.
Beginning with the P6 family processors, if a processor detects (through snooping) that another
processor is trying to access a memory location that it has modified in its cache, but has not yet
written back to system memory, the snooping processor will signal the other processor (by
means of the HITM# signal) that the cache line is held in modified state and will preform an
implicit write-back of the modified data. The implicit write-back is transferred directly to the
initial requesting processor and snooped by the memory controller to assure that system memory
has been updated. Here, the processor with the valid data may pass the data to the other processors
without actually writing it to system memory; however, it is the responsibility of the
memory controller to snoop this operation and update memory.
10.3 METHODS OF CACHING AVAILABLE
The processor allows any area of system memory to be cached in the L1, L2, and L3 caches. In
individual pages or regions of system memory, it allows the type of caching (also called
memory type) to be specified (see Section 10.5). Memory types currently defined for the IA-32
architecture are as follows (see Table 10-2):
• Strong Uncacheable (UC) —System memory locations are not cached. All reads and
writes appear on the system bus and are executed in program order without reordering. No
speculative memory accesses, page-table walks, or prefetches of speculated branch targets
are made. This type of cache-control is useful for memory-mapped I/O devices. When
used with normal RAM, it greatly reduces processor performance.
10-6 Vol. 3
MEMORY CACHE CONTROL
NOTE
The behavior of FP and SSE/SSE2 operations on operands in UC memory is
implementation dependent. In some implementations, accesses to UC
memory may occur more than once. To ensure predictable behavior, use loads
and stores of general purpose registers to access UC memory that may have
read or write side effects.
• Uncacheable (UC-) — Has same characteristics as the strong uncacheable (UC) memory
type, except that this memory type can be overridden by programming the MTRRs for the
WC memory type. This memory type is available in the Pentium 4, Intel Xeon, and
Pentium III processors and can only be selected through the PAT.
• Write Combining (WC) — System memory locations are not cached (as with
uncacheable memory) and coherency is not enforced by the processor’s bus coherency
protocol. Speculative reads are allowed. Writes may be delayed and combined in the write
combining buffer (WC buffer) to reduce memory accesses. If the WC buffer is partially
filled, the writes may be delayed until the next occurrence of a serializing event; such as,
an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached
memory, an interrupt occurrence, or a LOCK instruction execution. This type of cachecontrol
is appropriate for video frame buffers, where the order of writes is unimportant as
long as the writes update memory so they can be seen on the graphics display. See Section
10.3.1, “Buffering of Write Combining Memory Locations”, for more information about
caching the WC memory type. This memory type is available in the Pentium Pro and
Pentium II processors by programming the MTRRs or in the Pentium III, Pentium 4, and
Intel Xeon processors by programming the MTRRs or by selecting it through the PAT.
• Write-through (WT) — Writes and reads to and from system memory are cached. Reads
come from cache lines on cache hits; read misses cause cache fills. Speculative reads are
allowed. All writes are written to a cache line (when possible) and through to system
Table 10-2. Memory Types and Their Properties
Memory Type and
Mnemonic
Cacheable Writeback
Cacheable
Allows
Speculative
Reads
Memory Ordering Model
Strong Uncacheable
(UC)
No No No Strong Ordering
Uncacheable (UC-) No No No Strong Ordering. Can only be
selected through the PAT. Can be
overridden by WC in MTRRs.
Write Combining (WC) No No Yes Weak Ordering. Available by
programming MTRRs or by
selecting it through the PAT.
Write Through (WT) Yes No Yes Speculative Processor Ordering.
Write Back (WB) Yes Yes Yes Speculative Processor Ordering.
Write Protected (WP) Yes for
reads; no for
writes
No Yes Speculative Processor Ordering.
Available by programming
MTRRs.
Vol. 3 10-7
MEMORY CACHE CONTROL
memory. When writing through to memory, invalid cache lines are never filled, and valid
cache lines are either filled or invalidated. Write combining is allowed. This type of cachecontrol
is appropriate for frame buffers or when there are devices on the system bus that
access system memory, but do not perform snooping of memory accesses. It enforces
coherency between caches in the processors and system memory.
• Write-back (WB) — Writes and reads to and from system memory are cached. Reads
come from cache lines on cache hits; read misses cause cache fills. Speculative reads are
allowed. Write misses cause cache line fills (in the Pentium 4, Intel Xeon, and P6 family
processors), and writes are performed entirely in the cache, when possible. Write
combining is allowed. The write-back memory type reduces bus traffic by eliminating
many unnecessary writes to system memory. Writes to a cache line are not immediately
forwarded to system memory; instead, they are accumulated in the cache. The modified
cache lines are written to system memory later, when a write-back operation is performed.
Write-back operations are triggered when cache lines need to be deallocated, such as when
new cache lines are being allocated in a cache that is already full. They also are triggered
by the mechanisms used to maintain cache consistency. This type of cache-control
provides the best performance, but it requires that all devices that access system memory
on the system bus be able to snoop memory accesses to insure system memory and cache
coherency.
• Write protected (WP) — Reads come from cache lines when possible, and read misses
cause cache fills. Writes are propagated to the system bus and cause corresponding cache
lines on all processors on the bus to be invalidated. Speculative reads are allowed. This
memory type is available in the Pentium 4, Intel Xeon, and P6 family processors by
programming the MTRRs (see Table 10-6).
Table 10-3 shows which of these caching methods are available in the Pentium, P6 Family,
Pentium 4, and Intel Xeon processors.
Table 10-3. Methods of Caching Available in Pentium 4, Intel Xeon, P6 Family,
and Pentium Processors
Memory Type Pentium 4 and Intel
Xeon Processors
P6 Family Processors Pentium Processor
Strong Uncacheable (UC) Yes Yes Yes
Uncacheable (UC-) Yes Yes* No
Write Combining (WC) Yes Yes No
Write Through (WT) Yes Yes Yes
Write Back (WB) Yes Yes Yes
Write Protected (WP) Yes Yes No
NOTE:
* Introduced in the Pentium III processor; not available in the Pentium Pro or Pentium II processors
10-8 Vol. 3
MEMORY CACHE CONTROL
10.3.1 Buffering of Write Combining Memory Locations
Writes to the WC memory type are not cached in the typical sense of the word cached. They are
retained in an internal write combining buffer (WC buffer) that is separate from the internal L1,
L2, and L3 caches and the store buffer. The WC buffer is not snooped and thus does not provide
data coherency. Buffering of writes to WC memory is done to allow software a small window
of time to supply more modified data to the WC buffer while remaining as non-intrusive to software
as possible. The buffering of writes to WC memory also causes data to be collapsed; that
is, multiple writes to the same memory location will leave the last data written in the location
and the other writes will be lost.
The size and structure of the WC buffer is not architecturally defined. For the Pentium 4 and
Intel Xeon processors, the WC buffer is made up of several 64-byte WC buffers. For the P6
family processors, the WC buffer is made up of several 32-byte WC buffers.
When software begins writing to WC memory, the processor begins filling the WC buffers one
at a time. When one or more WC buffers has been filled, the processor has the option of evicting
the buffers to system memory. The protocol for evicting the WC buffers is implementation
dependent and should not be relied on by software for system memory coherency. When using
the WC memory type, software must be sensitive to the fact that the writing of data to system
memory is being delayed and must deliberately empty the WC buffers when system memory
coherency is required.
Once the processor has started to evict data from the WC buffer into system memory, it will
make a bus-transaction style decision based on how much of the buffer contains valid data. If
the buffer is full (for example, all bytes are valid) the processor will execute a burst-write transaction
on the bus that will result in all 32 bytes (P6 family processors) or 64 bytes (Pentium 4
and Intel Xeon processor) being transmitted on the data bus in a single burst transaction. If one
or more of the WC buffer’s bytes are invalid (for example, have not been written by software)
then the processor will transmit the data to memory using “partial write” transactions (one chunk
at a time, where a “chunk” is 8 bytes).
This will result in a maximum of 4 partial write transactions (for P6 family processors) or 8
partial write transactions (for the Pentium 4 and Intel Xeon processors) for one WC buffer of
data sent to memory.
The WC memory type is weakly ordered by definition. Once the eviction of a WC buffer has
started, the data is subject to the weak ordering semantics of its definition. Ordering is not maintained
between the successive allocation/deallocation of WC buffers (for example, writes to WC
buffer 1 followed by writes to WC buffer 2 may appear as buffer 2 followed by buffer 1 on the
system bus). When a WC buffer is evicted to memory as partial writes there is no guaranteed
ordering between successive partial writes (for example, a partial write for chunk 2 may appear
on the bus before the partial write for chunk 1 or vice versa).
Vol. 3 10-9
MEMORY CACHE CONTROL
The only elements of WC propagation to the system bus that are guaranteed are those provided
by transaction atomicity. For example, with a P6 family processor, a completely full WC buffer
will always be propagated as a single 32-bit burst transaction using any chunk order. In a WC
buffer eviction where the data will be evicted as partials, all data contained in the same chunk
(0 mod 8 aligned) will be propagated simultaneously. Likewise, with a Pentium 4 or Intel Xeon
processor, a full WC buffer will always be propagated as a single burst transactions, using any
chunk order within a transaction. For partial buffer propagations, all data contained in the same
chunk will be propagated simultaneously.
10.3.2 Choosing a Memory Type
The simplest system memory model does not use memory-mapped I/O with read or write side
effects, does not include a frame buffer, and uses the write-back memory type for all memory.
An I/O agent can perform direct memory access (DMA) to write-back memory and the cache
protocol maintains cache coherency.
A system can use strong uncacheable memory for other memory-mapped I/O, and should
always use strong uncacheable memory for memory-mapped I/O with read side effects.
Dual-ported memory can be considered a write side effect, making relatively prompt writes
desirable, because those writes cannot be observed at the other port until they reach the memory
agent. A system can use strong uncacheable, uncacheable, write-through, or write-combining
memory for frame buffers or dual-ported memory that contains pixel values displayed on a
screen. Frame buffer memory is typically large (a few megabytes) and is usually written more
than it is read by the processor. Using strong uncacheable memory for a frame buffer generates
very large amounts of bus traffic, because operations on the entire buffer are implemented using
partial writes rather than line writes. Using write-through memory for a frame buffer can
displace almost all other useful cached lines in the processor's L2 and L3 caches and L1 data
cache. Therefore, systems should use write-combining memory for frame buffers whenever
possible.
Software can use page-level cache control, to assign appropriate effective memory types when
software will not access data structures in ways that benefit from write-back caching. For
example, software may read a large data structure once and not access the structure again until
the structure is rewritten by another agent. Such a large data structure should be marked as
uncacheable, or reading it will evict cached lines that the processor will be referencing again.
A similar example would be a write-only data structure that is written to (to export the data to
another agent), but never read by software. Such a structure can be marked as uncacheable,
because software never reads the values that it writes (though as uncacheable memory, it will be
written using partial writes, while as write-back memory, it will be written using line writes,
which may not occur until the other agent reads the structure and triggers implicit write-backs).
On the Pentium III, Pentium 4, and Intel Xeon processors, new instructions are provided that
give software greater control over the caching, prefetching, and the write-back characteristics of
data. These instructions allow software to use weakly ordered or processor ordered memory
types to improve processor performance, but when necessary to force strong ordering on
memory reads and/or writes. They also allow software greater control over the caching of data.
10-10 Vol. 3
MEMORY CACHE CONTROL
For a description of these instructions and there intended use, see Section 10.5.5, “Cache
Management Instructions”.
10.4 CACHE CONTROL PROTOCOL
The following section describes the cache control protocol currently defined for the IA-32 architecture.
This protocol is used by the Pentium 4, Intel Xeon, P6 family, and Pentium processors.
In the L1 data cache and in the L2 and L3 unified caches, the MESI (modified, exclusive, shared,
invalid) cache protocol maintains consistency with caches of other processors. The L1 data
cache and the L2 and L3 unified caches have two MESI status flags per cache line. Each line
can thus be marked as being in one of the states defined in Table 10-4. In general, the operation
of the MESI protocol is transparent to programs.
The L1 instruction cache in P6 family processors implements only the “SI” part of the MESI
protocol, because the instruction cache is not writable. The instruction cache monitors changes
in the data cache to maintain consistency between the caches when instructions are modified.
See Section 10.6, “Self-Modifying Code”, for more information on the implications of caching
instructions.
10.5 CACHE CONTROL
The IA-32 architecture provides a variety of mechanisms for controlling the caching of data and
instructions and for controlling the ordering of reads and writes between the processor, the
caches, and memory. These mechanisms can be divided into two groups:
• Cache control registers and bits — The IA-32 architecture defines several dedicated
registers and various bits within control registers and page- and directory-table entries that
control the caching system memory locations in the L1, L2, and L3 caches. These
mechanisms control the caching of virtual memory pages and of regions of physical
memory.
Table 10-4. MESI Cache Line States
Cache Line State M (Modified) E (Exclusive) S (Shared) I (Invalid)
This cache line is valid? Yes Yes Yes No
The memory copy is… Out of date Valid Valid —
Copies exist in caches of
other processors?
No No Maybe Maybe
A write to this line … Does not go to
the system bus.
Does not go to
the system bus.
Causes the
processor to
gain exclusive
ownership of the
line.
Goes directly to
the system bus.
Vol. 3 10-11
MEMORY CACHE CONTROL
• Cache control and memory ordering instructions — The IA-32 architecture provides
several instructions that control the caching of data, the ordering of memory reads and
writes, and the prefetching of data. These instructions allow software to control the
caching of specific data structures, to control memory coherency for specific locations in
memory, and to force strong memory ordering at specific locations in a program.
The following sections describe these two groups of cache control mechanisms.
10.5.1 Cache Control Registers and Bits
The current IA-32 architecture provides the following cache-control registers and bits for use in
enabling and/or restricting caching to various pages or regions in memory (see Figure 10-2):
• CD flag, bit 30 of control register CR0 — Controls caching of system memory locations
(see Section 2.5, “Control Registers”). If the CD flag is clear, caching is enabled for the
whole of system memory, but may be restricted for individual pages or regions of memory
by other cache-control mechanisms. When the CD flag is set, caching is restricted in the
processor’s caches (cache hierarchy) for the Pentium 4, Intel Xeon, and P6 family
processors and prevented for the Pentium processor (see note below). With the CD flag set,
however, the caches will still respond to snoop traffic. Caches should be explicitly flushed
to insure memory coherency. For highest processor performance, both the CD and the NW
flags in control register CR0 should be cleared. Table 10-5 shows the interaction of the CD
and NW flags.
The effect of setting the CD flag is somewhat different for the Pentium 4, Intel Xeon,
and P6 family processors than for the Pentium processor (see Table 10-5). To insure
memory coherency after the CD flag is set, the caches should be explicitly flushed (see
Section 10.5.3, “Preventing Caching”). Setting the CD flag for the Pentium 4, Intel
Xeon, and P6 family processors modifies cache line fill and update behaviour. Also for
the Pentium 4, Intel Xeon, and P6 family processors, setting the CD flag does not force
strict ordering of memory accesses unless the MTRRs are disabled and/or all memory is
referenced as uncached (see Section 7.2.4, “Strengthening or Weakening the Memory
Ordering Model”).
10-12 Vol. 3
MEMORY CACHE CONTROL
Figure 10-2. Cache-Control Registers and Bits Available in IA-32 Processors
Page-Directory or
Page-Table Entry
TLBs
MTRRs3
Physical Memory
0
FFFFFFFFH2
control overall caching
of system memory
CD and NW Flags PCD and PWT flags
control page-level
caching
G flag controls pagelevel
flushing of TLBs
MTRRs control caching
of selected regions of
physical memory
PC
D
CR3
Control caching of
page directory
PWT
C
D
CR0
NW
Store Buffer
PC
D
PWT
G1
CR4
Enables global pages
PGE
designated with G flag
1. G flag only available in Pentium 4, Intel Xeon, and P6 family
3. MTRRs available only in Pentium 4 and P6 family processors;
similar control available in Pentium processor with the KEN#
and WB/WT# pins.
2. The maximum physical address size is reported by CPUID leaf
function 80000008H. The maximum physical address size of
PAT4
PAT controls caching
of virtual memory
pages
4. PAT available only in Pentium III and Pentium 4 processors.
P4
AT
processors.
IA32_MISC_ENABLE MSR
3rd Level
Cache Disable
FFFFFFFFFH applies only If 36-bit physical addressing is used.
Vol. 3 10-13
MEMORY CACHE CONTROL
Table 10-5. Cache Operating Modes
CD NW Caching and Read/Write Policy L1 L2/L31
0 0 Normal Cache Mode. Highest performance cache operation.
- Read hits access the cache; read misses may cause replacement.
- Write hits update the cache.
- Only writes to shared lines and write misses update system memory.
- Write misses cause cache line fills.
- Write hits can change shared lines to modified under control of the
MTRRs and with associated read invalidation cycle.
- (Pentium processor only.) Write misses do not cause cache line fills.
- (Pentium processor only.) Write hits can change shared lines to
exclusive under control of WB/WT#.
- Invalidation is allowed.
- External snoop traffic is supported.
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
0 1 Invalid setting.
Generates a general-protection exception (#GP) with an error code of 0. NA NA
1 0 No-fill Cache Mode. Memory coherency is maintained.
- (Pentium 4 and Intel Xeon processors.) State of processor after a power
up or reset.
- Read hits access the cache; read misses do not cause replacement
(see Pentium 4 and Intel Xeon processors reference below).
- Write hits update the cache.
- Only writes to shared lines and write misses update system memory.
- Write misses access memory.
- Write hits can change shared lines to exclusive under control of the
MTRRs and with associated read invalidation cycle.
- (Pentium processor only.) Write hits can change shared lines to
exclusive under control of the WB/WT#.
- (Pentium 4, Intel Xeon, and P6 family processors only.) Strict memory
ordering is not enforced unless the MTRRs are disabled and/or all
memory is referenced as uncached (see Section 7.2.4., “Strengthening
or Weakening the Memory Ordering Model”).
- Invalidation is allowed.
- External snoop traffic is supported.
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
1 1 Memory coherency is not maintained.2
- (P6 family and Pentium processors.) State of the processor after a
power up or reset.
- Read hits access the cache; read misses do not cause replacement.
- Write hits update the cache and change exclusive lines to modified.
- Shared lines remain shared after write hit.
- Write misses access memory.
- Invalidation is inhibited when snooping; but is allowed with INVD and
WBINVD instructions.
- External snoop traffic is supported.
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
NOTES:
1. The L2/L3 column in this table is definitive for the Pentium 4, Intel Xeon, and P6 family processors. It is
intended to represent what could be implemented in a system based on a Pentium processor with an
external, platform specific, write-back L2 cache.
2. The Pentium 4 and Intel Xeon processors do not support this mode; setting the CD and NW bits to 1
selects the no-fill cache mode.
10-14 Vol. 3
MEMORY CACHE CONTROL
• NW flag, bit 29 of control register CR0 — Controls the write policy for system memory
locations (see Section 2.5, “Control Registers”). If the NW and CD flags are clear, writeback
is enabled for the whole of system memory, but may be restricted for individual pages
or regions of memory by other cache-control mechanisms. Table 10-5 shows how the other
combinations of CD and NW flags affects caching.
NOTES
For the Pentium 4 and Intel Xeon processors, the NW flag is a don’t care flag;
that is, when the CD flag is set, the processor uses the no-fill cache mode,
regardless of the setting of the NW flag.
For the Pentium processor, when the L1 cache is disabled (the CD and NW
flags in control register CR0 are set), external snoops are accepted in DP
(dual-processor) systems and inhibited in uniprocessor systems.
When snoops are inhibited, address parity is not checked and APCHK# is not
asserted for a corrupt address; however, when snoops are accepted, address
parity is checked and APCHK# is asserted for corrupt addresses.
• PCD flag in the page-directory and page-table entries — Controls caching for
individual page tables and pages, respectively (see Section 3.7.6, “Page-Directory and
Page-Table Entries”). This flag only has effect when paging is enabled and the CD flag in
control register CR0 is clear. The PCD flag enables caching of the page table or page when
clear and prevents caching when set.
• PWT flag in the page-directory and page-table entries — Controls the write policy for
individual page tables and pages, respectively (see Section 3.7.6, “Page-Directory and
Page-Table Entries”). This flag only has effect when paging is enabled and the NW flag in
control register CR0 is clear. The PWT flag enables write-back caching of the page table or
page when clear and write-through caching when set.
• PCD and PWT flags in control register CR3 — Control the global caching and write
policy for the page directory (see Section 2.5, “Control Registers”). The PCD flag enables
caching of the page directory when clear and prevents caching when set. The PWT flag
enables write-back caching of the page directory when clear and write-through caching
when set. These flags do not affect the caching and write policy for individual page tables.
These flags only have effect when paging is enabled and the CD flag in control register
CR0 is clear.
• G (global) flag in the page-directory and page-table entries (introduced to the IA-32
architecture in the P6 family processors) — Controls the flushing of TLB entries for
individual pages. See Section 3.12, “Translation Lookaside Buffers (TLBs)”, for more
information about this flag.
• PGE (page global enable) flag in control register CR4 — Enables the establishment of
global pages with the G flag. See Section 3.12, “Translation Lookaside Buffers (TLBs)”,
for more information about this flag.
Vol. 3 10-15
MEMORY CACHE CONTROL
• Memory type range registers (MTRRs) (introduced in P6 family processors) —
Control the type of caching used in specific regions of physical memory. Any of the
caching types described in Section 10.3, “Methods of Caching Available”, can be selected.
See Section 10.11, “Memory Type Range Registers (MTRRs)”, for a detailed description
of the MTRRs.
• Page Attribute Table (PAT) MSR (introduced in the Pentium III processor) — Extends
the memory typing capabilities of the processor to permit memory types to be assigned on
a page-by-page basis (see Section 10.12, “Page Attribute Table (PAT)”).
• Third-Level Cache Disable flag, bit 6 of the IA32_MISC_ENABLE MSR (introduced
in the Intel Xeon processors) — Allows the L3 cache to be disabled and enabled,
independently of the L1 and L2 caches.
• KEN# and WB/WT# pins (Pentium processor) — Allow external hardware to control
the caching method used for specific areas of memory. They perform similar (but not
identical) functions to the MTRRs in the P6 family processors.
• PCD and PWT pins (Pentium processor) — These pins (which are associated with the
PCD and PWT flags in control register CR3 and in the page-directory and page-table
entries) permit caching in an external L2 cache to be controlled on a page-by-page basis,
consistent with the control exercised on the L1 cache of these processors. The Pentium 4,
Intel Xeon, and P6 family processors do not provide these pins because the L2 cache in
internal to the chip package.
10.5.2 Precedence of Cache Controls
The cache control flags and MTRRs operate hierarchically for restricting caching. That is, if the
CD flag is set, caching is prevented globally (see Table 10-5). If the CD flag is clear, the pagelevel
cache control flags and/or the MTRRs can be used to restrict caching. If there is an overlap
of page-level and MTRR caching controls, the mechanism that prevents caching has precedence.
For example, if an MTRR makes a region of system memory uncachable, a page-level
caching control cannot be used to enable caching for a page in that region. The converse is also
true; that is, if a page-level caching control designates a page as uncachable, an MTRR cannot
be used to make the page cacheable.
In cases where there is a overlap in the assignment of the write-back and write-through caching
policies to a page and a region of memory, the write-through policy takes precedence. The writecombining
policy (which can only be assigned through an MTRR or the PAT) takes precedence
over either write-through or write-back.
The selection of memory types at the page level varies depending on whether PAT is being used
to select memory types for pages, as described in the following sections.
Third-level cache disable flag (bit 6 of the IA32_MISC_ENABLE MSR) takes precedence over
the CD flag, MTRRs, and PAT for the L3 cache. That is, when the third-level cache disable flag
is set (cache disabled), the other cache controls have no affect on the L3 cache; when the flag is
clear (enabled), the cache controls have the same affect on the L3 cache as they have on the L1
and L2 caches.
10-16 Vol. 3
MEMORY CACHE CONTROL
10.5.2.1 Selecting Memory Types for Pentium Pro and Pentium II Processors
The Pentium Pro and Pentium II processors do not support the PAT. Here, the effective memory
type for a page is selected with the MTRRs and the PCD and PWT bits in the page-table or pagedirectory
entry for the page. Table 10-6 describes the mapping of MTRR memory types and
page-level caching attributes to effective memory types, when normal caching is in effect (the
CD and NW flags in control register CR0 are clear). Combinations that appear in gray are implementation-
defined for the Pentium Pro and Pentium II processors. System designers are encouraged
to avoid these implementation-defined combinations.
When normal caching is in effect, the effective memory type shown in Table 10-6 is determined
using the following rules:
1. If the PCD and PWT attributes for the page are both 0, then the effective memory type is
identical to the MTRR-defined memory type.
2. If the PCD flag is set, then the effective memory type is UC.
3. If the PCD flag is clear and the PWT flag is set, the effective memory type is WT for the
WB memory type and the MTRR-defined memory type for all other memory types.
Table 10-6. Effective Page-Level Memory Type for Pentium Pro and
Pentium II Processors
MTRR Memory Type1 PCD Value PWT Value Effective Memory Type
UC X X UC
WC 0 0 WC
0 1 WC
1 0 WC
1 1 UC
WT 0 X WT
1 X UC
WP 0 0 WP
0 1 WP
1 0 WC
1 1 UC
WB 0 0 WB
0 1 WT
1 X UC
NOTE:
1. These effective memory types also apply to the Pentium 4, Intel Xeon, and Pentium III processors
when the PAT bit is not used (set to 0) in page-table and page-directory entries.
Vol. 3 10-17
MEMORY CACHE CONTROL
4. Setting the PCD and PWT flags to opposite values is considered model-specific for the WP
and WC memory types and architecturally-defined for the WB, WT, and UC memory
types.
10.5.2.2 Selecting Memory Types for Pentium 4, Intel Xeon,
and Pentium III Processors
The Pentium 4, Intel Xeon, and Pentium III processors use the PAT to select effective page-level
memory types. Here, a memory type for a page is selected by the MTRRs and the value in a PAT
entry that is selected with the PAT, PCD and PWT bits in a page-table or page-directory entry
(see Section 10.12.3, “Selecting a Memory Type from the PAT”). Table 10-7 describes the
mapping of MTRR memory types and PAT entry types to effective memory types, when normal
caching is in effect (the CD and NW flags in control register CR0 are clear). The combinations
shown in gray are implementation-defined for the Pentium 4, Intel Xeon, and Pentium III processors.
System designers are encouraged to avoid the implementation-defined combinations.
Table 10-7. Effective Page-Level Memory Types for Pentium III, Pentium 4,
and Intel Xeon Processors
MTRR Memory Type PAT Entry Value Effective Memory Type
UC UC UC1
UC- UC1
WC WC
WT UC1
WB UC1
WP UC1
WC UC UC2
UC- WC
WC WC
WT UC2,3
WB WC
WP UC2,3
WT UC UC2
UC- UC2
WC WC
WT WT
WB WT
WP WP3
10-18 Vol. 3
MEMORY CACHE CONTROL
10.5.2.3 Writing Values Across Pages with Different Memory Types
If two adjoining pages in memory have different memory types, and a word or longer operand
is written to a memory location that crosses the page boundary between those two pages, the
operand might be written to memory twice. This action does not present a problem for writes to
actual memory; however, if a device is mapped the memory space assigned to the pages, the
device might malfunction.
10.5.3 Preventing Caching
To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills,
perform the following steps:
1. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag
to 0.
2. Flush all caches using the WBINVD instruction.
WB UC UC2
UC- UC2
WC WC
WT WT
WB WB
WP WP
WP UC UC2
UC- WC3
WC WC
WT WT3
WB WP
WP WP
NOTES:
1. The UC attribute comes from the MTRRs and the processors are not required to snoop their caches
since the data could never have been cached. This attribute is preferred for performance reasons.
2. The UC attribute came from the page-table or page-directory entry and processors are required to check
their caches because the data may be cached due to page aliasing, which is not recommended.
3. These combinations were specified as “undefined” in previous editions of the IA-32 Intel Architecture
Software Developer’s Manual. However, all processors that support both the PAT and the MTRRs determine
the effective page-level memory types for these combinations as given.
Table 10-7. Effective Page-Level Memory Types for Pentium III, Pentium 4,
and Intel Xeon Processors (Contd.)
MTRR Memory Type PAT Entry Value Effective Memory Type
Vol. 3 10-19
MEMORY CACHE CONTROL
3. Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the
uncached memory type (see the discussion of the discussion of the TYPE field and the E
flag in Section 10.11.2.1, “IA32_MTRR_DEF_TYPE MSR”).
The caches must be flushed (step 2) after the CD flag is set to insure system memory coherency.
If the caches are not flushed, cache hits on reads will still occur and data will be read from valid
cache lines.
NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching
behaviour as indicated in Table 10-5, but it does not force the effective
memory type for all physical memory to be UC nor does it force strict
memory ordering. To force the UC memory type and strict memory ordering
on all of physical memory, either the MTRRs must all be programmed for the
UC memory type or they must be disabled.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps
given above has been executed, the cache lines containing the code between
the end of the WBINVD instruction and before the MTRRS have actually
been disabled may be retained in the cache hierarchy. Here, to remove code
from the cache completely, a second WBINVD instruction must be executed
after the MTRRs have been disabled.
10.5.4 Disabling and Enabling the L3 Cache
Third-level cache disable flag (bit 6 of the IA32_MISC_ENABLE MSR) allows the L3 cache
to be disabled and enabled, independently of the L1 and L2 caches. Prior to using this control to
disable or enable the L3 cache, software should disable and flush all the processor caches, as
described earlier in Section 10.5.3, “Preventing Caching”, to prevent of loss of information
stored in the L3 cache. After the L3 cache has been disabled or enabled, caching for the whole
processor can be restored.
10.5.5 Cache Management Instructions
The IA-32 architecture provide several instructions for managing the L1, L2, and L3 caches. The
INVD, WBINVD, and WBINVD instructions are system instructions that operate on the L1, L2,
and L3 caches as a whole. The PREFETCHh and CLFLUSH instructions and the non-temporal
move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD), which
were introduced in SSE/SSE2 extensions, offer more granular control over caching.
The INVD and WBINVD instructions are used to invalidate the contents of the L1, L2, and L3
caches. The INVD instruction invalidates all internal cache entries, then generates a specialfunction
bus cycle that indicates that external caches also should be invalidated. The INVD
instruction should be used with care. It does not force a write-back of modified cache lines;
therefore, data stored in the caches and not written back to system memory will be lost. Unless
there is a specific requirement or benefit to invalidating the caches without writing back the
10-20 Vol. 3
MEMORY CACHE CONTROL
modified lines (such as, during testing or fault recovery where cache coherency with main
memory is not a concern), software should use the WBINVD instruction.
The WBINVD instruction first writes back any modified lines in all the internal caches, then
invalidates the contents of both the L1, L2, and L3 caches. It ensures that cache coherency with
main memory is maintained regardless of the write policy in effect (that is, write-through or
write-back). Following this operation, the WBINVD instruction generates one (P6 family
processors) or two (Pentium and Intel486 processors) special-function bus cycles to indicate to
external cache controllers that write-back of modified data followed by invalidation of external
caches should occur.
The PREFETCHh instructions allow a program to suggest to the processor that a cache line from
a specified location in system memory be prefetched into the cache hierarchy (see Section 10.8,
“Explicit Caching”).
The CLFLUSH instruction allow selected cache lines to be flushed from memory. This instruction
give a program the ability to explicitly free up cache space, when it is known that cached
section of system memory will not be accessed in the near future.
The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and
MOVNTPD) allow data to be moved from the processor’s registers directly into system memory
without being also written into the L1, L2, and/or L3 caches. These instructions can be used to
prevent cache pollution when operating on data that is going to be modified only once before
being stored back into system memory. These instructions operate on data in the generalpurpose,
MMX, and XMM registers.
10.5.6 L1 Data Cache Context Mode
L1 data cache context mode is a feature of IA-32 processors that support Hyper-Threading Technology.
When CPUID.1:ECX[bit 10] = 1, the processor supports setting L1 data cache context
mode using the L1 data cache context mode flag ( IA32_MISC_ENABLE[bit 24] ). Selectable
modes are adaptive mode (default) and shared mode.
The BIOS is responsible for configuring the L1 data cache context mode.
Vol. 3 10-21
MEMORY CACHE CONTROL
10.5.6.1 Adaptive Mode
Adaptive mode facilitates L1 data cache sharing between logical processors. When running in
adaptive mode, the L1 data cache is shared across logical processors in the same core if:
• CR3 control registers for logical processors sharing the cache are identical.
• The same paging mode is used by logical processors sharing the cache.
In this situation, the entire L1 data cache is available to each logical processor (instead of being
competitively shared).
If CR3 values are different for the logical processors sharing an L1 data cache or the logical
processors use different paging modes, processors compete for cache resources. This reduces
the effective size of the cache for each logical processor. Aliasing of the cache is not allowed
(which prevents data thrashing).
10.5.6.2 Shared Mode
In shared mode, the L1 data cache is competitively shared between logical processors. This is
true even if the logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear
address in the cache can point to different physical locations. The mechanism for resolving
aliasing can lead to thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the
preferred configuration for IA-32 processors that support Hyper-Threading Technology.
10.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the processor causes
the associated cache line (or lines) to be invalidated. This check is based on the physical address
of the instruction. In addition, the P6 family and Pentium processors check whether a write to a
code segment may modify an instruction that has been prefetched for execution. If the write
affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on
the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a
snoop of an instruction in a code segment, where the target instruction is already decoded and
resident in the trace cache, invalidates the entire trace cache. The latter behavior means that
programs that self-modify code can cause severe degradation of performance when run on the
Pentium 4 and Intel Xeon processors.
In practice, the check on linear addresses should not create compatibility problems among IA-32
processors. Applications that include self-modifying code use the same linear address for modifying
and fetching the instruction. Systems software, such as a debugger, that might possibly
modify an instruction using a different linear address than that used to fetch the instruction, will
execute a serializing operation, such as a CPUID instruction, before the modified instruction is
executed, which will automatically resynchronize the instruction cache and prefetch queue. (See
Section 7.1.3, “Handling Self- and Cross-Modifying Code”, for more information about the use
of self-modifying code.)
10-22 Vol. 3
MEMORY CACHE CONTROL
For Intel486 processors, a write to an instruction in the cache will modify it in both the cache
and memory, but if the instruction was prefetched before the write, the old version of the instruction
could be the one executed. To prevent the old instruction from being executed, flush the
instruction prefetch unit by coding a jump instruction immediately after any write that modifies
an instruction.
10.7 IMPLICIT CACHING (PENTIUM 4, INTEL XEON,
AND P6 FAMILY PROCESSORS)
Implicit caching occurs when a memory element is made potentially cacheable, although the
element may never have been accessed in the normal von Neumann sequence. Implicit caching
occurs on the Pentium 4, Intel Xeon, and P6 family processors due to aggressive prefetching,
branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of
existing Intel386, Intel486, and Pentium processor systems, since software running on these
processor families also has not been able to deterministically predict the behavior of instruction
prefetch.
To avoid problems related to implicit caching, the operating system must explicitly invalidate
the cache when changes are made to cacheable data that the cache coherency mechanism does
not automatically handle. This includes writes to dual-ported or physically aliased memory
boards that are not detected by the snooping mechanisms of the processor, and changes to pagetable
entries in memory.
The code in Example 10-1 shows the effect of implicit caching on page-table entries. The linear
address F000H points to physical location B000H (the page-table entry for F000H contains the
value B000H), and the page-table entry for linear address F000 is PTE_F000.
Example 10-1. Effect of Implicit Caching on Page-Table Entries
mov EAX, CR3 ; Invalidate the TLB
mov CR3, EAX ; by copying CR3 to itself
mov PTE_F000, A000H; Change F000H to point to A000H
mov EBX, [F000H];
Because of speculative execution in the Pentium 4, Intel Xeon, and P6 family processors, the
last MOV instruction performed would place the value at physical location B000H into EBX,
rather than the value at the new physical address A000H. This situation is remedied by placing
a TLB invalidation between the load and the store.
10.8 EXPLICIT CACHING
The Pentium III processor introduced four new instructions, the PREFETCHh instructions, that
provide software with explicit control over the caching of data. These instructions provide
“hints” to the processor that the data requested by a PREFETCHh instruction should be read into
Vol. 3 10-23
MEMORY CACHE CONTROL
cache hierarchy now or as soon as possible, in anticipation of its use. The instructions provide
different variations of the hint that allow selection of the cache level into which data will be read.
The PREFETCHh instructions can help reduce the long latency typically associated with
reading data from memory and thus help prevent processor “stalls.” However, these instructions
should be used judiciously. Overuse can lead to resource conflicts and hence reduce the performance
of an application. Also, these instructions should only be used to prefetch data from
memory; they should not be used to prefetch instructions. For more detailed information on the
proper use of the prefetch instruction, refer to Chapter 6, “Optimizing Cache Usage for the Intel
Pentium 4 Processors”, in the Pentium 4 Processor Optimization Reference Manual (see
Section 1.4, “Related Literature”, for the document order number).
10.9 INVALIDATING THE TRANSLATION LOOKASIDE BUFFERS
(TLBS)
The processor updates its address translation caches (TLBs) transparently to software. Several
mechanisms are available, however, that allow software and hardware to invalidate the TLBs
either explicitly or as a side effect of another operation.
The INVLPG instruction invalidates the TLB for a specific page. This instruction is the most
efficient in cases where software only needs to invalidate a specific page, because it improves
performance over invalidating the whole TLB. This instruction is not affected by the state of the
G flag in a page-directory or page-table entry.
The following operations invalidate all TLB entries except global entries. (A global entry is one
for which the G (global) flag is set in its corresponding page-directory or page-table entry. The
global flag was introduced into the IA-32 architecture in the P6 family processors, see Section
10.5, “Cache Control”.)
• Writing to control register CR3.
• A task switch that changes control register CR3.
The following operations invalidate all TLB entries, irrespective of the setting of the G flag:
• Asserting or de-asserting the FLUSH# pin.
• (Pentium 4, Intel Xeon, and P6 family processors only.) Writing to an MTRR (with a
WRMSR instruction).
• Writing to control register CR0 to modify the PG or PE flag.
• (Pentium 4, Intel Xeon, and P6 family processors only.) Writing to control register CR4 to
modify the PSE, PGE, or PAE flag.
See Section 3.12, “Translation Lookaside Buffers (TLBs)”, for additional information about the
TLBs.
10-24 Vol. 3
MEMORY CACHE CONTROL
10.10 STORE BUFFER
IA-32 processors temporarily store each write (store) to memory in a store buffer. The store
buffer improves processor performance by allowing the processor to continue executing instructions
without having to wait until a write to memory and/or to a cache is complete. It also allows
writes to be delayed for more efficient use of memory-access bus cycles.
In general, the existence of the store buffer is transparent to software, even in systems that use
multiple processors. The processor ensures that write operations are always carried out in
program order. It also insures that the contents of the store buffer are always drained to memory
in the following situations:
• When an exception or interrupt is generated.
• (Pentium 4, Intel Xeon, and P6 family processors only) When a serializing instruction is
executed.
• When an I/O instruction is executed.
• When a LOCK operation is performed.
• (Pentium 4, Intel Xeon, and P6 family processors only) When a BINIT operation is
performed.
• (Pentium III, Pentium 4, and Intel Xeon processors only) When using an SFENCE
instruction to order stores.
• (Pentium 4 and Intel Xeon processors only) When using an MFENCE instruction to order
stores.
The discussion of write ordering in Section 7.2, “Memory Ordering”, gives a detailed description
of the operation of the store buffer.
10.11 MEMORY TYPE RANGE REGISTERS (MTRRS)
The following section pertains only to the Pentium 4, Intel Xeon, and P6 family processors.
The memory type range registers (MTRRs) provide a mechanism for associating the memory
types (see Section 10.3, “Methods of Caching Available”) with physical-address ranges in
system memory. They allow the processor to optimize operations for different types of memory
such as RAM, ROM, frame-buffer memory, and memory-mapped I/O devices. They also
simplify system hardware design by eliminating the memory control pins used for this function
on earlier IA-32 processors and the external logic needed to drive them.
The MTRR mechanism allows up to 96 memory ranges to be defined in physical memory, and
it defines a set of model-specific registers (MSRs) for specifying the type of memory that is
contained in each range. Table 10-8 shows the memory types that can be specified and their
properties; Figure 10-3 shows the mapping of physical memory with MTRRs. See Section 10.3,
“Methods of Caching Available”, for a more detailed description of each memory type.
Following a hardware reset, a Pentium 4, Intel Xeon, or P6 family processor disables all the
fixed and variable MTRRs, which in effect makes all of physical memory uncachable. InitialVol.
3 10-25
MEMORY CACHE CONTROL
ization software should then set the MTRRs to a specific, system-defined memory map. Typically,
the BIOS (basic input/output system) software configures the MTRRs. The operating
system or executive is then free to modify the memory map using the normal page-level cacheability
attributes.
In a multiprocessor system, different Pentium 4, Intel Xeon, or P6 family processors MUST use
the identical MTRR memory map so that software has a consistent view of memory, independent
of the processor executing a program.
Table 10-8. Memory Types That Can Be Encoded in MTRRs
Memory Type and Mnemonic Encoding in MTRR
Uncacheable (UC) 00H
Write Combining (WC) 01H
Reserved* 02H
Reserved* 03H
Write-through (WT) 04H
Write-protected (WP) 05H
Writeback (WB) 06H
Reserved* 7H through FFH
NOTE:
* Use of these encodings results in a general-protection exception (#GP).
10-26 Vol. 3
MEMORY CACHE CONTROL
10.11.1 MTRR Feature Identification
The availability of the MTRR feature is model-specific. Software can determine if MTRRs are
supported on a processor by executing the CPUID instruction and reading the state of the MTRR
flag (bit 12) in the feature information register (EDX).
If the MTRR flag is set (indicating that the processor implements MTRRs), additional information
about MTRRs can be obtained from the 64-bit IA32_MTRRCAP MSR (named MTRRcap
MSR for the P6 family processors). The IA32_MTRRCAP MSR is a read-only MSR that can
be read with the RDMSR instruction. Figure 10-4 shows the contents of the IA32_MTRRCAP
MSR. The functions of the flags and field in this register are as follows:
• VCNT (variable range registers count) field, bits 0 through 7 — Indicates the number
of variable ranges implemented on the processor. The Pentium 4, Intel Xeon, and P6
family processors have eight pairs of MTRRs for setting up eight variable ranges.
• FIX (fixed range registers supported) flag, bit 8 — Fixed range MTRRs
(IA32_MTRR_FIX64K_00000 through IA32_MTRR_FIX4K_0F8000) are supported
when set; no fixed range registers are supported when clear.
Figure 10-3. Mapping Physical Memory With MTRRs
0
FFFFFFFFH
80000H
BFFFFH
C0000H
FFFFFH
100000H
7FFFFH
512 KBytes
256 KBytes
256 KBytes
8 fixed ranges
16 fixed ranges
64 fixed ranges
8 variable ranges
(64-KBytes each)
(16 KBytes each)
(4 KBytes each)
(from 4 KBytes to
maximum size of
Address ranges not
Physical Memory
mapped by an MTRR
are set to a default type
physical memory)
Vol. 3 10-27
MEMORY CACHE CONTROL
• WC (write combining) flag, bit 10 — The write-combining (WC) memory type is
supported when set; the WC type is not supported when clear.
Bit 9 and bits 11 through 63 in the IA32_MTRRCAP MSR are reserved. If software attempts to
write to the IA32_MTRRCAP MSR, a general-protection exception (#GP) is generated.
For the Pentium 4, Intel Xeon, and P6 family processors, the IA32_MTRRCAP MSR always
contai
Vol. 3 10-1
CHAPTER 10 MEMORY CACHE CONTROL
This chapter describes the IA-32 architecture’s memory cache and cache control mechanisms, the
TLBs, and the store buffer. It also describes the memory type range registers (MTRRs) found in
the P6 family processors and how they are used to control caching of physical memory locations.
10.1 INTERNAL CACHES, TLBS, AND BUFFERS
The IA-32 architecture supports caches, translation look aside buffers (TLBs), and a store buffer
for temporary on-chip (and external) storage of instructions and data. (Figure 10-1 shows the
arrangement of caches, TLBs, and the store buffer for the Pentium 4 and Intel Xeon processors.)
Table 10-1 shows the characteristics of these caches and buffers for the Pentium 4, Intel Xeon,
P6 family, and Pentium processors. The sizes and characteristics of these units are machine
specific and may change in future versions of the processor. The CPUID instruction returns
the sizes and characteristics of the caches and buffers for the processor on which the instruction
is executed (see “CPUID—CPU Identification” in Chapter 3 of the IA-32 Intel Architecture Software
Developer’s Manual, Volume 2).
Figure 10-1. Cache Structure of the Pentium 4 and Intel Xeon Processors
Instruction Decoder Trace Cache
Bus Interface Unit
System Bus
Data Cache
Unit (L1)
(External)
Physical
Memory
Store Buffer
Data TLBs
L2 Cache
Instruction
TLBs
L3 Cache†
† Intel Xeon processors only
10-2 Vol. 3
MEMORY CACHE CONTROL
Table 10-1. Characteristics of the Caches, TLBs, Store Buffer, and
Write Combining Buffer in IA-32 processors
Cache or Buffer Characteristics
Trace Cache† - Pentium 4 and Intel Xeon processors: 12 Kμops, 8-way set associative.
- Pentium M processor: not implemented.
- P6 family and Pentium processors: not implemented.
L1 Instruction Cache - Pentium 4 and Intel Xeon processors: not implemented.
- Pentium M processor: 32-KByte, 8-way set associative.
- P6 family and Pentium processors: 8- or 16-KByte, 4-way set associative,
32-byte cache line size; 2-way set associative for earlier Pentium processors.
L1 Data Cache - Pentium 4 and Intel Xeon processors: 8-KByte, 4-way set associative, 64-byte
cache line size.
- Pentium 4 and Intel Xeon processors: 16-KByte, 8-way set associative, 64-byte
cache line size.
- Pentium M processor: 32-KByte, 8-way set associative, 64-byte cache line size.
- P6 family processors: 16-KByte, 4-way set associative, 32-byte cache line size;
8-KBytes, 2-way set associative for earlier P6 family processors.
- Pentium processors: 16-KByte, 4-way set associative, 32-byte cache line size;
8-KByte, 2-way set associative for earlier Pentium processors.
L2 Unified Cache - Pentium 4 and Intel Xeon processors: 256, 512, 1024, or 2048-KByte, 8-way set
associative, 64-byte cache line size, 128-byte sector size.
- Pentium M processor: 1 or 2-MByte, 8-way set associative, 64-byte cache line
size.
- P6 family processors: 128-KByte, 256-KByte, 512-KByte, 1-MByte, or 2-MByte,
4-way set associative, 32-byte cache line size.
- Pentium processor (external optional): System specific, typically 256- or
512-KByte, 4-way set associative, 32-byte cache line size.
L3 Unified Cache - Intel Xeon processors: 512-KByte, 1-MByte, 2-MByte, or 4-MByte, 8-way set
associative, 64-byte cache line size, 128-byte sector size.
Instruction TLB
(4-KByte Pages)
- Pentium 4 and Intel Xeon processors: 128 entries, 4-way set associative.
- Pentium M processor: 128 entries, 4-way set associative.
- P6 family processors: 32 entries, 4-way set associative.
- Pentium processor: 32 entries, 4-way set associative; fully set associative for
Pentium processors with MMX technology.
Data TLB (4-KByte
Pages)
- Pentium 4 and Intel Xeon processors: 64 entries, fully set associative; shared
with large page data TLBs.
- Pentium M processor: 128 entries, 4-way set associative.
- Pentium and P6 family processors: 64 entries, 4-way set associative; fully set.
associative for Pentium processors with MMX technology.
Instruction TLB
(Large Pages)
- Pentium 4 and Intel Xeon processors: large pages are fragmented.
- Pentium M processor: 2 entries, fully associative.
- P6 family processors: 2 entries, fully associative.
- Pentium processor: Uses same TLB as used for 4-KByte pages.
Data TLB (Large
Pages)
- Pentium 4 and Intel Xeon processors: 64 entries, fully set associative; shared
with small page data TLBs.
- Pentium M processor: 8 entries, fully associative.
- P6 family processors: 8 entries, 4-way set associative.
- Pentium processor: 8 entries, 4-way set associative; uses same TLB as used for
4-KByte pages in Pentium processors with MMX technology.
Vol. 3 10-3
MEMORY CACHE CONTROL
The IA-32 processors implement four types of caches: the trace cache, the level 1 (L1) cache,
the level 2 (L2) cache, and the level 3 (L3) cache (see Figure 10-1). The uses of these caches
differs from the Pentium 4, Intel Xeon, and P6 family processors, as follows:
• Pentium 4 and Intel Xeon processors — The trace cache caches decoded instructions
(μops) from the instruction decoder, and the L1 cache contains only data. The L2 and L3
caches are unified data and instruction caches that are located on the processor chip. (The
L3 cache is only implemented on Intel Xeon processors.)
• P6 family processors — The L1 cache is divided into two sections: one dedicated to
caching IA-32 architecture instructions (pre-decoded instructions) and one to caching data.
The L2 cache is a unified data and instruction cache that is located on the processor chip.
The P6 family processors do not implement a trace cache.
• Pentium processors — The L1 cache has the same structure as on the P6 family
processors (and a trace cache is not implemented). The L2 cache is a unified data and
instruction cache that is external to the processor chip on earlier Pentium processors and
implemented on the processor chip in later Pentium processors. For Pentium processors
where the L2 cache is external to the processor, access to the cache is through the system
bus.
The cache lines for the L1 and L2 caches in the Pentium 4 and the L1, L2, and L3 caches in the
Intel Xeon processors are 64 bytes wide. The processor always reads a cache line from system
memory beginning on a 64-byte boundary. (A 64-byte aligned cache line begins at an address
with its 6 least-significant bits clear.) A cache line can be filled from memory with a 8-transfer
burst transaction. The caches do not support partially-filled cache lines, so caching even a single
doubleword requires caching an entire line.
The L1 and L2 cache lines in the P6 family and Pentium processors are 32 bytes wide, with
cache line reads from system memory beginning on a 32-byte boundary (5 least-significant bits
of a memory address clear.) A cache line can be filled from memory with a 4-transfer burst transaction.
Partially-filled cache lines are not supported.
Store Buffer - Pentium 4 and Intel Xeon processors: 24 entries.
- Pentium M processor: 16 entries.
- P6 family processors: 12 entries.
- Pentium processor: 2 buffers, 1 entry each (Pentium processors with MMX
technology have 4 buffers for 4 entries).
Write Combining
(WC) Buffer
- Pentium 4 and Intel Xeon processors: 6 or 8 entries.
- Pentium M processor: 6 entries.
- P6 family processors: 4 entries.
NOTES:
† Introduced to the IA-32 architecture in the Pentium 4 and Intel Xeon processors.
Table 10-1. Characteristics of the Caches, TLBs, Store Buffer, and
Write Combining Buffer in IA-32 processors (Contd.)
Cache or Buffer Characteristics
10-4 Vol. 3
MEMORY CACHE CONTROL
The trace cache in the Pentium 4 and Intel Xeon processors is an integral part of the Intel
NetBurst microarchitecture and is available in all execution modes: protected mode, system
management mode (SMM), and real-address mode. The L1,L2, and L3 caches are also available
in all execution modes; however, use of them must be handled carefully in SMM (see Section
13.4.2, “SMRAM Caching”).
The TLBs store the most recently used page-directory and page-table entries. They speed up
memory accesses when paging is enabled by reducing the number of memory accesses that are
required to read the page tables stored in system memory. The TLBs are divided into four
groups: instruction TLBs for 4-KByte pages, data TLBs for 4-KByte pages; instruction TLBs
for large pages (2-MByte or 4-MByte pages), and data TLBs for large pages. The TLBs are
normally active only in protected mode with paging enabled. When paging is disabled or the
processor is in real-address mode, the TLBs maintain their contents until explicitly or implicitly
flushed (see Section 10.9, “Invalidating the Translation Lookaside Buffers (TLBs)”).
The store buffer is associated with the processors instruction execution units. It allows writes to
system memory and/or the internal caches to be saved and in some cases combined to optimize
the processor’s bus accesses. The store buffer is always enabled in all execution modes.
The processor’s caches are for the most part transparent to software. When enabled, instructions
and data flow through these caches without the need for explicit software control. However,
knowledge of the behavior of these caches may be useful in optimizing software performance.
For example, knowledge of cache dimensions and replacement algorithms gives an indication
of how large of a data structure can be operated on at once without causing cache thrashing.
In multiprocessor systems, maintenance of cache consistency may, in rare circumstances,
require intervention by system software. For these rare cases, the processor provides privileged
cache control instructions for use in flushing caches and forcing memory ordering.
The Pentium III, Pentium 4, and Intel Xeon processors introduced several instructions that software
can use to improve the performance of the L1, L2, and L3 caches, including the
PREFETCHh and CLFLUSH instructions and the non-temporal move instructions (MOVNTI,
MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD). The use of these instructions are
discussed in Section 10.5.5, “Cache Management Instructions”.
10.2 CACHING TERMINOLOGY
The IA-32 architecture (beginning with the Pentium processor) uses the MESI (modified, exclusive,
shared, invalid) cache protocol to maintain consistency with internal caches and caches in
other processors (see Section 10.4, “Cache Control Protocol”).
When the processor recognizes that an operand being read from memory is cacheable, the
processor reads an entire cache line into the appropriate cache (L1, L2, L3, or all). This operation
is called a cache line fill. If the memory location containing that operand is still cached the next
time the processor attempts to access the operand, the processor can read the operand from the
cache instead of going back to memory. This operation is called a cache hit.
Vol. 3 10-5
MEMORY CACHE CONTROL
When the processor attempts to write an operand to a cacheable area of memory, it first checks
if a cache line for that memory location exists in the cache. If a valid cache line does exist, the
processor (depending on the write policy currently in force) can write the operand into the cache
instead of writing it out to system memory. This operation is called a write hit. If a write misses
the cache (that is, a valid cache line is not present for area of memory being written to), the
processor performs a cache line fill, write allocation. Then it writes the operand into the cache
line and (depending on the write policy currently in force) can also write it out to memory. If the
operand is to be written out to memory, it is written first into the store buffer, and then written
from the store buffer to memory when the system bus is available. (Note that for the Pentium
processor, write misses do not result in a cache line fill; they always result in a write to memory.
For this processor, only read misses result in cache line fills.)
When operating in an MP system, IA-32 processors (beginning with the Intel486 processor)
have the ability to snoop other processor’s accesses to system memory and to their internal
caches. They use this snooping ability to keep their internal caches consistent both with system
memory and with the caches in other processors on the bus. For example, in the Pentium and P6
family processors, if through snooping one processor detects that another processor intends to
write to a memory location that it currently has cached in shared state, the snooping processor
will invalidate its cache line forcing it to perform a cache line fill the next time it accesses the
same memory location.
Beginning with the P6 family processors, if a processor detects (through snooping) that another
processor is trying to access a memory location that it has modified in its cache, but has not yet
written back to system memory, the snooping processor will signal the other processor (by
means of the HITM# signal) that the cache line is held in modified state and will preform an
implicit write-back of the modified data. The implicit write-back is transferred directly to the
initial requesting processor and snooped by the memory controller to assure that system memory
has been updated. Here, the processor with the valid data may pass the data to the other processors
without actually writing it to system memory; however, it is the responsibility of the
memory controller to snoop this operation and update memory.
10.3 METHODS OF CACHING AVAILABLE
The processor allows any area of system memory to be cached in the L1, L2, and L3 caches. In
individual pages or regions of system memory, it allows the type of caching (also called
memory type) to be specified (see Section 10.5). Memory types currently defined for the IA-32
architecture are as follows (see Table 10-2):
• Strong Uncacheable (UC) —System memory locations are not cached. All reads and
writes appear on the system bus and are executed in program order without reordering. No
speculative memory accesses, page-table walks, or prefetches of speculated branch targets
are made. This type of cache-control is useful for memory-mapped I/O devices. When
used with normal RAM, it greatly reduces processor performance.
10-6 Vol. 3
MEMORY CACHE CONTROL
NOTE
The behavior of FP and SSE/SSE2 operations on operands in UC memory is
implementation dependent. In some implementations, accesses to UC
memory may occur more than once. To ensure predictable behavior, use loads
and stores of general purpose registers to access UC memory that may have
read or write side effects.
• Uncacheable (UC-) — Has same characteristics as the strong uncacheable (UC) memory
type, except that this memory type can be overridden by programming the MTRRs for the
WC memory type. This memory type is available in the Pentium 4, Intel Xeon, and
Pentium III processors and can only be selected through the PAT.
• Write Combining (WC) — System memory locations are not cached (as with
uncacheable memory) and coherency is not enforced by the processor’s bus coherency
protocol. Speculative reads are allowed. Writes may be delayed and combined in the write
combining buffer (WC buffer) to reduce memory accesses. If the WC buffer is partially
filled, the writes may be delayed until the next occurrence of a serializing event; such as,
an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached
memory, an interrupt occurrence, or a LOCK instruction execution. This type of cachecontrol
is appropriate for video frame buffers, where the order of writes is unimportant as
long as the writes update memory so they can be seen on the graphics display. See Section
10.3.1, “Buffering of Write Combining Memory Locations”, for more information about
caching the WC memory type. This memory type is available in the Pentium Pro and
Pentium II processors by programming the MTRRs or in the Pentium III, Pentium 4, and
Intel Xeon processors by programming the MTRRs or by selecting it through the PAT.
• Write-through (WT) — Writes and reads to and from system memory are cached. Reads
come from cache lines on cache hits; read misses cause cache fills. Speculative reads are
allowed. All writes are written to a cache line (when possible) and through to system
Table 10-2. Memory Types and Their Properties
Memory Type and
Mnemonic
Cacheable Writeback
Cacheable
Allows
Speculative
Reads
Memory Ordering Model
Strong Uncacheable
(UC)
No No No Strong Ordering
Uncacheable (UC-) No No No Strong Ordering. Can only be
selected through the PAT. Can be
overridden by WC in MTRRs.
Write Combining (WC) No No Yes Weak Ordering. Available by
programming MTRRs or by
selecting it through the PAT.
Write Through (WT) Yes No Yes Speculative Processor Ordering.
Write Back (WB) Yes Yes Yes Speculative Processor Ordering.
Write Protected (WP) Yes for
reads; no for
writes
No Yes Speculative Processor Ordering.
Available by programming
MTRRs.
Vol. 3 10-7
MEMORY CACHE CONTROL
memory. When writing through to memory, invalid cache lines are never filled, and valid
cache lines are either filled or invalidated. Write combining is allowed. This type of cachecontrol
is appropriate for frame buffers or when there are devices on the system bus that
access system memory, but do not perform snooping of memory accesses. It enforces
coherency between caches in the processors and system memory.
• Write-back (WB) — Writes and reads to and from system memory are cached. Reads
come from cache lines on cache hits; read misses cause cache fills. Speculative reads are
allowed. Write misses cause cache line fills (in the Pentium 4, Intel Xeon, and P6 family
processors), and writes are performed entirely in the cache, when possible. Write
combining is allowed. The write-back memory type reduces bus traffic by eliminating
many unnecessary writes to system memory. Writes to a cache line are not immediately
forwarded to system memory; instead, they are accumulated in the cache. The modified
cache lines are written to system memory later, when a write-back operation is performed.
Write-back operations are triggered when cache lines need to be deallocated, such as when
new cache lines are being allocated in a cache that is already full. They also are triggered
by the mechanisms used to maintain cache consistency. This type of cache-control
provides the best performance, but it requires that all devices that access system memory
on the system bus be able to snoop memory accesses to insure system memory and cache
coherency.
• Write protected (WP) — Reads come from cache lines when possible, and read misses
cause cache fills. Writes are propagated to the system bus and cause corresponding cache
lines on all processors on the bus to be invalidated. Speculative reads are allowed. This
memory type is available in the Pentium 4, Intel Xeon, and P6 family processors by
programming the MTRRs (see Table 10-6).
Table 10-3 shows which of these caching methods are available in the Pentium, P6 Family,
Pentium 4, and Intel Xeon processors.
Table 10-3. Methods of Caching Available in Pentium 4, Intel Xeon, P6 Family,
and Pentium Processors
Memory Type Pentium 4 and Intel
Xeon Processors
P6 Family Processors Pentium Processor
Strong Uncacheable (UC) Yes Yes Yes
Uncacheable (UC-) Yes Yes* No
Write Combining (WC) Yes Yes No
Write Through (WT) Yes Yes Yes
Write Back (WB) Yes Yes Yes
Write Protected (WP) Yes Yes No
NOTE:
* Introduced in the Pentium III processor; not available in the Pentium Pro or Pentium II processors
10-8 Vol. 3
MEMORY CACHE CONTROL
10.3.1 Buffering of Write Combining Memory Locations
Writes to the WC memory type are not cached in the typical sense of the word cached. They are
retained in an internal write combining buffer (WC buffer) that is separate from the internal L1,
L2, and L3 caches and the store buffer. The WC buffer is not snooped and thus does not provide
data coherency. Buffering of writes to WC memory is done to allow software a small window
of time to supply more modified data to the WC buffer while remaining as non-intrusive to software
as possible. The buffering of writes to WC memory also causes data to be collapsed; that
is, multiple writes to the same memory location will leave the last data written in the location
and the other writes will be lost.
The size and structure of the WC buffer is not architecturally defined. For the Pentium 4 and
Intel Xeon processors, the WC buffer is made up of several 64-byte WC buffers. For the P6
family processors, the WC buffer is made up of several 32-byte WC buffers.
When software begins writing to WC memory, the processor begins filling the WC buffers one
at a time. When one or more WC buffers has been filled, the processor has the option of evicting
the buffers to system memory. The protocol for evicting the WC buffers is implementation
dependent and should not be relied on by software for system memory coherency. When using
the WC memory type, software must be sensitive to the fact that the writing of data to system
memory is being delayed and must deliberately empty the WC buffers when system memory
coherency is required.
Once the processor has started to evict data from the WC buffer into system memory, it will
make a bus-transaction style decision based on how much of the buffer contains valid data. If
the buffer is full (for example, all bytes are valid) the processor will execute a burst-write transaction
on the bus that will result in all 32 bytes (P6 family processors) or 64 bytes (Pentium 4
and Intel Xeon processor) being transmitted on the data bus in a single burst transaction. If one
or more of the WC buffer’s bytes are invalid (for example, have not been written by software)
then the processor will transmit the data to memory using “partial write” transactions (one chunk
at a time, where a “chunk” is 8 bytes).
This will result in a maximum of 4 partial write transactions (for P6 family processors) or 8
partial write transactions (for the Pentium 4 and Intel Xeon processors) for one WC buffer of
data sent to memory.
The WC memory type is weakly ordered by definition. Once the eviction of a WC buffer has
started, the data is subject to the weak ordering semantics of its definition. Ordering is not maintained
between the successive allocation/deallocation of WC buffers (for example, writes to WC
buffer 1 followed by writes to WC buffer 2 may appear as buffer 2 followed by buffer 1 on the
system bus). When a WC buffer is evicted to memory as partial writes there is no guaranteed
ordering between successive partial writes (for example, a partial write for chunk 2 may appear
on the bus before the partial write for chunk 1 or vice versa).
Vol. 3 10-9
MEMORY CACHE CONTROL
The only elements of WC propagation to the system bus that are guaranteed are those provided
by transaction atomicity. For example, with a P6 family processor, a completely full WC buffer
will always be propagated as a single 32-bit burst transaction using any chunk order. In a WC
buffer eviction where the data will be evicted as partials, all data contained in the same chunk
(0 mod 8 aligned) will be propagated simultaneously. Likewise, with a Pentium 4 or Intel Xeon
processor, a full WC buffer will always be propagated as a single burst transactions, using any
chunk order within a transaction. For partial buffer propagations, all data contained in the same
chunk will be propagated simultaneously.
10.3.2 Choosing a Memory Type
The simplest system memory model does not use memory-mapped I/O with read or write side
effects, does not include a frame buffer, and uses the write-back memory type for all memory.
An I/O agent can perform direct memory access (DMA) to write-back memory and the cache
protocol maintains cache coherency.
A system can use strong uncacheable memory for other memory-mapped I/O, and should
always use strong uncacheable memory for memory-mapped I/O with read side effects.
Dual-ported memory can be considered a write side effect, making relatively prompt writes
desirable, because those writes cannot be observed at the other port until they reach the memory
agent. A system can use strong uncacheable, uncacheable, write-through, or write-combining
memory for frame buffers or dual-ported memory that contains pixel values displayed on a
screen. Frame buffer memory is typically large (a few megabytes) and is usually written more
than it is read by the processor. Using strong uncacheable memory for a frame buffer generates
very large amounts of bus traffic, because operations on the entire buffer are implemented using
partial writes rather than line writes. Using write-through memory for a frame buffer can
displace almost all other useful cached lines in the processor's L2 and L3 caches and L1 data
cache. Therefore, systems should use write-combining memory for frame buffers whenever
possible.
Software can use page-level cache control, to assign appropriate effective memory types when
software will not access data structures in ways that benefit from write-back caching. For
example, software may read a large data structure once and not access the structure again until
the structure is rewritten by another agent. Such a large data structure should be marked as
uncacheable, or reading it will evict cached lines that the processor will be referencing again.
A similar example would be a write-only data structure that is written to (to export the data to
another agent), but never read by software. Such a structure can be marked as uncacheable,
because software never reads the values that it writes (though as uncacheable memory, it will be
written using partial writes, while as write-back memory, it will be written using line writes,
which may not occur until the other agent reads the structure and triggers implicit write-backs).
On the Pentium III, Pentium 4, and Intel Xeon processors, new instructions are provided that
give software greater control over the caching, prefetching, and the write-back characteristics of
data. These instructions allow software to use weakly ordered or processor ordered memory
types to improve processor performance, but when necessary to force strong ordering on
memory reads and/or writes. They also allow software greater control over the caching of data.
10-10 Vol. 3
MEMORY CACHE CONTROL
For a description of these instructions and there intended use, see Section 10.5.5, “Cache
Management Instructions”.
10.4 CACHE CONTROL PROTOCOL
The following section describes the cache control protocol currently defined for the IA-32 architecture.
This protocol is used by the Pentium 4, Intel Xeon, P6 family, and Pentium processors.
In the L1 data cache and in the L2 and L3 unified caches, the MESI (modified, exclusive, shared,
invalid) cache protocol maintains consistency with caches of other processors. The L1 data
cache and the L2 and L3 unified caches have two MESI status flags per cache line. Each line
can thus be marked as being in one of the states defined in Table 10-4. In general, the operation
of the MESI protocol is transparent to programs.
The L1 instruction cache in P6 family processors implements only the “SI” part of the MESI
protocol, because the instruction cache is not writable. The instruction cache monitors changes
in the data cache to maintain consistency between the caches when instructions are modified.
See Section 10.6, “Self-Modifying Code”, for more information on the implications of caching
instructions.
10.5 CACHE CONTROL
The IA-32 architecture provides a variety of mechanisms for controlling the caching of data and
instructions and for controlling the ordering of reads and writes between the processor, the
caches, and memory. These mechanisms can be divided into two groups:
• Cache control registers and bits — The IA-32 architecture defines several dedicated
registers and various bits within control registers and page- and directory-table entries that
control the caching system memory locations in the L1, L2, and L3 caches. These
mechanisms control the caching of virtual memory pages and of regions of physical
memory.
Table 10-4. MESI Cache Line States
Cache Line State M (Modified) E (Exclusive) S (Shared) I (Invalid)
This cache line is valid? Yes Yes Yes No
The memory copy is… Out of date Valid Valid —
Copies exist in caches of
other processors?
No No Maybe Maybe
A write to this line … Does not go to
the system bus.
Does not go to
the system bus.
Causes the
processor to
gain exclusive
ownership of the
line.
Goes directly to
the system bus.
Vol. 3 10-11
MEMORY CACHE CONTROL
• Cache control and memory ordering instructions — The IA-32 architecture provides
several instructions that control the caching of data, the ordering of memory reads and
writes, and the prefetching of data. These instructions allow software to control the
caching of specific data structures, to control memory coherency for specific locations in
memory, and to force strong memory ordering at specific locations in a program.
The following sections describe these two groups of cache control mechanisms.
10.5.1 Cache Control Registers and Bits
The current IA-32 architecture provides the following cache-control registers and bits for use in
enabling and/or restricting caching to various pages or regions in memory (see Figure 10-2):
• CD flag, bit 30 of control register CR0 — Controls caching of system memory locations
(see Section 2.5, “Control Registers”). If the CD flag is clear, caching is enabled for the
whole of system memory, but may be restricted for individual pages or regions of memory
by other cache-control mechanisms. When the CD flag is set, caching is restricted in the
processor’s caches (cache hierarchy) for the Pentium 4, Intel Xeon, and P6 family
processors and prevented for the Pentium processor (see note below). With the CD flag set,
however, the caches will still respond to snoop traffic. Caches should be explicitly flushed
to insure memory coherency. For highest processor performance, both the CD and the NW
flags in control register CR0 should be cleared. Table 10-5 shows the interaction of the CD
and NW flags.
The effect of setting the CD flag is somewhat different for the Pentium 4, Intel Xeon,
and P6 family processors than for the Pentium processor (see Table 10-5). To insure
memory coherency after the CD flag is set, the caches should be explicitly flushed (see
Section 10.5.3, “Preventing Caching”). Setting the CD flag for the Pentium 4, Intel
Xeon, and P6 family processors modifies cache line fill and update behaviour. Also for
the Pentium 4, Intel Xeon, and P6 family processors, setting the CD flag does not force
strict ordering of memory accesses unless the MTRRs are disabled and/or all memory is
referenced as uncached (see Section 7.2.4, “Strengthening or Weakening the Memory
Ordering Model”).
10-12 Vol. 3
MEMORY CACHE CONTROL
Figure 10-2. Cache-Control Registers and Bits Available in IA-32 Processors
Page-Directory or
Page-Table Entry
TLBs
MTRRs3
Physical Memory
0
FFFFFFFFH2
control overall caching
of system memory
CD and NW Flags PCD and PWT flags
control page-level
caching
G flag controls pagelevel
flushing of TLBs
MTRRs control caching
of selected regions of
physical memory
PC
D
CR3
Control caching of
page directory
PWT
C
D
CR0
NW
Store Buffer
PC
D
PWT
G1
CR4
Enables global pages
PGE
designated with G flag
1. G flag only available in Pentium 4, Intel Xeon, and P6 family
3. MTRRs available only in Pentium 4 and P6 family processors;
similar control available in Pentium processor with the KEN#
and WB/WT# pins.
2. The maximum physical address size is reported by CPUID leaf
function 80000008H. The maximum physical address size of
PAT4
PAT controls caching
of virtual memory
pages
4. PAT available only in Pentium III and Pentium 4 processors.
P4
AT
processors.
IA32_MISC_ENABLE MSR
3rd Level
Cache Disable
FFFFFFFFFH applies only If 36-bit physical addressing is used.
Vol. 3 10-13
MEMORY CACHE CONTROL
Table 10-5. Cache Operating Modes
CD NW Caching and Read/Write Policy L1 L2/L31
0 0 Normal Cache Mode. Highest performance cache operation.
- Read hits access the cache; read misses may cause replacement.
- Write hits update the cache.
- Only writes to shared lines and write misses update system memory.
- Write misses cause cache line fills.
- Write hits can change shared lines to modified under control of the
MTRRs and with associated read invalidation cycle.
- (Pentium processor only.) Write misses do not cause cache line fills.
- (Pentium processor only.) Write hits can change shared lines to
exclusive under control of WB/WT#.
- Invalidation is allowed.
- External snoop traffic is supported.
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
0 1 Invalid setting.
Generates a general-protection exception (#GP) with an error code of 0. NA NA
1 0 No-fill Cache Mode. Memory coherency is maintained.
- (Pentium 4 and Intel Xeon processors.) State of processor after a power
up or reset.
- Read hits access the cache; read misses do not cause replacement
(see Pentium 4 and Intel Xeon processors reference below).
- Write hits update the cache.
- Only writes to shared lines and write misses update system memory.
- Write misses access memory.
- Write hits can change shared lines to exclusive under control of the
MTRRs and with associated read invalidation cycle.
- (Pentium processor only.) Write hits can change shared lines to
exclusive under control of the WB/WT#.
- (Pentium 4, Intel Xeon, and P6 family processors only.) Strict memory
ordering is not enforced unless the MTRRs are disabled and/or all
memory is referenced as uncached (see Section 7.2.4., “Strengthening
or Weakening the Memory Ordering Model”).
- Invalidation is allowed.
- External snoop traffic is supported.
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
1 1 Memory coherency is not maintained.2
- (P6 family and Pentium processors.) State of the processor after a
power up or reset.
- Read hits access the cache; read misses do not cause replacement.
- Write hits update the cache and change exclusive lines to modified.
- Shared lines remain shared after write hit.
- Write misses access memory.
- Invalidation is inhibited when snooping; but is allowed with INVD and
WBINVD instructions.
- External snoop traffic is supported.
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
NOTES:
1. The L2/L3 column in this table is definitive for the Pentium 4, Intel Xeon, and P6 family processors. It is
intended to represent what could be implemented in a system based on a Pentium processor with an
external, platform specific, write-back L2 cache.
2. The Pentium 4 and Intel Xeon processors do not support this mode; setting the CD and NW bits to 1
selects the no-fill cache mode.
10-14 Vol. 3
MEMORY CACHE CONTROL
• NW flag, bit 29 of control register CR0 — Controls the write policy for system memory
locations (see Section 2.5, “Control Registers”). If the NW and CD flags are clear, writeback
is enabled for the whole of system memory, but may be restricted for individual pages
or regions of memory by other cache-control mechanisms. Table 10-5 shows how the other
combinations of CD and NW flags affects caching.
NOTES
For the Pentium 4 and Intel Xeon processors, the NW flag is a don’t care flag;
that is, when the CD flag is set, the processor uses the no-fill cache mode,
regardless of the setting of the NW flag.
For the Pentium processor, when the L1 cache is disabled (the CD and NW
flags in control register CR0 are set), external snoops are accepted in DP
(dual-processor) systems and inhibited in uniprocessor systems.
When snoops are inhibited, address parity is not checked and APCHK# is not
asserted for a corrupt address; however, when snoops are accepted, address
parity is checked and APCHK# is asserted for corrupt addresses.
• PCD flag in the page-directory and page-table entries — Controls caching for
individual page tables and pages, respectively (see Section 3.7.6, “Page-Directory and
Page-Table Entries”). This flag only has effect when paging is enabled and the CD flag in
control register CR0 is clear. The PCD flag enables caching of the page table or page when
clear and prevents caching when set.
• PWT flag in the page-directory and page-table entries — Controls the write policy for
individual page tables and pages, respectively (see Section 3.7.6, “Page-Directory and
Page-Table Entries”). This flag only has effect when paging is enabled and the NW flag in
control register CR0 is clear. The PWT flag enables write-back caching of the page table or
page when clear and write-through caching when set.
• PCD and PWT flags in control register CR3 — Control the global caching and write
policy for the page directory (see Section 2.5, “Control Registers”). The PCD flag enables
caching of the page directory when clear and prevents caching when set. The PWT flag
enables write-back caching of the page directory when clear and write-through caching
when set. These flags do not affect the caching and write policy for individual page tables.
These flags only have effect when paging is enabled and the CD flag in control register
CR0 is clear.
• G (global) flag in the page-directory and page-table entries (introduced to the IA-32
architecture in the P6 family processors) — Controls the flushing of TLB entries for
individual pages. See Section 3.12, “Translation Lookaside Buffers (TLBs)”, for more
information about this flag.
• PGE (page global enable) flag in control register CR4 — Enables the establishment of
global pages with the G flag. See Section 3.12, “Translation Lookaside Buffers (TLBs)”,
for more information about this flag.
Vol. 3 10-15
MEMORY CACHE CONTROL
• Memory type range registers (MTRRs) (introduced in P6 family processors) —
Control the type of caching used in specific regions of physical memory. Any of the
caching types described in Section 10.3, “Methods of Caching Available”, can be selected.
See Section 10.11, “Memory Type Range Registers (MTRRs)”, for a detailed description
of the MTRRs.
• Page Attribute Table (PAT) MSR (introduced in the Pentium III processor) — Extends
the memory typing capabilities of the processor to permit memory types to be assigned on
a page-by-page basis (see Section 10.12, “Page Attribute Table (PAT)”).
• Third-Level Cache Disable flag, bit 6 of the IA32_MISC_ENABLE MSR (introduced
in the Intel Xeon processors) — Allows the L3 cache to be disabled and enabled,
independently of the L1 and L2 caches.
• KEN# and WB/WT# pins (Pentium processor) — Allow external hardware to control
the caching method used for specific areas of memory. They perform similar (but not
identical) functions to the MTRRs in the P6 family processors.
• PCD and PWT pins (Pentium processor) — These pins (which are associated with the
PCD and PWT flags in control register CR3 and in the page-directory and page-table
entries) permit caching in an external L2 cache to be controlled on a page-by-page basis,
consistent with the control exercised on the L1 cache of these processors. The Pentium 4,
Intel Xeon, and P6 family processors do not provide these pins because the L2 cache in
internal to the chip package.
10.5.2 Precedence of Cache Controls
The cache control flags and MTRRs operate hierarchically for restricting caching. That is, if the
CD flag is set, caching is prevented globally (see Table 10-5). If the CD flag is clear, the pagelevel
cache control flags and/or the MTRRs can be used to restrict caching. If there is an overlap
of page-level and MTRR caching controls, the mechanism that prevents caching has precedence.
For example, if an MTRR makes a region of system memory uncachable, a page-level
caching control cannot be used to enable caching for a page in that region. The converse is also
true; that is, if a page-level caching control designates a page as uncachable, an MTRR cannot
be used to make the page cacheable.
In cases where there is a overlap in the assignment of the write-back and write-through caching
policies to a page and a region of memory, the write-through policy takes precedence. The writecombining
policy (which can only be assigned through an MTRR or the PAT) takes precedence
over either write-through or write-back.
The selection of memory types at the page level varies depending on whether PAT is being used
to select memory types for pages, as described in the following sections.
Third-level cache disable flag (bit 6 of the IA32_MISC_ENABLE MSR) takes precedence over
the CD flag, MTRRs, and PAT for the L3 cache. That is, when the third-level cache disable flag
is set (cache disabled), the other cache controls have no affect on the L3 cache; when the flag is
clear (enabled), the cache controls have the same affect on the L3 cache as they have on the L1
and L2 caches.
10-16 Vol. 3
MEMORY CACHE CONTROL
10.5.2.1 Selecting Memory Types for Pentium Pro and Pentium II Processors
The Pentium Pro and Pentium II processors do not support the PAT. Here, the effective memory
type for a page is selected with the MTRRs and the PCD and PWT bits in the page-table or pagedirectory
entry for the page. Table 10-6 describes the mapping of MTRR memory types and
page-level caching attributes to effective memory types, when normal caching is in effect (the
CD and NW flags in control register CR0 are clear). Combinations that appear in gray are implementation-
defined for the Pentium Pro and Pentium II processors. System designers are encouraged
to avoid these implementation-defined combinations.
When normal caching is in effect, the effective memory type shown in Table 10-6 is determined
using the following rules:
1. If the PCD and PWT attributes for the page are both 0, then the effective memory type is
identical to the MTRR-defined memory type.
2. If the PCD flag is set, then the effective memory type is UC.
3. If the PCD flag is clear and the PWT flag is set, the effective memory type is WT for the
WB memory type and the MTRR-defined memory type for all other memory types.
Table 10-6. Effective Page-Level Memory Type for Pentium Pro and
Pentium II Processors
MTRR Memory Type1 PCD Value PWT Value Effective Memory Type
UC X X UC
WC 0 0 WC
0 1 WC
1 0 WC
1 1 UC
WT 0 X WT
1 X UC
WP 0 0 WP
0 1 WP
1 0 WC
1 1 UC
WB 0 0 WB
0 1 WT
1 X UC
NOTE:
1. These effective memory types also apply to the Pentium 4, Intel Xeon, and Pentium III processors
when the PAT bit is not used (set to 0) in page-table and page-directory entries.
Vol. 3 10-17
MEMORY CACHE CONTROL
4. Setting the PCD and PWT flags to opposite values is considered model-specific for the WP
and WC memory types and architecturally-defined for the WB, WT, and UC memory
types.
10.5.2.2 Selecting Memory Types for Pentium 4, Intel Xeon,
and Pentium III Processors
The Pentium 4, Intel Xeon, and Pentium III processors use the PAT to select effective page-level
memory types. Here, a memory type for a page is selected by the MTRRs and the value in a PAT
entry that is selected with the PAT, PCD and PWT bits in a page-table or page-directory entry
(see Section 10.12.3, “Selecting a Memory Type from the PAT”). Table 10-7 describes the
mapping of MTRR memory types and PAT entry types to effective memory types, when normal
caching is in effect (the CD and NW flags in control register CR0 are clear). The combinations
shown in gray are implementation-defined for the Pentium 4, Intel Xeon, and Pentium III processors.
System designers are encouraged to avoid the implementation-defined combinations.
Table 10-7. Effective Page-Level Memory Types for Pentium III, Pentium 4,
and Intel Xeon Processors
MTRR Memory Type PAT Entry Value Effective Memory Type
UC UC UC1
UC- UC1
WC WC
WT UC1
WB UC1
WP UC1
WC UC UC2
UC- WC
WC WC
WT UC2,3
WB WC
WP UC2,3
WT UC UC2
UC- UC2
WC WC
WT WT
WB WT
WP WP3
10-18 Vol. 3
MEMORY CACHE CONTROL
10.5.2.3 Writing Values Across Pages with Different Memory Types
If two adjoining pages in memory have different memory types, and a word or longer operand
is written to a memory location that crosses the page boundary between those two pages, the
operand might be written to memory twice. This action does not present a problem for writes to
actual memory; however, if a device is mapped the memory space assigned to the pages, the
device might malfunction.
10.5.3 Preventing Caching
To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills,
perform the following steps:
1. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag
to 0.
2. Flush all caches using the WBINVD instruction.
WB UC UC2
UC- UC2
WC WC
WT WT
WB WB
WP WP
WP UC UC2
UC- WC3
WC WC
WT WT3
WB WP
WP WP
NOTES:
1. The UC attribute comes from the MTRRs and the processors are not required to snoop their caches
since the data could never have been cached. This attribute is preferred for performance reasons.
2. The UC attribute came from the page-table or page-directory entry and processors are required to check
their caches because the data may be cached due to page aliasing, which is not recommended.
3. These combinations were specified as “undefined” in previous editions of the IA-32 Intel Architecture
Software Developer’s Manual. However, all processors that support both the PAT and the MTRRs determine
the effective page-level memory types for these combinations as given.
Table 10-7. Effective Page-Level Memory Types for Pentium III, Pentium 4,
and Intel Xeon Processors (Contd.)
MTRR Memory Type PAT Entry Value Effective Memory Type
Vol. 3 10-19
MEMORY CACHE CONTROL
3. Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the
uncached memory type (see the discussion of the discussion of the TYPE field and the E
flag in Section 10.11.2.1, “IA32_MTRR_DEF_TYPE MSR”).
The caches must be flushed (step 2) after the CD flag is set to insure system memory coherency.
If the caches are not flushed, cache hits on reads will still occur and data will be read from valid
cache lines.
NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching
behaviour as indicated in Table 10-5, but it does not force the effective
memory type for all physical memory to be UC nor does it force strict
memory ordering. To force the UC memory type and strict memory ordering
on all of physical memory, either the MTRRs must all be programmed for the
UC memory type or they must be disabled.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps
given above has been executed, the cache lines containing the code between
the end of the WBINVD instruction and before the MTRRS have actually
been disabled may be retained in the cache hierarchy. Here, to remove code
from the cache completely, a second WBINVD instruction must be executed
after the MTRRs have been disabled.
10.5.4 Disabling and Enabling the L3 Cache
Third-level cache disable flag (bit 6 of the IA32_MISC_ENABLE MSR) allows the L3 cache
to be disabled and enabled, independently of the L1 and L2 caches. Prior to using this control to
disable or enable the L3 cache, software should disable and flush all the processor caches, as
described earlier in Section 10.5.3, “Preventing Caching”, to prevent of loss of information
stored in the L3 cache. After the L3 cache has been disabled or enabled, caching for the whole
processor can be restored.
10.5.5 Cache Management Instructions
The IA-32 architecture provide several instructions for managing the L1, L2, and L3 caches. The
INVD, WBINVD, and WBINVD instructions are system instructions that operate on the L1, L2,
and L3 caches as a whole. The PREFETCHh and CLFLUSH instructions and the non-temporal
move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD), which
were introduced in SSE/SSE2 extensions, offer more granular control over caching.
The INVD and WBINVD instructions are used to invalidate the contents of the L1, L2, and L3
caches. The INVD instruction invalidates all internal cache entries, then generates a specialfunction
bus cycle that indicates that external caches also should be invalidated. The INVD
instruction should be used with care. It does not force a write-back of modified cache lines;
therefore, data stored in the caches and not written back to system memory will be lost. Unless
there is a specific requirement or benefit to invalidating the caches without writing back the
10-20 Vol. 3
MEMORY CACHE CONTROL
modified lines (such as, during testing or fault recovery where cache coherency with main
memory is not a concern), software should use the WBINVD instruction.
The WBINVD instruction first writes back any modified lines in all the internal caches, then
invalidates the contents of both the L1, L2, and L3 caches. It ensures that cache coherency with
main memory is maintained regardless of the write policy in effect (that is, write-through or
write-back). Following this operation, the WBINVD instruction generates one (P6 family
processors) or two (Pentium and Intel486 processors) special-function bus cycles to indicate to
external cache controllers that write-back of modified data followed by invalidation of external
caches should occur.
The PREFETCHh instructions allow a program to suggest to the processor that a cache line from
a specified location in system memory be prefetched into the cache hierarchy (see Section 10.8,
“Explicit Caching”).
The CLFLUSH instruction allow selected cache lines to be flushed from memory. This instruction
give a program the ability to explicitly free up cache space, when it is known that cached
section of system memory will not be accessed in the near future.
The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and
MOVNTPD) allow data to be moved from the processor’s registers directly into system memory
without being also written into the L1, L2, and/or L3 caches. These instructions can be used to
prevent cache pollution when operating on data that is going to be modified only once before
being stored back into system memory. These instructions operate on data in the generalpurpose,
MMX, and XMM registers.
10.5.6 L1 Data Cache Context Mode
L1 data cache context mode is a feature of IA-32 processors that support Hyper-Threading Technology.
When CPUID.1:ECX[bit 10] = 1, the processor supports setting L1 data cache context
mode using the L1 data cache context mode flag ( IA32_MISC_ENABLE[bit 24] ). Selectable
modes are adaptive mode (default) and shared mode.
The BIOS is responsible for configuring the L1 data cache context mode.
Vol. 3 10-21
MEMORY CACHE CONTROL
10.5.6.1 Adaptive Mode
Adaptive mode facilitates L1 data cache sharing between logical processors. When running in
adaptive mode, the L1 data cache is shared across logical processors in the same core if:
• CR3 control registers for logical processors sharing the cache are identical.
• The same paging mode is used by logical processors sharing the cache.
In this situation, the entire L1 data cache is available to each logical processor (instead of being
competitively shared).
If CR3 values are different for the logical processors sharing an L1 data cache or the logical
processors use different paging modes, processors compete for cache resources. This reduces
the effective size of the cache for each logical processor. Aliasing of the cache is not allowed
(which prevents data thrashing).
10.5.6.2 Shared Mode
In shared mode, the L1 data cache is competitively shared between logical processors. This is
true even if the logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear
address in the cache can point to different physical locations. The mechanism for resolving
aliasing can lead to thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the
preferred configuration for IA-32 processors that support Hyper-Threading Technology.
10.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the processor causes
the associated cache line (or lines) to be invalidated. This check is based on the physical address
of the instruction. In addition, the P6 family and Pentium processors check whether a write to a
code segment may modify an instruction that has been prefetched for execution. If the write
affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on
the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a
snoop of an instruction in a code segment, where the target instruction is already decoded and
resident in the trace cache, invalidates the entire trace cache. The latter behavior means that
programs that self-modify code can cause severe degradation of performance when run on the
Pentium 4 and Intel Xeon processors.
In practice, the check on linear addresses should not create compatibility problems among IA-32
processors. Applications that include self-modifying code use the same linear address for modifying
and fetching the instruction. Systems software, such as a debugger, that might possibly
modify an instruction using a different linear address than that used to fetch the instruction, will
execute a serializing operation, such as a CPUID instruction, before the modified instruction is
executed, which will automatically resynchronize the instruction cache and prefetch queue. (See
Section 7.1.3, “Handling Self- and Cross-Modifying Code”, for more information about the use
of self-modifying code.)
10-22 Vol. 3
MEMORY CACHE CONTROL
For Intel486 processors, a write to an instruction in the cache will modify it in both the cache
and memory, but if the instruction was prefetched before the write, the old version of the instruction
could be the one executed. To prevent the old instruction from being executed, flush the
instruction prefetch unit by coding a jump instruction immediately after any write that modifies
an instruction.
10.7 IMPLICIT CACHING (PENTIUM 4, INTEL XEON,
AND P6 FAMILY PROCESSORS)
Implicit caching occurs when a memory element is made potentially cacheable, although the
element may never have been accessed in the normal von Neumann sequence. Implicit caching
occurs on the Pentium 4, Intel Xeon, and P6 family processors due to aggressive prefetching,
branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of
existing Intel386, Intel486, and Pentium processor systems, since software running on these
processor families also has not been able to deterministically predict the behavior of instruction
prefetch.
To avoid problems related to implicit caching, the operating system must explicitly invalidate
the cache when changes are made to cacheable data that the cache coherency mechanism does
not automatically handle. This includes writes to dual-ported or physically aliased memory
boards that are not detected by the snooping mechanisms of the processor, and changes to pagetable
entries in memory.
The code in Example 10-1 shows the effect of implicit caching on page-table entries. The linear
address F000H points to physical location B000H (the page-table entry for F000H contains the
value B000H), and the page-table entry for linear address F000 is PTE_F000.
Example 10-1. Effect of Implicit Caching on Page-Table Entries
mov EAX, CR3 ; Invalidate the TLB
mov CR3, EAX ; by copying CR3 to itself
mov PTE_F000, A000H; Change F000H to point to A000H
mov EBX, [F000H];
Because of speculative execution in the Pentium 4, Intel Xeon, and P6 family processors, the
last MOV instruction performed would place the value at physical location B000H into EBX,
rather than the value at the new physical address A000H. This situation is remedied by placing
a TLB invalidation between the load and the store.
10.8 EXPLICIT CACHING
The Pentium III processor introduced four new instructions, the PREFETCHh instructions, that
provide software with explicit control over the caching of data. These instructions provide
“hints” to the processor that the data requested by a PREFETCHh instruction should be read into
Vol. 3 10-23
MEMORY CACHE CONTROL
cache hierarchy now or as soon as possible, in anticipation of its use. The instructions provide
different variations of the hint that allow selection of the cache level into which data will be read.
The PREFETCHh instructions can help reduce the long latency typically associated with
reading data from memory and thus help prevent processor “stalls.” However, these instructions
should be used judiciously. Overuse can lead to resource conflicts and hence reduce the performance
of an application. Also, these instructions should only be used to prefetch data from
memory; they should not be used to prefetch instructions. For more detailed information on the
proper use of the prefetch instruction, refer to Chapter 6, “Optimizing Cache Usage for the Intel
Pentium 4 Processors”, in the Pentium 4 Processor Optimization Reference Manual (see
Section 1.4, “Related Literature”, for the document order number).
10.9 INVALIDATING THE TRANSLATION LOOKASIDE BUFFERS
(TLBS)
The processor updates its address translation caches (TLBs) transparently to software. Several
mechanisms are available, however, that allow software and hardware to invalidate the TLBs
either explicitly or as a side effect of another operation.
The INVLPG instruction invalidates the TLB for a specific page. This instruction is the most
efficient in cases where software only needs to invalidate a specific page, because it improves
performance over invalidating the whole TLB. This instruction is not affected by the state of the
G flag in a page-directory or page-table entry.
The following operations invalidate all TLB entries except global entries. (A global entry is one
for which the G (global) flag is set in its corresponding page-directory or page-table entry. The
global flag was introduced into the IA-32 architecture in the P6 family processors, see Section
10.5, “Cache Control”.)
• Writing to control register CR3.
• A task switch that changes control register CR3.
The following operations invalidate all TLB entries, irrespective of the setting of the G flag:
• Asserting or de-asserting the FLUSH# pin.
• (Pentium 4, Intel Xeon, and P6 family processors only.) Writing to an MTRR (with a
WRMSR instruction).
• Writing to control register CR0 to modify the PG or PE flag.
• (Pentium 4, Intel Xeon, and P6 family processors only.) Writing to control register CR4 to
modify the PSE, PGE, or PAE flag.
See Section 3.12, “Translation Lookaside Buffers (TLBs)”, for additional information about the
TLBs.
10-24 Vol. 3
MEMORY CACHE CONTROL
10.10 STORE BUFFER
IA-32 processors temporarily store each write (store) to memory in a store buffer. The store
buffer improves processor performance by allowing the processor to continue executing instructions
without having to wait until a write to memory and/or to a cache is complete. It also allows
writes to be delayed for more efficient use of memory-access bus cycles.
In general, the existence of the store buffer is transparent to software, even in systems that use
multiple processors. The processor ensures that write operations are always carried out in
program order. It also insures that the contents of the store buffer are always drained to memory
in the following situations:
• When an exception or interrupt is generated.
• (Pentium 4, Intel Xeon, and P6 family processors only) When a serializing instruction is
executed.
• When an I/O instruction is executed.
• When a LOCK operation is performed.
• (Pentium 4, Intel Xeon, and P6 family processors only) When a BINIT operation is
performed.
• (Pentium III, Pentium 4, and Intel Xeon processors only) When using an SFENCE
instruction to order stores.
• (Pentium 4 and Intel Xeon processors only) When using an MFENCE instruction to order
stores.
The discussion of write ordering in Section 7.2, “Memory Ordering”, gives a detailed description
of the operation of the store buffer.
10.11 MEMORY TYPE RANGE REGISTERS (MTRRS)
The following section pertains only to the Pentium 4, Intel Xeon, and P6 family processors.
The memory type range registers (MTRRs) provide a mechanism for associating the memory
types (see Section 10.3, “Methods of Caching Available”) with physical-address ranges in
system memory. They allow the processor to optimize operations for different types of memory
such as RAM, ROM, frame-buffer memory, and memory-mapped I/O devices. They also
simplify system hardware design by eliminating the memory control pins used for this function
on earlier IA-32 processors and the external logic needed to drive them.
The MTRR mechanism allows up to 96 memory ranges to be defined in physical memory, and
it defines a set of model-specific registers (MSRs) for specifying the type of memory that is
contained in each range. Table 10-8 shows the memory types that can be specified and their
properties; Figure 10-3 shows the mapping of physical memory with MTRRs. See Section 10.3,
“Methods of Caching Available”, for a more detailed description of each memory type.
Following a hardware reset, a Pentium 4, Intel Xeon, or P6 family processor disables all the
fixed and variable MTRRs, which in effect makes all of physical memory uncachable. InitialVol.
3 10-25
MEMORY CACHE CONTROL
ization software should then set the MTRRs to a specific, system-defined memory map. Typically,
the BIOS (basic input/output system) software configures the MTRRs. The operating
system or executive is then free to modify the memory map using the normal page-level cacheability
attributes.
In a multiprocessor system, different Pentium 4, Intel Xeon, or P6 family processors MUST use
the identical MTRR memory map so that software has a consistent view of memory, independent
of the processor executing a program.
Table 10-8. Memory Types That Can Be Encoded in MTRRs
Memory Type and Mnemonic Encoding in MTRR
Uncacheable (UC) 00H
Write Combining (WC) 01H
Reserved* 02H
Reserved* 03H
Write-through (WT) 04H
Write-protected (WP) 05H
Writeback (WB) 06H
Reserved* 7H through FFH
NOTE:
* Use of these encodings results in a general-protection exception (#GP).
10-26 Vol. 3
MEMORY CACHE CONTROL
10.11.1 MTRR Feature Identification
The availability of the MTRR feature is model-specific. Software can determine if MTRRs are
supported on a processor by executing the CPUID instruction and reading the state of the MTRR
flag (bit 12) in the feature information register (EDX).
If the MTRR flag is set (indicating that the processor implements MTRRs), additional information
about MTRRs can be obtained from the 64-bit IA32_MTRRCAP MSR (named MTRRcap
MSR for the P6 family processors). The IA32_MTRRCAP MSR is a read-only MSR that can
be read with the RDMSR instruction. Figure 10-4 shows the contents of the IA32_MTRRCAP
MSR. The functions of the flags and field in this register are as follows:
• VCNT (variable range registers count) field, bits 0 through 7 — Indicates the number
of variable ranges implemented on the processor. The Pentium 4, Intel Xeon, and P6
family processors have eight pairs of MTRRs for setting up eight variable ranges.
• FIX (fixed range registers supported) flag, bit 8 — Fixed range MTRRs
(IA32_MTRR_FIX64K_00000 through IA32_MTRR_FIX4K_0F8000) are supported
when set; no fixed range registers are supported when clear.
Figure 10-3. Mapping Physical Memory With MTRRs
0
FFFFFFFFH
80000H
BFFFFH
C0000H
FFFFFH
100000H
7FFFFH
512 KBytes
256 KBytes
256 KBytes
8 fixed ranges
16 fixed ranges
64 fixed ranges
8 variable ranges
(64-KBytes each)
(16 KBytes each)
(4 KBytes each)
(from 4 KBytes to
maximum size of
Address ranges not
Physical Memory
mapped by an MTRR
are set to a default type
physical memory)
Vol. 3 10-27
MEMORY CACHE CONTROL
• WC (write combining) flag, bit 10 — The write-combining (WC) memory type is
supported when set; the WC type is not supported when clear.
Bit 9 and bits 11 through 63 in the IA32_MTRRCAP MSR are reserved. If software attempts to
write to the IA32_MTRRCAP MSR, a general-protection exception (#GP) is generated.
For the Pentium 4, Intel Xeon, and P6 family processors, the IA32_MTRRCAP MSR always
contai








How to use Quote function: