Skip navigation.

Bit's World

Hope the best,plan the worst!

PC Processor Microarchitecture -- The Memory Subsystem

from : http://www.extremetech.com/article2/0,1697,1155316,00.asp
The Memory Subsystem

The memory subsystem plays a big part in the microarchitecture of a CPU. Notice that both the Instruction Access stage and the Data Access stage of our simple processor must get to memory. This memory can be split into separate sections for instructions and data, allowing each stage to have a dedicated (hence faster) port to memory.

This is called a "Harvard Architecture", a term from work at Harvard University in the 1940s that has been extended to also refer to architectures with separate instruction and data caches--even though main memory (and sometimes L2 cache) is "unified". For some background on cache design, you can refer to the memory hierarchy discussion in the article, "PC Motherboard Technology". That article also covers the system bus interface, an important part of the PC CPU design that is tailored to support the internal microarchitecture.

Virtual Memory: Making Life Easier for the Programmer and Tougher for the Hardware Designer
To make life simpler for the programmer, most addresses are "virtual addresses" that allow the software designer to pretend to have a large, linear block of memory. These virtual addresses are translated into "physical addresses" that refer to the actual addresses of the memory in the computer. In almost all x86 chips, the caches contain memory data that is addressed with physical addresses. Before the cache is accessed, any virtual addresses are translated in a "Translation Look-aside Buffer (TLB)". A TLB is like a cache of recently-used virtual address blocks (pages), responding back with the physical address page that corresponds to the virtual address presented by the CPU core. If the virtual address isn't in one of the pages stored by the TLB (a TLB miss), then the TLB must be updated from a bigger table stored in main memory--a huge performance hit (especially if the page isn't in main memory and must be loaded from disk). Some CPUs have multiple levels of TLBs, similar to the notion of cache memory hierarchy. The size and structure of the TLBs and caches will be important during our CPU comparisons later, but we'll focus mainly on the CPU core for our analysis.

Exploiting ILP Through Pipelining

Instead of waiting until an instruction has completed all five stages of our model machine, we could start a new instruction as soon as the first instruction has cleared stage 1. Notice that we can now have five instructions progressing through our "pipeline" at the same time. Essentially, we're processing five instructions in parallel, referred to as "Instruction-Level Parallelism (ILP)". If it took five clock cycles to completely execute an instruction before we pipelined the machine, we're now able to execute a new instruction every single clock. We made our computer five times faster, just with this "simple" change.

Let's Just Think About This a Minute
We'll use a bunch of computer engineering terms in a moment, since we've got to keep that person in the back of the room happy. Before doing that, take a step back and think about what we did to the machine. (Even experienced engineers forget to do that sometimes.) Suddenly, memory fetches have to occur five times faster then before. This implies that system and cache must now run five times as fast, even though each instruction still takes five cycles to completely execute.

We've also made a huge assumption that each stage was taking exactly the same amount of time, since that's the rule that our pipeline clock is enforcing. What about the assumption that the processor was even going to run the next four instructions in that order? We (usually) won't even know until the execute stage whether we need to branch to some other instruction address. Hey, what would happen if the sequence of instructions called for the processor to load some data from memory and then try to perform a math operation using that data in the next instruction? The math operation would likely be delayed, due to memory latency slowing down the process.

They're Called Pipeline Hazards
What we're describing are called "pipeline hazards", and their effects can get really ugly. There are three types of hazards that can cause our pipeline to come to a screeching halt--or cause nasty errors if we don't put in extra hardware to detect them. The first hazard is a "data hazard", such as the problem of trying to use data before it's available (a "data dependency"). Another type is a "control hazard" where the pipeline contains instructions that come after a branch. A "structural hazard" is caused by resource conflicts where an instruction sequence can cause multiple instructions to need the same processor resource during a given clock cycle. We'd have a structural hazard if we tried to use the same memory port for both instructions and data.

Modern Pipelines Can Have a Lot of Stall Cycles
There are ways to reduce the chances of a pipeline hazard occurring, and we'll discuss some of the ways that CPU architects deal with the various cases. In a practical sense, there will always be some hazards that will cause the pipeline to stall. One way to describe the situation is to say that an instruction will "block" part of the pipe (something modern implementations help minimize). When the pipe stalls, every (blocked) instruction behind the stalled stage will have to wait, while the instructions fetched earlier can continue on their way. This opens up a gap (a "pipeline bubble") between blocked instructions and the instructions proceeding down the pipeline in front of the blocked instructions.

When the blocked instruction restarts, the bubble will continue down the pipeline. For some hazards, like the control hazard caused by a (mispredicted) branch instruction, the following instructions in the pipeline need to be killed, since they aren't supposed to execute. If the branch target address isn't in the instruction cache, the pipeline can stall for a large number of clock cycles. The stall would be extended by the latency of accesses to the L2 cache or, worse, accesses to main memory. Stalls due to branches are a serious problem, and this is one of the two major areas where designers have focused their energy (and transistor budget). The other major area, not surprisingly, is when the pipeline goes to memory to load data. Most of our analysis will focus in on these 2 latency-induced problems.

Design Tricks To Reduce Data Hazards
For some data hazards, one commonly-used solution is to forward result data from a completed instruction straight to another instruction yet to execute in the pipeline (data "forwarding", though sometimes called "bypassing"). This is much faster than writing out the data and forcing the other instruction to read it back in. Our case of a math operation needing data from a previous memory load instruction would seem to be a good candidate for this technique. The data loaded from memory into a register can also be forwarded straight to the ALU execute stage, instead of going all the way through the register write-back stage. An instruction in the write-back stage could forward data straight to an instruction in the execute stage.

Why wait 2 cycles? Why not forward straight from the data access stage? In reality, the data load stage is far from instantaneous and suffers from the same memory latency risk as instruction fetches. The figure below shows how this can occur. What if the data is not in the cache? There would be a huge pipeline bubble. As it turns out, data access is even more challenging than an instruction fetch, since we don't know the memory address until we've calculated the Effective Address. While instructions are usually accessed sequentially, allowing several cache lines to be prefetched from the instruction cache (and main memory) into a fast local buffer near the execution core, data accesses don't always have such nice "locality of reference".



The Limits of Pipelining
If five stages made us run up to five times faster, why not chop up the work into a bunch more stages? Who cares about pipeline hazards when it gives the marketing folks some really high peak performance numbers to brag about? Well, every x86 processor we'll analyze has a lot more than five stages. Originally called "super-pipelining" until Intel (for no obvious reason) decided to rename it "hyper-pipelining" in their Pentium 4 design, this technique breaks up various processing stages into multiple clock cycles.

This also has the architectural benefit of giving better granularity to operations, so there should be fewer cases where a fast operation waits around while slow operations throttle the clock rate. With some of the clever design techniques we'll examine, the pipeline hazards can be managed, and clock rates can be cranked into the stratosphere. The real limit isn't an architectural issue, but is related to the way digital circuits clock data between pipeline stages.

To pipeline an operation, each new stage of the pipeline must store information passed to it from a prior stage, since each stage will (usually) contain information for a different instruction. This staged data is held in a storage device (usually a "latch"). As you chop up a task into smaller and smaller pipeline stages, the overhead time it takes to clock data into the latch ("set-up and hold" times and allowance for clock "skew" between circuits) becomes a significant percentage of the entire clock period. At some point, there is no time left in the clock cycle to do any real work. There are some exotic circuit tricks that can help, but it would burn a lot of power - not a good trade-off for chips that already exceed 70 watts in some cases.

xploiting ILP Via Superscalar Processing

While our simple machine doesn't have any serious structural hazards, that's only because it is a "single-issue" architecture. Only a single instruction can be executed during a clock cycle. In a "superscalar" architecture, extra compute resources are added to achieve another dimension of instruction-level parallelism. The original Pentium provided 2 separate pipelines that Intel called the U and V pipelines. In theory, each pipeline could be working simultaneously on 2 different sets of instructions.

With a multi-issue processor (where multiple instructions can be dispatched each clock cycle to multiple pipelines in the single processor), we can have even more data hazards, since an operation in one pipeline could depend on data that is in another pipeline. The control hazards can get worse, since our "instruction fetch bandwidth" rises (doubled in a 2-issue machine, for example). A (mispredicted) branch instruction could cause both pipelines to need instructions flushed.

Issue Restrictions Limit How Often Parallelism Can Be Achieved
In practice, a superscalar machine has lots of "issue restrictions" that limit what each pipeline is capable of processing. This structural hazard limited how often both the U and V pipe of the Pentium could simultaneously execute 2 instructions. The limitations are caused by the cost of duplicating all the hardware for each pipeline, so the designers focus instead on exploiting parallelism in as many cases as practical.



Combining Superscalar with Super-Pipelining to Get the Best of Both
Another approach to superscalar is to duplicate portions of the pipeline. This becomes much easier in the new architectures that don't require instructions to proceed at the same rate through the pipeline (or even in the original program order). An obvious stage for exploiting superscalar design techniques is the execute stage, since PC's process three different types of data. There are integer operations, floating-point operations and now "media" operations. We know all about integer and floating-point. A media instruction processes graphics, sound or video data (as well as communications data). The instruction sets now include MMX, 3DNow!, Enhanced 3DNow!, SSE, and SSE2 media instructions. The execute stage could attempt to simultaneously process all three types of instructions, as long as there is enough hardware to avoid structural hazards.

In practice, there are several structural hazards that require issue restrictions. Each new execution resource could also have its own pipeline. Many floating-point instructions and media instructions require multiple clocks and aren't fully pipelined in some implementations. We'll clear up any confusion when we analyze some real processors later. For now, it's only important to understand the fundamentals of superscalar design and realize that modern architectures include combinations of multiple pipelines running simultaneously.

Exploiting Data-Level Parallelism Via SIMD

We'll talk more about this later, but the new focus on media instructions has allowed CPU designers to recognize the inherent parallelism in the way data is processed. The same operation is often performed on independent data sets, such as multiplying data stored in a vector or a matrix. A single instruction is repeated over and over for multiple pieces of data. We can design special hardware to do this more efficiently, and we call this a "Single Instruction Multiple Data (SIMD)" computing model.

More Pressure on the Memory System
Once again, take a step back and think about the implications before that person in the back of the room gets us to dive into implementation details. With some intuitive analysis, we can observe that we've once again put tremendous pressure on our memory subsystem. A single instruction coming down our pipeline(s) could force multiple data load and store operations. Thinking a bit further about the nature of media processing, some of the streaming media types (like video) have critical timing constraints, and the streams can last for a long time (i.e. as a viewer of video, you expect a continuous flow of the video stream over time, preferably without choppiness or interruptions). Our data caches may not do us much good, since the data may only get processed once before the next chunk of data wants to replace it (data caches are most effective when the same data is accessed over and over). Thus the CPU architects have some new challenges to solve.

Where Should Designers Focus The Effort?

By now, you've likely come to realize that every CPU vendor is trying to solve similar problems. They're all trying to take a 1970s instruction set and do as much parallel processing as possible, but they're forced to deal with the limitations of both the instruction set and the nature of memory systems. There is a practical limit to how many instructions can be processed in parallel, and it gets more and more difficult for the hardware to "dynamically" schedule instructions around any possible instruction blockage. The compilers are getting better at "statically" scheduling, based on the limited information available at compile time. However, the hardware is being pushed to the limits in an attempt to look as far ahead in the instruction stream as possible in the search for non-blocking instructions.

It's All About Memory Latency
As we've shown, there are 2 stages of our computer model where the designers can get the most return on their efforts. These are Instruction Fetch and Data Access, and both can cause an enormous performance loss if not handled properly. The problem is caused by the fact that our pipelines are now running at over one GHz, and it can take over 100 pipeline cycles to get something from main memory. The key to solving the problem is to make sure that the required instructions or data aren't sitting in main memory when you need them, but instead, are already in a buffer inside your pipeline--or at least sitting in an upper level of your cache hierarchy.

Branch Prediction Can Solve the Problem With I-Fetch Latency
If we could predict with 100% certainty which direction a program branch is going (forward or backward in the instruction stream), then we could make sure that the instructions following the branch instruction are in the correct sequence in the pipeline. That's not possible, but improvement in the branch predictor can have a dramatic performance gain for these modern, deeply-pipelined architectures. We'll analyze some branch prediction approaches later.

Data Memory Latency is Much Tougher to Handle
One way to deal with data latency is to have "non-blocking loads" so that other memory operations can proceed while we're waiting for the data for a specific instruction to come back from the memory system. Every x86 architecture does this now. Still, if the data is sitting in main memory when the load is being executed, the chip's performance will take a severe hit. The key is to pre-fetch blocks of data before they're needed, and special instructions have been added to directly allow the software to deal with the limited locality of data.

There are also some ways that the pipeline can help by buffering up load requests and using intelligent data pre-fetching techniques based on the processor's knowledge of the instruction stream. We'll analyze some of the vendor solutions to the problem of data access.


VIM quick referenceGeneral Rules for Optimization

Write a comment

You must be logged in to write a comment. If you're not a registered member, please sign up.