Skip navigation.

Software Development

Correcting The Future

Massively Parallel Code

Parallel programming will come in many forms. Right now, most people think of multicore systems as the main type of parallel programming. However, I don't think this will be the way of the future. In the past, I said things will get smaller and faster and more numerous. Before we get there, there is an alternative and I've somewhat discussed it on a prior date. It's to have massive amount of data being transformed by the exact same piece of code. That's why I'm really fascinated by 3D hardware. In every image operation, the same set of operations is done on each and every pixel. Before that, each and every vertex is operated upon by its own, but identical, set of operations.

Why is this important? Anyone that's taken a parallel computing course knows that there are four types of computing.

1. SISD
2. SIMD
3. MISD
4. MIMD

1. Single Instruction, Single Data

This is what is commonly known as scalar operations. It's what is used all the time. A=B+C type of operations. Each operation works on one set of input and produces one set of output.

2. Single Instruction, Multiple Data

This is what is commonly known as vector operations. MMX, SSE and AltiVec are typical instruction sets that are SIMD. You issue a single instruction, but work on multiple sets of input at a time. For example:

addps xmm0, xmm1

The above is a SSE instruction that adds four pairs of 32bit floats at the same time. Each register can hold four distinct floating point values and this is how vector operations achieve a speedup. Of course, if you don't need to use the same operation on all four values, then you lose out on the speedup and actually end up wasting processing power.

3. Multiple Instruction, Single Data

This isn't used much. It's basically where you would need to perform different operations on the same data. Fault tolerant systems would be one use. I could also see this being used where you want to choose the best specific implementation for a particular task. For example, if you want to compress data, you could try several algorithms in parallel and pick the one that has the best compression ratio.

4. Multiple Instruction, Multiple Data

This is multicore programming. You have different programs running on different cores all working toward a common goal.


The scenario I was talking about earlier doesn't quite fit any of these models. It is closest to the SIMD model though. But it's not just ONE single instruction that is being repeated. It is a set of instructions. And it's not 2, 4, 8 or 16 data items. It's millions.

Image and video processing along with 3D rendering are a few obvious areas where this would apply. Could this be used in other areas? I can think of plenty of places. Google is the obvious example. Map/Reduce is the functional and more limited version of what I'm talking about. Anything where you use Map/Reduce would translate to this model. But not everything in this model translates to Map/Reduce.

The big question is how would you convert your code to use this kind of hardware? Realise that this kind of hardware is cheap to build and can offer additional substantial performance to even the most archaic of machines. So it'd be foolish to pass up this opportunity. Unfortunately, there is no demand for this kind of processing because hardware manufacturers wait for demand and don't try to produce it. AMD has a huge advantage here and you can see that Intel has made an alliance with NVidia. This is not a coincidence. Things will go parallel and may go in any of several directions. Problem is that no one is looking at the big picture or in how to take advantage of these assets.

There are really four basic requirements for this kind of data processing to work.

1. Replication
2. Reduction
3. Pipelining
4. Dependent reads

1. Replication

I made up this term. If there's a more appropriate one, let me know. It's basically where you can make your data grow. Scaling an image up is one example. Say you want both the quotient and remainder. That would double the data set. Basically, any type of task where the output is larger than the input.

2. Reduction

This is the opposite of the last one. It's where you reduce the amount of output. Taking the average of a list of numbers is the most obvious example.

3. Pipelining

In graphics programming, you would use multiple "passes". It's where the output of one task is passed onto the next stage. This is what gives this technique most of its power and where Map/Reduce is incapable of the same power. Pipelining gives you the ability to either wait for each pass to complete before starting the next stage, or you can add more hardware so that multiple stages can operate in parallel. Note that this last option would be used for non-linear pipelining where one stage may need the outputs of multiple prior stages.

4. Dependent reads

What are dependent reads? In graphics programming, it's where you can change the source location of the input. Say you want to average two textures in a blend operation, the pixels have a one to one relationship. Pixel 1 of the first image is blended with pixel 1 of the second image. With dependent reads, you can change what pixels you want to operate on within the code itself. The simplest example of this is bump mapping. Another example would be a type of switchboard where you want to redirect data as is done with routers.



This kind of programming is said to be very difficult. While multicore programming is also difficult, massively parallel code has its own set of problems. The first and main problem is that you don't have access to the results of the other outputs that are computed at the same time. You need multiple passes for that. So what happens is that dependent code fails in this scenario.

Example:

A[0] = B[0] + C[0];
A[1] = B[1] + A[0];


Why does this fail? Because all the operations aren't the same. Every single data item must be processed by the exact same code. So what you can do is input a third selection that tells the code what computing path to take. You could send in D where it's simply an alternating array of booleans and change the code to this.

if (!D[i])
  return B[i] + C[i];
else
  return B[i] + B[i-1] + C[i-1];
end if


The above code would work because none of the inputs relies on the output of a parallel computation. Note that this uses dependent reads (B[i-1] & C[i-1]). Also note that B[i-1]+C[i-1] is a computation that is done TWICE! Once for i-1 and again for i. This is contrary to every commonly known programming practice. Yet it's the correct way to do this. And that brings me back to the main problem I talked about earlier. There's no easy way to tell a compiler to convert a two pass algorithm like the last one into a one pass algorithm like this one. Sure the two stage code above could be converted. But unfortunately, you would not code up a two stage algorithm in that manner.

So think of every single for loop you've ever coded up where one iteration used the results of the last iteration. Now think about accomplishing the same thing without being able to use prior results. Just think of computing the Fibonacci sequence. How would you do it? The simplest algorithm won't work.

And that's really the root of the issue when it comes to parallel programming. If you use conventional programming practices or even what I've described here, you are essentially programming the hardware. There is no way you can write a piece of code that will run either on single core, multicores or on massively parallel hardware without changing the code. Parallel programming isn't just about coding for single cores as well as multicore. It goes beyond that. There are many different types of parallel architectures. Many of them that requires forgetting common programming practices. How often do you write software where the correct method involves recomputing the same results a million times over? How often do you replicate data simply to work around the fact that you can't use loops?

Loops are truly the one obstacle standing in the way of having a universal computing "language". If we did away with loops, it would open up doors that you never knew were there. The obstacle that loops represent is that they bind together a set of operations into an atomic compound operation. If you want to add an operation, you can't simply add it on. You must open up this loop, look inside and change its internals. The coupling that loops incur is prohibitive. With the massively parallel model I've discussed, you can simply use pipelining and tack on the operation.

What's the difference if you have to change the dependencies in both cases? It's this. Pipelining enables parallel processing. Loops do not. Pipelining involves implicit processing of many data items. Loops do not. Loops also hinder composition. You can't simply chain two loops one after the other in most cases. Instead, you have to break open both loops and merge them together. With pipelining, chaining is natural.

The main point I hope I got across is that programming for one particular model is no different than programming the hardware and is not the way we should be doing things. Also, that hardware need not exist in reality. Functional programming is hitting the hardware. Imperative is hitting the hardware. OOP is hitting the hardware. The web is hitting the hardware. These all have their own models and don't work on all different kinds of underlying platforms, only specific ones. I've heard many times from functional programmers that if the hardware was different, functional programming would be a better fit and that the Von Neumann architecture is what's causing all the problems. Well, that's actually a point against.

A real software development tool should be able to adapt to any hardware. I am learning new things every day that I wished I had known a LONG time ago. There are no books and no course you can take to find out about this information. I really wish others before me would have taken a closer look at the fundamentals of computing. Instead, we have theories that there are no silver bullets. We have nay sayers to good concepts in unknowingly wrong environments and yes men to the programming flavour du jour.

When we move forward into the future, are we really going to write software specifically for multicore and have that fall back on single core? What will happen when the next parallel model comes along? And the next? At least now, I have a better understanding of why this information still hasn't come out into the mainstream today instead of 50 years ago as it should have. It's not because humans think sequentially or that people have their favourite tools or any of that. It's because we were led to believe that computing has a different set of rules than the real world. That it was unique and that our imaginations were the limits of what is possible only to end up with the same obstacles no matter what was tried. It's not a lack of imagination, but rather an unfortunate path that was impossible to backtrack. Early on, major advances were being made. Long afterwards when programmers started hitting the same obstacles, everything thus far was built on top of those early advances. So it became impossible to throw all that away. Not that it was necessary, but that is the commonly held notion when you substitute a language or platform as has been the case every time since. In the irony of ironies, here we are incapable of going forward because we are unable to go backwards.

The physical world has certain properties that cannot be avoided even in the computing world. It's why we have the concept of leaky abstractions. It's also why we can find out information about the underlying platform despite the best efforts to keep things generic. Realistically, the notion that it's the hardware that's the problem and not the programming technique is untenable. It's simply not acceptable that when new hardware comes out, we have to code everything anew every time. That's no better than the DOS days. Also, if we did things before computers existed, then surely we can do them in a like fashion with computers. And that's the ultimate path. It's why patterns are so important. It's also why so many people do not understand how to create their own. Patterns go beyond the machine. It's these kinds of notions that will lead to the future of computing. It's also these that are the most resisted.

Who Is The Safe Choice For The White House? (Updated May 11, 2008)Battlestar Galactica: The Final Cylon

Write a comment

Comment
(BBcode and HTML is turned off for anonymous user comments.)

Please type this security code : 1204a1

Smilies

November 2008
S M T W T F S
October 2008December 2008
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29