Skip navigation.

Software Development

Correcting The Future

Heterogeneous Multicore Programming

In a recent article in EETimes, there seems to be indications of a panic of sorts when it comes to programming a diverse set of cores. It is believed that the common theme in the future will not be multiple cores of all the same make. But rather different chips for both different and similar uses. Only over time will their functionality be merged and become more homogeneous. How best to deal with these problems is the ongoing theme.

I dare to posit a different scenario. Heterogeneous multicores are here now. Video cards are the biggest showcase of this. But let us not forget all the other kinds of processing that are done on our computers. Hard drives have their own caches and their own processors. Sound cards and network cards too. These often go unnoticed because they are very specialised. The only one that is different is the video card. More and more, video cards are becoming general purpose parallel processing cores. You can do a lot of processing on video cards that is unmatched by other hardware. Not only is it well suited for video processing, but you can do all sorts of manipulation on large sets of data. The only problem right now is that each processing pass must be independent. IOW, you can process millions of data items in parallel, but all the information needed must be provided at the beginning of the pass. One item cannot use the results of another item unless you use multiple passes. Often, you can simply recompute the other item that you need and this will not slow anything down. This is a different view than what common sense would indicate. In massively parallel environments, you often want to recompute something instead of waiting for the data to arrive.

Using video cards as an example where you can send streams of instructions, we can see that heterogeneous processing is already here. And we're already seeing this processing power go to waste. Truth be told, a LOT of the processing power of video cards can give older machines new life. I have two older boxes here used for backups that are near a decade old. One of them has a cheap 4MB video card and is almost unusable with KDE. The other box with Windows has an old ATI Radeon 9200 and is just as responsive as dual core machines of today. It can't do much processing, but the video card takes over 50% of the processing time off the hands of the CPU and yet the video card basically remains idle. The article linked above seems to want to continue this trend in making modules available for common tasks.

At one point though, we're going to get a scenario where we have different general purpose processors in the same machine. And this is where panic would seem to set in. If video cards mostly go to waste today, then there's no reason other processing cores would not suffer the same fate. What we should remember is that the programming model should change. And not just the stuff I've been talking about in the past with respect to data flow.

One thing we must learn is that wasteful computations WILL happen. Take an example with video cards. Say you want every pixel to also use neighbouring pixels in its computation. Instead of two stages, you want to merge them into one stage. What you do is recompute the first stage of each neighbouring pixel and then compute the result. This means there are eight times as many computations being done than necessary. But guess what? It's faster than the alternative. Here's why. First, you'd have to do two passes and multiple passes are always slower because you need to rearrange the data for the second pass. Second, recomputing the data is faster because you don't need to transfer the data between processing pipelines which can operate in parallel. Today, you can get several hundred pipelines going at once. Even the old ATI 9200 can do 4 pipelines at once.

What does all this mean? Simple. The cost of recomputation can often be far less than the cost of data transfer and reorganisation, especially between processing nodes. Most bottlenecks happen in data paths. They rarely happen during a computation where all the data is already available. This is contrary to conventional programming practice. So the question should not be about wasting processing. Rather, wasteful computations should be the norm when dealing with multicores. The problem really comes about in how to get these cores active. But I think the distinction is great enough to warrant closer attention.

Not only is getting all cores active a problem, we now also have to think about how best to transfer data. Do we transfer all data or do we only send enough data so that the other core can recompute that same information on its own. If you do tests, recomputation will often win out. The notion of not doing twice what you can do once is out the window. Today, with minimal number of cores, recomputation isn't on anyone's mind. But when we deal with hundreds or thousands of cores, there is simply no way to maintain a global memory pool at full speed.

Another problem with multicore is the imperative model of programming that we use today. This includes procedural, object oriented and functional amongst others. I won't go into the debate about functional programming being partially imperative, but I will mention the one drawback it has when it comes to multicores. Functional programming uses the substitution model. That means that all computations must come back to the same location. So the best you can hope for is delegation. So client/server is the best you can achieve without breaking the functional paradigm (by using data flow monads and arrows for example).

Whether you use peer to peer (threads) or client/server or whatever else is in use today, there always has to be a central entity that controls how everything interacts. You cannot take part in the system without some other entity initiating an action on your behalf. This can be good, but it's mostly bad when you deal with scalable systems. You need to be able to add and remove computing nodes at any time.

The problem with peer to peer and client/server is this. Remove the central tracker or remove the server and the whole system falls apart. And that's my original point. That most programming techniques use peer to peer or client/server as their concurrency model regardless of what programming paradigm is involved.

I won't go into data flow. I've talked about this enough that anyone can go back and read up on it as to why it's automatically concurrent. What I want to go into is how the problem of concurrency relates to portability. I've said it before, but I truly believe this next statement. That portability is making the best use of each individual piece of hardware while emulation simply involves reproducing the functionality of another platform. That's a HUGE difference. If we are to achieve true concurrency, we're going to have to come to terms with how best to achieve portability as I've defined it.

It comes down to a similar situation to that of handling people with different skill sets. How do you best organise them? For some reason, people think that this should be automatic. Panic sets in. Well, you know what? Programming involves a lot of the same decisions as in other fields. The problem is really that programmers are expected to be able to do it all. Most problems in programming have already been dealt with in the real world. Think about it. Is it cheaper to move (transmit) a house than it is to build (recompute) a new one? Do certain people (cores) have specialised skills (functionality)? After the electrical crew comes in, does not the drywall crew come in (pipelining) where the electrical crew now works on another house? Why resource management is so frowned upon is mind boggling to me.

Right now, while working on Project V, it's not difficult, but it is tedious. I have to completely reorganise all the resources and how they are managed. I've had to take a million steps backwards just to take one forward. Sometimes I think the reason people think programming is hard is because it's such a mess right now. Some of the best programmers don't deal with the technology at the start, but rather with the big picture that goes beyond the machine. That way, it simply becomes a matter of translating the big picture onto the physical hardware. And that's what true portability is all about. We need a way to let the machine do this for us. That's why it's taking me a long time. Everything today is done opposite to this idea. But I'll dispense with my usual rant against VM's (horrible disasters that they are).

Just to give you an idea of how to advance, think about this. With humans, we're always thinking about individuals and coordinating free and able bodies. You always end up with conflicts and organisation problems. So these problems eventually find their way into the way we write software. We bring those same problems with us. But with multicore, there is no longer the individual (CPU). There are only the masses (concurrency). Again, this problem was solved in ancient times as well as in 1908 by Ford. Here's a paragraph from an article that discusses processor pipelines.

The pipeline itself comprises a whole task that has been broken out into smaller sub-tasks. The concept actually has its roots in mass production manufacturing plants, such as Ford Motor Company. Henry Ford determined long ago that even though it took several hours to physically build a car, he could actually produce a car a minute if he broke out all of the steps required to put a car together into different physical stations on an assembly line. As such, one station was responsible for putting in the engine, another the tires, another the seats, and so on.


From Wikipedia:

As a result [of the assembly line], Ford's cars came off the line in three minute intervals, much faster than previous methods, increasing production by seven to one (requiring 12.5 man-hours before, 1 hour 33 minutes after), while using less manpower. In 1914, an assembly line worker could buy a Model T with four months' pay.


With computers, we have the luxury that the tasks are already broken down (that's what programmers are best at doing) and we don't need a linear pipeline. We can have nodes take in multiple inputs and produce multiple outputs. We can merge and split data as we see fit. We can also move nodes from one task to another quite easily in the same way we do multitasking today, but part of a pipeline where it alternates between computing tasks. I've learned more about programming in the real world than I ever will in computer courses or programming books. Assembly lines are commonplace today in most production facilities. Yet, programming still regresses backwards because there is a clear lack of understanding when it comes to the fundamentals of computing. Below all paradigms, there are only two computing models. These are imperative and data flow. There are NO other fundamental computing models. Data flow has historically been the clear winner when it comes to production. Unfortunately, there are very little tools and virtually no material that discusses these topics for computers. If more developers can learn about these techniques and stop dismissing them out of hand (such as saying visual programming never works when it's used in the wrong places), we'll be better off. On this issue, the future has already been written. It's called the past. Sometimes you do have to take a million steps backwards to take one forward.

Project V Type SystemDoes This Happen To You?

Write a comment

Comment
(BBcode and HTML is turned off for anonymous user comments.)

Please type this security code : 387d0e

Smilies

November 2008
S M T W T F S
October 2008December 2008
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29