Merging of Different Operations
Sunday, 12. April 2009, 22:45:19
For example, in video cards, we render one texture at a certain location to essentially copy one texture into an (x,y) location in another (the backbuffer). Leaving out 3D operations for the moment, isn't this really just an array copy? How would a generic operation be described?
More than that, but pixel shaders now let different operations happen between source textures (arrays) before applying them to the backbuffer. For now, the operations possible when applying the result to the backbuffer are rather limited. Still, there are a few operations possible like copying, addition, subtraction, modulation, etc. If there is an alpha channel, you can use these as weights which might be of use in neural network applications for example.
As we move more toward generic computations on video cards, I think older cards still have a potential to be used in ways that may not have been obvious or easy in the past.
Should we allow something like:
Array.copy(sourceArray,x,y);
Array.add(sourceArray,x,y);
In a dataflow environment, I think using operations (and even pixel shaders) between sources would be the appropriate way to do things. One area where video cards don't work very well with the dataflow way of doing things is that the destination can be both a source and a destination at the same time. Sorta like this.
a+=b;
So I think having components that take inputs and where you provide actions between those inputs would be cool. There isn't just transformations on the data though. There are transformations on the addresses (x,y).
I'm thinking of adding more than just array types. But full matrix support with index transformation and multi-channel support (basically multiple data per element like structures).
This way, no matter if you're using graphics or arrays, or whatever else, it'll use the best hardware available.
The biggest issue I have rigth now is being able to detect generic operations and mapping them to specific hardware. That's going to be tough. And I'm not talking about little things. I'm talking about hardware that can handle massive amounts of data at once. That's where the biggest gains are. But even if I only succeed a little bit, that'll be an advantage that we wouldn't have otherwise anyhow.
Also, to give readers a better idea of where I'm heading, I am building Project V components in stages. The first components will be built-in. Each one will have a built-in piece of code that gets activated every time that component is executed. Kinda boring.
But later on, this isn't how it will work. Native components will instead describe how to access the hardware. The compiler will then be able merge different operations together. Not just in the way that current compilers work though. In dataflow, there is a very dynamic property to it. For example, if you get only one data value coming in, then you'll use a generic assembly operation. If there are two to four floating point values coming in, you can use SSE. If there is a huge amount of data coming in, you can use the video card to process them all at once. Not only that, but it can often be SLOWER to do it the way I just described. Sometimes, it's best to merge multiple operations together before going on to the next element. So if there are 12 elements that come in, that's too little to send to the video card. But assume there are an increment and multiply operation in sequence, it's best to apply BOTH operations in sequence instead of one at a time.
// Component 1
for(int i=0;i<12;i+=4)
{
output[i..i+3] = input[i..i+3]+1;
}
// Component 2 (output of component 1 is inputA)
for(int i=0;i<12;i+=4)
{
output[i..i+3] = (inputA[i..i+3]+1)*(inputB[i..i+3]);
}
versus
for(int i=0;i<12;i+=4)
{
output[i..i+3] = (inputA[i..i+3]+1)*(inputB[i..i+3]);
}
That's not all. Depending on the number of elements, different code will be produced and cached. So I'm going to have to store descriptions for different code. Sometimes, I won't generate the code on the fly right away, but instead keep a counter on different code descriptions that COULD be used. Yet, I'll executed the generic code (or something else that is capable of producing the result). Once a particular code description gets a high enough count, then the controller will generate that code in the background. This will be another task just like the software that it is executing. Once it's available, it'll use that. So no slowdown will happen unless it'll actually produce an advantage in the long run. The main obtracle I see is producing some code that doesn't get used anymore just as that code becomes available. This happens in CPU's right now with jump predictions.
As you can see, the runtime engine will be LOTS of fun for me and this is what I really want to do. The rest of it is cool too. But this is where it's at for me. And then think about adding network latency between machines instead of RAM latencies between cores. That'll affect the kind of code that gets generated too. FUN STUFF!


How to use Quote function: