OpenGL: Pixel Shaders and Why The Future of Software Depends on it.
Tuesday, 22. January 2008, 23:33:43
Recently, while coding up some OpenGL for the Project V GUI, I stumbled upon something rather strange. I guess I've been hiding in a dark hole all this time. I heard rumours. I heard stories. Mostly, I heard it was tough to get a handle on. What I had stumbled upon were pixel shaders and render to texture capabilities. I'm only using the old ATI SmartShaders (PixelShader 1.4 in DirectX) right now, but will get into more advanced shaders later on. The advantage with these old pixel shaders is that they work on old ATI Radeon cards (pre-9500) as well as newer ones. NVidia has their own custom shaders I think, but I don't have one of those. Luckily, it seems that OpenGL 2.0 has a standard for pixel shaders finally.
Anyways, this is the coolest stuff I've seen in a long time. Just when I had lost a lot of the drive to code, I find out about this stuff. I find it incredibly fun. I still haven't dealt with more advanced stuff like creating displacement maps and height maps and calculating normals for Phong shading and all that stuff. What I did learn are the basics. With SmartShader 1.0, you can do a total of 16 operations (32 if you use paired alpha channel computations), along with 6 extra operations that will read from 6 texture units. Newer cards with newer shader models allow for more registers and more texture units.
While 16 (or 32) instructions doesn't seem like much, it's actually quite powerful. I'm currently implementing a wavelet codec completely within the video card except for the packing and unpacking of compressed units. The compression actually takes place on the card, but most of the combining of the results is done on the CPU. Parallel compression was a challenge considering that sequential operations are what you need for compression to take place.
Anyways, this is the coolest stuff I've seen in a long time. Just when I had lost a lot of the drive to code, I find out about this stuff. I find it incredibly fun. I still haven't dealt with more advanced stuff like creating displacement maps and height maps and calculating normals for Phong shading and all that stuff. What I did learn are the basics. With SmartShader 1.0, you can do a total of 16 operations (32 if you use paired alpha channel computations), along with 6 extra operations that will read from 6 texture units. Newer cards with newer shader models allow for more registers and more texture units.
While 16 (or 32) instructions doesn't seem like much, it's actually quite powerful. I'm currently implementing a wavelet codec completely within the video card except for the packing and unpacking of compressed units. The compression actually takes place on the card, but most of the combining of the results is done on the CPU. Parallel compression was a challenge considering that sequential operations are what you need for compression to take place.
So why am I looking into this? Well, the video card can do parallel processing. This is an area I wasn't fully aware as to what was possible. In fact, I didn't even know that the older cards on some of the machines I have here had this power at all. Project V is perfectly suited to use this processing power. So I'm implementing my test application on the video card itself and see how well it works. Then I'll use this knowledge to make it possible within Project V. In short, it'll use the video card if it can as well as the CPU and possibly other machines along with their video cards. Old machines with a bunch of ATI 9200 (which works in old AGP 2X machines) could outpower machines several orders of magnitude in cost and complexity.
That's not the only issue at hand. Now, we have CPU extensions with AMD's extra registers for both general purpose registers as well as MMX and SSE. Then we have MMX and SSE instructions themselves which have different versions. With video cards, especially OpenGL, it's all about extensions. Pixel and vertex shaders come in several different flavours. So it'll become increasingly difficult to manage all these different capabilities. So much so that I believe the future must have a mechanism to handle this aside from defining a lowest common denominator such as what VM's do. I hope that people will realise that a VM is nothing more than another platform of which we already have plenty of like Windows, Linux, MacOS and whatever else. That it's hardware or software matters not. It changes nothing. It makes nothing easier unless you're doing stuff that's already been done.
The web is increasingly coming to realise that it is not immune to these forces. It too has a multitude of standards, browsers and compatibility problems. In the past, people tended to pick a winner. These days, there are so many people that use computers that you cannot simply dismiss certain hardware or standards or browsers without incurring a deep penalty.
Multi core seems to be the crisis that everyone talks about. But it's not the only game in town. Pixel shaders, vertex shaders, multitexturing, general purpose graphical processing units, web standards, etc. are all part of the same coming crisis. The multi core crisis is just more obvious. It contrasts with other technologies in that people are struggling to find ways to make the fullest use of multi cores. With everything else, there are some people that are already using them. So it doesn't look like there's a problem.
The future will deal not only with the multi core crisis, but with portability. Portability within the same framework. When the veil finally comes off, you'll see sparks fly. OpenGL has a model on how to handle this. So does DirectX. They have a lot of flaws as anyone who's dealt with this will tell you. Still, it's FAR better than general programming practices. Fallback measures are found everywhere in graphics programming. Unfortunately, these are cumbersome and difficult to keep track of. Same thing for processor capabilities. You can use CPUID to find out the capabilities of the processor, but how many people even understand or know about this instruction.
Hardware drives software no matter how much we may wish otherwise. And software is not immune to the same forces that affects hardware in that multiple revisions means multiple ways to do things. How we handle this will determine the future. We can no longer afford to micromanage and hardcode everything. Half ass memory management and device handling that still works the same from the days of old is one thing. New hardware and different software API's (such as the web) is another.
Video Compression
This article is complete, but for those who want to know more about what I'm doing with pixel shaders, I'll leave a little explanation (and by little, I mean several pages worth). My codec has multiple steps. First, we convert from RGB to YUV color format. Each of these steps must be reversible, so keep that in mind while I go through them. Pixel shaders can have two stages (for SmartShaders). The first is usually used to do computations on the addressing. IOW, you compute WHERE in your texture you want to grab a texel (texture pixel). Then you grab it with the computed result. With SmartShaders, you can access 6 texture units at once, so you can calculate and grab up to six texels. Then you can do some processing on those (up to 6) pixel colors and output it to the screen or into an off screen texture. As you can see, doing a color conversion is trivial because it's simply three equations. One for each color component Y, U and V. There's a DOT3 instruction that can process one component at a time. It'll multiply each RGB component with three components of you choosing and add them together.
The thing to remember is that video cards have multiple pipelines. A video card is a massively parallel machine. Older cards like the ATI 9200 can handle 4 pixels at a time within the pixel shader while newer cards like the HD3850 can potentially handle 320 pixels at once (though this is combined for vertex shaders, geometry and pixel shaders). The Radeon HD 3870 X2 has two GPU's, so you can double that amount. Eventually, we'll be able to process ALL pixels at once. Pipelining is again being heard in GPU design, but to a stronger degree. Right now, I believe pipelining is already being used for different stages such as geometry, rasterisation and rendering. Each stage can be processing different sets of sequentially issued commands.
Back to the codec... after I have YUV components, I use a wavelet algorithm on it. I'm using one that is cryptically named (2,2). I've tested all the others including Debauchies, Antonini and CDF and they all suck. Well, some of them are ok, but the extra complexity isn't worth the savings IMO. Besides, this is just a test. So I perform it once on the X axis and then on the Y axis. I repeat this on the first quadrant (one quarter the size of the original) and repeat this until I have only one pixel left. The final picture will be exactly the same size of the original, but will be organised differently and with different values for pixels. Yet, wavelets allow you to reconstruct the original picture exactly. What wavelets do is organise all the most important data in the upper left corner of the picture. That's why we repeat the operation so that the top left pixel contains the most information. That means all other pixels will take less and less data to encode the further away you get from the top left corner. No compression has taken place yet. It's the how we encode this new frame that handles this.
What I do for compression is again reorganising the data into 8x8 blocks. But I take pixels from dependent pixels only. The upper 8x8 block will obviously stay intact. But for lower levels, the data is not in 8x8 blocks. So we'll have to put those pixels back together again. Pixel shaders are great at this, but I'll try to order them correctly initially so that this isn't needed.
The next step is quantization. This is where you scale your data. This is a drastic measure and can have profound effects on the quality of the picture. Pixel shaders can handle this quite easily. Next is the encoding. While pixel shaders aren't effective in this stage, they can still do a lot. For example, they can count the amount for each pixel. With this information, you can construct a table for use with Huffman compression. The pixel shaders can even grab the value and the number of bits used. Then all the CPU has to do is put these values together and output it on disk. That's a simple view, but that's pretty much what's involved. The decompression goes through all these steps backwards.
The above steps are for I-frames. For P and B frames, something else happens. P frames are based off changes since the last I frame. And B frames are based off changes between I and P frames. For P frames, you want to search where a block of 8x8 or 16x16 has the closest match in the previous I frame. You encode this location along with a compressed version of the changes and that's your compressed frame. This is called motion compensation. The cool thing about GPU's and wavelets is that at every iteration of the wavelet computations, you obtain a smaller version of the original in the upper left corner. So you can do motion compensation at a MUCH smaller level than you ever could using conventional techniques. In fact, each iteration is only 4 pixels larger, so you can do 4x4 pixel checks rather quickly since even very old video cards have 4 texture units. With 4 passes, you can check all four locations FOR THE ENTIRE FRAME. The result will be U and V coordinates for the best matches. At the end, you'll have a UV map for every single pixel on screen. But most pixels will have UV offsets that will be VERY close to their neighbours unless something moved on screen since the last frame. This is PERFECT for wavelet encoding. Here, we do a lossless transform. If we do compress it, then we have to uncompress it again to see what final changes are needed to get the correct final result. I haven't decided yet what to do. For B frames, the process is the same except we do it for prior and subsequent frames. Usually, there are 3 B frames between I and P frames. Lately, B frames have fallen out of favour entirely.
With MPEG encoding, something else they do is use sub pixel motion compensation. They enlarge the picture and see if your block (8x8 or 16x16) of pixels stretch out by the same amount fits better with those interpolated pixels. Only those 8x8 or 16x16 pixels are compared for the original. Say you enlarge the I frame by 4 so that it's twice as wide and twice as high, you'd compute the missing pixels using an average of neighbouring pixels. For the 8x8 block you're comparing, you'd stretch them out as well, but leave the new pixels transparent. You don't use them. Then you go and compare again. Only now, you have 4 times as many locations to check although there are algorithms to speed that up too. Here's where the GPU comes in. You don't need to stretch anything. Video cards can automatically supply sub pixels. Just ask it to give you a pixel located at a fractional position and it will compute that pixel automatically depending on where it's closest to. In OpenGL, GL_LINEAR will do this for you. Most video cards support at least 4 bits of sub pixel precision meaning you can go to 16 sub pixels, not just 2 or 4 like most video codecs do. And you get this for free other than perhaps one or two extra passes. The benefit in quality would far surpass the extra computational time.
I'm mentioning all this here in case of someone trying to patent this stuff. If these already exist, then I'm not doing the above. In fact, I swear none of this exists at the time of this writing!
Happy coding.
Interesting article. Thank you!
I generally follow the logic, especially where you say "Pixel shaders can handle this quite easily." And I think I understand what you are saying about compression. As a self-test, I will try to expound on you comment "The decompression goes through all these steps backwards."
It seems you are saying the steps for decompression would be:
(1) the CPU is used to convert the Huffman bit stream into an array of { length, value } pairs.
(2) a GPU pixel shader is used to convert these back into quantized DCT values
(3) a GPU pixel shader is used to do the iDCT to get the pixel values back
Very interesting (if I got that right), though the de-Huffman is still a boat load of work on my CPU.
Again, thanks for the article.
-Jesse Chisholm at USA dot net
By anonymous user, # 29. January 2008, 02:54:11
Maybe taking a look at the mplayer source code wouldn't hurt. Specifically, ffdshow. That's all in software on the CPU and it's extremely fast.
By Vorlath, # 29. January 2008, 14:29:37