# Software Development

Correcting The Future

## Project V: Advanced Alpha Blending

Everything you ever wanted to know about alpha blending and more.

When doing alpha channels, few algorithms put in the extra effort for maximum flexibility. This is because there is always a battle between speed, memory and complexity. As we shall see, there are many different ways of doing alpha blending.

By far, the most common form of alpha channel is where you have a semi-transparent image that you want to apply over a static background. The semi-transparent image will contain red, green and blue channels as well as an alpha channel that specifies how transparent each pixel is. Each channel contains a series of values that range from 0 to 255 for 32bit images.

What this means is that an alpha channel of 255 would mean full intensity and an alpha value of 0 would mean that the background would show completely through. Any other value would be some percentage of the image and the inverse percentage of the background. For example, if an alpha channel is at 30%, then you'll see 70% of the background showing through.

The equation for a semi transparent image applied on a static background is as follows:

pixel = (alpha * image + (255 - alpha) * background)/255

It may not be obvious what is happening here. Most articles on the subject do not even attempt to explain it. First, let's look at how a pixel is actually stored. In most cases, you'll see blue, then red, then green and finally the alpha channel for each and every pixel.

Byte 0 is B (blue).
Byte 1 is G (green).
Byte 2 is R (red).
Byte 3 is A (alpha).

This is for the image. The background is stored the same way, but the alpha channel is ignored. So if we have S for the source image and D for the destination (background), we would get the following algorithm.

```D[B] = (S[A] * S[B] + (255 - S[A]) * D[B])/255
D[G] = (S[A] * S[G] + (255 - S[A]) * D[G])/255
D[R] = (S[A] * S[R] + (255 - S[A]) * D[R])/255```

Here, most other articles proceed to shortening the above equations. We'll get to that, but there is something much more important here. Notice that the source is multiplied by the alpha channel while the destination is multiplied by its additive inverse. Why is this? Well, let's look at the equation for the blue channel again.

`D[B] = (S[A] * S[B] + (255 - S[A]) * D[B])/255`

Let's propagate the division by 255 so that we get two independent parts being added together.

`D[B] = (S[A] * S[B])/255  +  ((255 - S[A]) * D[B])/255`

The left part takes a certain percentage from S[ B ] (source blue). And the right part takes the leftover percentage from the destination D[ B ] (dest blue). What this means is that the total percentage is always 100%. Let's look at that.

```S[A]/255 + (255 - S[A])/255
(S[A] + 255 - S[A])/255
255/255
1```

So the reason we take the additive inverse from the destination is because we want to fill in the leftover percentage not used by source. What's important is that the total percentage is always 100%. This way, the final result will never exceed 100% of the maximum value for that pixel. If the source takes 10%, the dest takes 90%. The maximum will be 100% of 255 which is 255. If the source takes 32%, the dest will take 68%. Again, it adds up to 100%.

Ok, so what does this mean? Let's back up first. In most other articles, the main obstacle is the division by 255. So what they do is show how the equations can be simplified. Let's do that right now. Here's our original equations.

```D[B] = (S[A] * S[B] + (255 - S[A]) * D[B])/255
D[G] = (S[A] * S[G] + (255 - S[A]) * D[G])/255
D[R] = (S[A] * S[R] + (255 - S[A]) * D[R])/255```

We will expand the inside brackets.

```D[B] = (S[A] * S[B] + 255 * D[B] - S[A] * D[B])/255
D[G] = (S[A] * S[G] + 255 * D[G] - S[A] * D[G])/255
D[R] = (S[A] * S[R] + 255 * D[R] - S[A] * D[R])/255```

We pull out S[A].

```D[B] = (S[A] * (S[B] - D[B]) + 255 * D[B])/255
D[G] = (S[A] * (S[G] - D[G]) + 255 * D[G])/255
D[R] = (S[A] * (S[R] - D[R]) + 255 * D[R])/255```

We can separate the mults by 255.

```D[B] = (S[A] * (S[B] - D[B]))/255 + D[B]
D[G] = (S[A] * (S[G] - D[G]))/255 + D[G]
D[R] = (S[A] * (S[R] - D[R]))/255 + D[R]```

That's much shorter since there is now only one multiplication for each channel. This is usually the best that you can find.

But there's more. When we paste transparent images onto a background, the source images usually never change. So we can precalculate what their values would be when used in the above equations. So we would store precalculated values for our source images. This would save a lot of computations if the same images are used several times. For this, we need to use an earlier set of equations.

```D[B] = (S[A] * S[B])/255  +  ((255 - S[A]) * D[B])/255
D[G] = (S[A] * S[G])/255  +  ((255 - S[A]) * D[G])/255
D[R] = (S[A] * S[R])/255  +  ((255 - S[A]) * D[R])/255```

We precompute the following:

```S[PB] = (S[A] * S[B])/255
S[PG] = (S[A] * S[G])/255
S[PR] = (S[A] * S[R])/255
S[PA] = 255 - S[A]```

We convert all our source images with the above equations. When it comes time to blend them with a background, we can now use the much simplified equations below.

```D[B] = S[PB] + (S[PA] * D[B])/255
D[G] = S[PG] + (S[PA] * D[G])/255
D[R] = S[PR] + (S[PA] * D[R])/255```

Notice that you just need to add to the source. This is significantly easier and faster.

The only other topic you are likely to find is about how to get rid of the division by 255. Divisions are inhently slow. One easy way to speed up this algorithm is to just divide by 256. Dividing by 256 is simply shifting right by 8 bits. It doesn't get any faster than that. The problem with this is that you never get maximum intensity. You'll always get some of the background showing through. No matter how hard you try, you'll always get 254 as your maximum value. All your images will be a little darker. And if you apply an image several times, it will become noticeable.

Another option is to do some bit twiddling. Someone found that by inverting 255 (ie. 1/255), you get 257 shifted over by 16 bits. This means that taking your number X and multiplying it with 257 and then shifting it by 16 bits should give a result close to what is expected. Note however that 257 contains two bits each 8 bits apart. So you can do ((X>>8) + X)>>8 to get an approximate value for dividing X by 255. But this too is a little off. Someone found that if you add 128 to X and then apply the preceeding technique, it should give the correct result. I could not get this to work and frankly, it's not supposed to. The problem lies with the fact that this approximation is just that, an approximation. Multiplying by 257 is not the actual inverse of 255. The inverse of 255 is actually an infinite series. This means it's a limit. I don't know how good is the reader's knowledge of calculus, but for now it suffices to say that a limit is an equation (by repeated use) that at each iteration approaches the expected value, but never actually reaches it. Any value that is reached is only because we throw away the remainder when dividing. So errors sometime cancel out.

Now, we can finally go back to what we talked about earlier. We said before that the entire purpose of the equations for alpha channels is to have the source use up a certain percentage and have the destination use up the remaining percentage to make up a total of 100%. So what really matters is to be able to take percentages of the source and destination and have them add up to 100%. The fact that the alpha channel can take values up to 255 makes us think that we have to divide by 255. But this is not so. We can divide by 256 instead with a little tweaking.

Looking at alpha channels with maximum values of 255, we want the source to use up a portion of this 255, say 50, and have the destination use up the rest. So the remainder of 50 from 255 is 205. 205+50 = 255. Basically, we want the total to always be 255 (which is 100%). The techniques used to divide by 255 may well provide an adequate result. They'll just be slightly off some of the time by one intensity level. Not a big deal. But it takes two shifts and two additions. Maybe there's a better way.

Why not use a total of 256 instead of 255? Well, because the maximum value we can use in an alpha channel is 255. We'd never be able to have an opaque pixel because we can never store the required 256. So how about adding 1 to all alpha channel values above or equal to 128? Now, we can easily shift right by 8 to divide by 256 with proper results.

But how do we add 1 for alpha values above or equal to 128? Would this not require a comparison? Comparisons are expensive. Well, no. Not if you use assembly. In assembly, you can check if a number is greater than or equal to 128. If it is, the carry bit will be set. And now you can use an add with carry operation to get the correct adjusted alpha value.

In x86 assembly:

CMP EAX, 128 ; Carry bit set if AL is less than 128
SBC EAX, -1

That last one is tricky. Subtracting -1 means we're actually adding one. So if we have EAX lower than 128, we'll have to add 1 and subtract 1 (the carry). They cancel out and the value stays the same. On the other hand, if EAX is above or equal to 128, the carry will be 0. So we'd only add 1 (or sub -1) to EAX. And this is what we want. So we can adjust our alpha value with two instructions. Another way is to copy EAX, shift it left by 7 bits and add it back to EAX. That's three instruction and uses an extra register, but the technique is useful if you use MMX.

So let's do a quick recap. If you don't care about speed, you can use the default equations and you're output will be precise. If you don't like dividing by 255, you can divide by 256 and know that your images will always be a little darker and you can never obtain 100% opacity. The next algorithm involves using approximate inverse of 255 with the equation ((t>>8)+t)>>8 if you add 128 to t first. And the final method is to adjust your alpha value by one if it's above or equal to 128. I compared all the methods against division by 255. So if you multiply an alpha value with a pixel value and divide by 255, this is the optimal result. Then I used all the other algorithms and compared the output with what it should have been. Here's what I got.

1. 100% DUH! (alpha * pixel)/255
2. 74.5% (alpha * pixel)/256
3. 51.5% t = (alpha * pixel) + 128; ((t>>8)+t)>>8
4. 97.7% t = (alpha * pixel); ((t>>8)+t)>>8
5. 99.4% t = (alpha * pixel); (((t+128)>>8)+t)>>8
6. 100% t = (alpha * pixel); (((t+255)>>8)+t)>>8
7. 82.3% (((alpha>127)?alpha+1:alpha) * pixel)>>8
8. 86.6% (((alpha>127)?alpha+1:alpha) * ((pixel>127)?pixel+1:pixel))>>8

I don't know why many sites report the third option as producing correct output. It's complete garbage according to my results. Option 6 looks pretty good. It requires two adds and two shifts as well as a temporary variable. The second last one is what I suggested. It's above 80% and the only differences are one intensity level spread out over mid range values. For speed and adequate results, it's the best you can get. But for one or two instructions more, you can get 100% correct results. I doubt the difference will be noticeable. The choice is yours.

Let's see what it would look like using MMX. If you're not into this, skip over it.

First, we'll assume our data is a rectangular region. So we need a source and a destination. We don't need to know where in the destination the source should be pasted because the destination pointer should be precalculated with this in mind. Next, we need to know the pitch of both the source and destination. The pitch is the number of pixels between each line. Finally, we need to know how many rows and columns to copy over from the source. We'll call this the width and height.

In C, this would be our prototype:

void stdcall MMXPasteAlpha(int *dest, int *source, int destPitch, int sourcePitch, int width, int height)

We need to grab the source and destination. We'll put these eax and ebx. At each row, we'll add the pitch and then copy them over to esi and edi.

```mov eax, source
mov ebx, dest
```

Now, we need to get the width and height and check if they are not zero.

```mov edx, height
cmp edx, 0
jle near .out

cmp dword width, 0
jle near .out
```

If you have an AMD or Pentium with 64bit mode, then you can use the extra registers to store the width and pitch. For everyone else, you'll have to keep it on the stack. Not that it matters that much. Even older processors would alias memory references that are used several times into a virtual register.

Next, we need to clear a MMX register to zero because we'll need it later when we convert our values to 16bit. We also need 255. I'll explain why later.

```pxor mm7, mm7

pcmpeqw mm6, mm6    ; FFFF FFFF FFFF FFFF
psrlw mm6, 8

; Set mm6 to 127 if using method 7 instead of the preceeding instruction.
; psrlw mm6, 9        ; 007F 007F 007F 007F (127,127,127,127)
```

Now, we can start the outer loop. We copy over our main source and dest pointers to esi and edi and use that for every pixel in the row.

```.outerloop:
mov esi, eax
mov edi, ebx
```

We need our counter for the row. We'll use ecx.

```  mov ecx, width
cmp ecx, 1
jz .onePixel
sub ecx, 2
```

Here, we use a little trick. We're going to be processing two pixels at once, so we need to check if there's only one pixel left as we don't want any memory access errors. The subtraction by two is our trick. At the end of our inner loop, we could again check if there's only one pixel left and then check again if there's none left, otherwise loop back. But there's an easier way. All we really want to know is if there's at least two pixels left. So we start off by subtracting by two. This means our result is always 2 less that what it should be. So if there's only one pixel left, it'll report -1. If there's 2 pixels left, it'll report 0. At the end of our loop after we update our counter, we can just use a branch instruction that jumps if above or equal to 0. We don't even need to use a compare instruction since our update instruction will set the CPU flags correctly.

We can finally get to the good stuff. We load up two pixels from the source and two from the destination.

```.innerloop:
movq mm0, [esi]       ; a2r2 g2b2 a1r1 g1b1   (groups of 16bit from high to low).
movq mm1, [edi]       ; a4r4 g4b2 a3r3 g3b3
```

pixel 1 and 2 are the source which will be blended onto destination pixels 3 and 4 respectively.

We need to convert our values to 16bit because we're going to be multiplying. We also need to separate out our alpha values from the source. For this example, we're assuming the source pixels are precalculated.

```movq mm3, mm0         ; a2xx xxxx a1xx xxxx (we only care about the alpha values.)
; Erase the following instruction if using method 2 and you have a PIII or better.
psrlw mm3, 8          ; 00a2 00xx 00a1 00xx (shift right)

; If using method 7, we need to adjust our alpha values
; movq mm5, mm3       ; copy alpha values
; psrlw mm5, 7        ; get high bit only (to low bit)
; paddw mm3, mm5      ; add 1 if over or equal to 128

; We need alpha byte on every word.

; Use the following 5 instructions if you have a Pentium or PII.
; And delete the two instructions after that.
;movq mm4, mm3         ; 00a2 00xx 00a1 00xx (copy it for our second pixel)
;punpcklwd mm3, mm3    ; 00a1 00a1 00xx 00xx
;punpckhwd mm4, mm4    ; 00a2 00a2 00xx 00xx

;punpckhdq mm3, mm3    ; 00a1 00a1 00a1 00a1
;punpckhdq mm4, mm4    ; 00a2 00a2 00a2 00a2

pshufw mm4, mm3, 0ffh  ; 00a2 00a2 00a2 00a2
pushfw mm3, mm3, 055h  ; 00a1 00a1 00a1 00a1

; Convert dest RGB to 16bit.
movq mm2, mm1         ; a4r4 g4b2 a3r3 g3b3

punpcklbw mm1, mm7    ; 00a3 00r3 00g3 00b3
punpckhbw mm2, mm7    ; 00a4 00r4 00g4 00b4
```

Finally, we can do the actual alpha blending.

```pmullw mm1, mm3
pmullw mm2, mm4
```

Now, we have to put our pixels back to 8bit as well as divide by 255 or 256 depending on the method we chose. Dividing by 256 is simple (method 2 & 7).

```psrlw mm1, 8
psrlw mm2, 8
```

If we're using method 2 (only divide by 256), we can do much better than this. We can leave our alpha values in 8.8 format. So when we multiply, we'll get a 24.8 result. But we need to divide by 256. In reality, our result will be in 16.16 format. This means we only need the high order 16 bits. It just so happens that there is such an instruction if you have a PIII or better. If you're using method 2, you can replace the above two code segments with the following.

```pmulhw mm1, mm3
pmulhw mm2, mm4
```

But what if we are using method 6? We need to do ((t+255)>>8)+t)>>8. How is that done? One step at a time. It should now be apparent why we set mm6 to 255. This code follows the multiplication "pmullw mm2, mm4".

```movq mm3, mm1
movq mm4, mm2

psrlw mm3, 8 ; >>8
psrlw mm4, 8 ; >>8

psrlw mm1, 8 ; >>8
psrlw mm2, 8 ; >>8
```

This will give exact results. Somehow I like my way better with just adjusting the alpha values and dividing by 256. There's no way anyone will see the difference and it takes up a lot less instructions.

At last, we pack up our values, add 'em up and write it out.

```packuswb mm1, mm2  ; xxr4 g4b4 xxr3 g3b3

movq [edi], mm0
```

The rest of the code is easy, so I'll just write it out. I'm also leaving out how to handle a single pixel as it's the same as above except you only write out one pixel with "movd" instead of "movq".

```; increment our pointers

.onePixel:
sub ecx, 2
jge .innerloop

cmp ecx, -1
jnz .cont

; Here we would do a single pixel
...

.cont:
sub edx, 1
jnz .outerloop

emms
.out:
; cleanup code
```

The next thing I want to talk about is the more generic case of alpha blending where the alpha values of both the source and the destination is used. When would this be useful? It's useful when you want to create an intermediate image that will then be used later on. For example, if I want to create a composite image from two or more other images, then I will need this composite image to retain alpha information. This alpha will be used when I paste the intermediate image onto the final background. The reasons for composite images are many, but usually it's when you want to create images on the fly from other images.

So image A and B are merged to produce image C. Image C is then pasted any number of times on the background. In order to produce image C, the alpha channels of both A and B must be used. How is this accomplished?

Well, let's take the example that A is pasted onto a copy of B (that will end up being C). We know that when we use an alpha channel the topmost image will use a certain percentage and the leftover percentage by the background image. Our background in this case is B. In effect, our destination alpha channel will have to be reduced to the amount specified by this leftover percentage. We'll rename A to S, and B to D.

```Alpha channel S: Remains the same.
Alpha channel D: ((255-S[A])*D[A])/255
```

Now we have to convert our color channels. How do we do this? Well, since we know the total final alpha value, we check what percentage our original alphas are. This will determine by how much we must adjust our values.

```// Get final alpha channel.
FA = S[A] + ((255-S[A])*D[A])/255
// Get percentage (out of 255) of source alpha compared to final alpha
if (FA==0) SA = 0
else SA = S[A]*255/FA

// Destination percentage is just the additive inverse.
DA = 255-SA

D[B] = (S[B] * SA + D[B] * DA)/255
D[G] = (S[G] * SA + D[G] * DA)/255
D[R] = (S[R] * SA + D[R] * DA)/255

// Final color is FA,D[R],D[G],D[B]
```

It's not that much more complicated, but it is a lot slower. There's a real division there that can't be hacked away. The reason for this extra computation is because the R,G,B values must be a full intensity. So we have to compute what those values are BEFORE any alpha channel is applied. For example, if I have a red value of 255 but an alpha of 128, then when I finally draw it on the screen, I'll have to apply the alpha on that red value before writing it out. This is because the screen has no concept of alpha values.

alpha = 128
pixel = 255
final output = alpha * pixel / 255
final output = 128 * 255 / 255
final output = 128

Once I apply the alpha channel, I don't need it anymore. I can just keep the RGB values because the alpha channel will always be 100%. However, if we don't apply the alpha channel, then we must make sure to keep our RGB values at full intensity. That's why it gets more complicated. But like we said, if we can apply the alpha channel, then we could simplify things a great deal.

Let's assume both our source and our destination are precalculated. Well, the source values are already computed. The destination would simply be reduced to the leftover percentage. And since both images hold alpha values that says how much leftover percentages there are, we can just multiply them together to get our final alpha value.

```// Both source and dest are precalculated.
D[B] = S[B] + (S[A] * D[B])/255
D[G] = S[G] + (S[A] * D[G])/255
D[R] = S[R] + (S[A] * D[R])/255
D[A] = (S[A] * D[A])/255
```

Here, we can see significant advantages to using precalculated images. It's as simple as our original case except for the extra alpha channel computation. Oddly enough, this comes for free if we use MMX since we can multiply all of the destination's RGBA values at once.

There are only two cases left. And they get a little more complicated. We'll deal with the case where only the destination is precalculated. Because the above is so simple, we'll just precalculate source pixels on the fly and apply the above algorithm.

```IA = 255 - S[A]
D[B] = (S[B] * S[A] + D[B] * IA)/255
D[G] = (S[G] * S[A] + D[G] * IA)/255
D[R] = (S[R] * S[A] + D[R] * IA)/255
D[A] = (D[A] * IA)/255
```

It's the exact same thing as before, but with the precalculation algorithm included for the source.

Now, we get to the last case. Ironically, this setup is the most complicated whereas it used to be the simplest when the destination is static. There are many special cases. If the source alpha is 255, then only the destination is used. Nothing need be done for this. If alpha is 0, then only the source is used. Again, nothing need be done except change alpha to 255.

If the destination alpha is 0, then we only use the source, but it must be un-precomputed.
If the destination alpha is 255, the the destination need only be adjusted by the source alpha. In other words, we multiply the dest RGB values with the source alpha and divide by 255. The source RGB are added to these values for the new destination pixel. The final alpha will be 255.

Failing all of the above, we can un-precompute the source and then use the algorithm above for when both the source and destination are not precomputed. I will only show how to undo a precomputation.

```if S[A]==255 then
S[A] = 0
S[R] = 0
S[G] = 0
S[B] = 0
else
IA = 255 - S[A]
S[B] = S[B]*255/IA
S[G] = S[G]*255/IA
S[R] = S[R]*255/IA
end if
```

Note that in all the above when I say 'alpha', I mean the value stored in the image as is. So for normal images, it's the regular alpha value. But if the image is precomputed, I'm talking about the inverse alpha that's stored. In both cases, I'm always talking about the stored value when I use the term 'alpha'.

Here are all the possible combinations:

Static is an image where the alpha channel is unused.
Normal is an image with alpha channel.
Precomputed is an image where the alpha channel is pre-applied to RGB and the alpha channel is 255-alpha.

1. Normal on Static
2. Precomputed on Static
3. Normal on Normal
4. Normal on Precomputed
5. Precomputed on Normal
6. Precomputed on Precomputed

The fastest algorithms are 2 and 6. These two algorithms should be enough for 99% of your needs. Other useful algorithms are those used to undo precomputed images into normal images as well as the precomputation algorithm itself.

It may seem like this kind of library could easily be written, but the truth is that I can find none and it's 2006. Some libraries handle transparent images rather well, but they always lack the flexibility needed to achieve that extra speed. There's also the issue about configurations. Do you use normal, static or precomputed images? How do you specify which ones you want? Then there's the issue that different hardware has different instructions that can be used. Auto-detection isn't so hard, but having multiple functions for the same thing is prone to errors. You end up having to maintain multiple versions of the same code.

MMX was and is a great idea. Allowing the processing of multiple data items at once is fantastic. Now with SSE2, you can use 128bit vector operations. Not only that, but 64bit mode also doubles the number of registers all around. You have twice the number of general purpose registers and twice the number of MMX and SSE registers. However, all this power comes at a cost. That cost is a maintenance nightmare. Not only that, but the more versions you have, the slower your final implementation if you have to check what version to use every time you call a function. You can use a jump table and that's better than the alternative. Still, it adds extra complexity.

Just to tie all this in with Project V, what I want to do is make it so your application can continue to compile after shipping. For example, let's say you have a library with functions for all the different kinds of alpha blending as well as different functions for each kind of processor currently available. When a user launches your software, it should check what hardware is available and link directly into your application the appropriate function calls where needed. No jump tables. No conditionals. What you get is an executable at runtime specifically tailored for your machine.

That's not the only power here. At runtime, your software can use different algorithms depending on the circumstances. For example, there are sort routines that work well with a small set of data while other sort routines work well with large data sets. So depending on the incoming data, your software can automatically use the correct one. It can even start with one algorithm and as more data comes in, it can switch to a better routine.

If technologies like MMX, SSE, GPU and other custom hardware are to become accessible to the public, we need better tools to handle them for us. Currently, there are not many compilers that can take advantage of any of these technologies. Everything is still hand coded. The future will provide a way to create an algorithm once for a specific configuration and never have to rewrite it again. After a while, everything and anything that can be created will have been done. At least stuff like alpha blending where it won't matter what software development tool you use. That's a goal we should all be excited about.

EDIT: I've retested the algorithms when compared to the floating point version of (a * p)/255 rounded to the nearest value.

Here are the integer results from before.
1. 100% DUH! (alpha * pixel)/255
2. 74.5% (alpha * pixel)/256
3. 51.5% t = (alpha * pixel) + 128; ((t>>8)+t)>>8
4. 97.7% t = (alpha * pixel); ((t>>8)+t)>>8
5. 99.4% t = (alpha * pixel); (((t+128)>>8)+t)>>8
6. 100% t = (alpha * pixel); (((t+255)>>8)+t)>>8
7. 82.3% (((alpha>127)?alpha+1:alpha) * pixel)>>8
8. 86.6% (((alpha>127)?alpha+1:alpha) * ((pixel>127)?pixel+1:pixel))>>8

And here are the new results compared to the floating point algorithm.
1. 51.6% (alpha * pixel)/255
2. 28.2% (alpha * pixel)/256
3. 100% t = (alpha * pixel) + 128; ((t>>8)+t)>>8
4. 49.3% t = (alpha * pixel); ((t>>8)+t)>>8
5. 51.0% t = (alpha * pixel); (((t+128)>>8)+t)>>8
6. 51.6% t = (alpha * pixel); (((t+255)>>8)+t)>>8
7. 35.1% (((alpha>127)?alpha+1:alpha) * pixel)>>8
8. 53.3% (((alpha>127)?alpha+1:alpha) * ((pixel>127)?pixel+1:pixel))>>8

Ok, so that's where algorithm 3 states that it has exact results. It's exact as compared to floating point results. This is a more accurate result for sure. Everything else looks fairly bad in comparison. Still, all those other results, no individual output is more than one intensity level away from the expected value. The algorithms to watch out for are the ones that do not give correct results at the extremes of (0,0) and (255,255). These algorithms are no good because you can never get 100% transparency or 100% opacity. The algorithms that do not give proper results at the extremes are 2, 4, 5 and 8.

I've found that this algorithm is over 82.6% accurate: ((((alpha>127)?alpha+1:alpha) * pixel)+128)>>8

EDIT2: I've recently found that ((alpha+1)*pixel)>>8 is over 92% accurate with the integer version of division by 255. It uses very little instructions. And you can get the multiply and divide all in one shot by using the multiply and return high word instruction if you leave the alpha in the high byte. This is what I'm going to use and I'll report back how it goes.

Unregistered user Friday, February 16, 2007 9:04:35 PM

Martin writes: WOW, pretty cool, I just need this and it looks really well. A time ago I was thinking about this, and figured the first method (* pixel = (alpha * image + (255 - alpha) * background)/255 *) by myself. It works fine, but I wasn't thinking very deep and therefore I didn't realise your great improvements - I like especially the one with CMP AX,0x80; SBB AX,0xFF (I use BC++ 3.1 under DOS). Well, precomputing values would be great too, but I think in this case it's not useful (at least for me), because you have to precompute values with known alpha values. And that's the problem - you can't just move your semitransparent image through the background, because every pixel of it has a different opacity. Maybe I missed something :P. However, I appreciate this article very much, I don't have to use DIV instruction anymore :))). Thank you, Yours sincerely Martin K.

Vorlath Saturday, February 17, 2007 12:21:56 AM

Glad you found it useful, but you should really look into precomputing your images. I guarantee you can find a use for it.

The equation you provide is for a semi-transparent image pasted onto a static background (by static I mean that it's 100% opaque). If you use a semi-transparent source image more than once, you should consider precomputing them because you're calculating a lot of the same things over and over for no good reason.

Each R,G,B,A value would be converted to this:

newR = (oldR * oldA) / 255
newG = (oldG * oldA) / 255
newB = (oldB * oldA) / 255
newA = 255 - oldA

You do this for every pixel in your image. You replace [oldA,oldR,oldG,oldB] with [newA,newR,newG,newB] for every pixel in your source image.

You can use divide because this would be in your setup. You can also use something like (oldR * (oldA+1)) >> 8, or the CMP trick.

Now when you go to paste, instead of using
pixel = (alpha * image + (255 - alpha) * background)/255

You use:
pixel = image + (alpha * background) / 255

There's one multiplication less. So it's twice as fast. Again, you can use the CMP trick on alpha, multiply and shift by 8. Or you can just add 1 to alpha before multiplying and then shifting by 8. Either way will work ok.

The reason precomputation works is because the first part of the original equation is always the same. Here's the original equation again.

pixel = (alpha * image + (255 - alpha) * background)/255

"alpha * image" will give the same result for a source image every time you use it. So why not just store this value directly into the image? That's what precomputation does.

The only other place that we need the alpha value is with "255 - alpha". So why not store 255-alpha directly as well as our new alpha value? This way, the equation technically stays the same, but all the computations that can be done ahead of time are done so. And the cool thing is that you can paste as many transluscent images one on top of the other as many times as you want with this technique just as long as they're pasted in the proper order.

I should add that if you change the alpha values in your source image at runtime, then precomputing the image makes no sense. However, this should be rare anyhow.

Unregistered user Sunday, February 18, 2007 4:33:21 PM

Martin writes: Oh thank you, now it's clear to me. You're right, it doesn't make a sense to calculate it over and over again. It could be pushed even further - when loading WinXP icons, it contains R,G,B,A bytes/pixel. This way the image can be directly saved (converted) to file in the "newR,newG,newB,newA" format and the alpha calculation would be much quicker. Of course, we can't use different alpha then (easily).

Vorlath Monday, February 19, 2007 5:14:36 AM

Exactly! However, I only precalculate after loading the file. I don't actually save the out in that format. This is because you may want to later change the graphics or alter them for a future release. Any external files, I try to keep them in a common file format. But internally just after loading, I precomputed what I can for faster drawing afterwards. It's a tradeoff between interoperability and efficiency.

Unregistered user Thursday, October 30, 2008 6:28:27 PM

Anjelous writes: i am glad u understand all that its greek to me :doh:

### Write a comment

New comments have been disabled for this post.

December 2013
S M T W T F S
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31