Assembly: Interlaced Zig Zag x264 (Updated Dec 8,2009)
Monday, December 7, 2009 7:21:02 PM
Furthermore, as should be clear by the “SSSE3″, there’s still no version of this assembly function for any AMD CPU or any pre-Core 2 Intel chip.
I like MMX and SSE programming, so I thought I'd give it a shot. I wanted to implement the standard zigzag algorithm, but someone had already coded it up and it didn't look like I could do anything much faster. So I decided to give it a shot with the interlaced version since it's not available. Took about 30 minutes to code up and another half hour to 45 min to test. First test to make sure the output was correct. Only had two values that were off. This was caused by an incorrect shuffle on the high part when it should have been the low. Easy one letter fix. And then I did some timing tests. Since this is written for PIII machines, I tested it on an actual PIII 600Mhz machine. I used a single page of memory (4K) and repeated the algorithm 625000 times on the entire page. The reason I use the same memory is because I want to test the algorithm itself and not have memory latency skew the results.
Generic version: 6.1 seconds
Extended MMX version: 4.7 seconds
Not bad for a first try with it being 23% faster. Not that great either. For grayscale frames of 640x360, you would save 1.4 seconds every 11000 frames. But it's a saving I will gladly take on older machines. With a framesize of 1024x512, you save 1.4 second every 5000 frames. 5000 frames at 24fps is about 3 minutes 24 seconds of video. Like I said the savings aren't that great. Over a 90 minute video, you save 36.4 seconds. So it all adds up. There may also be a way to speed it up by removing the push and pop instructions I use to save and restore esi and edi. And I didn't really look at instruction scheduling. I did a LITTLE tweaking and it saved a few tenths of a second. Some more tweaking may yield better results. Note that the savings per frame are THEORETICAL savings based on the math. They are not actual savings as I haven't tested that (I only tested the speed difference in the two implementations).
As to the code, it's not that bad. Here is the order of the output required.
0 1 2 8 9 3 4 10 16 11 5 6 7 12 17 24 18 13 14 15 19 25 32 26 20 21 22 23 27 33 40 34 28 29 30 31 35 41 48 42 36 37 38 39 43 49 50 44 45 46 47 51 56 57 52 53 54 55 58 59 60 61 62 63
There's one thing I noticed right away from what the original author mentioned. Keeping in mind that MMX only works on four 16bit values at a time, I noticed that there were four sequential sequences with groups of four entries.
[20 21 22 23]
[28 29 30 31]
[36 37 38 39]
[60 61 62 63]
That's two lines worth of the entire grid. We're one quarter done right there. Though this gets us ahead, it also negates any speed advantage since these are just copies. Perhaps we get a slight speed advantage by copying four values at a time instead of two, but from experience this doesn't amount to much.
Next thing I noticed were almost sequential sequences.
[0 1 2 8]
[18 13 14 15]
[45 46 47 51]
For these, I used movd or pextrw instructions to grab element 8, 18 and 51. And then I used pinsrw to insert those values at the correct locations. These are high latency instructions, but they're extremely simple to use meaning that any alternative method would use many more instructions to accomplish the same task. I haven't timed those alternative though.
The next thing I noticed were the second and third last groups of four.
[56 57 52 53]
[54 55 58 59]
See how the first two and last two elements are sequential? Also note how all eight entries are sequential if you were to sort them? This is a perfect situation for unpack instructions.
movq mm0, [esi+2*56] ; 59 58 57 56 movq mm1, [esi+2*52] ; 55 54 53 52 movq mm2, mm0 punpckldq mm2, mm1 ; 53 52 57 56 punpckhdq mm1, mm0 ; 59 58 55 54 movq [edi+2*52], mm2 movq [edi+2*56], mm1
Only three instructions to produce the correct result (plus the load and stores).
Now we have 9 groups of four completed. That's 4*9 = 36 values done which is 56% of our goal.
The rest is not so easy though. There were a few situations that did lend themselves to SOME optimizations.
[19 25 32 26]
[27 33 40 34]
[35 41 48 42]
[43 49 50 44]
What do these four groups have in common? They all have two entries that are sequential. Unfortunately, they are not next to each other. The last group is even stranger in that the sequential entries are at the ends and in the center.
So what I did was process all entries in a earlier groups until only entry 19 remained from [16 17 18 19]. I then shifted it over to the lowest vector element to give [19 XX XX XX] (XX=0, but not element 0).
See how I need element 32 now in [19 25 32 26]? Well, the group where 32 is located is this [32 33 34 35]. It happens to be the lowest entry as well, so an unpack instruction will yield [19 32 XX 33]. For elements 25 and 26, they are at the center of their group. [24 25 26 27]. So I have to copy it and shift it over to give [25 26 27 XX]. And one last unpack on the low words...
[19 32 XX 33]
[25 26 27 XX]
gives
[19 25 32 26].
I used the same technique for the other three groups. Note how the group [24 25 26 27] only has 27 that has not been output yet? So I can use the same technique again that I used with element 19 where I will now shift 27 from the group [24 25 26 27] to the lowest vector element. So all four of the groups that we output actually work very well together. The last group required an extra shuffle to swap the last two elements.
I was quite happy with this, especially how each section fit in with the next.
We are now 81% complete. Only three groups remain.
[7 12 17 24]
[9 3 4 10]
[16 11 5 6]
These all had to be created manually. For group [9 3 4 10], the original elements are in three different groups. So I extracted element 3 with a shuffle, unpacked it with element 4 and then unpacked those two elements with elements 9 and 10 (after these were shifted over). I then used a shuffle instruction to put them in their correct locations. It's messy. Real messy. Maybe there's an easier way, but I could not find it. The other two groups were done in a similar fashion, but ended up being more straightforward.
I'll now see if they want this code for their x264 codec.
Here is the code for reference in NASM format. I am placing this code in the public domain. Do with it as you wish.
(update below code)
; void _cdecl ASM_zigzag(short *dest, short *input);
GLOBAL ASM_zigzag
%define dest [esp+12]
%define input [esp+16]
align 4
ASM_zigzag:
push esi
push edi
mov esi, input
mov edi, dest
movq mm0, [esi+2*0] ; 03 02 01 00 *
movq mm1, [esi+2*4] ; 07 06 05 04
movq mm2, [esi+2*8] ; 11 10 09 08
pshufw mm3, mm0, 011111111b ; 03 03 03 03
movd eax, mm2 ; 09 08
pshufw mm2, mm2, 000111001b ; 08 11 10 09
punpcklwd mm3, mm1 ; 05 03 04 03
pinsrw mm0, eax, 3 ; 08 02 01 00
movq mm4, mm2
punpcklwd mm2, mm3 ; 04 10 03 09
pshufw mm2, mm2, 010110100b ; 10 04 03 09
movq [edi+2*0], mm0 ; output 08 02 01 00
movq [edi+2*4], mm2 ; output 10 04 03 09
; want
; 06 05 11 16
movq mm3, [esi+2*12] ; 15 14 13 12
movq mm5, [esi+2*16] ; 19 18 17 16
; mm1= 07 06 05 04
; mm4= 08 11 10 09
punpckldq mm6, mm5 ; 17 16 XX XX
psrlq mm1, 16 ; XX 07 06 05
punpckhwd mm6, mm4 ; 08 17 11 16
punpckldq mm6, mm1 ; 06 05 11 16
movq [edi+2*8], mm6 ; 06 05 11 16
; want
; 24 17 12 07
; mm1= XX 07 06 05
; mm3= 15 14 13 12
; mm5= 19 18 17 16
psrlq mm1, 16 ; XX XX 07 06
punpcklwd mm1, mm5 ; 17 07 16 06
movq mm0, [esi+2*20] ; 23 22 21 20
movq mm2, [esi+2*24] ; 27 26 25 24
movq mm6, mm3
punpckhdq mm1, mm1 ; 17 07 17 07
punpcklwd mm6, mm2 ; 25 13 24 12
pextrw eax, mm5, 2
movq [edi+2*24], mm0 ; 23 22 21 20
punpcklwd mm1, mm6 ; 24 17 12 07
movq [edi+2*12], mm1
; mm3= 15 14 13 12
; mm5= 19 18 17 16
; mm2= 27 26 25 24
; want
; 15 14 13 18
pinsrw mm3, eax, 0 ; 15 14 13 18
movq [edi+2*16], mm3 ; 15 14 13 18
; mm5= 19 18 17 16
; mm2= 27 26 25 24
; want
; 26 32 25 19
movq mm7, [esi+2*28]
movq mm0, [esi+2*32] ; 35 34 33 32
psrlq mm5, 48 ; XX XX XX 19
pshufw mm1, mm2, 011111001b ; 27 27 26 25
punpcklwd mm5, mm0 ; 33 XX 32 19
psrlq mm2, 48 ; XX XX XX 27
punpcklwd mm5, mm1 ; 26 32 25 19
movq [edi+2*32], mm7
movq [edi+2*20], mm5 ; 26 32 25 19
; mm0= 35 34 33 32
; mm2= XX XX XX 27
; want
; 34 40 33 27
movq mm7, [esi+2*36]
movq mm1, [esi+2*40] ; 43 42 41 40
pshufw mm3, mm0, 011111001b ; 35 35 34 33
punpcklwd mm2, mm1 ; 41 XX 40 27
movq [edi+2*40], mm7
punpcklwd mm2, mm3 ; 34 40 33 27
movq [edi+2*28], mm2
; mm0= 35 34 33 32
; mm1= 43 42 41 40
; want
; 42 48 41 35
movq mm7, [esi+2*44] ; 47 46 45 44
movq mm2, [esi+2*48] ; 51 50 49 48
psrlq mm0, 48 ; XX XX XX 35
punpcklwd mm0, mm2 ; 49 XX 48 35
pshufw mm3, mm1, 011111001b ; 43 43 42 41
punpcklwd mm0, mm3 ; 42 48 41 35
movq [edi+2*36], mm0
; mm1= 43 42 41 40
; mm7= 47 46 45 44
; mm2= 51 50 49 48
; want
; 44 50 49 43
; 51 47 46 45
pextrw eax, mm2, 3 ; 51
psrlq mm1, 48 ; XX XX XX 43
punpcklwd mm1, mm7 ; 45 XX 44 43
psrlq mm2, 16 ; XX 51 50 49
punpcklwd mm1, mm2 ; 50 44 49 43
pshufw mm1, mm1, 010110100b ; 44 50 49 43
movq [edi+2*44], mm1
; mm7= 47 46 45 44
psrlq mm7, 16 ; XX 47 46 45
pinsrw mm7, eax, 3 ; 51 47 46 45
movq [edi+2*48], mm7
; 52-59
movq mm0, [esi+2*56] ; 59 58 57 56
movq mm1, [esi+2*52] ; 55 54 53 52
movq mm2, mm0
movq mm7, [esi+2*60]
punpckldq mm2, mm1 ; 53 52 57 56
punpckhdq mm1, mm0 ; 59 58 55 54
movq [edi+2*52], mm2
movq [edi+2*56], mm1
; Move 60-63
movq [edi+2*60], mm7
.end:
pop edi
pop esi
ret
UPDATE Dec 9, 2009:
I was contacted by Dark Shikari, the author of the original blog entry that talks about the zigzag algorithm. Here are the timing results that he produced on his i7:
zigzag_scan_8x8_field_c: 532
zigzag_scan_8x8_field_mmx: 209 (my version)
zigzag_scan_8x8_field_ssse3: 210
I was shocked to see this for three reasons. First, having larger registers usually yields faster results by the simple fact that you can manipulate more data in registers (nothing is faster than registers). Second, my code is much longer. Third, I only got a 23% speed increase on an actual P3. I did not really expect the speedup to be more pronounced (61%, aka more than twice as fast) on more recent processors. Perhaps naively, I thought the difference between C code and custom assembly would be less on more recent processors. It seems the opposite is true in certain cases.
Dark Shikari has said that since my code is just as fast, that mine will be used instead since it works on all processors since P3. I'm very happy that I could be of help for the MMX version (though I wasn't expecting to replace the original SSSE3 version).
On some projects, I've submitted a lot of code and it never went anywhere. So in this case, I'm elated as well as eager to help out some more.


Unregistered user # Tuesday, December 8, 2009 10:53:59 PM
Vorlath # Tuesday, December 8, 2009 11:09:15 PM
Vorlath # Thursday, December 10, 2009 12:07:04 AM
Unregistered user # Wednesday, February 17, 2010 3:02:10 AM
Vorlath # Wednesday, February 17, 2010 3:20:24 PM
BTW, I don't see the additional 8 registers as being that big a deal. Yes, you can get a boost. But depending on the code, there are ways to use a great deal more than 8 registers even back on the PIII. What is the bigger boost is the wider registers.