Software Development

Correcting The Future

Assembly: Interlaced Zig Zag x264 (Updated Dec 8,2009)

Via reddit, we find this article about implementing zig zag in assembly. See zig zag figure in section 1.6.6 of this article for more info. The author implemented the interlaced version of zigzag (16bit word elements) as it was not already available. This version is for more recent processors as can be seen by this quote.

Furthermore, as should be clear by the “SSSE3″, there’s still no version of this assembly function for any AMD CPU or any pre-Core 2 Intel chip.



I like MMX and SSE programming, so I thought I'd give it a shot. I wanted to implement the standard zigzag algorithm, but someone had already coded it up and it didn't look like I could do anything much faster. So I decided to give it a shot with the interlaced version since it's not available. Took about 30 minutes to code up and another half hour to 45 min to test. First test to make sure the output was correct. Only had two values that were off. This was caused by an incorrect shuffle on the high part when it should have been the low. Easy one letter fix. And then I did some timing tests. Since this is written for PIII machines, I tested it on an actual PIII 600Mhz machine. I used a single page of memory (4K) and repeated the algorithm 625000 times on the entire page. The reason I use the same memory is because I want to test the algorithm itself and not have memory latency skew the results.

Generic version: 6.1 seconds
Extended MMX version: 4.7 seconds

Not bad for a first try with it being 23% faster. Not that great either. For grayscale frames of 640x360, you would save 1.4 seconds every 11000 frames. But it's a saving I will gladly take on older machines. With a framesize of 1024x512, you save 1.4 second every 5000 frames. 5000 frames at 24fps is about 3 minutes 24 seconds of video. Like I said the savings aren't that great. Over a 90 minute video, you save 36.4 seconds. So it all adds up. There may also be a way to speed it up by removing the push and pop instructions I use to save and restore esi and edi. And I didn't really look at instruction scheduling. I did a LITTLE tweaking and it saved a few tenths of a second. Some more tweaking may yield better results. Note that the savings per frame are THEORETICAL savings based on the math. They are not actual savings as I haven't tested that (I only tested the speed difference in the two implementations).

As to the code, it's not that bad. Here is the order of the output required.

 0  1  2  8  9  3  4 10
16 11  5  6  7 12 17 24
18 13 14 15 19 25 32 26
20 21 22 23 27 33 40 34
28 29 30 31 35 41 48 42
36 37 38 39 43 49 50 44
45 46 47 51 56 57 52 53
54 55 58 59 60 61 62 63


There's one thing I noticed right away from what the original author mentioned. Keeping in mind that MMX only works on four 16bit values at a time, I noticed that there were four sequential sequences with groups of four entries.

[20 21 22 23]
[28 29 30 31]
[36 37 38 39]
[60 61 62 63]

That's two lines worth of the entire grid. We're one quarter done right there. Though this gets us ahead, it also negates any speed advantage since these are just copies. Perhaps we get a slight speed advantage by copying four values at a time instead of two, but from experience this doesn't amount to much.

Next thing I noticed were almost sequential sequences.

[0 1 2 8]
[18 13 14 15]
[45 46 47 51]

For these, I used movd or pextrw instructions to grab element 8, 18 and 51. And then I used pinsrw to insert those values at the correct locations. These are high latency instructions, but they're extremely simple to use meaning that any alternative method would use many more instructions to accomplish the same task. I haven't timed those alternative though.

The next thing I noticed were the second and third last groups of four.

[56 57 52 53]
[54 55 58 59]

See how the first two and last two elements are sequential? Also note how all eight entries are sequential if you were to sort them? This is a perfect situation for unpack instructions.

movq mm0, [esi+2*56] ;  59 58 57 56
movq mm1, [esi+2*52] ;  55 54 53 52
movq mm2, mm0

punpckldq mm2, mm1  ; 53 52 57 56
punpckhdq mm1, mm0  ; 59 58 55 54

movq [edi+2*52], mm2
movq [edi+2*56], mm1


Only three instructions to produce the correct result (plus the load and stores).

Now we have 9 groups of four completed. That's 4*9 = 36 values done which is 56% of our goal.

The rest is not so easy though. There were a few situations that did lend themselves to SOME optimizations.

[19 25 32 26]
[27 33 40 34]
[35 41 48 42]
[43 49 50 44]

What do these four groups have in common? They all have two entries that are sequential. Unfortunately, they are not next to each other. The last group is even stranger in that the sequential entries are at the ends and in the center.

So what I did was process all entries in a earlier groups until only entry 19 remained from [16 17 18 19]. I then shifted it over to the lowest vector element to give [19 XX XX XX] (XX=0, but not element 0).

See how I need element 32 now in [19 25 32 26]? Well, the group where 32 is located is this [32 33 34 35]. It happens to be the lowest entry as well, so an unpack instruction will yield [19 32 XX 33]. For elements 25 and 26, they are at the center of their group. [24 25 26 27]. So I have to copy it and shift it over to give [25 26 27 XX]. And one last unpack on the low words...

[19 32 XX 33]
[25 26 27 XX]
gives
[19 25 32 26].

I used the same technique for the other three groups. Note how the group [24 25 26 27] only has 27 that has not been output yet? So I can use the same technique again that I used with element 19 where I will now shift 27 from the group [24 25 26 27] to the lowest vector element. So all four of the groups that we output actually work very well together. The last group required an extra shuffle to swap the last two elements.

I was quite happy with this, especially how each section fit in with the next.

We are now 81% complete. Only three groups remain.

[7 12 17 24]
[9 3 4 10]
[16 11 5 6]

These all had to be created manually. For group [9 3 4 10], the original elements are in three different groups. So I extracted element 3 with a shuffle, unpacked it with element 4 and then unpacked those two elements with elements 9 and 10 (after these were shifted over). I then used a shuffle instruction to put them in their correct locations. It's messy. Real messy. Maybe there's an easier way, but I could not find it. The other two groups were done in a similar fashion, but ended up being more straightforward.

I'll now see if they want this code for their x264 codec.

Here is the code for reference in NASM format. I am placing this code in the public domain. Do with it as you wish.

(update below code)

        ; void _cdecl ASM_zigzag(short *dest, short *input);
        GLOBAL ASM_zigzag
%define dest     [esp+12]
%define input    [esp+16]
  align 4
ASM_zigzag:
	push esi
	push edi

  mov esi, input
  mov edi, dest

movq mm0, [esi+2*0]  ; 03 02 01 00  *
movq mm1, [esi+2*4]  ; 07 06 05 04
movq mm2, [esi+2*8]  ; 11 10 09 08
pshufw mm3, mm0, 011111111b ; 03 03 03 03
movd eax, mm2 ; 09 08
pshufw mm2, mm2, 000111001b ; 08 11 10 09
punpcklwd mm3, mm1 ; 05 03 04 03

pinsrw mm0, eax, 3 ; 08 02 01 00
movq mm4, mm2
punpcklwd mm2, mm3 ; 04 10 03 09
pshufw mm2, mm2, 010110100b ; 10 04 03 09
movq [edi+2*0], mm0  ; output 08 02 01 00
movq [edi+2*4], mm2  ; output 10 04 03 09

; want 
; 06 05 11 16

movq mm3, [esi+2*12] ; 15 14 13 12
movq mm5, [esi+2*16] ; 19 18 17 16

; mm1= 07 06 05 04
; mm4= 08 11 10 09

punpckldq mm6, mm5  ; 17 16 XX XX
psrlq mm1, 16  ; XX 07 06 05

punpckhwd mm6, mm4  ; 08 17 11 16
punpckldq mm6, mm1  ; 06 05 11 16

movq [edi+2*8], mm6  ; 06 05 11 16

; want
; 24 17 12 07

; mm1= XX 07 06 05
; mm3= 15 14 13 12
; mm5= 19 18 17 16

psrlq mm1, 16 ; XX XX 07 06
punpcklwd mm1, mm5 ; 17 07 16 06

movq mm0, [esi+2*20] ; 23 22 21 20
movq mm2, [esi+2*24] ; 27 26 25 24

movq mm6, mm3
punpckhdq mm1, mm1 ; 17 07 17 07
punpcklwd mm6, mm2 ; 25 13 24 12
pextrw eax, mm5, 2
movq [edi+2*24], mm0 ; 23 22 21 20
punpcklwd mm1, mm6 ; 24 17 12 07

movq [edi+2*12], mm1

; mm3= 15 14 13 12
; mm5= 19 18 17 16
; mm2= 27 26 25 24

; want
; 15 14 13 18
pinsrw mm3, eax, 0 ; 15 14 13 18
movq [edi+2*16], mm3 ; 15 14 13 18

; mm5= 19 18 17 16
; mm2= 27 26 25 24

; want 
; 26 32 25 19

movq mm7, [esi+2*28]

movq mm0, [esi+2*32] ; 35 34 33 32

psrlq mm5, 48 ; XX XX XX 19
pshufw mm1, mm2, 011111001b ; 27 27 26 25
punpcklwd mm5, mm0 ; 33 XX 32 19
psrlq mm2, 48 ; XX XX XX 27
punpcklwd mm5, mm1 ; 26 32 25 19

movq [edi+2*32], mm7
movq [edi+2*20], mm5 ; 26 32 25 19

; mm0= 35 34 33 32
; mm2= XX XX XX 27

; want
; 34 40 33 27

movq mm7, [esi+2*36]

movq mm1, [esi+2*40] ; 43 42 41 40

pshufw mm3, mm0, 011111001b ; 35 35 34 33
punpcklwd mm2, mm1 ; 41 XX 40 27
movq [edi+2*40], mm7
punpcklwd mm2, mm3 ; 34 40 33 27

movq [edi+2*28], mm2

; mm0= 35 34 33 32
; mm1= 43 42 41 40

; want
; 42 48 41 35

movq mm7, [esi+2*44]  ; 47 46 45 44
movq mm2, [esi+2*48]  ; 51 50 49 48

psrlq mm0, 48 ; XX XX XX 35
punpcklwd mm0, mm2 ; 49 XX 48 35
pshufw mm3, mm1, 011111001b ; 43 43 42 41

punpcklwd mm0, mm3 ; 42 48 41 35
movq [edi+2*36], mm0

; mm1= 43 42 41 40
; mm7= 47 46 45 44
; mm2= 51 50 49 48

; want
; 44 50 49 43
; 51 47 46 45  

pextrw eax, mm2, 3 ; 51
psrlq mm1, 48 ; XX XX XX 43
punpcklwd mm1, mm7 ; 45 XX 44 43
psrlq mm2, 16 ; XX 51 50 49
punpcklwd mm1, mm2 ; 50 44 49 43
pshufw mm1, mm1, 010110100b ; 44 50 49 43

movq [edi+2*44], mm1

; mm7= 47 46 45 44
psrlq mm7, 16 ; XX 47 46 45
pinsrw mm7, eax, 3 ; 51 47 46 45
movq [edi+2*48], mm7

; 52-59
movq mm0, [esi+2*56] ;  59 58 57 56
movq mm1, [esi+2*52] ;  55 54 53 52
movq mm2, mm0

movq mm7, [esi+2*60]

punpckldq mm2, mm1  ; 53 52 57 56
punpckhdq mm1, mm0  ; 59 58 55 54

movq [edi+2*52], mm2
movq [edi+2*56], mm1

; Move 60-63
movq [edi+2*60], mm7

.end:
  pop edi
  pop esi
  ret



UPDATE Dec 9, 2009:

I was contacted by Dark Shikari, the author of the original blog entry that talks about the zigzag algorithm. Here are the timing results that he produced on his i7:

zigzag_scan_8x8_field_c: 532
zigzag_scan_8x8_field_mmx: 209 (my version)
zigzag_scan_8x8_field_ssse3: 210

I was shocked to see this for three reasons. First, having larger registers usually yields faster results by the simple fact that you can manipulate more data in registers (nothing is faster than registers). Second, my code is much longer. Third, I only got a 23% speed increase on an actual P3. I did not really expect the speedup to be more pronounced (61%, aka more than twice as fast) on more recent processors. Perhaps naively, I thought the difference between C code and custom assembly would be less on more recent processors. It seems the opposite is true in certain cases.

Dark Shikari has said that since my code is just as fast, that mine will be used instead since it works on all processors since P3. I'm very happy that I could be of help for the MMX version (though I wasn't expecting to replace the original SSSE3 version).

On some projects, I've submitted a lot of code and it never went anywhere. So in this case, I'm elated as well as eager to help out some more.

Intel 48 Core Processor (Update: Larrabee canceled)The Scientific Method (Updated: New 2008 map)

Comments

Unregistered user Tuesday, December 8, 2009 10:53:59 PM

gary writes: Interesting read. Well done! My assembly sandbox has been limited to the MIPS instruction set. When I was a kid I read "Programming the 6809" (Zaks and Labiak), and discovered that I loved processor architecture and logic.

Vorlath Tuesday, December 8, 2009 11:09:15 PM

MIPS is actually a nice architecture. Three address opcodes! Lots of registers. What's not to like? Unfortunately, I never got to play with it. I always found it weird that the x86 took so long to add new registers and then it was AMD that did that, not Intel. Heck, even the M68K series had 16 registers (though they went overboard with the addressing modes).

Vorlath Thursday, December 10, 2009 12:07:04 AM

A little update. I did profile it and it's not worth changing my code with instruction re-ordering. The timings are spread evenly throughout except for a few spots, but changing it has variable effects. Sometimes, it's 1.6% faster and sometimes it's 5% slower from the first version of the MMX code. I'm not sure what causes this fluctuation. It's never in between. However, 1.6% faster isn't worth it if there's any chance of it running slower. So I'm leaving it as is.

Unregistered user Wednesday, February 17, 2010 3:02:10 AM

babel writes: It's nice piece of artistry, whatever make you think otherwise dont ;) Anyhow it's all 32-bit piece of code and as that much more speedup could be derived from use of additional 8 amd-x64 registers than to try to optimize it for pure MMX(+) beside that MMX become obsolete in true 64bit world and they decided to abandon it. I dont know if this is also reality of 3DNow!/Pro (since AXP, usually abb as 3dnow+) And as far cpu goes Core/Core2/Corei7 is based on well established P6 (PII/PIII) research only they make ooo arbiter (reservation station) fattier so that can do more work in parallel, and thats nothing big since PII was 350nm and Core is 65nm just 128 times smaller process node and only 33% wider 24 entry RS vs. 32 entry RS on C2 but ROB grew up 2.5x and become capable to handle multiple load store thru now separated memory ROB and depending ports. And the uttermost thing memory latency drop significantly especially for this small chunk of code with 4/6MB highly associative L2 cache that now works @3000MHz instead 500MHz of first p3 Anyhow good job and respect for it. Btw how much time was neede for al that speed tests?

Vorlath Wednesday, February 17, 2010 3:20:24 PM

Speed tests don't take long to set up. A few minutes to write and compile. Running the tests can be part of the coding process. In my case, I only tested at the end and the runs were for varying amounts of memory. Like I said in the article, I took about 45 min of testing.

BTW, I don't see the additional 8 registers as being that big a deal. Yes, you can get a boost. But depending on the code, there are ways to use a great deal more than 8 registers even back on the PIII. What is the bigger boost is the wider registers.

Write a comment

New comments have been disabled for this post.

June 2012
S M T W T F S
May 2012July 2012
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30