photo of Macallan

Rants & Ramblings

Old Mac vs. alpha blending

, , , , ,

A while ago Someone™ sent me a Performa 6360 - it's got a 160MHz 603ev CPU, not exactly high end even in 1996. Installation was quite painful since the firmware neither supports the onboard video nor booting anything other than MacOS from CDROM. Since I wanted to upgrade the harddisk anyway I just prepared the 'new' disk in another Mac, and since I wanted to do some voodoofb hackery I put a Voodoo3 in the single PCI slot which does have the right firmware goo to serve as OpenFirmware console. This particular Performa came with a standard 10MBit/s Apple Ethernet board, no modem, no TV module and the standard 256kB cache module. I would have tried the G3 accelerator from my PowerMac 4400 since it uses the same cache slot but the graphics card is in the way. Either way, I found two suitable 64MB modules, now RAM is maxed out at a whopping 136MB.
Now for the other reason why I've been playing with this machine. It's quite slow and therefore a nice test bed for CPU-intensive tasks like alpha blending. The Voodoo3's 2D engine doesn't support alpha blending and the 3D engine is 16bit only even though the rest of the card will happily do 24bit colour. So the first step was to add support for anti-aliased fonts to voodoofb, for now only in 8 bit. As usual, rendering is by software but actual drawing of the characters uses host blits so we can use the pipeline instead of having to wait for the engine every time we want to draw something. This is already pretty fast, in order to make it faster I added a simple cacheing scheme which stores commonly used characters ( as in, everything that uses the default attribute ) in video memory when they're drawn the first time and if they're needed again we simply blit them in place from off-screen memory. That made it even faster.
While there I finally added DDC2 support ( mode switching has been in the driver for years although mostly unused ) which works nicely up to 1680x1200 ( my TV's 1920x1080 didn't work for some reason so these modes are disabled for now until I find out what's (not) going on ).
The other other reason for reviving this machine was the unsupported onboard video. In OF it shows up as /valkyrie, the 'screen' devalias points to it by default, so apparently it was intended to be the console at some point.
Of course the only documentation ( if you can call it that ) is the Linux driver which was apparently reverse engineered from MacOS. The hardware is rather primitive - there is an i2c-controlled PLL which generates the pixel clock, 1MB framebuffer memory, a simple RGB DAC and a handful registers to program video modes, colour depth and interrupts. Video mode programming is weird - there's a single 8bit register and the upper two bits are used to turn off video output and sync signals. The lower 4 or 5 bits apparently correspond to video mode numbers used by MacOS, so there is no way to program arbitrary modes although we can use whatever pixel clock we want. My driver is therefore split into two - one for the PLL so it can attach to CUDA's i2c bus and it might be useful for other, similar video hardware which may use the same way to program the pixel clock, and the actual framebuffer driver. It switches video modes by matching the requested mode against a list of suitable MacOS modes and then programming the PLL with the right pixel clock. Works alright so far. Since there is no drawing engine whatsoever everything is drawn in software which brings us to the next point, namely anti-aliased fonts on dumb framebuffers.
As it turned out, even on a low latency bus with a relatively slow CPU, the time it takes to draw an anti-aliased character is dominated by the time it takes to shove the pixels into the framebuffer, not the actual calculations. The fact that my first implementation of the drawing method was quite inefficient didn't help either.
In order to speed things up I now let it render each scanline into a buffer in main memory and then use memcpy to move it into video memory, instead of writing each pixel separately. That gave a nice boost. Then I discovered that the 'fast path' for blank characters which I copied from an existing putchar() method was even worse - it drew every pixel separately and every time it checked for a shadow framebuffer in order to update that as well, pixel by pixel. Replacing that with memset() gave another big boost.
The benchmark I used was to scroll a bunch of text ( always the same file of course ) and measure how long it takes.
In its first incarnation valkyriefb took about 56 seconds. The memcpy() trick reduced it to 50 seconds. Using memset() to draw blanks got it down to 32 seconds, and cacheing glyphs in main memory reduced it to 27 seconds.
So, out of the whole time it took to scroll the text ( which, on a dumb framebuffer redraws the entire page instead of reading from video memory. According to the same benchmark scrolling by copying video memory is even slower than the original, inefficient putchar() implementation ).
So, out of 32 seconds of constant drawing of anti-aliased characters, the actual calculations took a mere 5 seconds. The rest is almost all writing to video memory. I also experimented with mapping video memory cacheable or with relaxed ordering restrictions but when using memcpy() and memset() neither one made a measurable difference.
For comparison, the same benchmark on voodoofb took an average of ~2.15 seconds without cacheing in video memory, and ~1.3 with cacheing. The difference is that on voodoofb scrolling is done by the blitter so it doesn't draw nearly as many characters, which is why the calculations don't amount to the same 5 seconds.
The same optimizations yielded a visible speedup on a 2GHz Athlon 64 with PCIe graphics running as a dumb framebuffer. You'd think a CPU like that with a link to video memory that's way faster than the Performa's CPU bus would render circles around the 603e / Voodoo3 combo. Nope, it doesn't. It's barely faster than valkyriefb ( the same benchmark took 29 seconds without glyph cacheing ) and doesn't come anywhere near the Voodoo3. I'll have to figure out how to do host blits on modern Radeons.
Lessons learned:
  • video memory is slow, likely slower than you think it is
  • video memory reads are to be avoided at almost all cost, if you think you're at the point where the cost is too high you're probably wrong
  • forget your intuition, your CPU is probably faster than your video memory. It's always a good idea to measure instead of going with probably ill-supported assumptions.
  • fast methods to write video memory may compensate even for a vastly faster CPU
  • PCIe is ridiculously fast with big burst transfers, it really, really hates it when you transfer small chunks


CPU frequency scaling on GdiumPowerPC abandonment syndrome

Write a comment

New comments have been disabled for this post.