English Amiga Board - View Single Post - Optimizing Wolf3D-style rendering

Kalms · 06 January 2018, 13:44

Hm, I realized a mistake: pixel pairs will alternate between [texel(x,y) and texel(x,y)] and [texel(x,y) and texel(x,y+1)] when magnifying. When minifying in the scale range 0.5 ... 1.0 then the pixel pairs will alternate between [texel(x,y) and texel(x,y+1)] and [texel(x,y) and texel(x,y+2)]. Still, 3 different versions of a texture is acceptable I think to be able to support arbitrary magnification through scaling down to 0.5.

Quote:

Originally Posted by britelite

Yeah, I was actually thinking of mipmapping too, also for the reason of (at least when working in HAM-mode) making the smaller textures gradually darker, to get kind of a free depth shading. And the memory requirement per texture would basically only double, which would be bearable.

Indeed!

Quote:

Originally Posted by britelite

What I'm doing is having the pixel format being basically a3a3a2a2a1a1a0a0 (although the demo has two pixels packed in to each byte, to get the dithering) interleaved on four planes.
<snip>
The blitter c2p in this case only needs to do 4 and 2 bit merges.

OK so the demo maps horizontal pairs of pixels; in other words the "sampling resolution" has 2x2 pixel size.

Quote:

Originally Posted by LaBodilsen

Quote:

Originally Posted by britelite

One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine.

I had this idea too, as it would make more sence in the memory layout, but would it be possible to make the C2P just as fast?

Be aware that there are variations on 90 degree rotation that may be better; one variation is to split the framebuffer into NxM pixel blocks where each block is rotated/transposed internally. Not sure, haven't thought it through.

Quote:

Originally Posted by chb

Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster.

EDIT: Here's a nice example where a byte-swap instruction is missing...

Intriguing. Yes, that could work quite well.

Parameters used below: 160 slices in a frame, 320x128 pixels of planar frame buffer, 4 bitplanes. 50% of the frame buffer will be covered by vertical spans, on average.

Let's say we combine Britelite's rendering of horizontal pixel pairs (for dithering) with chb's of byte-interleaving row 0 and 127, row 1 and 126, etc... Then, the render operation is just MOVE.W xx(an),yy(am); This will fetch and write two mirrored 2x1 pixel pairs. this results in %a3b3a2b2a1b1a0b0 A3B3A2B2A1B1A0B0 where a,b are two consecutive pixels on line 0 and A,B are two consecutive pixels on line 127.

The blitter work required to unscramble is an 8bit, a 4bit and a 2bit passe.

CPU time for drawing spans: 20 cycles * 0.5 * 160*128 * 50% = 102400 cycles

Blitter output: 3 * (320*128 * 4 / 8) = 60kB = ~1.5 frames worth of processing

We can get rid of the 8bit pass by changing the CPU logic to do MOVE.W xx(an),dn / MOVEP.W dn,yy(am). This enables us to word-interleave lines instead of byte-interleaving them. This will take 12+16 = 28 cycles per pixel so increases CPU time to 143360 cycles, but, it probably results in just 40kB of blitter output (so ~1 frame worth of blitter processing).

If we go with Britelite's 2x1 dithered pixel format + chb's mirroring idea + my idea of precomputing pixel pairs vertically (but ignore MOVEP), then we end up with the first 4 bytes of the framebuffer containing pixel data for row 0, then row 127, then row 1, then row 126. We perform MOVE.L xx(an),yy(am) to fetch and write two mirrored 2x2 pixel pairs.

CPU time for drawing spans: 28 cycles * 0.25 * 160*128 * 50% = 71680 cycles

Blitter output: 3 * (320*128 * 4 / 8) = 60kB = ~1.5 frames worth of processing

If do the previous thing but replace the MOVE.L xx(an),yy(am) with a MOVE.L xx(an),dn / MOVEP.L dn,yy(am) pair, then the CPU time spent will reach 102400 cycles but blitter output should again be 40kB (so ~1 frame of blitter work).