Optimizing Wolf3D-style rendering

britelite · 05 January 2018, 19:19

As I promised in the other Wolfenstein 3D thread, I started a new thread for a more technical discussion on texture mapped wolf3d-style rendering. Let's keep the discussion on topic, and leave daydreaming (and preferrably discussion about non-texturemapped approaches) in another thread

I built another preview of my routine (http://dekadence64.org/wolf3d_v2.lha), which should run at least a bit smoother than the previous one. Still room for more optimizing though.

I also included a binary blob of raycasted data (wolf.3d), if anyone wants to try out the wall rendering, without having to write a raycaster. The format is pretty simple, 1024 frames of 320 bytes each. Every frame consists of 2 bytes per slice (160 slices in total), first byte being the height of the wall (0-127) and second byte being the texture u-coordinate (0-63 for texture one, 64-127 for texture two, textures being 64x64 in size).

So, about the wall rendering. The simple approach would be to have unrolled loops of code for rendering the wall from top to bottom, basically amounting to a lot of:

Code:

...
move.b 0(a1),0(a0)
move.b 64(a1),160(a0)
move.b 128(a1),320(a0)
move.b 192(a1),480(a0)
...

For slightly better performance I'm not rendering top-down, but instead from the middle towards top and bottom (which makes generating the code on the fly slightly easier). Also, I've rotated the textures 90 degrees, so that I can in some cases read from the texture without offsets, making use of post increments.

So, for cases where the wall size is smaller than the height of the texture, I still use offsets. But when we start stretching the texture (wall size 64 and above), I drop the offsets and start doing post increments instead. So for example for doubled size I'd do:

Code:

...
move.b (a1),0(a0) ; zoom factor 2.0, each texel drawn two times
move.b (a1)+,160(a0)
move.b (a1),320(a0)
move.b (a1)+,480(a0)
...

Further improvements could be to optimize zoom factor >2.0 like this:

Code:

...
move.b (a1)+,d2 ; zoom factor of 3.0, each texel drawn three times
move.b d2,0(a0)
move.b d2,160(a0)
move.b d2,320(a0)
move.b (a1)+,d2
move.b d2,480(a0)
...

Could also be a good idea to have a look the cases for zoom factor <1.0, to see if a combination of post increments and addq.l #value,(a1) could speed up the rendering, like:

Code:

...
move.b (a1)+,0(a0)
move.b (a1)+,160(a0)
addq.l #1,a1 ; skipping an additional byte
move.b (a1)+,320(a0)
move.b (a1)+,480(a0)
...

One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine.

I hope my explanation made any sense, let's see if anyone might have some other (hopefully better) approaches for this.

Kalms · 05 January 2018, 22:00

Neat! I like your testcase definition.

I presume you are targetting A500. If so, a couple of things:

Use mipmapping. By using mipmapping you can keep a lid on the worst-case minification that you will ever do. For example, if you do log2 mipmaps (128, 64, 32, ...) then you will never minify more than a factor of 2x. This may simplify the rendering.

The total time spent is roughly a factor of (number of pixels mapped) * (cycles per pixel for mapping) + (c2p time). It looks like typically 50% of on-screen pixels are being mapped. The goal will be to minimize the overall time.

If you have log2 mipmapping, then you can render a pair of pixels in one go: you know that two consecutive MOVE.Bs from the original routine will either fetch the texel pair texel(x,y) & texel(x,y+1), or texel(x,y) & texel(x,y), and write this pair to screen(x,y) & screen(x,y+1).

Now, you could make two versions of each texture. Generate one which contains the data %a3a3a2a2a1a1a0a0, and another which contains %a3A3a2A2a1A1a0A0. (a = texel(x,y), A = texel(x,y+1)) This way, each byte contains a texel-pair, either the same texel duplicated, or the texel combined with the texel below. Generate rendering code like this:

Code:

   move.b offs0(a0),0(a1)
   move.b offs1(a0),320(a1)
   move.b offs2(a0),640(a1)
   ...

Each move.b will fetch a pair of texels, and write them to the destination buffer. The first 320 bytes of the framebuffer will contain pixel data for scanline 0 and 1 interleaved. This way, you reduce the per-pixel mapping cost to half.

There is still a hefty cost for the c2p conversion. You get output like:
a3A3a2A2a1A1a0A0 b3B3b2B2b1B1b0B0 c3C3c2C2c1C1c0C0 d3D3d2D2d1D1d0D0

... which is not great for c2p merging, but reordering columns can make it better. I think this will turn out quicker in the end.

If you have more memory available for extra versions of the texture (so you can keep one %..xx..xx..xx..xx and one %xx..xx..xx..xx.. version, or similar) then it may turn out quicker to collect multiple pixels in a register by MOVE/OR combinations. You may also be able to get the data slightly better structured for the c2p pass.

I have wondered about whether or not it helps to store the screen swizzled in, say, 4x4 or similar pixel groups. With a fixed group size it may be possible to collect more things in registers and then dump it out.

Pardon if this is bit hand wavey - I'm having a pretty bad cold today.

Kalms · 06 January 2018, 09:48

After some sleep, I realized that there's no benefit to scrambling pixel order. Just having one texture with data %a3a2a1a0a3a2a1a0 and another with %a3a2a1a0A3A2A1A0 (so one pixel in the high nibble and another in the low nibble) works just as well. Then run 4bit, 2bit and 1bit merges with the blitter.

britelite · 06 January 2018, 10:30

Quote:

Originally Posted by Kalms

Use mipmapping. By using mipmapping you can keep a lid on the worst-case minification that you will ever do. For example, if you do log2 mipmaps (128, 64, 32, ...) then you will never minify more than a factor of 2x. This may simplify the rendering.

Yeah, I was actually thinking of mipmapping too, also for the reason of (at least when working in HAM-mode) making the smaller textures gradually darker, to get kind of a free depth shading. And the memory requirement per texture would basically only double, which would be bearable.

Quote:

There is still a hefty cost for the c2p conversion. You get output like:
a3A3a2A2a1A1a0A0 b3B3b2B2b1B1b0B0 c3C3c2C2c1C1c0C0 d3D3d2D2d1D1d0D0

What I'm doing is having the pixel format being basically a3a3a2a2a1a1a0a0 (although the demo has two pixels packed in to each byte, to get the dithering) interleaved on four planes.

Code:

byte 0 - plane 0
byte 1 - plane 1
byte 2 - plane 2
byte 3 - plane 3
byte 4 - plane 0
byte 5 - plane 1
...

The blitter c2p in this case only needs to do 4 and 2 bit merges.

LaBodilsen · 06 January 2018, 10:38

Quote:

Originally Posted by britelite

As I promised in the other Wolfenstein 3D thread, I started a new thread for a more technical discussion on texture mapped wolf3d-style rendering. Let's keep the discussion on topic, and leave daydreaming (and preferrably discussion about non-texturemapped approaches) in another thread

Looking forward to a more technical discussion on this.

Quote:

I built another preview of my routine (http://dekadence64.org/wolf3d_v2.lha), which should run at least a bit smoother than the previous one. Still room for more optimizing though.

Thank you.. although i now have to decrunch and disassemble all over again.

Quote:

I also included a binary blob of raycasted data (wolf.3d), if anyone wants to try out the wall rendering, without having to write a raycaster. The format is pretty simple, 1024 frames of 320 bytes each. Every frame consists of 2 bytes per slice (160 slices in total), first byte being the height of the wall (0-127) and second byte being the texture u-coordinate (0-63 for texture one, 64-127 for texture two, textures being 64x64 in size).

There seem to be some glitches in the blob, so i took the liberty to fix them.
Frame 17 Column 80 : $0F3A to $0E62
Frame 47 Column 80 : $4044 to $125F
Frame 48 Column 80 : $427F to $125F
Frame 86 Column 80 : $2C35 to $1E66

I've attached the fixed blob packed with lha.

Quote:

One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine.

I had this idea too, as it would make more sence in the memory layout, but would it be possible to make the C2P just as fast?

chb · 06 January 2018, 12:14

Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster.

EDIT: Here's a nice example where a byte-swap instruction is missing...

Kalms · 06 January 2018, 13:44

Hm, I realized a mistake: pixel pairs will alternate between [texel(x,y) and texel(x,y)] and [texel(x,y) and texel(x,y+1)] when magnifying. When minifying in the scale range 0.5 ... 1.0 then the pixel pairs will alternate between [texel(x,y) and texel(x,y+1)] and [texel(x,y) and texel(x,y+2)]. Still, 3 different versions of a texture is acceptable I think to be able to support arbitrary magnification through scaling down to 0.5.

Quote:

Originally Posted by britelite

Yeah, I was actually thinking of mipmapping too, also for the reason of (at least when working in HAM-mode) making the smaller textures gradually darker, to get kind of a free depth shading. And the memory requirement per texture would basically only double, which would be bearable.

Indeed!

Quote:

Originally Posted by britelite

What I'm doing is having the pixel format being basically a3a3a2a2a1a1a0a0 (although the demo has two pixels packed in to each byte, to get the dithering) interleaved on four planes.
<snip>
The blitter c2p in this case only needs to do 4 and 2 bit merges.

OK so the demo maps horizontal pairs of pixels; in other words the "sampling resolution" has 2x2 pixel size.

Quote:

Originally Posted by LaBodilsen

Quote:

Originally Posted by britelite

One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine.

I had this idea too, as it would make more sence in the memory layout, but would it be possible to make the C2P just as fast?

Be aware that there are variations on 90 degree rotation that may be better; one variation is to split the framebuffer into NxM pixel blocks where each block is rotated/transposed internally. Not sure, haven't thought it through.

Quote:

Originally Posted by chb

Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster.

EDIT: Here's a nice example where a byte-swap instruction is missing...

Intriguing. Yes, that could work quite well.

Parameters used below: 160 slices in a frame, 320x128 pixels of planar frame buffer, 4 bitplanes. 50% of the frame buffer will be covered by vertical spans, on average.

Let's say we combine Britelite's rendering of horizontal pixel pairs (for dithering) with chb's of byte-interleaving row 0 and 127, row 1 and 126, etc... Then, the render operation is just MOVE.W xx(an),yy(am); This will fetch and write two mirrored 2x1 pixel pairs. this results in %a3b3a2b2a1b1a0b0 A3B3A2B2A1B1A0B0 where a,b are two consecutive pixels on line 0 and A,B are two consecutive pixels on line 127.

The blitter work required to unscramble is an 8bit, a 4bit and a 2bit passe.

CPU time for drawing spans: 20 cycles * 0.5 * 160*128 * 50% = 102400 cycles

Blitter output: 3 * (320*128 * 4 / 8) = 60kB = ~1.5 frames worth of processing

We can get rid of the 8bit pass by changing the CPU logic to do MOVE.W xx(an),dn / MOVEP.W dn,yy(am). This enables us to word-interleave lines instead of byte-interleaving them. This will take 12+16 = 28 cycles per pixel so increases CPU time to 143360 cycles, but, it probably results in just 40kB of blitter output (so ~1 frame worth of blitter processing).

If we go with Britelite's 2x1 dithered pixel format + chb's mirroring idea + my idea of precomputing pixel pairs vertically (but ignore MOVEP), then we end up with the first 4 bytes of the framebuffer containing pixel data for row 0, then row 127, then row 1, then row 126. We perform MOVE.L xx(an),yy(am) to fetch and write two mirrored 2x2 pixel pairs.

CPU time for drawing spans: 28 cycles * 0.25 * 160*128 * 50% = 71680 cycles

Blitter output: 3 * (320*128 * 4 / 8) = 60kB = ~1.5 frames worth of processing

If do the previous thing but replace the MOVE.L xx(an),yy(am) with a MOVE.L xx(an),dn / MOVEP.L dn,yy(am) pair, then the CPU time spent will reach 102400 cycles but blitter output should again be 40kB (so ~1 frame of blitter work).

Leo42 · 06 January 2018, 23:47

I made a quick video of the latest wolf3D_v2 demo running on UAE (A500 unexpanded configuration) so that anyone can see your impressive work

[ Show youtube player ]

Is there a way to exit the demo? Didn't find it.

redblade · 07 January 2018, 03:31

The status bar at the bottom is that a different bitplane pointer on the coppelist? Also what about his face is that a 16 colour sprite?? Looks wicked.

LaBodilsen · 07 January 2018, 12:29

Quote:

Originally Posted by Leo42

I made a quick video of the latest wolf3D_v2 demo running on UAE (A500 unexpanded configuration) so that anyone can see your impressive work

[ Show youtube player ]

Is there a way to exit the demo? Didn't find it.

It exits fine with a single left mouse click here?

Quote:

Originally Posted by redblade

The status bar at the bottom is that a different bitplane pointer on the coppelist? Also what about his face is that a 16 colour sprite?? Looks wicked.

Not a different bitplane pointer. The Status bar is copied to the screenbuffer (twice for double buffering), and then the colors are changed on the rasterline where the statusbar starts. But it's only for display, to make it look more like wolfenstien. It's just a normal 16 color picture, no sprites used.

Frog · 07 January 2018, 18:08

indeed it's a good work making this running on an A500

saimon69 · 07 January 2018, 23:33

Am positively impressed - that look darn good for a 500, now what about step 2? (having player control the walk and objects around)? I personally think that if this routine can run all the stuff at 25fps (2 frames PAL) or 17fps (3 frames NTSC) should be good enough but thats me

Dunny · 08 January 2018, 00:27

Is that really running on a plain unexpanded A500? Looks like it's got a FastRAM expansion added in. How much RAM does the demo need?

britelite · 08 January 2018, 08:18

Quote:

Originally Posted by Dunny

Is that really running on a plain unexpanded A500? Looks like it's got a FastRAM expansion added in. How much RAM does the demo need?

Yes, it runs on a A500 512k+512k. Currently the rendering doesn't even make use of Fast RAM, as both the screen buffers and the wall rendering code is in Chip RAM.

britelite · 08 January 2018, 08:22

Quote:

Originally Posted by chb

Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster.

That's a really nice idea, and would definitely make the rendering faster as also sprite rendering would benefit from this. Will definitely try it out when I have some time

LaBodilsen · 08 January 2018, 09:29

Quote:

Originally Posted by britelite

That's a really nice idea, and would definitely make the rendering faster as also sprite rendering would benefit from this. Will definitely try it out when I have some time

I don't see it working to well with sprites, as transparency would be different for the top and bottom half of the sprite, so you would have to evaluate each pixel individually, and plot them one at a time anyway. also at least in Wolf3D, some sprites is only located on the bottom half of the screen (Ammo boxes, health packs etc.)
Ofcourse in some cases where there is data in both top and bottom pixel, it would work great.

Unless im clueless on how to implement sprites

britelite · 08 January 2018, 11:20

Quote:

Originally Posted by LaBodilsen

I don't see it working to well with sprites, as transparency would be different for the top and bottom half of the sprite, so you would have to evaluate each pixel individually, and plot them one at a time anyway.

With bitmasks this wouldn't be a problem. But there are of course other issues to also consider with sprite rendering.

Quote:

also at least in Wolf3D, some sprites is only located on the bottom half of the screen (Ammo boxes, health packs etc.)

I would definitely make a separate routine for these cases.

chb · 08 January 2018, 15:37

@kalms: Great estimation! And clever use of movep. I was actually considering that instruction for a moment, but was somehow under the impression that it works only on word bundaries, duh!

Writing out four pixels in a longword has another nice effect: As you are completely free to scramble the line order in the display buffer by using an custom copper list (e.g. you can use bpl 0 line 0, bpl 0 line 127, bpl 1 line 0,.. bpl 0 line 1, bpl 0 line 126,...), you can put all high-order bits (x3 and x2) in the first word and the low-order bits (x1 and x0) in the second word. This does not speed up the c2p, but has an advantage:

Use a 2-bit texture, and you can load the 4 texels using a move.w (a0)+,Dn instead of a move.l, as the remaining bits (upper word in Dn) are constant. Either write-out a longword with constant upper word (easier) or fill with blitter afterwards (more complicated, but potentially faster).

You'd need a second set of unrolled code for the 2-bit textures, so higher memory usage, but the textures will be smaller.

Most efficiently one would probably optimize the palette so that often-used textures (like the stone walls in wolf) fit in one of 4-color-blocks each. Not all lines in the texture need to use the same high-order bits, and not all lines need to use 2-bit textures. So a couple of columns with additional colors is possible and would not degrade performance a lot.

And another idea to burn memory: Anisotropic mipmaps, so textures scaled in x direction (in addition to the usual mipmaps). As you always plot columns two pixels wide, the texture may look strange when scaling is non-uniform (e.g. a wall at a steep angle). Doubles memory usage for textures, though.

Anyway, I guess I should dig out my assembler to out some code where my mouth is, I guess

. Just fearing I'm quite a bit rusty.

(And I quite dislike Wolfenstein as a game

)

aros-sg · 09 January 2018, 18:20

Random stupid idea: have a completely unrolled chunky2planar routine for a vertical stripe (16xscreenheight or 32xscreenheight) and do the chunky rendering/raycasting directly into that code by modifying it (~movel.l #$12345678,d0 -> poke into move.l opcode+2). So no extra chunkybuffer because it is nested inside the code itself.

britelite · 09 January 2018, 19:22

I had some spare time and decided to implement mipmapping to the rendering, resulting in getting rid of displacements when reading the texture. So, now the code is pretty much:

Code:

...
move.b (a1)+,0(a0)
move.b (a1),160(a0)
move.b (a1)+,320(a0)
...

and for really up close walls (zoom factor >2.0):

Code:

...
move.b (a1)+,d2
move.b d2,0(a0)
move.b d2,160(a0)
move.b d2,320(a0)
move.b (a1)+,d2
...

With the stream I've been using this gets the speed to around 1.7-2.5 frames (including blitter clear of buffer and c2p), while the previous version barely had a best case of under 2 frames.

The routine still only renders bytes, so next up would be to try out chb's method of rendering in pairs. With this approach I might also try turning the framebuffer 90 degrees, so I could do the writing with pre/post-increments instead of displacements, saving up a few additional cycles, as an additional blitter pass will anyway be required for handling the byte-pairs.

05 January 2018, 22:00	#2
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Neat! I like your testcase definition. I presume you are targetting A500. If so, a couple of things: Use mipmapping. By using mipmapping you can keep a lid on the worst-case minification that you will ever do. For example, if you do log2 mipmaps (128, 64, 32, ...) then you will never minify more than a factor of 2x. This may simplify the rendering. The total time spent is roughly a factor of (number of pixels mapped) * (cycles per pixel for mapping) + (c2p time). It looks like typically 50% of on-screen pixels are being mapped. The goal will be to minimize the overall time. If you have log2 mipmapping, then you can render a pair of pixels in one go: you know that two consecutive MOVE.Bs from the original routine will either fetch the texel pair texel(x,y) & texel(x,y+1), or texel(x,y) & texel(x,y), and write this pair to screen(x,y) & screen(x,y+1). Now, you could make two versions of each texture. Generate one which contains the data %a3a3a2a2a1a1a0a0, and another which contains %a3A3a2A2a1A1a0A0. (a = texel(x,y), A = texel(x,y+1)) This way, each byte contains a texel-pair, either the same texel duplicated, or the texel combined with the texel below. Generate rendering code like this: Code: move.b offs0(a0),0(a1) move.b offs1(a0),320(a1) move.b offs2(a0),640(a1) ... Each move.b will fetch a pair of texels, and write them to the destination buffer. The first 320 bytes of the framebuffer will contain pixel data for scanline 0 and 1 interleaved. This way, you reduce the per-pixel mapping cost to half. There is still a hefty cost for the c2p conversion. You get output like: a3A3a2A2a1A1a0A0 b3B3b2B2b1B1b0B0 c3C3c2C2c1C1c0C0 d3D3d2D2d1D1d0D0 ... which is not great for c2p merging, but reordering columns can make it better. I think this will turn out quicker in the end. If you have more memory available for extra versions of the texture (so you can keep one %..xx..xx..xx..xx and one %xx..xx..xx..xx.. version, or similar) then it may turn out quicker to collect multiple pixels in a register by MOVE/OR combinations. You may also be able to get the data slightly better structured for the c2p pass. I have wondered about whether or not it helps to store the screen swizzled in, say, 4x4 or similar pixel groups. With a fixed group size it may be possible to collect more things in registers and then dump it out. Pardon if this is bit hand wavey - I'm having a pretty bad cold today.

08 January 2018, 15:37	#18
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	@kalms: Great estimation! And clever use of movep. I was actually considering that instruction for a moment, but was somehow under the impression that it works only on word bundaries, duh! Writing out four pixels in a longword has another nice effect: As you are completely free to scramble the line order in the display buffer by using an custom copper list (e.g. you can use bpl 0 line 0, bpl 0 line 127, bpl 1 line 0,.. bpl 0 line 1, bpl 0 line 126,...), you can put all high-order bits (x3 and x2) in the first word and the low-order bits (x1 and x0) in the second word. This does not speed up the c2p, but has an advantage: Use a 2-bit texture, and you can load the 4 texels using a move.w (a0)+,Dn instead of a move.l, as the remaining bits (upper word in Dn) are constant. Either write-out a longword with constant upper word (easier) or fill with blitter afterwards (more complicated, but potentially faster). You'd need a second set of unrolled code for the 2-bit textures, so higher memory usage, but the textures will be smaller. Most efficiently one would probably optimize the palette so that often-used textures (like the stone walls in wolf) fit in one of 4-color-blocks each. Not all lines in the texture need to use the same high-order bits, and not all lines need to use 2-bit textures. So a couple of columns with additional colors is possible and would not degrade performance a lot. And another idea to burn memory: Anisotropic mipmaps, so textures scaled in x direction (in addition to the usual mipmaps). As you always plot columns two pixels wide, the texture may look strange when scaling is non-uniform (e.g. a wall at a steep angle). Doubles memory usage for textures, though. Anyway, I guess I should dig out my assembler to out some code where my mouth is, I guess . Just fearing I'm quite a bit rusty. (And I quite dislike Wolfenstein as a game)

09 January 2018, 19:22	#20
britelite Registered User Join Date: Feb 2010 Location: Espoo / Finland Posts: 821	I had some spare time and decided to implement mipmapping to the rendering, resulting in getting rid of displacements when reading the texture. So, now the code is pretty much: Code: ... move.b (a1)+,0(a0) move.b (a1),160(a0) move.b (a1)+,320(a0) ... and for really up close walls (zoom factor >2.0): Code: ... move.b (a1)+,d2 move.b d2,0(a0) move.b d2,160(a0) move.b d2,320(a0) move.b (a1)+,d2 ... With the stream I've been using this gets the speed to around 1.7-2.5 frames (including blitter clear of buffer and c2p), while the previous version barely had a best case of under 2 frames. The routine still only renders bytes, so next up would be to try out chb's method of rendering in pairs. With this approach I might also try turning the framebuffer 90 degrees, so I could do the writing with pre/post-increments instead of displacements, saving up a few additional cycles, as an additional blitter pass will anyway be required for handling the byte-pairs.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Wolf3D on stock A500	gururise	Retrogaming General Discussion	9	08 November 2017 14:03
Wolf3d: more ideas.	AndNN	Coders. Asm / Hardware	7	17 October 2017 13:03
Optimizing HAM8 renderer.	Thorham	Coders. Asm / Hardware	5	22 June 2017 18:29
NetSurf AGA optimizing	arti	Coders. Asm / Hardware	199	10 November 2013 14:36
rendering under wb 1.3	_ThEcRoW	request.Apps	2	02 October 2005 17:23

05 January 2018, 19:19	#1
britelite Registered User Join Date: Feb 2010 Location: Espoo / Finland Posts: 821	Optimizing Wolf3D-style rendering As I promised in the other Wolfenstein 3D thread, I started a new thread for a more technical discussion on texture mapped wolf3d-style rendering. Let's keep the discussion on topic, and leave daydreaming (and preferrably discussion about non-texturemapped approaches) in another thread I built another preview of my routine (http://dekadence64.org/wolf3d_v2.lha), which should run at least a bit smoother than the previous one. Still room for more optimizing though. I also included a binary blob of raycasted data (wolf.3d), if anyone wants to try out the wall rendering, without having to write a raycaster. The format is pretty simple, 1024 frames of 320 bytes each. Every frame consists of 2 bytes per slice (160 slices in total), first byte being the height of the wall (0-127) and second byte being the texture u-coordinate (0-63 for texture one, 64-127 for texture two, textures being 64x64 in size). So, about the wall rendering. The simple approach would be to have unrolled loops of code for rendering the wall from top to bottom, basically amounting to a lot of: Code: ... move.b 0(a1),0(a0) move.b 64(a1),160(a0) move.b 128(a1),320(a0) move.b 192(a1),480(a0) ... For slightly better performance I'm not rendering top-down, but instead from the middle towards top and bottom (which makes generating the code on the fly slightly easier). Also, I've rotated the textures 90 degrees, so that I can in some cases read from the texture without offsets, making use of post increments. So, for cases where the wall size is smaller than the height of the texture, I still use offsets. But when we start stretching the texture (wall size 64 and above), I drop the offsets and start doing post increments instead. So for example for doubled size I'd do: Code: ... move.b (a1),0(a0) ; zoom factor 2.0, each texel drawn two times move.b (a1)+,160(a0) move.b (a1),320(a0) move.b (a1)+,480(a0) ... Further improvements could be to optimize zoom factor >2.0 like this: Code: ... move.b (a1)+,d2 ; zoom factor of 3.0, each texel drawn three times move.b d2,0(a0) move.b d2,160(a0) move.b d2,320(a0) move.b (a1)+,d2 move.b d2,480(a0) ... Could also be a good idea to have a look the cases for zoom factor <1.0, to see if a combination of post increments and addq.l #value,(a1) could speed up the rendering, like: Code: ... move.b (a1)+,0(a0) move.b (a1)+,160(a0) addq.l #1,a1 ; skipping an additional byte move.b (a1)+,320(a0) move.b (a1)+,480(a0) ... One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine. I hope my explanation made any sense, let's see if anyone might have some other (hopefully better) approaches for this.

06 January 2018, 09:48	#3
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	After some sleep, I realized that there's no benefit to scrambling pixel order. Just having one texture with data %a3a2a1a0a3a2a1a0 and another with %a3a2a1a0A3A2A1A0 (so one pixel in the high nibble and another in the low nibble) works just as well. Then run 4bit, 2bit and 1bit merges with the blitter.

06 January 2018, 12:14	#6
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster. EDIT: Here's a nice example where a byte-swap instruction is missing...

06 January 2018, 23:47	#8
Leo42 Senior Member Join Date: Jan 2003 Location: Paris Posts: 134	I made a quick video of the latest wolf3D_v2 demo running on UAE (A500 unexpanded configuration) so that anyone can see your impressive work [ Show youtube player ] Is there a way to exit the demo? Didn't find it.

07 January 2018, 03:31	#9
redblade Zone Friend Join Date: Mar 2004 Location: Middle Earth Age: 40 Posts: 2,129	The status bar at the bottom is that a different bitplane pointer on the coppelist? Also what about his face is that a 16 colour sprite?? Looks wicked.

07 January 2018, 18:08	#11
Frog Junior Member Join Date: Aug 2001 Location: France Posts: 1,385	indeed it's a good work making this running on an A500

07 January 2018, 23:33	#12
saimon69 J.M.D - Bedroom Musician Join Date: Apr 2014 Location: los angeles,ca Posts: 3,603	Am positively impressed - that look darn good for a 500, now what about step 2? (having player control the walk and objects around)? I personally think that if this routine can run all the stuff at 25fps (2 frames PAL) or 17fps (3 frames NTSC) should be good enough but thats me

08 January 2018, 00:27	#13
Dunny Registered User Join Date: Aug 2006 Location: Scunthorpe/United Kingdom Posts: 2,094	Is that really running on a plain unexpanded A500? Looks like it's got a FastRAM expansion added in. How much RAM does the demo need?

09 January 2018, 18:20	#19
aros-sg Registered User Join Date: Nov 2015 Location: Italy Posts: 192	Random stupid idea: have a completely unrolled chunky2planar routine for a vertical stripe (16xscreenheight or 32xscreenheight) and do the chunky rendering/raycasting directly into that code by modifying it (~movel.l #$12345678,d0 -> poke into move.l opcode+2). So no extra chunkybuffer because it is nested inside the code itself.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)