English Amiga Board - View Single Post - Optimizing Wolf3D-style rendering

britelite · 05 January 2018, 19:19

As I promised in the other Wolfenstein 3D thread, I started a new thread for a more technical discussion on texture mapped wolf3d-style rendering. Let's keep the discussion on topic, and leave daydreaming (and preferrably discussion about non-texturemapped approaches) in another thread

I built another preview of my routine (http://dekadence64.org/wolf3d_v2.lha), which should run at least a bit smoother than the previous one. Still room for more optimizing though.

I also included a binary blob of raycasted data (wolf.3d), if anyone wants to try out the wall rendering, without having to write a raycaster. The format is pretty simple, 1024 frames of 320 bytes each. Every frame consists of 2 bytes per slice (160 slices in total), first byte being the height of the wall (0-127) and second byte being the texture u-coordinate (0-63 for texture one, 64-127 for texture two, textures being 64x64 in size).

So, about the wall rendering. The simple approach would be to have unrolled loops of code for rendering the wall from top to bottom, basically amounting to a lot of:

Code:

...
move.b 0(a1),0(a0)
move.b 64(a1),160(a0)
move.b 128(a1),320(a0)
move.b 192(a1),480(a0)
...

For slightly better performance I'm not rendering top-down, but instead from the middle towards top and bottom (which makes generating the code on the fly slightly easier). Also, I've rotated the textures 90 degrees, so that I can in some cases read from the texture without offsets, making use of post increments.

So, for cases where the wall size is smaller than the height of the texture, I still use offsets. But when we start stretching the texture (wall size 64 and above), I drop the offsets and start doing post increments instead. So for example for doubled size I'd do:

Code:

...
move.b (a1),0(a0) ; zoom factor 2.0, each texel drawn two times
move.b (a1)+,160(a0)
move.b (a1),320(a0)
move.b (a1)+,480(a0)
...

Further improvements could be to optimize zoom factor >2.0 like this:

Code:

...
move.b (a1)+,d2 ; zoom factor of 3.0, each texel drawn three times
move.b d2,0(a0)
move.b d2,160(a0)
move.b d2,320(a0)
move.b (a1)+,d2
move.b d2,480(a0)
...

Could also be a good idea to have a look the cases for zoom factor <1.0, to see if a combination of post increments and addq.l #value,(a1) could speed up the rendering, like:

Code:

...
move.b (a1)+,0(a0)
move.b (a1)+,160(a0)
addq.l #1,a1 ; skipping an additional byte
move.b (a1)+,320(a0)
move.b (a1)+,480(a0)
...

One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine.

I hope my explanation made any sense, let's see if anyone might have some other (hopefully better) approaches for this.

05 January 2018, 19:19	#1
britelite Registered User Join Date: Feb 2010 Location: Espoo / Finland Posts: 819	Optimizing Wolf3D-style rendering As I promised in the other Wolfenstein 3D thread, I started a new thread for a more technical discussion on texture mapped wolf3d-style rendering. Let's keep the discussion on topic, and leave daydreaming (and preferrably discussion about non-texturemapped approaches) in another thread I built another preview of my routine (http://dekadence64.org/wolf3d_v2.lha), which should run at least a bit smoother than the previous one. Still room for more optimizing though. I also included a binary blob of raycasted data (wolf.3d), if anyone wants to try out the wall rendering, without having to write a raycaster. The format is pretty simple, 1024 frames of 320 bytes each. Every frame consists of 2 bytes per slice (160 slices in total), first byte being the height of the wall (0-127) and second byte being the texture u-coordinate (0-63 for texture one, 64-127 for texture two, textures being 64x64 in size). So, about the wall rendering. The simple approach would be to have unrolled loops of code for rendering the wall from top to bottom, basically amounting to a lot of: Code: ... move.b 0(a1),0(a0) move.b 64(a1),160(a0) move.b 128(a1),320(a0) move.b 192(a1),480(a0) ... For slightly better performance I'm not rendering top-down, but instead from the middle towards top and bottom (which makes generating the code on the fly slightly easier). Also, I've rotated the textures 90 degrees, so that I can in some cases read from the texture without offsets, making use of post increments. So, for cases where the wall size is smaller than the height of the texture, I still use offsets. But when we start stretching the texture (wall size 64 and above), I drop the offsets and start doing post increments instead. So for example for doubled size I'd do: Code: ... move.b (a1),0(a0) ; zoom factor 2.0, each texel drawn two times move.b (a1)+,160(a0) move.b (a1),320(a0) move.b (a1)+,480(a0) ... Further improvements could be to optimize zoom factor >2.0 like this: Code: ... move.b (a1)+,d2 ; zoom factor of 3.0, each texel drawn three times move.b d2,0(a0) move.b d2,160(a0) move.b d2,320(a0) move.b (a1)+,d2 move.b d2,480(a0) ... Could also be a good idea to have a look the cases for zoom factor <1.0, to see if a combination of post increments and addq.l #value,(a1) could speed up the rendering, like: Code: ... move.b (a1)+,0(a0) move.b (a1)+,160(a0) addq.l #1,a1 ; skipping an additional byte move.b (a1)+,320(a0) move.b (a1)+,480(a0) ... One idea I had would be to rotate the whole chunkybuffer 90 degrees, which would let me draw to the buffer with the destination using (a0)+ instead of offset(a0), saving a lot of cycles. That would of course require rewriting the c2p-routine. I hope my explanation made any sense, let's see if anyone might have some other (hopefully better) approaches for this.