05 January 2018, 19:19 | #1 |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
Optimizing Wolf3D-style rendering
As I promised in the other Wolfenstein 3D thread, I started a new thread for a more technical discussion on texture mapped wolf3d-style rendering. Let's keep the discussion on topic, and leave daydreaming (and preferrably discussion about non-texturemapped approaches) in another thread
I built another preview of my routine (http://dekadence64.org/wolf3d_v2.lha), which should run at least a bit smoother than the previous one. Still room for more optimizing though. I also included a binary blob of raycasted data (wolf.3d), if anyone wants to try out the wall rendering, without having to write a raycaster. The format is pretty simple, 1024 frames of 320 bytes each. Every frame consists of 2 bytes per slice (160 slices in total), first byte being the height of the wall (0-127) and second byte being the texture u-coordinate (0-63 for texture one, 64-127 for texture two, textures being 64x64 in size). So, about the wall rendering. The simple approach would be to have unrolled loops of code for rendering the wall from top to bottom, basically amounting to a lot of: Code:
... move.b 0(a1),0(a0) move.b 64(a1),160(a0) move.b 128(a1),320(a0) move.b 192(a1),480(a0) ... So, for cases where the wall size is smaller than the height of the texture, I still use offsets. But when we start stretching the texture (wall size 64 and above), I drop the offsets and start doing post increments instead. So for example for doubled size I'd do: Code:
... move.b (a1),0(a0) ; zoom factor 2.0, each texel drawn two times move.b (a1)+,160(a0) move.b (a1),320(a0) move.b (a1)+,480(a0) ... Code:
... move.b (a1)+,d2 ; zoom factor of 3.0, each texel drawn three times move.b d2,0(a0) move.b d2,160(a0) move.b d2,320(a0) move.b (a1)+,d2 move.b d2,480(a0) ... Code:
... move.b (a1)+,0(a0) move.b (a1)+,160(a0) addq.l #1,a1 ; skipping an additional byte move.b (a1)+,320(a0) move.b (a1)+,480(a0) ... I hope my explanation made any sense, let's see if anyone might have some other (hopefully better) approaches for this. |
05 January 2018, 22:00 | #2 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Neat! I like your testcase definition.
I presume you are targetting A500. If so, a couple of things: Use mipmapping. By using mipmapping you can keep a lid on the worst-case minification that you will ever do. For example, if you do log2 mipmaps (128, 64, 32, ...) then you will never minify more than a factor of 2x. This may simplify the rendering. The total time spent is roughly a factor of (number of pixels mapped) * (cycles per pixel for mapping) + (c2p time). It looks like typically 50% of on-screen pixels are being mapped. The goal will be to minimize the overall time. If you have log2 mipmapping, then you can render a pair of pixels in one go: you know that two consecutive MOVE.Bs from the original routine will either fetch the texel pair texel(x,y) & texel(x,y+1), or texel(x,y) & texel(x,y), and write this pair to screen(x,y) & screen(x,y+1). Now, you could make two versions of each texture. Generate one which contains the data %a3a3a2a2a1a1a0a0, and another which contains %a3A3a2A2a1A1a0A0. (a = texel(x,y), A = texel(x,y+1)) This way, each byte contains a texel-pair, either the same texel duplicated, or the texel combined with the texel below. Generate rendering code like this: Code:
move.b offs0(a0),0(a1) move.b offs1(a0),320(a1) move.b offs2(a0),640(a1) ... There is still a hefty cost for the c2p conversion. You get output like: a3A3a2A2a1A1a0A0 b3B3b2B2b1B1b0B0 c3C3c2C2c1C1c0C0 d3D3d2D2d1D1d0D0 ... which is not great for c2p merging, but reordering columns can make it better. I think this will turn out quicker in the end. If you have more memory available for extra versions of the texture (so you can keep one %..xx..xx..xx..xx and one %xx..xx..xx..xx.. version, or similar) then it may turn out quicker to collect multiple pixels in a register by MOVE/OR combinations. You may also be able to get the data slightly better structured for the c2p pass. I have wondered about whether or not it helps to store the screen swizzled in, say, 4x4 or similar pixel groups. With a fixed group size it may be possible to collect more things in registers and then dump it out. Pardon if this is bit hand wavey - I'm having a pretty bad cold today. |
06 January 2018, 09:48 | #3 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
After some sleep, I realized that there's no benefit to scrambling pixel order. Just having one texture with data %a3a2a1a0a3a2a1a0 and another with %a3a2a1a0A3A2A1A0 (so one pixel in the high nibble and another in the low nibble) works just as well. Then run 4bit, 2bit and 1bit merges with the blitter.
|
06 January 2018, 10:30 | #4 | ||
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
Quote:
Quote:
Code:
byte 0 - plane 0 byte 1 - plane 1 byte 2 - plane 2 byte 3 - plane 3 byte 4 - plane 0 byte 5 - plane 1 ... |
||
06 January 2018, 10:38 | #5 | ||||
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
Quote:
Quote:
Frame 17 Column 80 : $0F3A to $0E62 Frame 47 Column 80 : $4044 to $125F Frame 48 Column 80 : $427F to $125F Frame 86 Column 80 : $2C35 to $1E66 I've attached the fixed blob packed with lha. Quote:
|
||||
06 January 2018, 12:14 | #6 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster.
EDIT: Here's a nice example where a byte-swap instruction is missing... |
06 January 2018, 13:44 | #7 | |||||
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Hm, I realized a mistake: pixel pairs will alternate between [texel(x,y) and texel(x,y)] and [texel(x,y) and texel(x,y+1)] when magnifying. When minifying in the scale range 0.5 ... 1.0 then the pixel pairs will alternate between [texel(x,y) and texel(x,y+1)] and [texel(x,y) and texel(x,y+2)]. Still, 3 different versions of a texture is acceptable I think to be able to support arbitrary magnification through scaling down to 0.5.
Quote:
Quote:
Quote:
Quote:
Parameters used below: 160 slices in a frame, 320x128 pixels of planar frame buffer, 4 bitplanes. 50% of the frame buffer will be covered by vertical spans, on average. Let's say we combine Britelite's rendering of horizontal pixel pairs (for dithering) with chb's of byte-interleaving row 0 and 127, row 1 and 126, etc... Then, the render operation is just MOVE.W xx(an),yy(am); This will fetch and write two mirrored 2x1 pixel pairs. this results in %a3b3a2b2a1b1a0b0 A3B3A2B2A1B1A0B0 where a,b are two consecutive pixels on line 0 and A,B are two consecutive pixels on line 127. The blitter work required to unscramble is an 8bit, a 4bit and a 2bit passe. CPU time for drawing spans: 20 cycles * 0.5 * 160*128 * 50% = 102400 cycles Blitter output: 3 * (320*128 * 4 / 8) = 60kB = ~1.5 frames worth of processing We can get rid of the 8bit pass by changing the CPU logic to do MOVE.W xx(an),dn / MOVEP.W dn,yy(am). This enables us to word-interleave lines instead of byte-interleaving them. This will take 12+16 = 28 cycles per pixel so increases CPU time to 143360 cycles, but, it probably results in just 40kB of blitter output (so ~1 frame worth of blitter processing). If we go with Britelite's 2x1 dithered pixel format + chb's mirroring idea + my idea of precomputing pixel pairs vertically (but ignore MOVEP), then we end up with the first 4 bytes of the framebuffer containing pixel data for row 0, then row 127, then row 1, then row 126. We perform MOVE.L xx(an),yy(am) to fetch and write two mirrored 2x2 pixel pairs. CPU time for drawing spans: 28 cycles * 0.25 * 160*128 * 50% = 71680 cycles Blitter output: 3 * (320*128 * 4 / 8) = 60kB = ~1.5 frames worth of processing If do the previous thing but replace the MOVE.L xx(an),yy(am) with a MOVE.L xx(an),dn / MOVEP.L dn,yy(am) pair, then the CPU time spent will reach 102400 cycles but blitter output should again be 40kB (so ~1 frame of blitter work). |
|||||
06 January 2018, 23:47 | #8 |
Senior Member
Join Date: Jan 2003
Location: Paris
Posts: 134
|
I made a quick video of the latest wolf3D_v2 demo running on UAE (A500 unexpanded configuration) so that anyone can see your impressive work
[ Show youtube player ] Is there a way to exit the demo? Didn't find it. |
07 January 2018, 03:31 | #9 |
Zone Friend
Join Date: Mar 2004
Location: Middle Earth
Age: 40
Posts: 2,129
|
The status bar at the bottom is that a different bitplane pointer on the coppelist? Also what about his face is that a 16 colour sprite?? Looks wicked.
|
07 January 2018, 12:29 | #10 | |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
Not a different bitplane pointer. The Status bar is copied to the screenbuffer (twice for double buffering), and then the colors are changed on the rasterline where the statusbar starts. But it's only for display, to make it look more like wolfenstien. It's just a normal 16 color picture, no sprites used. |
|
07 January 2018, 18:08 | #11 |
Junior Member
Join Date: Aug 2001
Location: France
Posts: 1,385
|
indeed it's a good work making this running on an A500
|
07 January 2018, 23:33 | #12 |
J.M.D - Bedroom Musician
Join Date: Apr 2014
Location: los angeles,ca
Posts: 3,603
|
Am positively impressed - that look darn good for a 500, now what about step 2? (having player control the walk and objects around)? I personally think that if this routine can run all the stuff at 25fps (2 frames PAL) or 17fps (3 frames NTSC) should be good enough but thats me
|
08 January 2018, 00:27 | #13 |
Registered User
Join Date: Aug 2006
Location: Scunthorpe/United Kingdom
Posts: 2,094
|
Is that really running on a plain unexpanded A500? Looks like it's got a FastRAM expansion added in. How much RAM does the demo need?
|
08 January 2018, 08:18 | #14 |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
Yes, it runs on a A500 512k+512k. Currently the rendering doesn't even make use of Fast RAM, as both the screen buffers and the wall rendering code is in Chip RAM.
|
08 January 2018, 08:22 | #15 | |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
Quote:
|
|
08 January 2018, 09:29 | #16 | |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
Ofcourse in some cases where there is data in both top and bottom pixel, it would work great. Unless im clueless on how to implement sprites |
|
08 January 2018, 11:20 | #17 | ||
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
Quote:
Quote:
|
||
08 January 2018, 15:37 | #18 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
@kalms: Great estimation! And clever use of movep. I was actually considering that instruction for a moment, but was somehow under the impression that it works only on word bundaries, duh!
Writing out four pixels in a longword has another nice effect: As you are completely free to scramble the line order in the display buffer by using an custom copper list (e.g. you can use bpl 0 line 0, bpl 0 line 127, bpl 1 line 0,.. bpl 0 line 1, bpl 0 line 126,...), you can put all high-order bits (x3 and x2) in the first word and the low-order bits (x1 and x0) in the second word. This does not speed up the c2p, but has an advantage: Use a 2-bit texture, and you can load the 4 texels using a move.w (a0)+,Dn instead of a move.l, as the remaining bits (upper word in Dn) are constant. Either write-out a longword with constant upper word (easier) or fill with blitter afterwards (more complicated, but potentially faster). You'd need a second set of unrolled code for the 2-bit textures, so higher memory usage, but the textures will be smaller. Most efficiently one would probably optimize the palette so that often-used textures (like the stone walls in wolf) fit in one of 4-color-blocks each. Not all lines in the texture need to use the same high-order bits, and not all lines need to use 2-bit textures. So a couple of columns with additional colors is possible and would not degrade performance a lot. And another idea to burn memory: Anisotropic mipmaps, so textures scaled in x direction (in addition to the usual mipmaps). As you always plot columns two pixels wide, the texture may look strange when scaling is non-uniform (e.g. a wall at a steep angle). Doubles memory usage for textures, though. Anyway, I guess I should dig out my assembler to out some code where my mouth is, I guess . Just fearing I'm quite a bit rusty. (And I quite dislike Wolfenstein as a game) |
09 January 2018, 18:20 | #19 |
Registered User
Join Date: Nov 2015
Location: Italy
Posts: 192
|
Random stupid idea: have a completely unrolled chunky2planar routine for a vertical stripe (16xscreenheight or 32xscreenheight) and do the chunky rendering/raycasting directly into that code by modifying it (~movel.l #$12345678,d0 -> poke into move.l opcode+2). So no extra chunkybuffer because it is nested inside the code itself.
|
09 January 2018, 19:22 | #20 |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
I had some spare time and decided to implement mipmapping to the rendering, resulting in getting rid of displacements when reading the texture. So, now the code is pretty much:
Code:
... move.b (a1)+,0(a0) move.b (a1),160(a0) move.b (a1)+,320(a0) ... Code:
... move.b (a1)+,d2 move.b d2,0(a0) move.b d2,160(a0) move.b d2,320(a0) move.b (a1)+,d2 ... The routine still only renders bytes, so next up would be to try out chb's method of rendering in pairs. With this approach I might also try turning the framebuffer 90 degrees, so I could do the writing with pre/post-increments instead of displacements, saving up a few additional cycles, as an additional blitter pass will anyway be required for handling the byte-pairs. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Wolf3D on stock A500 | gururise | Retrogaming General Discussion | 9 | 08 November 2017 14:03 |
Wolf3d: more ideas. | AndNN | Coders. Asm / Hardware | 7 | 17 October 2017 13:03 |
Optimizing HAM8 renderer. | Thorham | Coders. Asm / Hardware | 5 | 22 June 2017 18:29 |
NetSurf AGA optimizing | arti | Coders. Asm / Hardware | 199 | 10 November 2013 14:36 |
rendering under wb 1.3 | _ThEcRoW | request.Apps | 2 | 02 October 2005 17:23 |
|
|