Optimizing Wolf3D-style rendering - Page 3

LaBodilsen · 13 January 2018, 14:45

Quote:

Originally Posted by Master484

Could a system be used where the graphics of the previous frame are never cleared, but instead we preserve them, and when drawing a new frame we only draw those pixels that have been changed since the last frame, and therefore need to be updated.

And those pixels that remain the same color as they did last frame, are simply skipped.

So every frame we would go through the pixels one by one, and check the current color versus the color that it should be; and draw the pixel only in the case where the "should be color" is different from the current color.

I think in quite many cases the individual pixel colors in two consequtive frames would be the same, and also you would never need to totally "clear" the screen. So could this method work or boost the speed?

explained for a non-coder

what is currently happening:

Code:

copy TexturePixel1 to ScreenPixel1
copy TexturePixel2 to ScreenPixel2
copy TexturePixel3 to ScreenPixel3
etc

what you suggest would be

Code:

Read ScreenPixel1
Compare ScreenPixel1 to TexturePixel1
IF not equal:	Write TexturePixel1 to ScreenPixel1
If equal:	Read ScreenPixel2
Compare ScreenPixel2 to TexturePixel2
IF not equal:	Write TexturePixel2 to ScreenPixel2
If equal:	Read ScreenPixel3
Compare ScreenPixel3 to TexturePixel3
IF not equal:	Write TexturePixel3 to ScreenPixel3
If equal etc.....

So when the pixels are not equal, the current way is _MUCH_ faster than your suggestion. And if the pixel are equal, the current way is still faster, as you would have to read each screenpixel anyway to make a compare. Hope it makes sence.

Master484 · 13 January 2018, 18:31

Thanks for the explanation. Indeed that makes my idea look quite slow. So back to the drawing board.

Although in theory the idea is fascinating, because the previous frame always approximately contains the same image that we are going to draw next. So if the old data could be somehow used to build the current frame, without too many reads and comparisons, then something like this might be useful.

But right now it's just an interesting theory, and I don't know how to make it work.

chb · 13 January 2018, 21:27

Quote:

Originally Posted by Master484

Although in theory the idea is fascinating, because the previous frame always approximately contains the same image that we are going to draw next

That's a wrong assumption, or a misconception what "approximately" means in this context. Imagine walking in z-direction in the game. The zoom factor for every column changes and, because of perspective, the columns also move horizontally. No wall pixel will stay in its place, at least not in an easily predictable way.

You might think of something like video motion compensation, where blocks of pixels are moved to approximate the next frame. But that's not possible here, predicting the next frame would be much slower than just rendering it.

What could work: There are sometimes large parts of the image that are not textured - floor and ceiling. One could divide the screen in 16 or 32 pixels wide vertical stripes, determine the maximal distance from the center in y for each stripe, and do c2p and clearing only on that part. But there's a number of disadvantages: Depending on your chunky buffer layout, this might be rather hard to do; you have to call the blitter for c2p much more often (n times for every stripe, instead of n times for the whole image), which introduces some overhead - more if you use interrupts, less if you do copper waits, but in the latter case you have to be careful with frame boundaries (copper list is restarted at vertical blank); it improves only some cases (lot of floor & ceiling), but not all - so the frame rate could be quite unsteady, albeit on average higher. Probably not that desirable for a game. On the other hand, if you have a lot of columns with a high zoom factor (so close to a wall), you might in general see less enemies and other sprites, which speeds up rendering and could compensate for this.

Master484 · 14 January 2018, 10:59

I thought a little bit more about my idea, and I came up with a version that doesn't need any data reads or pixel comparisons.

So we would start with the same thing: no screen clearing and the previous frame is always preserved.

But when drawing the next frame, we only draw every other column. So if screen has 320 columns, we only update 160 of them. This results in a screen where every other column is new and every other is from the previous frame. And when the next frame arrives then we change the draw order: if in previous frame we drew columns 1, 3 and 5 then in this frame we draw columns 2, 4 and 6.

So this method would surely be faster: no screen clearing, and only need to raycast half of the columns each frame.

However, I'm not sure if this would look good. But because every other column would be 100% right, and the other half would be "approximately right", then maybe it could look acceptable?

---

Also, a more advanced version of this could be something like this: we draw every column, but skip every other pixel in the columns.

So in each column only every other pixel would be updated, and the rest would be from the previous frame. And for the next column we change the order, so if in the current column we updated pixels 1, 3 and 5 then in the next column we update pixels 2, 4 and 6.

So the resulting screen would have a "chessboard pattern" of new and old pixels. And when the frame changes, then we switch the pixel updating order in the columns, so that the old and new pixels alternate every frame.

But again, would this look good? Don't know.

Galahad/FLT · 14 January 2018, 11:38

Quote:

Originally Posted by Master484

I thought a little bit more about my idea, and I came up with a version that doesn't need any data reads or pixel comparisons.

So we would start with the same thing: no screen clearing and the previous frame is always preserved.

But when drawing the next frame, we only draw every other column. So if screen has 320 columns, we only update 160 of them. This results in a screen where every other column is new and every other is from the previous frame. And when the next frame arrives then we change the draw order: if in previous frame we drew columns 1, 3 and 5 then in this frame we draw columns 2, 4 and 6.

So this method would surely be faster: no screen clearing, and only need to raycast half of the columns each frame.

However, I'm not sure if this would look good. But because every other column would be 100% right, and the other half would be "approximately right", then maybe it could look acceptable?

---

Also, a more advanced version of this could be something like this: we draw every column, but skip every other pixel in the columns.

So in each column only every other pixel would be updated, and the rest would be from the previous frame. And for the next column we change the order, so if in the current column we updated pixels 1, 3 and 5 then in the next column we update pixels 2, 4 and 6.

So the resulting screen would have a "chessboard pattern" of new and old pixels. And when the frame changes, then we switch the pixel updating order in the columns, so that the old and new pixels alternate every frame.

But again, would this look good? Don't know.

Alternating the pixel drawing would result in the whole screen appearing to "flash" as it redraws, which in lowres would be magnified over something like hires.

Kalms · 14 January 2018, 14:09

@master484 How about you make proof-of-concepts of your ideas? If it is just a question of "what will the end result look like" and not "how quick will it run on the a500" then you don't need to implement a full Wolf3d renderer yourself; you can simulate the various ideas by doing video processing offline on a set of frame captures. (For source data, you could run wolf3d_v2 in an amiga emulator and do a video capture.) That way you will better understand which ideas are worth sharing, and that in turn saves time for others in the thread.

britelite · 14 January 2018, 16:47

I have to agree with Kalms here, posting random suggestions without displaying any knowledge of the subject at hand is better suited for the other Wolf3d-thread.

Master484 · 14 January 2018, 18:01

Ok, I'll exit the thread and leave it to you guys.

I just thought to share these ideas, as they seemed to good to me. But agreed, maybe it would be better to make a some kind of proof of concept demo first. I do have a book about raycasting, so maybe I'll try to cook something with Blitz if I have time.

britelite · 17 January 2018, 21:13

Alright, had some spare time to tinker with the rendering again, and now the stream runs at a nice steady 25fps (around 1.6-1.9 frames). The rendering is now done horizontally, with the previously added double pixel blitter pass modified to also rotate the buffer back 90 degrees.

There's still room for improvement in the wall rendering (for example, the code slices could be modified to draw longwords where possible). But I think the next would be to make the raycasting real time for interactivity, although this will probably not be happening any time soon

The usual preview is available here

LaBodilsen · 18 January 2018, 08:19

Quote:

Originally Posted by britelite

Alright, had some spare time to tinker with the rendering again, and now the stream runs at a nice steady 25fps (around 1.6-1.9 frames).

i'm lost for words.. this is simply amazing.

Quote:

The rendering is now done horizontally, with the previously added double pixel blitter pass modified to also rotate the buffer back 90 degrees.

So you get the 90 degrees rotation almost for free, as you have to perform that blitter pass anyway to sort out upper and lower pixels?

Quote:

There's still room for improvement in the wall rendering (for example, the code slices could be modified to draw longwords where possible).

Would that really help?.. as the A500 memorybus is 16bit, if you write a longword, it would still take as many cycles as writing 2 words? i think any improvement would be minimal.

Quote:

But I think the next would be to make the raycasting real time for interactivity, although this will probably not be happening any time soon

Maybe someone else can contribute to this. would you mind if anyone used the wall render for their own project, as long as proper credits are given?

If we could make a kinda framework for a game engine, then maybe others would love to create a game (or port) with that.

Quote:

The usual preview is available here

So cool, something to play with over the weekend. and maybe combine it with the raycaster i'm currently trying to create.

britelite · 18 January 2018, 08:29

Quote:

Originally Posted by LaBodilsen

So you get the 90 degrees rotation almost for free, as you have to perform that blitter pass anyway to sort out upper and lower pixels?

Indeed, the pass is slightly slower now as I need to restart the blitter more often.

Quote:

Would that really help?.. as the A500 memorybus is 16bit, if you write a longword, it would still take as many cycles as writing 2 words? i think any improvement would be minimal.

It might save a few cycles in some cases, but as you say it would probably be minimal.

Quote:

Maybe someone else can contribute to this. would you mind if anyone used the wall render for their own project, as long as proper credits are given?

Of course not, would be happy to see something real materialize from this.

chb · 18 January 2018, 10:28

Wow, running in 1.6 -1.9 frames? Totally awesome! Congratulations, great achievement.

Quote:

Originally Posted by LaBodilsen

Would that really help?.. as the A500 memorybus is 16bit, if you write a longword, it would still take as many cycles as writing 2 words? i think any improvement would be minimal.

The memory bus is only 16 bit, true, but you always need to fetch the instruction itself, too. That's why move.l is faster even on the A500: a move.w Dn,(a0)+ is 8 cycles (two memory accesses, one for the instruction and one for the data), a move.l is 12 cycles (three memory accesses, one for the instruction and two for the data). So one move.l Dn,(a0)+ takes 25% less time than two word moves. But that's a quite optimal case, for instructions that do more memory fetches (e.g. move.x (a0)+,(a1)+) the ratio of instruction fetches to memory access is lower and therefore the advantage smaller. On the other hand, if you use an instruction with complex address calculation like "d(An)" or "d(An,ix)", the gain may be even higher, as address calculations are always 32-bit anyway and have to be carried out twice for the two word instructions *- but then again that internal calculation is not slowed down by other DMA memory access.... So, it's complicated.

EDIT: The neogeo dev wiki has a nice table with instruction/addressing modes timings, AFAICS it's from the official 68000 manual, but with nicer formatting:
https://wiki.neogeodev.org/index.php...ctions_timings

EDIT2: *Hmm, does not seem to be true according to the table I linked...

LaBodilsen · 18 January 2018, 11:51

@chb
makes sense, so atleast some gain could be made by using longwords. maybe not much, but for a game like this running on A500, we would need to make use of all the tricks available

@Britelite
Do you have any good ideas for sprites, as i would see that as the next performance killer.

Mathesar · 18 January 2018, 12:16

Quote:

Originally Posted by britelite

Alright, had some spare time to tinker with the rendering again, and now the stream runs at a nice steady 25fps (around 1.6-1.9 frames). The rendering is now done horizontally, with the previously added double pixel blitter pass modified to also rotate the buffer back 90 degrees.

Simply amazing!

About the raycaster. All tutorials I have seen state that you need to cast a ray for every horizontal pixel in the projection plane. Which is, in this case, 160 pixels right? So 160 rays to be casted.

I was thinking, if the scene is simple, most rays would would hit the same wall. What if you cast every other ray (only even rays) and when two rays hit the same wall just interpolate the ray inbetween? Might save a few cycles. You could do it even dirtier by only calculating every fourth ray or so and when they hit the same wall, interpolate the other 3 rays.

britelite · 18 January 2018, 12:28

Quote:

Originally Posted by LaBodilsen

Do you have any good ideas for sprites, as i would see that as the next performance killer.

One idea would be to check the zoom factor of a visible sprite and depending on factor choose one of two different methods of drawing. If the sprite is small enough then just scale the sprite/mask directly to the buffer with CPU, but if a sprite is larger, then scale the sprite/mask vertically to a separate buffer with CPU and then draw it to the screenbuffer with the blitter, expanding it horizontally in the process.

LaBodilsen · 19 January 2018, 08:07

Quote:

Originally Posted by britelite

The usual preview is available here

I've had a change to view the new preview, and the speed is simply amazing. But i'm not particular fond of the mipmap artifacts, as it really degrades the visual quality.

Would it make much of a performance difference if the mipmaping was done differently. Instead of upscaling the smaller levels, then down scale to the next mipmap level, and then compensate the performance loss by using longwords where possible.

ex: for wall height between 64-33 use the fullsize texture, for 32-17 use first mipmap level texture and for 16-0 use lowest mipmap level texture. (hope i make sence)

i would'nt mind if i had to manually hand optimize every wall height below 64 pixels, as it's no more than 32 code segments.

britelite · 19 January 2018, 08:22

Quote:

Originally Posted by LaBodilsen

But i'm not particular fond of the mipmap artifacts, as it really degrades the visual quality.

I agree that the visual quality is degraded, but would it really matter during gameplay?

Quote:

Would it make much of a performance difference if the mipmaping was done differently. Instead of upscaling the smaller levels, then down scale to the next mipmap level, and then compensate the performance loss by using longwords where possible.

It would add 4 cycles to every pixel drawn with downscaling instead of upscaling, so the best case scenarios (when the walls are smaller and have less pixels to draw) would take a hit, but the worst case (when drawing full height strips) would remain the same. Mipmapping could also be added as an option in a game, so the player could choose between slightly better framerate or more precision.

LaBodilsen · 19 January 2018, 09:37

Quote:

Originally Posted by britelite

It would add 4 cycles to every pixel drawn with downscaling instead of upscaling, so the best case scenarios (when the walls are smaller and have less pixels to draw) would take a hit, but the worst case (when drawing full height strips) would remain the same.

Would it add 4 cycles in all cases, if you used the approach that you suggested in the first post, for walls smaller than texture size.

Code:

...
move.w (a1)+,(a0)+
move.w (a1)+,(a0)+
addq.l #2,a1 ; skipping an additional word
move.w (a1)+,(a0)+
move.w (a1)+,(a0)+
...

of course the closer we get to" move, add, move, add". it would not make sence.

Quote:

Mipmapping could also be added as an option in a game, so the player could choose between slightly better framerate or more precision.

I like that, it would be a great way for people to choose what is more important for them.

LaBodilsen · 19 January 2018, 10:11

Just tried some rough calculations to see the cycle count, to see if using Downscaling with longwords could be worth it. i assume you upscale the mipmap like below.

Mipmap upscale:

Code:

move.w	(a1),(a0)+		; 12 cycles
move.w	(a1)+,(a0)+		; 12 cycles
move.w	(a1),(a0)+		; 12 cycles
move.w	(a1)+,(a0)+		; 12 cycles
= 48 cycles

Mipmap downscale:

Code:

move.w	(a1)+,(a0)+		; 12 cycles
move.w	(a1)+,(a0)+		; 12 cycles
addq.l	#2,A1			; 4 cycles
move.w	(a1)+,(a0)+		; 12 cycles
move.w	(a1)+,(a0)+		; 12 cycles
= 52 cycles

Mipmap downscale with longwords:

Code:

move.l	(a1)+,(a0)+		; 20 cycles
addq.l	#4,A1			; 8 cycles
move.l	(a1)+,(a0)+		; 20 cycles
= 48 cycles

So in some cases downscaling with longwords would be faster than mipmap upscaling* So in some cases downscaling with longwords would be as fast as mipmap upscaling, and downscaling with words are in some cases only 4 cycles slower per 4 pixels. or am i missing the point here? ofcourse using longwords with mipmap upscaling would also in some cases be even faster.

Just an idea to take advantage of this:
for walls between 63 - 48 use downscaling of 64px texture with longwords where possible, and for 47-33 use mipmap upscaling of 32px texture with longwords where possible.

as mentioned, i would'nt mind if i had to manually hand optimize every wall height below 64 pixels

EDIT: *Changed the cycle count after Tony corrected me

Toni Wilen · 19 January 2018, 12:25

addq.l #x,an is 8 cycles. (memory access + 4 idle cycles)

14 January 2018, 10:59	#44
Master484 Registered User Join Date: Nov 2015 Location: Vaasa, Finland Posts: 525	I thought a little bit more about my idea, and I came up with a version that doesn't need any data reads or pixel comparisons. So we would start with the same thing: no screen clearing and the previous frame is always preserved. But when drawing the next frame, we only draw every other column. So if screen has 320 columns, we only update 160 of them. This results in a screen where every other column is new and every other is from the previous frame. And when the next frame arrives then we change the draw order: if in previous frame we drew columns 1, 3 and 5 then in this frame we draw columns 2, 4 and 6. So this method would surely be faster: no screen clearing, and only need to raycast half of the columns each frame. However, I'm not sure if this would look good. But because every other column would be 100% right, and the other half would be "approximately right", then maybe it could look acceptable? --- Also, a more advanced version of this could be something like this: we draw every column, but skip every other pixel in the columns. So in each column only every other pixel would be updated, and the rest would be from the previous frame. And for the next column we change the order, so if in the current column we updated pixels 1, 3 and 5 then in the next column we update pixels 2, 4 and 6. So the resulting screen would have a "chessboard pattern" of new and old pixels. And when the frame changes, then we switch the pixel updating order in the columns, so that the old and new pixels alternate every frame. But again, would this look good? Don't know. Last edited by Master484; 14 January 2018 at 11:05.

19 January 2018, 10:11	#59
LaBodilsen Registered User Join Date: Dec 2017 Location: Denmark Posts: 179	Just tried some rough calculations to see the cycle count, to see if using Downscaling with longwords could be worth it. i assume you upscale the mipmap like below. Mipmap upscale: Code: move.w (a1),(a0)+ ; 12 cycles move.w (a1)+,(a0)+ ; 12 cycles move.w (a1),(a0)+ ; 12 cycles move.w (a1)+,(a0)+ ; 12 cycles = 48 cycles Mipmap downscale: Code: move.w (a1)+,(a0)+ ; 12 cycles move.w (a1)+,(a0)+ ; 12 cycles addq.l #2,A1 ; 4 cycles move.w (a1)+,(a0)+ ; 12 cycles move.w (a1)+,(a0)+ ; 12 cycles = 52 cycles Mipmap downscale with longwords: Code: move.l (a1)+,(a0)+ ; 20 cycles addq.l #4,A1 ; 8 cycles move.l (a1)+,(a0)+ ; 20 cycles = 48 cycles So in some cases downscaling with longwords would be faster than mipmap upscaling* So in some cases downscaling with longwords would be as fast as mipmap upscaling, and downscaling with words are in some cases only 4 cycles slower per 4 pixels. or am i missing the point here? ofcourse using longwords with mipmap upscaling would also in some cases be even faster. Just an idea to take advantage of this: for walls between 63 - 48 use downscaling of 64px texture with longwords where possible, and for 47-33 use mipmap upscaling of 32px texture with longwords where possible. as mentioned, i would'nt mind if i had to manually hand optimize every wall height below 64 pixels EDIT: Changed the cycle count after Tony corrected me Last edited by LaBodilsen; 19 January 2018 at 12:32. Reason: Changed the cycle count after Tony corrected me*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Wolf3D on stock A500	gururise	Retrogaming General Discussion	9	08 November 2017 14:03
Wolf3d: more ideas.	AndNN	Coders. Asm / Hardware	7	17 October 2017 13:03
Optimizing HAM8 renderer.	Thorham	Coders. Asm / Hardware	5	22 June 2017 18:29
NetSurf AGA optimizing	arti	Coders. Asm / Hardware	199	10 November 2013 14:36
rendering under wb 1.3	_ThEcRoW	request.Apps	2	02 October 2005 17:23

13 January 2018, 18:31	#42
Master484 Registered User Join Date: Nov 2015 Location: Vaasa, Finland Posts: 525	Thanks for the explanation. Indeed that makes my idea look quite slow. So back to the drawing board. Although in theory the idea is fascinating, because the previous frame always approximately contains the same image that we are going to draw next. So if the old data could be somehow used to build the current frame, without too many reads and comparisons, then something like this might be useful. But right now it's just an interesting theory, and I don't know how to make it work.

14 January 2018, 14:09	#46
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	@master484 How about you make proof-of-concepts of your ideas? If it is just a question of "what will the end result look like" and not "how quick will it run on the a500" then you don't need to implement a full Wolf3d renderer yourself; you can simulate the various ideas by doing video processing offline on a set of frame captures. (For source data, you could run wolf3d_v2 in an amiga emulator and do a video capture.) That way you will better understand which ideas are worth sharing, and that in turn saves time for others in the thread.

14 January 2018, 16:47	#47
britelite Registered User Join Date: Feb 2010 Location: Espoo / Finland Posts: 821	I have to agree with Kalms here, posting random suggestions without displaying any knowledge of the subject at hand is better suited for the other Wolf3d-thread.

14 January 2018, 18:01	#48
Master484 Registered User Join Date: Nov 2015 Location: Vaasa, Finland Posts: 525	Ok, I'll exit the thread and leave it to you guys. I just thought to share these ideas, as they seemed to good to me. But agreed, maybe it would be better to make a some kind of proof of concept demo first. I do have a book about raycasting, so maybe I'll try to cook something with Blitz if I have time.

17 January 2018, 21:13	#49
britelite Registered User Join Date: Feb 2010 Location: Espoo / Finland Posts: 821	Alright, had some spare time to tinker with the rendering again, and now the stream runs at a nice steady 25fps (around 1.6-1.9 frames). The rendering is now done horizontally, with the previously added double pixel blitter pass modified to also rotate the buffer back 90 degrees. There's still room for improvement in the wall rendering (for example, the code slices could be modified to draw longwords where possible). But I think the next would be to make the raycasting real time for interactivity, although this will probably not be happening any time soon The usual preview is available here

18 January 2018, 11:51	#53
LaBodilsen Registered User Join Date: Dec 2017 Location: Denmark Posts: 179	@chb makes sense, so atleast some gain could be made by using longwords. maybe not much, but for a game like this running on A500, we would need to make use of all the tricks available @Britelite Do you have any good ideas for sprites, as i would see that as the next performance killer.

19 January 2018, 12:25	#60
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,574	addq.l #x,an is 8 cycles. (memory access + 4 idle cycles)

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)