OutRun: I started porting it! - Page 5

jotd · 17 February 2024, 21:29

yes, that's why "technology equivalent" ports are difficult. That's also why I personally engage only on very old games (and there's still some challenge as even some 1983 games can have a 256 color palette, and others have selectable palette per sprite, with possible 64 sprites!!)

A1200/020 with fastmem is what most users with A1200 have (with CF IDE + whdload) so it's a reasonable target when you can't do vanilla A1200.

chb · 18 February 2024, 09:31

Quote:

Originally Posted by reassembler

However, once you're scaling and stretching sprites in real-time, there's no real performance advantage in pre-flipping. You're ultimately sampling the source data on a per pixel basis anyway. So the flipping comes for free as part of the process.

You might then logically conclude - why not pre-scale AND pre-flip everything. Well, maybe. OutRun has 1mb of sprite data. Let's say you pre-cache 50 scaled values of each sprite in memory, which is less granularity than you'd have doing it at runtime (OutRun actually has approx 300 potential scaled sizes per sprite). 50 scaled sprites would probably take around 75mb of memory given that a lot of the scaling makes things larger. Now double that 75mb again for the flipping and I suppose you're at 150mb ram just for the sprites.

There's an approach using less granularity. But I'd rather have smoother scaling. If I get to a point where I finish the engine to the point where there is the potential for a performant game, there are other level layout optimisations I'd make first.

It seems like that in Outrun during a stage only a smaller subset of objects is shown, and those usually repeat. Hence, does it make sense to cache objects during a stage? So objects would be drawn into a "stage buffer" before drawing them, and if an object has been already drawn at the desired zoom factor (or one that is close), it is copied from the buffer.

Also, with such an approach some sort of incremental scaling like the old blitter zoom would be possible, splitting the object in four parts that are simply copied, then inserting one row/column from the original (or doubling a line if scale >1) for the next zoom step. Which ofc also looks different from a real zoom.

Probably has a lot of issues too, like overhead for indirect drawing and alignment in memory, and would get pretty complex...

BTW, fantastic work so far by the way!

trixster · 18 February 2024, 10:17

The latest screenshots look superb, amazing work so far. I’m glad I got the add4 board for my blizzard 1220/4, I know my 1200 is looking forward to this!

coder76 · 18 February 2024, 11:12

Have you thought about doing the sprite scaling in bitmap mode, and avoid using a c2p routine?

I think it would be possible to have some table lookup, where you precompute the scaled graphic in 8 bit increments. So given an 8 pixel wide graphic data (1 byte) and zoom level, the table lookup returns how the scaled data should look like. You then need a 256 byte table for each zoom level, so doing scaling 16 pixels at a time, would create tables that are probably too big (64K*scale levels).This could be combined with mip mapping to reduce distortion in sprite scaling.

But it needs to be thought out how to minimize memory accesses, accumulate scaled data in 32bit data reg, and write out a full longword into chip ram always. In bitmap mode the y-scaling is as simple is in chunky mode, just skip rows.

reassembler · 18 February 2024, 22:50

As promised made a video on hardware.

Also the C2P routine I'm using is here:
https://github.com/Kalmalyzer/kalms-...1_8_c5_030_2.s

Is there a faster one?

[ Show youtube player ]

copse · 18 February 2024, 23:05

Coo, just look at all the flocks of windsurfers in their natural habitat. Nowhere on Earth other than Outrunland has as high a population of this endangered species ??

Great work, really looking good!

paraj · 18 February 2024, 23:06

Very nice progress!
Regarding C2P you're not likely to beat Kalms for general cases, but that doesn't mean there aren't ways to improve performance for specific uses.

What and how are you using the C2P routine?

Some things you may want to look into:
- Do you always use it for 8 bitplanes? (Maybe you can avoid some writes otherwise)
- Premade/unrolled versions for specific widths (or maybe even heights as well), so you avoid reinitilizing+cache flush
- Combining C2P with other processing if possible.

I guess you're not C2P directly to chip ram? But if you are, there is a large benefit on 030+ to doing calculations that only depend on fast ram/CPU while chip store is completing

reassembler · 18 February 2024, 23:14

Quote:

Originally Posted by paraj

I guess you're not C2P directly to chip ram? But if you are, there is a large benefit on 030+ to doing calculations that only depend on fast ram/CPU while chip store is completing

Apologies for my ignorance here - I'm not an Amiga expert. Are you suggesting writing to the bitplanes in Fast Memory, then performing a complete copy of those bitplanes to Chip Memory with some sort of parallel blitter routine, whilst the CPU goes on to process other things?

Right now almost everything in the engine is in Fast Memory, with the exception of the Bitplanes which are in Chip Memory. So the C2P has to deal with writing directly into chip memory.

paraj · 18 February 2024, 23:48

Quote:

Originally Posted by reassembler

Apologies for my ignorance here - I'm not an Amiga expert. Are you suggesting writing to the bitplanes in Fast Memory, then performing a complete copy of those bitplanes to Chip Memory with some sort of parallel blitter routine, whilst the CPU goes on to process other things?

Right now almost everything in the engine is in Fast Memory, with the exception of the Bitplanes which are in Chip Memory. So the C2P has to deal with writing directly into chip memory.

I only have 060 so that's what I'm most familiar with, so apologies in advance if the following is not accurate for 030/040, but I think basics should be right.

Writing to chip mem is slow. Even when you're nice and doing properly aligned long word writes you're limited to 7MBs. 030+ accelerator cards w/ all bells and whistles enabled can do the transfer "in the background" (on 060 called store buffer with up to 4 pending transfers being allowed, 1 on 030 and called something [write pending buffer?]).

So in some cases, and with great effort, you can get "free" extra cycles by overlapping computations with the chip memory stores completing.

E.g.

Code:

  move.l d0,(a0)+ ; store to chip mem
  ; XXX

  move.l d1,(a0)+ ; store to chip mem again, has to wait for first store to complete

In above snippet the second store will generally have to wait for the first store to complete, but you can fit - with restrictions - some code "for free" into the XXX part as long as it doesn't disturb the chip write. On 060 this is how "copy speed" C2P is done - while the chipmem writes are completing the C2P is done "for free" (many caveats apply).

So basically if you have somewhere in you code, where you're just copying data to chipmem, there might be an optimization opportunity (though it might be difficult to exploit).

reassembler · 19 February 2024, 10:29

Ah yes, I'd read about the above technique. The only other place I write to chip memory is updating the copper list for the background. I write around 448 words per frame. But I'll have a think about my C2P usage in general to see if I can figure out some smart approaches.

An obvious, but somewhat brutal optimization would be cropping the top of the display. A lot of the time there isn't an awful lot going on there and is arguably not C2P time well spent. You could probably remove at least 32 vertical rows without really making a massive difference to the experience. Just maybe one to think about once I've got the background layer in to avoid optimizing things too soon and rewriting.

Can anyone patiently explain to me why WinUAE speeds are so radically different? I've run SysInfo on my real Amiga. It correctly measures my CPU to be 50 Mhz. My accelerator card is definitely work ok from what I can tell.

I can understand a bit of discrepancy, but I have to set WinUAE to vanilla A1200 (i.e. around 14Mhz) speed to really match what I'm getting on hardware (i.e. 4x slower than hardware) - even for routines that don't touch chip memory. This seems to suggest I'm doing something wrong somewhere either with my UAE setup or even code. I'm not thinking I'm magically going to conjure up 4x speed out of nowhere on hardware - I just am a bit baffled by this massive discrepancy. UAE, in cycle accurate mode, is presumably executing a defined number of clock cycles per scanline - so I wouldn't expect it to be 4x faster if I say set the processor to 030 @ 50 Mhz with cycle accurate settings.

AestheticDebris · 19 February 2024, 11:06

Quote:

Originally Posted by reassembler

Can anyone patiently explain to me why WinUAE speeds are so radically different? I've run SysInfo on my real Amiga. It correctly measures my CPU to be 50 Mhz. My accelerator card is definitely work ok from what I can tell.

AIUI, "cycle exact" is really only cycle exact for an A500, 68000 configuration. On higher spec CPUs it helps with things like chipram access timings, but isn't really emulating the CPU internals at a real speed.

jotd · 19 February 2024, 11:14

ok so you mean it would for instance be faster to do this (with A1 in chip)

Code:

  move.b  (a0)+,(a1)
  add.w #40,a1
  move.b  (a0)+,(a1)

than

Code:

  move.b  (a0)+,(a1)
  move.b  (a0)+,(40,a1)

? as the second instruction is longer to decode, while the "add" is virtually free?

paraj · 19 February 2024, 17:58

Quote:

Originally Posted by jotd

ok so you mean it would for instance be faster to do this (with A1 in chip)

Code:

  move.b  (a0)+,(a1)
  add.w #40,a1
  move.b  (a0)+,(a1)

than

Code:

  move.b  (a0)+,(a1)
  move.b  (a0)+,(40,a1)

? as the second instruction is longer to decode, while the "add" is virtually free?

Specifically what I'm referring to is this passage from MC68030UM §11.1:

Quote:

The MC68030 can overlap data writes with instruction cache reads, data cache reads, and/ or microsequencer execution. Instruction cache reads can be overlapped with data cache fills and/or microsequencer activity. Similarly, data cache reads can be overlapped with instruction cache fills and/or microsequencer activity. The execution of an instruction that only accesses on-chip registers can be overlapped entirely with a concurrent data write generated by a previous instruction, if prefetches generated by that instruction are resident in the instruction cache.

So it only works in specific circumstances. Your example may or may not be faster, but it uses more of the precious instruction cache, so probably not (but always time changes like that!).

Ideally you'd want the instructions between the writes to be in cache and only operate on registers. But I guess this is sort of OT for now.

reassembler · 20 February 2024, 17:56

Last night I was fried from programming, so I decided to chill out on the sofa and watch YouTube videos of people soldering capacitors.

Unfortunately, I had a brainwave whilst on the sofa about optimizing the sprite rendering further which involved a realization that OutRun never tried to render clipped sprites (i.e. ones that are partially off-screen) starting from an off-screen coordinate, but would always draw starting on-screen -> off-screen. I immediately had to try out some tweaks to see if I was imagining things. Mostly because there are 6 separate rendering routines by this point for speed already. But no it worked and shaved significant cycles from the sprite routines.

Today I messed around with the 68030 CACR register for data cache, instruction cache (on by default) and the various burst modes. I wish I could say I was smart enough to read the 030 manual and immediately identify the best combination to use. But, no, I just turned them on and off and measured the results.

Although there seem to be mixed reports online about the benefits of the data cache, for me, simply turning everything on (Instruction Cache + Burst, Data Cache + Burst) yielded the best performance and doesn't appear to cause any problems. Probably only 10-12% faster overall than just the default of the Instruction Cache. But I'll take that performance gain.

EDIT: On reflection and further tests I'm not entirely sure the CACR stuff has made any difference at all really. The performance speed differences I listed above seem more between the best case/worse case of the various options I tried. The real difference was from my sprite optimizations I made at the start of this post. Still - it was worth tinkering with as a test on hardware at least.

The next thing to tackle is potentially rolling some of the most intense graphical loops to fit in the 256 byte instruction cache - 030 and above specific optimizations really. I also have other performance related ideas to try.

paraj · 20 February 2024, 18:49

Great to hear about your experiments, keep up the good work!
I doubt messing with CACR is really going to bring you much (except if you start w/o startup-sequence). Both instruction and data cache should be enabled if you start normally, and I think that should be fine for now. But by all means try stuff

Keep in mind that cache lines are 16-bytes (on 030), so you want to limit functions that have to be kept in cache to 256-16 bytes or make sure they're 16-byte aligned. You generally get 8-byte alignment from AmigaOS so you might be able to make that even 256-8 w/o doing too much extra work, but as always measure..

reassembler · 20 February 2024, 21:13

Quote:

Originally Posted by paraj

Keep in mind that cache lines are 16-bytes (on 030), so you want to limit functions that have to be kept in cache to 256-16 bytes or make sure they're 16-byte aligned. You generally get 8-byte alignment from AmigaOS so you might be able to make that even 256-8 w/o doing too much extra work, but as always measure..

Got it. I read somewhere on an Atari forum that loops had to conclude with a DBRA to be cached. Is that actually true?! Sounds a bit random to me. Not sure if I should trust those Atari owners!

(Disclosure: I own two STs)

gimbal · 20 February 2024, 22:02

Quote:

Originally Posted by reassembler

Last night I was fried from programming, so I decided to chill out on the sofa and watch YouTube videos of people soldering capacitors.

Unfortunately, I had a brainwave whilst on the sofa

Nothing unfortunate about it, that's just the Tetris effect of programming and it has led to so much redundant code being scrapped in my own projects

At least one half of your brain just keeps working, awake or asleep.

There ain't no rest for the wicked.

reassembler · 21 February 2024, 00:16

Tried a variation where I reduced some of the most utilised sprite loops/routines to well under 256 bytes. Made no difference to overall speed - in fact, slightly slower, regardless of CACR settings also.

I think the most significant gains to be had will result from improving my code, as opposed to tinkering with this stuff. It doesn't seem a fruitful path to carry on down for now in terms of gaining any significant performance.

Don_Adan · 21 February 2024, 00:56

If I remember meynafs tests on 68030, cache routine can be maximum 240 bytes long.

hooverphonique · 21 February 2024, 09:38

Quote:

Originally Posted by reassembler

Got it. I read somewhere on an Atari forum that loops had to conclude with a DBRA to be cached. Is that actually true?!

Not true!

If you don't see much degradation when turning off caches, it suggests that you haven't adapted your code to take advantage of them, so there's probably something to be gained there ;-)

19 February 2024, 11:14	#92
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,196	ok so you mean it would for instance be faster to do this (with A1 in chip) Code: move.b (a0)+,(a1) add.w #40,a1 move.b (a0)+,(a1) than Code: move.b (a0)+,(a1) move.b (a0)+,(40,a1) ? as the second instruction is longer to decode, while the "add" is virtually free?

20 February 2024, 17:56	#94
reassembler Registered User Join Date: Oct 2023 Location: London, UK Posts: 92	Last night I was fried from programming, so I decided to chill out on the sofa and watch YouTube videos of people soldering capacitors. Unfortunately, I had a brainwave whilst on the sofa about optimizing the sprite rendering further which involved a realization that OutRun never tried to render clipped sprites (i.e. ones that are partially off-screen) starting from an off-screen coordinate, but would always draw starting on-screen -> off-screen. I immediately had to try out some tweaks to see if I was imagining things. Mostly because there are 6 separate rendering routines by this point for speed already. But no it worked and shaved significant cycles from the sprite routines. Today I messed around with the 68030 CACR register for data cache, instruction cache (on by default) and the various burst modes. I wish I could say I was smart enough to read the 030 manual and immediately identify the best combination to use. But, no, I just turned them on and off and measured the results. Although there seem to be mixed reports online about the benefits of the data cache, for me, simply turning everything on (Instruction Cache + Burst, Data Cache + Burst) yielded the best performance and doesn't appear to cause any problems. Probably only 10-12% faster overall than just the default of the Instruction Cache. But I'll take that performance gain. EDIT: On reflection and further tests I'm not entirely sure the CACR stuff has made any difference at all really. The performance speed differences I listed above seem more between the best case/worse case of the various options I tried. The real difference was from my sprite optimizations I made at the start of this post. Still - it was worth tinkering with as a test on hardware at least. The next thing to tackle is potentially rolling some of the most intense graphical loops to fit in the 256 byte instruction cache - 030 and above specific optimizations really. I also have other performance related ideas to try. Last edited by reassembler; 20 February 2024 at 18:13.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Outrun AGA	agermose	project.Amiga Game Factory	252	18 April 2024 12:57
Better Outrun port for Amiga	tekopaa	Retrogaming General Discussion	399	14 April 2022 17:56
Outrun adfs	macce2	request.Old Rare Games	3	18 April 2021 21:22
would you like to have an Outrun like for Aga?	sandruzzo	Retrogaming General Discussion	50	30 January 2013 12:03
Aweb: New APL 3.5Beta AOS4 PPC code + Milestone: KHTML porting started	Paul	News	0	05 November 2004 11:21

17 February 2024, 21:29	#81
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,196	yes, that's why "technology equivalent" ports are difficult. That's also why I personally engage only on very old games (and there's still some challenge as even some 1983 games can have a 256 color palette, and others have selectable palette per sprite, with possible 64 sprites!!) A1200/020 with fastmem is what most users with A1200 have (with CF IDE + whdload) so it's a reasonable target when you can't do vanilla A1200.

18 February 2024, 10:17	#83
trixster Guru Meditating Join Date: Jun 2014 Location: England Posts: 2,339	The latest screenshots look superb, amazing work so far. I’m glad I got the add4 board for my blizzard 1220/4, I know my 1200 is looking forward to this!

18 February 2024, 11:12	#84
coder76 Registered User Join Date: Dec 2016 Location: Finland Posts: 168	Have you thought about doing the sprite scaling in bitmap mode, and avoid using a c2p routine? I think it would be possible to have some table lookup, where you precompute the scaled graphic in 8 bit increments. So given an 8 pixel wide graphic data (1 byte) and zoom level, the table lookup returns how the scaled data should look like. You then need a 256 byte table for each zoom level, so doing scaling 16 pixels at a time, would create tables that are probably too big (64K*scale levels).This could be combined with mip mapping to reduce distortion in sprite scaling. But it needs to be thought out how to minimize memory accesses, accumulate scaled data in 32bit data reg, and write out a full longword into chip ram always. In bitmap mode the y-scaling is as simple is in chunky mode, just skip rows.

18 February 2024, 22:50	#85
reassembler Registered User Join Date: Oct 2023 Location: London, UK Posts: 92	As promised made a video on hardware. Also the C2P routine I'm using is here: https://github.com/Kalmalyzer/kalms-...1_8_c5_030_2.s Is there a faster one? [ Show youtube player ]

18 February 2024, 23:05	#86
copse Registered User Join Date: Jul 2009 Location: Lala Land Posts: 522	Coo, just look at all the flocks of windsurfers in their natural habitat. Nowhere on Earth other than Outrunland has as high a population of this endangered species ?? Great work, really looking good!

18 February 2024, 23:06	#87
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,104	Very nice progress! Regarding C2P you're not likely to beat Kalms for general cases, but that doesn't mean there aren't ways to improve performance for specific uses. What and how are you using the C2P routine? Some things you may want to look into: - Do you always use it for 8 bitplanes? (Maybe you can avoid some writes otherwise) - Premade/unrolled versions for specific widths (or maybe even heights as well), so you avoid reinitilizing+cache flush - Combining C2P with other processing if possible. I guess you're not C2P directly to chip ram? But if you are, there is a large benefit on 030+ to doing calculations that only depend on fast ram/CPU while chip store is completing

19 February 2024, 10:29	#90
reassembler Registered User Join Date: Oct 2023 Location: London, UK Posts: 92	Ah yes, I'd read about the above technique. The only other place I write to chip memory is updating the copper list for the background. I write around 448 words per frame. But I'll have a think about my C2P usage in general to see if I can figure out some smart approaches. An obvious, but somewhat brutal optimization would be cropping the top of the display. A lot of the time there isn't an awful lot going on there and is arguably not C2P time well spent. You could probably remove at least 32 vertical rows without really making a massive difference to the experience. Just maybe one to think about once I've got the background layer in to avoid optimizing things too soon and rewriting. Can anyone patiently explain to me why WinUAE speeds are so radically different? I've run SysInfo on my real Amiga. It correctly measures my CPU to be 50 Mhz. My accelerator card is definitely work ok from what I can tell. I can understand a bit of discrepancy, but I have to set WinUAE to vanilla A1200 (i.e. around 14Mhz) speed to really match what I'm getting on hardware (i.e. 4x slower than hardware) - even for routines that don't touch chip memory. This seems to suggest I'm doing something wrong somewhere either with my UAE setup or even code. I'm not thinking I'm magically going to conjure up 4x speed out of nowhere on hardware - I just am a bit baffled by this massive discrepancy. UAE, in cycle accurate mode, is presumably executing a defined number of clock cycles per scanline - so I wouldn't expect it to be 4x faster if I say set the processor to 030 @ 50 Mhz with cycle accurate settings.

20 February 2024, 18:49	#95
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,104	Great to hear about your experiments, keep up the good work! I doubt messing with CACR is really going to bring you much (except if you start w/o startup-sequence). Both instruction and data cache should be enabled if you start normally, and I think that should be fine for now. But by all means try stuff Keep in mind that cache lines are 16-bytes (on 030), so you want to limit functions that have to be kept in cache to 256-16 bytes or make sure they're 16-byte aligned. You generally get 8-byte alignment from AmigaOS so you might be able to make that even 256-8 w/o doing too much extra work, but as always measure..

21 February 2024, 00:16	#98
reassembler Registered User Join Date: Oct 2023 Location: London, UK Posts: 92	Tried a variation where I reduced some of the most utilised sprite loops/routines to well under 256 bytes. Made no difference to overall speed - in fact, slightly slower, regardless of CACR settings also. I think the most significant gains to be had will result from improving my code, as opposed to tinkering with this stuff. It doesn't seem a fruitful path to carry on down for now in terms of gaining any significant performance.

21 February 2024, 00:56	#99
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,975	If I remember meynafs tests on 68030, cache routine can be maximum 240 bytes long.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)