English Amiga Board - OutRun: I started porting it!

English Amiga Board (https://eab.abime.net/index.php)

- Coders. General (https://eab.abime.net/forumdisplay.php?f=37)

- - OutRun: I started porting it! (https://eab.abime.net/showthread.php?t=116656)

Yep, the 'I don't know how to code myself, but I know how you should have done it and then it would run at 60hz with twice as many objects, colors, music channels and rainbows, unicorns and megagodzillaspacekittens!!!!eins!' crowd is strong these days.

fortunately this doesn't happen often. Someone must have left the bozo gate open for a while.

Quote:

Originally Posted by jotd (Post 1684776)

Someone must have left the bozo gate open for a while.

:agree:laughing

Back to C2P, I figured I'd 'quickly' try the blitter assisted routine here:
https://github.com/Kalmalyzer/kalms-...1_8_c3b1_030.s

For whatever reason and despite having success with other C2P routines, I can't get this to work.

- What is the difference between 'c3' and 'c5' in the naming conventions of this routine?
- Does anyone understand the size the extra buffer has to be (passed in the a2 register) for this version? I've stored the buffer in chip memory.
- I can confirm I have enabled Blitter DMA.
- Am I doing something completely dumb when it comes to calling QBlit - is there any circumstances where QBlit would fail?

Code:

move.l        S_GraBase,a6  ; Graphic Base, grabbed during startup address appears ok

jsr        _LVOQBlit(a6) ; Where the subroutine is at an offset of -276

From quickly debugging, I'm presuming something isn't getting setup correctly, as the c2p routine stalls on second execution waiting for the blitter indefinitely.

Haven't used his blitter assisted C2P's, but:

c3b1 = 3 cpu passes, 1 blitter pass. c5 = 5 cpu passes
95% sure the extra blit buffer buffer needs to be the same size as the screen.
Do you have blitter interrupts enabled? If not, that's probably the cause as that's what will be driving the blits. If you kill the system, you probably have to use a replacement for QBlit, and the repo might have that (EDIT: it does: https://github.com/Kalmalyzer/kalms-...hers/qblit.lha)

@paraj - well spotted. I'll give that a go tomorrow and report back on hardware.

Right yes, that's 'working' albeit with a load of timing issues because the c2p is running in parallel as opposed to in series with the rest of the codebase. Will have to untangle that mess and see if it's yielded any exciting performance improvements!

Quote:

Originally Posted by reassembler (Post 1684799)

Great (and good progress BTW!), but yeah, it's unfortunately not a free win. While the blitter will handle 2 passes this way the CPU will still do as many chip writes as before, and since the blitter can only work on 16-bit words, and need 3 chip accesses for each, it probably works best when you're limited by computation speed (and never on 040+).

Looking forward to the results of your tests, but you probably should spend too much time on getting it "right" if the speed improvement isn't very noticeable.

Yeah, I haven't focused on pure Amiga hardware optimizations for some time so figured I'd revisit some options. The necessity to use chip memory kills a lot of ideas.

I do want to benchmark having dummy space either side of the chunky buffer on a horizontal line basis. This means some expensive clipping checks can be omitted. However, I'm unsure whether doing so would completely hose the c2p performance as the algorithm would have to skip bytes which might negate any advantage.

I do something internally like this for road rendering, which uses bucketloads of ram, but is lightning fast as a result. Basically avoiding checks and branch conditions in the most intense routines and granting the code permission to write all over the place is often a massive win. Being as dumb and simple as possible is often better than being smart.

Quote:

Originally Posted by reassembler (Post 1684594)

Alright, you asked for it - a video:

https://youtu.be/CLaedlbr4wg

Thanks a lot for the vid, it already looks spectacular.

Quote:

Originally Posted by reassembler (Post 1684594)

I'm not going to go wild implementing 'new' features in the Amiga version. I've really done that as part of CannonBall (modern machines) and OutRun Enhanced (runs on Arcade hardware). This is more of a 'just get it bloody running' effort. That being said, I have fixed a few of the simple bugs present in the original game.

Yes of course, I totally understand, maybe my feature request would better fit as option to CannonBall. Just in case you feel like revisiting CannonBall at some later point, it would be a nice addition.

But by all means, keep going on working at this port here first! It's simply amazing!

Quote:

Originally Posted by paraj (Post 1684805)

Looking forward to the results of your tests, but you probably should spend too much time on getting it "right" if the speed improvement isn't very noticeable.

OK, thanks to your help I implemented the C2P algorithm using the Blitter.

I measured the time it took the AI to drive the car from the start line to a particular point of the game with a stop-watch. The AI is deterministic so will behave the same way every time from a fresh boot. Due to the way in which the blitter runs in the background, I figured this was a better way of analyzing overall performance, as opposed to actual in-engine timings.

With the normal 030 c2p... it took 2m 11s. With the blitter assisted c2p it took 1m 59s. About 11% faster overall for a real world situation.

However... whilst this is a moderate speed-boost there are some caveats:

1/ The more intensive parts of the game are now obviously faster (because C2P is chugging away in the background)

2/ The lightweight parts of the game are now slower (because the C2P is still running when we get to the end of a frame, and we have to wait for it to finish).

3/ In order to get the benefit from Blitter C2P I have to start the C2P at the very beginning of the frame, effectively working on the previous frames data, so that there's other stuff the game engine can be doing whilst the blitter slowly ploughs through the data.

So originally we had:
Game Logic -> Render to chunky -> c2p -> swap screen buffer -> vblank stuff

Now we have:
c2p -> Game Logic-> render to chunky -> check c2p has finished -> swap screen buffer -> vblank stuff

This means that the frame displayed is 1 frame older than previously. It also means I need to add some hacks/delays with various palette updates that ran in the vblank. So overall the code gets a little messier and harder to debug.

I don't really know whether I love this approach. On the one hand, it's faster overall. On the other, it's kind of hacky and is very 030 specific. There's no way you'd want this on an 040. Plus there's the fact that I'm planning 030 optimizations/streamlining anyway. It kind of sucks if their benefit is diminished due to blitter waits.

I'll sleep on it, but I think if it had been significantly faster I would have welcomed it more. But I'm not so sure.

Edit: Not that it will tell you more than the above, but here's a quick video of it running:
https://www.youtube.com/watch?v=Sz-te2uKDLk

Looking really good so far, amazing work! Will keep an eye on this thread, if you ever want a build tested on my 060 just let me know :)

And how it behave on more heavy parts like the one with the arks?

Quote:

Originally Posted by reassembler (Post 1684681)

OutRun generally pushes over 256 colours simultaneously and uses many more over the course of some stages. There's already palette reduction going on for AGA. By the time you slice that down to 128 colours, you're really looking at reworking all the palettes. Which I'm not enthusiastic about as there are 255 * 16 colour palettes available at any one time. So over 4,000 colours. Losing a bitplane doesn't seem worth it.

I always ask myself, what the pillars that make the game successful? For OutRun it's the vibrant palette and scenery, cool audio, fast frame-rate, simple yet highly nuanced gameplay. Ideally, I'll keep as many of those pillars intact where possible - with hopefully some tasteful sacrifices.

Maybe in the future I'll port PCE OutRun to A500! But even that has a pretty impressive colour palette vs. a typical 16/32 colour Amiga game.

Agermose is currently reworking all the OutRun art for his port. That's a massive job. Easy enough to get a single level of the game working ok. But the scenery needs to work across multiple stages in various combinations of colours. A daunting amount of work. And also why his project is interesting because of the different approach taken. :)

Technically speaking it is now Adrian who is doing the gfx. I dumped everything including a huge Excel sheet with all the patterns and sprite/palette usage in the 15 stages. He’s doing the monster task of converting to 32 colours.

Quote:

Originally Posted by reassembler (Post 1684673)

Maybe. That would need to be benchmarked in terms of Sprite DMA usage trade-off, amount of chip ram needed as the cached tilemaps are huge, the cost of accessing chip ram vs. fast ram etc. Plus I've already eaten the cost of chunky conversion anyway, so that computation wouldn't be recovered.

So the honest answer is, I don't really know without spending considerable time trying it. It's more of a case of considering the overall architecture of the engine, as opposed to a binary 'sprites are almost free' unfortunately (an imaginary quote not yours!)

I'd probably rather keep remaining chip ram for music and sound effects.

There would be a considerable saving to be made by merging the tile layer into one single layer with no transparency - as them I'm just movem.l'ing (is that a new expression?) data around fast memory. Plus native sprites wouldn't handle the parallax anyway - which is the expensive part of this.

I think Agermose was experimenting with sprites for the tile layers on his AGA port. He may have some insights into the pros and cons. Bear in mind I'm using a full 8 bitplanes, so I'm already hammering those DMA slots hard... My lack of Amiga coding experience makes it hard to anticipate how it would play out. From previous experiments with Blitter usage it was a pain in the arse / bottleneck on the 030.

Yes I’m using the hw sprites for the back tile layer, ditched the front tile layer. I don’t think it is suited for your project. Pros are the easy scrolling, and “free” layer. There are quite a few cons, mainly the 3 colour limit, and the huge (chip) memory use (which can be reduced somewhat, but with complicated code). PM me if you want more details.