English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 17 February 2024, 21:29   #81
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,196
yes, that's why "technology equivalent" ports are difficult. That's also why I personally engage only on very old games (and there's still some challenge as even some 1983 games can have a 256 color palette, and others have selectable palette per sprite, with possible 64 sprites!!)

A1200/020 with fastmem is what most users with A1200 have (with CF IDE + whdload) so it's a reasonable target when you can't do vanilla A1200.
jotd is offline  
Old 18 February 2024, 09:31   #82
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by reassembler View Post
However, once you're scaling and stretching sprites in real-time, there's no real performance advantage in pre-flipping. You're ultimately sampling the source data on a per pixel basis anyway. So the flipping comes for free as part of the process.

You might then logically conclude - why not pre-scale AND pre-flip everything. Well, maybe. OutRun has 1mb of sprite data. Let's say you pre-cache 50 scaled values of each sprite in memory, which is less granularity than you'd have doing it at runtime (OutRun actually has approx 300 potential scaled sizes per sprite). 50 scaled sprites would probably take around 75mb of memory given that a lot of the scaling makes things larger. Now double that 75mb again for the flipping and I suppose you're at 150mb ram just for the sprites.

There's an approach using less granularity. But I'd rather have smoother scaling. If I get to a point where I finish the engine to the point where there is the potential for a performant game, there are other level layout optimisations I'd make first.
It seems like that in Outrun during a stage only a smaller subset of objects is shown, and those usually repeat. Hence, does it make sense to cache objects during a stage? So objects would be drawn into a "stage buffer" before drawing them, and if an object has been already drawn at the desired zoom factor (or one that is close), it is copied from the buffer.

Also, with such an approach some sort of incremental scaling like the old blitter zoom would be possible, splitting the object in four parts that are simply copied, then inserting one row/column from the original (or doubling a line if scale >1) for the next zoom step. Which ofc also looks different from a real zoom.

Probably has a lot of issues too, like overhead for indirect drawing and alignment in memory, and would get pretty complex...

BTW, fantastic work so far by the way!
chb is offline  
Old 18 February 2024, 10:17   #83
trixster
Guru Meditating
 
Join Date: Jun 2014
Location: England
Posts: 2,339
The latest screenshots look superb, amazing work so far. I’m glad I got the add4 board for my blizzard 1220/4, I know my 1200 is looking forward to this!
trixster is offline  
Old 18 February 2024, 11:12   #84
coder76
Registered User
 
Join Date: Dec 2016
Location: Finland
Posts: 168
Have you thought about doing the sprite scaling in bitmap mode, and avoid using a c2p routine?

I think it would be possible to have some table lookup, where you precompute the scaled graphic in 8 bit increments. So given an 8 pixel wide graphic data (1 byte) and zoom level, the table lookup returns how the scaled data should look like. You then need a 256 byte table for each zoom level, so doing scaling 16 pixels at a time, would create tables that are probably too big (64K*scale levels).This could be combined with mip mapping to reduce distortion in sprite scaling.

But it needs to be thought out how to minimize memory accesses, accumulate scaled data in 32bit data reg, and write out a full longword into chip ram always. In bitmap mode the y-scaling is as simple is in chunky mode, just skip rows.
coder76 is offline  
Old 18 February 2024, 22:50   #85
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
As promised made a video on hardware.

Also the C2P routine I'm using is here:
https://github.com/Kalmalyzer/kalms-...1_8_c5_030_2.s

Is there a faster one?

[ Show youtube player ]
reassembler is offline  
Old 18 February 2024, 23:05   #86
copse
Registered User
 
Join Date: Jul 2009
Location: Lala Land
Posts: 522
Coo, just look at all the flocks of windsurfers in their natural habitat. Nowhere on Earth other than Outrunland has as high a population of this endangered species ??

Great work, really looking good!
copse is offline  
Old 18 February 2024, 23:06   #87
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
Very nice progress!
Regarding C2P you're not likely to beat Kalms for general cases, but that doesn't mean there aren't ways to improve performance for specific uses.

What and how are you using the C2P routine?

Some things you may want to look into:
- Do you always use it for 8 bitplanes? (Maybe you can avoid some writes otherwise)
- Premade/unrolled versions for specific widths (or maybe even heights as well), so you avoid reinitilizing+cache flush
- Combining C2P with other processing if possible.

I guess you're not C2P directly to chip ram? But if you are, there is a large benefit on 030+ to doing calculations that only depend on fast ram/CPU while chip store is completing
paraj is offline  
Old 18 February 2024, 23:14   #88
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
Quote:
Originally Posted by paraj View Post
I guess you're not C2P directly to chip ram? But if you are, there is a large benefit on 030+ to doing calculations that only depend on fast ram/CPU while chip store is completing
Apologies for my ignorance here - I'm not an Amiga expert. Are you suggesting writing to the bitplanes in Fast Memory, then performing a complete copy of those bitplanes to Chip Memory with some sort of parallel blitter routine, whilst the CPU goes on to process other things?

Right now almost everything in the engine is in Fast Memory, with the exception of the Bitplanes which are in Chip Memory. So the C2P has to deal with writing directly into chip memory.
reassembler is offline  
Old 18 February 2024, 23:48   #89
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
Quote:
Originally Posted by reassembler View Post
Apologies for my ignorance here - I'm not an Amiga expert. Are you suggesting writing to the bitplanes in Fast Memory, then performing a complete copy of those bitplanes to Chip Memory with some sort of parallel blitter routine, whilst the CPU goes on to process other things?

Right now almost everything in the engine is in Fast Memory, with the exception of the Bitplanes which are in Chip Memory. So the C2P has to deal with writing directly into chip memory.
I only have 060 so that's what I'm most familiar with, so apologies in advance if the following is not accurate for 030/040, but I think basics should be right.

Writing to chip mem is slow. Even when you're nice and doing properly aligned long word writes you're limited to 7MBs. 030+ accelerator cards w/ all bells and whistles enabled can do the transfer "in the background" (on 060 called store buffer with up to 4 pending transfers being allowed, 1 on 030 and called something [write pending buffer?]).

So in some cases, and with great effort, you can get "free" extra cycles by overlapping computations with the chip memory stores completing.

E.g.
Code:
  move.l d0,(a0)+ ; store to chip mem
  ; XXX

  move.l d1,(a0)+ ; store to chip mem again, has to wait for first store to complete
In above snippet the second store will generally have to wait for the first store to complete, but you can fit - with restrictions - some code "for free" into the XXX part as long as it doesn't disturb the chip write. On 060 this is how "copy speed" C2P is done - while the chipmem writes are completing the C2P is done "for free" (many caveats apply).


So basically if you have somewhere in you code, where you're just copying data to chipmem, there might be an optimization opportunity (though it might be difficult to exploit).
paraj is offline  
Old 19 February 2024, 10:29   #90
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
Ah yes, I'd read about the above technique. The only other place I write to chip memory is updating the copper list for the background. I write around 448 words per frame. But I'll have a think about my C2P usage in general to see if I can figure out some smart approaches.

An obvious, but somewhat brutal optimization would be cropping the top of the display. A lot of the time there isn't an awful lot going on there and is arguably not C2P time well spent. You could probably remove at least 32 vertical rows without really making a massive difference to the experience. Just maybe one to think about once I've got the background layer in to avoid optimizing things too soon and rewriting.

Can anyone patiently explain to me why WinUAE speeds are so radically different? I've run SysInfo on my real Amiga. It correctly measures my CPU to be 50 Mhz. My accelerator card is definitely work ok from what I can tell.

I can understand a bit of discrepancy, but I have to set WinUAE to vanilla A1200 (i.e. around 14Mhz) speed to really match what I'm getting on hardware (i.e. 4x slower than hardware) - even for routines that don't touch chip memory. This seems to suggest I'm doing something wrong somewhere either with my UAE setup or even code. I'm not thinking I'm magically going to conjure up 4x speed out of nowhere on hardware - I just am a bit baffled by this massive discrepancy. UAE, in cycle accurate mode, is presumably executing a defined number of clock cycles per scanline - so I wouldn't expect it to be 4x faster if I say set the processor to 030 @ 50 Mhz with cycle accurate settings.
reassembler is offline  
Old 19 February 2024, 11:06   #91
AestheticDebris
Registered User
 
Join Date: May 2023
Location: Norwich
Posts: 376
Quote:
Originally Posted by reassembler View Post
Can anyone patiently explain to me why WinUAE speeds are so radically different? I've run SysInfo on my real Amiga. It correctly measures my CPU to be 50 Mhz. My accelerator card is definitely work ok from what I can tell.
AIUI, "cycle exact" is really only cycle exact for an A500, 68000 configuration. On higher spec CPUs it helps with things like chipram access timings, but isn't really emulating the CPU internals at a real speed.
AestheticDebris is offline  
Old 19 February 2024, 11:14   #92
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,196
ok so you mean it would for instance be faster to do this (with A1 in chip)

Code:
  move.b  (a0)+,(a1)
  add.w #40,a1
  move.b  (a0)+,(a1)
than

Code:
  move.b  (a0)+,(a1)
  move.b  (a0)+,(40,a1)
? as the second instruction is longer to decode, while the "add" is virtually free?
jotd is offline  
Old 19 February 2024, 17:58   #93
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
Quote:
Originally Posted by jotd View Post
ok so you mean it would for instance be faster to do this (with A1 in chip)

Code:
  move.b  (a0)+,(a1)
  add.w #40,a1
  move.b  (a0)+,(a1)
than

Code:
  move.b  (a0)+,(a1)
  move.b  (a0)+,(40,a1)
? as the second instruction is longer to decode, while the "add" is virtually free?
Specifically what I'm referring to is this passage from MC68030UM §11.1:
Quote:
The MC68030 can overlap data writes with instruction cache reads, data cache reads, and/ or microsequencer execution. Instruction cache reads can be overlapped with data cache fills and/or microsequencer activity. Similarly, data cache reads can be overlapped with instruction cache fills and/or microsequencer activity. The execution of an instruction that only accesses on-chip registers can be overlapped entirely with a concurrent data write generated by a previous instruction, if prefetches generated by that instruction are resident in the instruction cache.
So it only works in specific circumstances. Your example may or may not be faster, but it uses more of the precious instruction cache, so probably not (but always time changes like that!).


Ideally you'd want the instructions between the writes to be in cache and only operate on registers. But I guess this is sort of OT for now.
paraj is offline  
Old 20 February 2024, 17:56   #94
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
Last night I was fried from programming, so I decided to chill out on the sofa and watch YouTube videos of people soldering capacitors.

Unfortunately, I had a brainwave whilst on the sofa about optimizing the sprite rendering further which involved a realization that OutRun never tried to render clipped sprites (i.e. ones that are partially off-screen) starting from an off-screen coordinate, but would always draw starting on-screen -> off-screen. I immediately had to try out some tweaks to see if I was imagining things. Mostly because there are 6 separate rendering routines by this point for speed already. But no it worked and shaved significant cycles from the sprite routines.

Today I messed around with the 68030 CACR register for data cache, instruction cache (on by default) and the various burst modes. I wish I could say I was smart enough to read the 030 manual and immediately identify the best combination to use. But, no, I just turned them on and off and measured the results.

Although there seem to be mixed reports online about the benefits of the data cache, for me, simply turning everything on (Instruction Cache + Burst, Data Cache + Burst) yielded the best performance and doesn't appear to cause any problems. Probably only 10-12% faster overall than just the default of the Instruction Cache. But I'll take that performance gain.

EDIT: On reflection and further tests I'm not entirely sure the CACR stuff has made any difference at all really. The performance speed differences I listed above seem more between the best case/worse case of the various options I tried. The real difference was from my sprite optimizations I made at the start of this post. Still - it was worth tinkering with as a test on hardware at least.

The next thing to tackle is potentially rolling some of the most intense graphical loops to fit in the 256 byte instruction cache - 030 and above specific optimizations really. I also have other performance related ideas to try.

Last edited by reassembler; 20 February 2024 at 18:13.
reassembler is offline  
Old 20 February 2024, 18:49   #95
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
Great to hear about your experiments, keep up the good work!
I doubt messing with CACR is really going to bring you much (except if you start w/o startup-sequence). Both instruction and data cache should be enabled if you start normally, and I think that should be fine for now. But by all means try stuff

Keep in mind that cache lines are 16-bytes (on 030), so you want to limit functions that have to be kept in cache to 256-16 bytes or make sure they're 16-byte aligned. You generally get 8-byte alignment from AmigaOS so you might be able to make that even 256-8 w/o doing too much extra work, but as always measure..
paraj is offline  
Old 20 February 2024, 21:13   #96
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
Quote:
Originally Posted by paraj View Post
Keep in mind that cache lines are 16-bytes (on 030), so you want to limit functions that have to be kept in cache to 256-16 bytes or make sure they're 16-byte aligned. You generally get 8-byte alignment from AmigaOS so you might be able to make that even 256-8 w/o doing too much extra work, but as always measure..
Got it. I read somewhere on an Atari forum that loops had to conclude with a DBRA to be cached. Is that actually true?! Sounds a bit random to me. Not sure if I should trust those Atari owners!

(Disclosure: I own two STs)
reassembler is offline  
Old 20 February 2024, 22:02   #97
gimbal
cheeky scoundrel
 
gimbal's Avatar
 
Join Date: Nov 2004
Location: Spijkenisse/Netherlands
Age: 42
Posts: 6,917
Quote:
Originally Posted by reassembler View Post
Last night I was fried from programming, so I decided to chill out on the sofa and watch YouTube videos of people soldering capacitors.

Unfortunately, I had a brainwave whilst on the sofa
Nothing unfortunate about it, that's just the Tetris effect of programming and it has led to so much redundant code being scrapped in my own projects At least one half of your brain just keeps working, awake or asleep.

There ain't no rest for the wicked.
gimbal is offline  
Old 21 February 2024, 00:16   #98
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
Tried a variation where I reduced some of the most utilised sprite loops/routines to well under 256 bytes. Made no difference to overall speed - in fact, slightly slower, regardless of CACR settings also.

I think the most significant gains to be had will result from improving my code, as opposed to tinkering with this stuff. It doesn't seem a fruitful path to carry on down for now in terms of gaining any significant performance.
reassembler is offline  
Old 21 February 2024, 00:56   #99
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
If I remember meynafs tests on 68030, cache routine can be maximum 240 bytes long.
Don_Adan is offline  
Old 21 February 2024, 09:38   #100
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
Quote:
Originally Posted by reassembler View Post
Got it. I read somewhere on an Atari forum that loops had to conclude with a DBRA to be cached. Is that actually true?!
Not true!


If you don't see much degradation when turning off caches, it suggests that you haven't adapted your code to take advantage of them, so there's probably something to be gained there ;-)
hooverphonique is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Outrun AGA agermose project.Amiga Game Factory 252 18 April 2024 12:57
Better Outrun port for Amiga tekopaa Retrogaming General Discussion 399 14 April 2022 17:56
Outrun adfs macce2 request.Old Rare Games 3 18 April 2021 21:22
would you like to have an Outrun like for Aga? sandruzzo Retrogaming General Discussion 50 30 January 2013 12:03
Aweb: New APL 3.5Beta AOS4 PPC code + Milestone: KHTML porting started Paul News 0 05 November 2004 11:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 19:24.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10294 seconds with 14 queries