17 February 2024, 21:29 | #81 |
Moon 1969 = amiga 1985
Join Date: Apr 2007
Location: belgium
Age: 48
Posts: 3,914
|
great job
|
18 February 2024, 09:31 | #82 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 449
|
Quote:
Also, with such an approach some sort of incremental scaling like the old blitter zoom would be possible, splitting the object in four parts that are simply copied, then inserting one row/column from the original (or doubling a line if scale >1) for the next zoom step. Which ofc also looks different from a real zoom. Probably has a lot of issues too, like overhead for indirect drawing and alignment in memory, and would get pretty complex... BTW, fantastic work so far by the way! |
|
18 February 2024, 10:17 | #83 |
Guru Meditating
Join Date: Jun 2014
Location: England
Posts: 2,367
|
The latest screenshots look superb, amazing work so far. I’m glad I got the add4 board for my blizzard 1220/4, I know my 1200 is looking forward to this!
|
18 February 2024, 11:12 | #84 |
Registered User
Join Date: Dec 2016
Location: Finland
Posts: 169
|
Have you thought about doing the sprite scaling in bitmap mode, and avoid using a c2p routine?
I think it would be possible to have some table lookup, where you precompute the scaled graphic in 8 bit increments. So given an 8 pixel wide graphic data (1 byte) and zoom level, the table lookup returns how the scaled data should look like. You then need a 256 byte table for each zoom level, so doing scaling 16 pixels at a time, would create tables that are probably too big (64K*scale levels).This could be combined with mip mapping to reduce distortion in sprite scaling. But it needs to be thought out how to minimize memory accesses, accumulate scaled data in 32bit data reg, and write out a full longword into chip ram always. In bitmap mode the y-scaling is as simple is in chunky mode, just skip rows. |
18 February 2024, 22:50 | #85 |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
As promised made a video on hardware.
Also the C2P routine I'm using is here: https://github.com/Kalmalyzer/kalms-...1_8_c5_030_2.s Is there a faster one? [ Show youtube player ] |
18 February 2024, 23:05 | #86 |
Registered User
Join Date: Jul 2009
Location: Lala Land
Posts: 608
|
Coo, just look at all the flocks of windsurfers in their natural habitat. Nowhere on Earth other than Outrunland has as high a population of this endangered species ??
Great work, really looking good! |
18 February 2024, 23:06 | #87 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,317
|
Very nice progress!
Regarding C2P you're not likely to beat Kalms for general cases, but that doesn't mean there aren't ways to improve performance for specific uses. What and how are you using the C2P routine? Some things you may want to look into: - Do you always use it for 8 bitplanes? (Maybe you can avoid some writes otherwise) - Premade/unrolled versions for specific widths (or maybe even heights as well), so you avoid reinitilizing+cache flush - Combining C2P with other processing if possible. I guess you're not C2P directly to chip ram? But if you are, there is a large benefit on 030+ to doing calculations that only depend on fast ram/CPU while chip store is completing |
18 February 2024, 23:14 | #88 | |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
Quote:
Right now almost everything in the engine is in Fast Memory, with the exception of the Bitplanes which are in Chip Memory. So the C2P has to deal with writing directly into chip memory. |
|
18 February 2024, 23:48 | #89 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,317
|
Quote:
Writing to chip mem is slow. Even when you're nice and doing properly aligned long word writes you're limited to 7MBs. 030+ accelerator cards w/ all bells and whistles enabled can do the transfer "in the background" (on 060 called store buffer with up to 4 pending transfers being allowed, 1 on 030 and called something [write pending buffer?]). So in some cases, and with great effort, you can get "free" extra cycles by overlapping computations with the chip memory stores completing. E.g. Code:
move.l d0,(a0)+ ; store to chip mem ; XXX move.l d1,(a0)+ ; store to chip mem again, has to wait for first store to complete So basically if you have somewhere in you code, where you're just copying data to chipmem, there might be an optimization opportunity (though it might be difficult to exploit). |
|
19 February 2024, 10:29 | #90 |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
Ah yes, I'd read about the above technique. The only other place I write to chip memory is updating the copper list for the background. I write around 448 words per frame. But I'll have a think about my C2P usage in general to see if I can figure out some smart approaches.
An obvious, but somewhat brutal optimization would be cropping the top of the display. A lot of the time there isn't an awful lot going on there and is arguably not C2P time well spent. You could probably remove at least 32 vertical rows without really making a massive difference to the experience. Just maybe one to think about once I've got the background layer in to avoid optimizing things too soon and rewriting. Can anyone patiently explain to me why WinUAE speeds are so radically different? I've run SysInfo on my real Amiga. It correctly measures my CPU to be 50 Mhz. My accelerator card is definitely work ok from what I can tell. I can understand a bit of discrepancy, but I have to set WinUAE to vanilla A1200 (i.e. around 14Mhz) speed to really match what I'm getting on hardware (i.e. 4x slower than hardware) - even for routines that don't touch chip memory. This seems to suggest I'm doing something wrong somewhere either with my UAE setup or even code. I'm not thinking I'm magically going to conjure up 4x speed out of nowhere on hardware - I just am a bit baffled by this massive discrepancy. UAE, in cycle accurate mode, is presumably executing a defined number of clock cycles per scanline - so I wouldn't expect it to be 4x faster if I say set the processor to 030 @ 50 Mhz with cycle accurate settings. |
19 February 2024, 11:06 | #91 |
Registered User
Join Date: May 2023
Location: Norwich
Posts: 531
|
AIUI, "cycle exact" is really only cycle exact for an A500, 68000 configuration. On higher spec CPUs it helps with things like chipram access timings, but isn't really emulating the CPU internals at a real speed.
|
19 February 2024, 11:14 | #92 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,458
|
ok so you mean it would for instance be faster to do this (with A1 in chip)
Code:
move.b (a0)+,(a1) add.w #40,a1 move.b (a0)+,(a1) Code:
move.b (a0)+,(a1) move.b (a0)+,(40,a1) |
19 February 2024, 17:58 | #93 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,317
|
Quote:
Quote:
Ideally you'd want the instructions between the writes to be in cache and only operate on registers. But I guess this is sort of OT for now. |
||
20 February 2024, 17:56 | #94 |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
Last night I was fried from programming, so I decided to chill out on the sofa and watch YouTube videos of people soldering capacitors.
Unfortunately, I had a brainwave whilst on the sofa about optimizing the sprite rendering further which involved a realization that OutRun never tried to render clipped sprites (i.e. ones that are partially off-screen) starting from an off-screen coordinate, but would always draw starting on-screen -> off-screen. I immediately had to try out some tweaks to see if I was imagining things. Mostly because there are 6 separate rendering routines by this point for speed already. But no it worked and shaved significant cycles from the sprite routines. Today I messed around with the 68030 CACR register for data cache, instruction cache (on by default) and the various burst modes. I wish I could say I was smart enough to read the 030 manual and immediately identify the best combination to use. But, no, I just turned them on and off and measured the results. Although there seem to be mixed reports online about the benefits of the data cache, for me, simply turning everything on (Instruction Cache + Burst, Data Cache + Burst) yielded the best performance and doesn't appear to cause any problems. Probably only 10-12% faster overall than just the default of the Instruction Cache. But I'll take that performance gain. EDIT: On reflection and further tests I'm not entirely sure the CACR stuff has made any difference at all really. The performance speed differences I listed above seem more between the best case/worse case of the various options I tried. The real difference was from my sprite optimizations I made at the start of this post. Still - it was worth tinkering with as a test on hardware at least. The next thing to tackle is potentially rolling some of the most intense graphical loops to fit in the 256 byte instruction cache - 030 and above specific optimizations really. I also have other performance related ideas to try. Last edited by reassembler; 20 February 2024 at 18:13. |
20 February 2024, 18:49 | #95 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,317
|
Great to hear about your experiments, keep up the good work!
I doubt messing with CACR is really going to bring you much (except if you start w/o startup-sequence). Both instruction and data cache should be enabled if you start normally, and I think that should be fine for now. But by all means try stuff Keep in mind that cache lines are 16-bytes (on 030), so you want to limit functions that have to be kept in cache to 256-16 bytes or make sure they're 16-byte aligned. You generally get 8-byte alignment from AmigaOS so you might be able to make that even 256-8 w/o doing too much extra work, but as always measure.. |
20 February 2024, 21:13 | #96 | |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
Quote:
(Disclosure: I own two STs) |
|
20 February 2024, 22:02 | #97 | |
cheeky scoundrel
Join Date: Nov 2004
Location: Spijkenisse/Netherlands
Age: 43
Posts: 7,123
|
Quote:
There ain't no rest for the wicked. |
|
21 February 2024, 00:16 | #98 |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
Tried a variation where I reduced some of the most utilised sprite loops/routines to well under 256 bytes. Made no difference to overall speed - in fact, slightly slower, regardless of CACR settings also.
I think the most significant gains to be had will result from improving my code, as opposed to tinkering with this stuff. It doesn't seem a fruitful path to carry on down for now in terms of gaining any significant performance. |
21 February 2024, 00:56 | #99 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,134
|
If I remember meynafs tests on 68030, cache routine can be maximum 240 bytes long.
|
21 February 2024, 09:38 | #100 | |
ex. demoscener "Bigmama"
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,666
|
Quote:
If you don't see much degradation when turning off caches, it suggests that you haven't adapted your code to take advantage of them, so there's probably something to be gained there ;-) |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Outrun AGA | agermose | project.Amiga Game Factory | 417 | 17 July 2024 19:08 |
Better Outrun port for Amiga | tekopaa | Retrogaming General Discussion | 399 | 14 April 2022 17:56 |
Outrun adfs | macce2 | request.Old Rare Games | 3 | 18 April 2021 21:22 |
would you like to have an Outrun like for Aga? | sandruzzo | Retrogaming General Discussion | 50 | 30 January 2013 12:03 |
Aweb: New APL 3.5Beta AOS4 PPC code + Milestone: KHTML porting started | Paul | News | 0 | 05 November 2004 11:21 |
|
|