English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 08 February 2021, 10:04   #1
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 408
Tinyus Tech

For tech-savvy readers here comes some information about the inner workings of Tinyus:

The game (including replayer) was coded in C99 except two short asm routines which were used to copy images from slowram to chipmem on demand.

Gfx
-The game runs at 256x224 (plus the top and bottom hud area).
-The game runs at 32 colors (5 planes).
-24 colors are shared among all levels. 8 colors are unique for each level.
-A single sprite is used for the ~70 stars in background. Updating them takes around 8 rasterlines each frame.
-In the last level sprites are used to create the large 'cage' enemy.

Audio
-Music was done with Pretracker. It contains 14 songs and 15 sfx.
-Music takes 24.542 bytes ram, and 4.690 bytes chipram.
-Music player is called by a copper irq on the line the bg graphics starts.

Blitting
-Blitting (with priority on) starts after the last displayed line. So blitter starts running when no gfx DMA is happening for maximum throughput
-Blits are orchestrated by the CPU. In Tiny Bobble i used copper blits which left more CPU time, but Tinyus has much more diverse blitter setups and so the overhead for generating a copperlist for them was not viable.

Scrolling
-Scrolling is achived without copper splits to have less CPU overhead (but for much higher chipmem usage).
-The game uses 3 buffers: One for restore, two for double buffering.
-Each buffer is sized 288x448
-Horizontal scrolling is achieved by adding another scratch buffer of 4096 pixels to each of the buffers and using hardwarescroll plus plane offset
-Vertical scrolling is achieved by plane offset plus duplicating all bg elements 256 pixel aprt on y. If the y scrollPos goes over 256 it wraps.
pink^abyss is offline  
Old 08 February 2021, 10:48   #2
alpine9000
Registered User
 
Join Date: Mar 2016
Location: Australia
Posts: 881
I bet you were happy when you worked out you had enough free ram to avoid copper split blits!

Thanks for sharing the info!
alpine9000 is offline  
Old 08 February 2021, 12:46   #3
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 408
Quote:
Originally Posted by alpine9000 View Post
I bet you were happy when you worked out you had enough free ram to avoid copper split blits!

Thanks for sharing the info!
Hehe.. you got me.. i really tried to avoid them and simply decided on start of the project to do so, whatever the costs may be...
Tho such splits can be a big timesaver when you have a dynamic background. When an area changed in the backgrounds of Tinyus i needed to blit this area 5 times.. because everything was duplicated on the Y axis... would be cheaper with splits.
pink^abyss is offline  
Old 08 February 2021, 12:50   #4
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by pink^abyss View Post
Scrolling
-Scrolling is achived without copper splits to have less CPU overhead (but for much higher chipmem usage).
-The game uses 3 buffers: One for restore, two for double buffering.
-Each buffer is sized 288x448
-Horizontal scrolling is achieved by adding another scratch buffer of 4096 pixels to each of the buffers and using hardwarescroll plus plane offset
-Vertical scrolling is achieved by plane offset plus duplicating all bg elements 256 pixel aprt on y. If the y scrollPos goes over 256 it wraps.
Hi pink, so the engine use a similar idea:
http://eab.abime.net/showpost.php?p=...1&postcount=41
Difference is that is bigger on y for the 256 wrap.

Well, if I'm not wrong on number you use ~200KB. Definitely worth it, given the result.

ross is offline  
Old 08 February 2021, 13:31   #5
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 408
Quote:
Originally Posted by ross View Post
Hi pink, so the engine use a similar idea:
http://eab.abime.net/showpost.php?p=...1&postcount=41
Difference is that is bigger on y for the 256 wrap.

Well, if I'm not wrong on number you use ~200KB. Definitely worth it, given the result.


Yeah, thats the approach. It took around 240kb chipmem as i use 5 planes.
pink^abyss is offline  
Old 08 February 2021, 15:43   #6
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by pink^abyss View Post
Blitting
-Blitting (with priority on) starts after the last displayed line. So blitter starts running when no gfx DMA is happening for maximum throughput
-Blits are orchestrated by the CPU. In Tiny Bobble i used copper blits which left more CPU time, but Tinyus has much more diverse blitter setups and so the overhead for generating a copperlist for them was not viable.
As this is the Tinyus technical thread, here is a technical question.
I have also seen in other engines cases that the blitter is started at the end of the bitplanes DMA, probably for the same reasoning as you.
But I've never been convinced that it's the best way or that it actually brings benefits (I usually do it another way).

Let me explain, maybe something obvious escapes me.
Since we are in a double buffer environment it doesn't matter when I start filling the second buffer since I always have the previous one in display; it is sufficient that for when the copper set the new pointers I have the new buffer ready and CPU set it (it could also be enough in one of the first video lines if I do it immediately upon arrival of IRQ3 if I am sure that the higher priority IRQs are of fast execution).

So why not avoid going through the blitting routines (IRQ or polling) at the bottom of the screen, but do it directly from the VBI?
In any case the time for the frame is the same and if I have to skip it, I skip it anyway if I start after the DMA of the bitplanes or if I start from the beginning of the frame..

Also I don't understand why it would be better to execute the blitter queue when the DMA of the bit planes is not used; if I use BLTPRI the cycles are certainly all used (excluding the usual idle cycles in case). Moreover, if I have some 'lost memory accesses' cycles because the 68k is working internally there is the possibility that are better used by other DMA channels if during the active area of the video.
Obviously we are talking about very little difference between the two approaches, but it is only for an exchange of views, and I can be wrong about it.
ross is offline  
Old 08 February 2021, 16:10   #7
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Well, I've done a ton of experiments with the CPU/Blitter and throughput over the past few years and it has been my observation that in general you want the CPU to do it's thing during the scanlines where the bitplane/sprite/etc DMA is active. This is more efficient than starting at VBL because the CPU can do it's interleave with less losses while the bitplanes are fetching than the Blitter can. This is due to the CPU always having idle cycles, vs the Blitter usually not having any.

Now, the difference here is indeed small. But it's still a couple of % over CPU starting at start of VBL so it may be worthwhile.

However, I always try to start Blitting as soon as the CPU logic part is done, I never wait until any rasterline to do so (unless single buffering).
roondar is offline  
Old 08 February 2021, 16:41   #8
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Yes, starting immediately after the VBI is not the best, but for this very reason it is also a bit worse to start the blitter queue after the DMA video, where the bus is almost totally free.
What usually I first do is some clear operation where I overlap blitter and CPU. So when I start the blitter's queue I am very close or within the zone with the most DMA traffic.

And yes, the 'after bitplanes DMA' approach is better suited (in a very limited manner) for single buffer mode.

In any case, given the quality of Tinyus, I would say that this way works great too!
ross is offline  
Old 08 February 2021, 16:46   #9
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 408
Quote:
Originally Posted by ross View Post
As this is the Tinyus technical thread, here is a technical question.
I have also seen in other engines cases that the blitter is started at the end of the bitplanes DMA, probably for the same reasoning as you.
But I've never been convinced that it's the best way or that it actually brings benefits (I usually do it another way).

Let me explain, maybe something obvious escapes me.
Since we are in a double buffer environment it doesn't matter when I start filling the second buffer since I always have the previous one in display; it is sufficient that for when the copper set the new pointers I have the new buffer ready and CPU set it (it could also be enough in one of the first video lines if I do it immediately upon arrival of IRQ3 if I am sure that the higher priority IRQs are of fast execution).

So why not avoid going through the blitting routines (IRQ or polling) at the bottom of the screen, but do it directly from the VBI?
In any case the time for the frame is the same and if I have to skip it, I skip it anyway if I start after the DMA of the bitplanes or if I start from the beginning of the frame..

Also I don't understand why it would be better to execute the blitter queue when the DMA of the bit planes is not used; if I use BLTPRI the cycles are certainly all used (excluding the usual idle cycles in case). Moreover, if I have some 'lost memory accesses' cycles because the 68k is working internally there is the possibility that are better used by other DMA channels if during the active area of the video.
Obviously we are talking about very little difference between the two approaches, but it is only for an exchange of views, and I can be wrong about it.

I saw waiting for DMA blank areas before blitting in a couple of other games too. I tested it for my game and it was also faster.

As reference i made a real world test in Tinyus to measure how many cycles are used by the game depending on the rasterwait position before blitting is started.
I accumulate CIA cycles for 64 frames (while the game is running, always at the same startpoint). I disabled the music interrupt to avoid any clutter in the measuring.

Cycles
336236 - Start blitting at Rasterline 274 (when DMA is off)
387610 - Start blitting at Rasterline 174
400532 - Start blitting at Rasterline 74

It seems my game uses 20% more cycles if i blit when DMA is active.
I'm not sure if my test is correct, because these values look too good to me but i'm not aware of anything wrong with my measurements.
pink^abyss is offline  
Old 08 February 2021, 16:50   #10
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
I guess my first question would be: do you do anything else after blitting or is blitting the last step in a frame? And my second question would be: when does game logic start running? Start of VBL or some other time?

By the way, 20% difference does sound reasonable for blitting costs while DMA is running. The point here is to try and optimise when the CPU runs rather than when the Blitter runs.

Edit: the above was phrased a bit weirdly by me. What I mean is that a blit costing 20% if it starts and finishes during display DMA is reasonable, not that an overall 20% difference for blitting+logic is reasonable.

Last edited by roondar; 08 February 2021 at 17:02.
roondar is offline  
Old 08 February 2021, 17:01   #11
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by pink^abyss View Post
I saw waiting for DMA blank areas before blitting in a couple of other games too. I tested it for my game and it was also faster.

As reference i made a real world test in Tinyus to measure how many cycles are used by the game depending on the rasterwait position before blitting is started.
I accumulate CIA cycles for 64 frames (while the game is running, always at the same startpoint). I disabled the music interrupt to avoid any clutter in the measuring.

Cycles
336236 - Start blitting at Rasterline 274 (when DMA is off)
387610 - Start blitting at Rasterline 174
400532 - Start blitting at Rasterline 74

It seems my game uses 20% more cycles if i blit when DMA is active.
I'm not sure if my test is correct, because these values look too good to me but i'm not aware of anything wrong with my measurements.
Well, if these are absolute cycles then the one from raster line 74 is an excellent result (only 20% loss, when the bus is overloaded is great!)

Obviously you have to review the logic of the events a bit, but be aware how much you would then be able to do when the bus is completely free and you could potentially overlap the blitter and CPU (with large blitter's objects and CPU doing math/engine operations)
ross is offline  
Old 08 February 2021, 18:28   #12
aros-sg
Registered User
 
Join Date: Nov 2015
Location: Italy
Posts: 191
Quote:
Originally Posted by pink^abyss View Post
Yeah, thats the approach. It took around 240kb chipmem as i use 5 planes.

No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.
aros-sg is offline  
Old 08 February 2021, 18:45   #13
aros-sg
Registered User
 
Join Date: Nov 2015
Location: Italy
Posts: 191
If you want to also avoid a splitted restore buffer (blits), you would need one additional normal sized buffer -> 4 buffers inside master buffer -> 3 will always be non-splitting/wrapping.


Think about the whole thing like a ROL or ROR of a 32 bit value (0xAABBCCDD).
aros-sg is offline  
Old 08 February 2021, 18:58   #14
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 386
Quote:
Originally Posted by aros-sg View Post
No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.

I presume the method you are thinking of would not require those double height buffers and would not need to duplicate tile drawing into the lower half, which is what Pink seems to be describing.


I presume of the three buffers that are present the one that is split is the one that gets kept as the restore buffer each frame, so the other two can be back/front buffers without the difficulty of a copper split.


So that would suggest the restore process needs to know about the split, making it slightly tricky, but better than dealing with a copper split.


I haven't tried to do anything like that. But I did try investigating the chip ram contents for Turrican2, and guessed it must be doing something like you describe.


I would be interested to know how other technically excellent 8-way scrolling games handle the challenge.
Jobbo is offline  
Old 08 February 2021, 19:33   #15
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by aros-sg View Post
No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.
But isn't it very similar?
Pink do duplicate blit because the buffers are separated and not 'enveloped' (I don't know how best to indicate it),
but 'jump and re-start' every 256 py (and this requires a double sized buffer).
In you case it's like a roller that runs on y.

Is there a working implementation? (I don't think I've ever seen a similar engine)
ross is offline  
Old 08 February 2021, 19:41   #16
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 386
Quote:
Originally Posted by ross View Post
But isn't it very similar?
Pink do duplicate blit because the buffers are separated and not 'enveloped' (I don't know how best to indicate it),
but 'jump and re-start' every 256 py (and this requires a double sized buffer).
In you case it's like a roller that runs on y.

Is there a working implementation? (I don't think I've ever seen a similar engine)

I may be wrong but I think Turrican 2 does it.
Jobbo is offline  
Old 08 February 2021, 19:47   #17
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by Jobbo View Post
I may be wrong but I think Turrican 2 does it.
No, just checked, it use an y copper split (and usual x corkscrew scroll).
ross is offline  
Old 09 February 2021, 12:30   #18
zero
Registered User
 
Join Date: Jun 2016
Location: UK
Posts: 428
Interesting you used C. Is it just the case that modern C compilers for 68k are good at producing efficient code now? Back in the day there would have been big gains to be had from using assembler for some routines I think.

My day job is writing C for embedded systems so I've become quite familiar with how compilers produce inefficient code! Especially on less well supported platforms.
zero is offline  
Old 09 February 2021, 13:14   #19
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 408
Quote:
Originally Posted by aros-sg View Post
No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.

If i understand right then this means

- for updating the background tiles you would need 2 blits, instead of 4
- for restoring you would always need splited blits
- for everything else you can do normal blits
- the restore buffer is always a 'splitted' buffer

Is this what you describe?

If yes, the pros and cons are
+The scrolling needs 50% less blitting (tho not much time anyway)
+You save 50% chipmem (thats good!)
-The restoring gets more complicated and may have more blits
pink^abyss is offline  
Old 09 February 2021, 13:18   #20
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 408
Quote:
Originally Posted by zero View Post
Interesting you used C. Is it just the case that modern C compilers for 68k are good at producing efficient code now? Back in the day there would have been big gains to be had from using assembler for some routines I think.

My day job is writing C for embedded systems so I've become quite familiar with how compilers produce inefficient code! Especially on less well supported platforms.

Yeah, especially Bartmans GCC11 compiler is efficent enough. However, often it does not matter so much if C or ASM is used but what your algorithms are. In asm projects you often tend to micro optimize, while in C projects you simply try another algorithm.
pink^abyss is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tinyus Open Beta Released (OCS Gradius port) pink^abyss News 213 11 May 2023 01:50
Tinyus - An arcade quality Amiga OCS port of Gradius/Nemesis pink^abyss News 103 12 May 2021 04:58
Tech AMIGA magazine thinlega request.Apps 9 19 February 2021 17:26
Trackmo tech paraj Coders. Asm / Hardware 4 30 March 2017 20:57
AmigaWorld Tech Journal Shadowfire AMR news 7 26 April 2009 19:14

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 16:21.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.15507 seconds with 15 queries