Tinyus Tech

pink^abyss · 08 February 2021, 10:04

For tech-savvy readers here comes some information about the inner workings of Tinyus:

The game (including replayer) was coded in C99 except two short asm routines which were used to copy images from slowram to chipmem on demand.

Gfx
-The game runs at 256x224 (plus the top and bottom hud area).
-The game runs at 32 colors (5 planes).
-24 colors are shared among all levels. 8 colors are unique for each level.
-A single sprite is used for the ~70 stars in background. Updating them takes around 8 rasterlines each frame.
-In the last level sprites are used to create the large 'cage' enemy.

Audio
-Music was done with Pretracker. It contains 14 songs and 15 sfx.
-Music takes 24.542 bytes ram, and 4.690 bytes chipram.
-Music player is called by a copper irq on the line the bg graphics starts.

Blitting
-Blitting (with priority on) starts after the last displayed line. So blitter starts running when no gfx DMA is happening for maximum throughput
-Blits are orchestrated by the CPU. In Tiny Bobble i used copper blits which left more CPU time, but Tinyus has much more diverse blitter setups and so the overhead for generating a copperlist for them was not viable.

Scrolling
-Scrolling is achived without copper splits to have less CPU overhead (but for much higher chipmem usage).
-The game uses 3 buffers: One for restore, two for double buffering.
-Each buffer is sized 288x448
-Horizontal scrolling is achieved by adding another scratch buffer of 4096 pixels to each of the buffers and using hardwarescroll plus plane offset
-Vertical scrolling is achieved by plane offset plus duplicating all bg elements 256 pixel aprt on y. If the y scrollPos goes over 256 it wraps.

alpine9000 · 08 February 2021, 10:48

I bet you were happy when you worked out you had enough free ram to avoid copper split blits!

Thanks for sharing the info!

pink^abyss · 08 February 2021, 12:46

Quote:

Originally Posted by alpine9000

I bet you were happy when you worked out you had enough free ram to avoid copper split blits!

Thanks for sharing the info!

Hehe.. you got me.. i really tried to avoid them and simply decided on start of the project to do so, whatever the costs may be...
Tho such splits can be a big timesaver when you have a dynamic background. When an area changed in the backgrounds of Tinyus i needed to blit this area 5 times.. because everything was duplicated on the Y axis... would be cheaper with splits.

ross · 08 February 2021, 12:50

Quote:

Originally Posted by pink^abyss

Scrolling
-Scrolling is achived without copper splits to have less CPU overhead (but for much higher chipmem usage).
-The game uses 3 buffers: One for restore, two for double buffering.
-Each buffer is sized 288x448
-Horizontal scrolling is achieved by adding another scratch buffer of 4096 pixels to each of the buffers and using hardwarescroll plus plane offset
-Vertical scrolling is achieved by plane offset plus duplicating all bg elements 256 pixel aprt on y. If the y scrollPos goes over 256 it wraps.

Hi pink, so the engine use a similar idea:
http://eab.abime.net/showpost.php?p=...1&postcount=41
Difference is that is bigger on y for the 256 wrap.

Well, if I'm not wrong on number you use ~200KB. Definitely worth it, given the result.

pink^abyss · 08 February 2021, 13:31

Quote:

Originally Posted by ross

Hi pink, so the engine use a similar idea:
http://eab.abime.net/showpost.php?p=...1&postcount=41
Difference is that is bigger on y for the 256 wrap.

Well, if I'm not wrong on number you use ~200KB. Definitely worth it, given the result.

Yeah, thats the approach. It took around 240kb chipmem as i use 5 planes.

ross · 08 February 2021, 15:43

Quote:

Originally Posted by pink^abyss

Blitting
-Blitting (with priority on) starts after the last displayed line. So blitter starts running when no gfx DMA is happening for maximum throughput
-Blits are orchestrated by the CPU. In Tiny Bobble i used copper blits which left more CPU time, but Tinyus has much more diverse blitter setups and so the overhead for generating a copperlist for them was not viable.

As this is the Tinyus technical thread, here is a technical question.

I have also seen in other engines cases that the blitter is started at the end of the bitplanes DMA, probably for the same reasoning as you.
But I've never been convinced that it's the best way or that it actually brings benefits (I usually do it another way).

Let me explain, maybe something obvious escapes me.
Since we are in a double buffer environment it doesn't matter when I start filling the second buffer since I always have the previous one in display; it is sufficient that for when the copper set the new pointers I have the new buffer ready and CPU set it (it could also be enough in one of the first video lines if I do it immediately upon arrival of IRQ3 if I am sure that the higher priority IRQs are of fast execution).

So why not avoid going through the blitting routines (IRQ or polling) at the bottom of the screen, but do it directly from the VBI?
In any case the time for the frame is the same and if I have to skip it, I skip it anyway if I start after the DMA of the bitplanes or if I start from the beginning of the frame..

Also I don't understand why it would be better to execute the blitter queue when the DMA of the bit planes is not used; if I use BLTPRI the cycles are certainly all used (excluding the usual idle cycles in case). Moreover, if I have some 'lost memory accesses' cycles because the 68k is working internally there is the possibility that are better used by other DMA channels if during the active area of the video.
Obviously we are talking about very little difference between the two approaches, but it is only for an exchange of views, and I can be wrong about it.

roondar · 08 February 2021, 16:10

Well, I've done a ton of experiments with the CPU/Blitter and throughput over the past few years and it has been my observation that in general you want the CPU to do it's thing during the scanlines where the bitplane/sprite/etc DMA is active. This is more efficient than starting at VBL because the CPU can do it's interleave with less losses while the bitplanes are fetching than the Blitter can. This is due to the CPU always having idle cycles, vs the Blitter usually not having any.

Now, the difference here is indeed small. But it's still a couple of % over CPU starting at start of VBL so it may be worthwhile.

However, I always try to start Blitting as soon as the CPU logic part is done, I never wait until any rasterline to do so (unless single buffering).

ross · 08 February 2021, 16:41

Yes, starting immediately after the VBI is not the best, but for this very reason it is also a bit worse to start the blitter queue after the DMA video, where the bus is almost totally free.
What usually I first do is some clear operation where I overlap blitter and CPU. So when I start the blitter's queue I am very close or within the zone with the most DMA traffic.

And yes, the 'after bitplanes DMA' approach is better suited (in a very limited manner) for single buffer mode.

In any case, given the quality of Tinyus, I would say that this way works great too!

pink^abyss · 08 February 2021, 16:46

Quote:

Originally Posted by ross

As this is the Tinyus technical thread, here is a technical question.

I have also seen in other engines cases that the blitter is started at the end of the bitplanes DMA, probably for the same reasoning as you.
But I've never been convinced that it's the best way or that it actually brings benefits (I usually do it another way).

Let me explain, maybe something obvious escapes me.
Since we are in a double buffer environment it doesn't matter when I start filling the second buffer since I always have the previous one in display; it is sufficient that for when the copper set the new pointers I have the new buffer ready and CPU set it (it could also be enough in one of the first video lines if I do it immediately upon arrival of IRQ3 if I am sure that the higher priority IRQs are of fast execution).

So why not avoid going through the blitting routines (IRQ or polling) at the bottom of the screen, but do it directly from the VBI?
In any case the time for the frame is the same and if I have to skip it, I skip it anyway if I start after the DMA of the bitplanes or if I start from the beginning of the frame..

Also I don't understand why it would be better to execute the blitter queue when the DMA of the bit planes is not used; if I use BLTPRI the cycles are certainly all used (excluding the usual idle cycles in case). Moreover, if I have some 'lost memory accesses' cycles because the 68k is working internally there is the possibility that are better used by other DMA channels if during the active area of the video.
Obviously we are talking about very little difference between the two approaches, but it is only for an exchange of views, and I can be wrong about it.

I saw waiting for DMA blank areas before blitting in a couple of other games too. I tested it for my game and it was also faster.

As reference i made a real world test in Tinyus to measure how many cycles are used by the game depending on the rasterwait position before blitting is started.
I accumulate CIA cycles for 64 frames (while the game is running, always at the same startpoint). I disabled the music interrupt to avoid any clutter in the measuring.

Cycles
336236 - Start blitting at Rasterline 274 (when DMA is off)
387610 - Start blitting at Rasterline 174
400532 - Start blitting at Rasterline 74

It seems my game uses 20% more cycles if i blit when DMA is active.
I'm not sure if my test is correct, because these values look too good to me but i'm not aware of anything wrong with my measurements.

roondar · 08 February 2021, 16:50

I guess my first question would be: do you do anything else after blitting or is blitting the last step in a frame? And my second question would be: when does game logic start running? Start of VBL or some other time?

By the way, 20% difference does sound reasonable for blitting costs while DMA is running. The point here is to try and optimise when the CPU runs rather than when the Blitter runs.

Edit: the above was phrased a bit weirdly by me. What I mean is that a blit costing 20% if it starts and finishes during display DMA is reasonable, not that an overall 20% difference for blitting+logic is reasonable.

ross · 08 February 2021, 17:01

Quote:

Originally Posted by pink^abyss

I saw waiting for DMA blank areas before blitting in a couple of other games too. I tested it for my game and it was also faster.

As reference i made a real world test in Tinyus to measure how many cycles are used by the game depending on the rasterwait position before blitting is started.
I accumulate CIA cycles for 64 frames (while the game is running, always at the same startpoint). I disabled the music interrupt to avoid any clutter in the measuring.

Cycles
336236 - Start blitting at Rasterline 274 (when DMA is off)
387610 - Start blitting at Rasterline 174
400532 - Start blitting at Rasterline 74

It seems my game uses 20% more cycles if i blit when DMA is active.
I'm not sure if my test is correct, because these values look too good to me but i'm not aware of anything wrong with my measurements.

Well, if these are absolute cycles then the one from raster line 74 is an excellent result

(only 20% loss, when the bus is overloaded is great!)

Obviously you have to review the logic of the events a bit, but be aware how much you would then be able to do when the bus is completely free and you could potentially overlap the blitter and CPU (with large blitter's objects and CPU doing math/engine operations)

aros-sg · 08 February 2021, 18:28

Quote:

Originally Posted by pink^abyss

Yeah, thats the approach. It took around 240kb chipmem as i use 5 planes.

No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.

aros-sg · 08 February 2021, 18:45

If you want to also avoid a splitted restore buffer (blits), you would need one additional normal sized buffer -> 4 buffers inside master buffer -> 3 will always be non-splitting/wrapping.

Think about the whole thing like a ROL or ROR of a 32 bit value (0xAABBCCDD).

Jobbo · 08 February 2021, 18:58

Quote:

Originally Posted by aros-sg

No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.

I presume the method you are thinking of would not require those double height buffers and would not need to duplicate tile drawing into the lower half, which is what Pink seems to be describing.

I presume of the three buffers that are present the one that is split is the one that gets kept as the restore buffer each frame, so the other two can be back/front buffers without the difficulty of a copper split.

So that would suggest the restore process needs to know about the split, making it slightly tricky, but better than dealing with a copper split.

I haven't tried to do anything like that. But I did try investigating the chip ram contents for Turrican2, and guessed it must be doing something like you describe.

I would be interested to know how other technically excellent 8-way scrolling games handle the challenge.

ross · 08 February 2021, 19:33

Quote:

Originally Posted by aros-sg

No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.

But isn't it very similar?
Pink do duplicate blit because the buffers are separated and not 'enveloped' (I don't know how best to indicate it),
but 'jump and re-start' every 256 py (and this requires a double sized buffer).
In you case it's like a roller that runs on y.

Is there a working implementation? (I don't think I've ever seen a similar engine)

Jobbo · 08 February 2021, 19:41

Quote:

Originally Posted by ross

But isn't it very similar?
Pink do duplicate blit because the buffers are separated and not 'enveloped' (I don't know how best to indicate it),
but 'jump and re-start' every 256 py (and this requires a double sized buffer).
In you case it's like a roller that runs on y.

Is there a working implementation? (I don't think I've ever seen a similar engine)

I may be wrong but I think Turrican 2 does it.

ross · 08 February 2021, 19:47

Quote:

Originally Posted by Jobbo

I may be wrong but I think Turrican 2 does it.

No, just checked, it use an y copper split (and usual x corkscrew scroll).

zero · 09 February 2021, 12:30

Interesting you used C. Is it just the case that modern C compilers for 68k are good at producing efficient code now? Back in the day there would have been big gains to be had from using assembler for some routines I think.

My day job is writing C for embedded systems so I've become quite familiar with how compilers produce inefficient code! Especially on less well supported platforms.

pink^abyss · 09 February 2021, 13:14

Quote:

Originally Posted by aros-sg

No, with the approach in the link it would only take half of that. And no duplicate blits.I probably did not explain it well enough.

Think of 3 normal sized buffers inside a 3xheight master buffer where during y-scrolling the buffers move/travel down in memory (a bit similar to horizontal scrolling) and only ever one of them will cross the bottom edge of the master buffer and split/wrap over to the top of the master buffer.

If i understand right then this means

- for updating the background tiles you would need 2 blits, instead of 4
- for restoring you would always need splited blits
- for everything else you can do normal blits
- the restore buffer is always a 'splitted' buffer

Is this what you describe?

If yes, the pros and cons are
+The scrolling needs 50% less blitting (tho not much time anyway)
+You save 50% chipmem (thats good!)
-The restoring gets more complicated and may have more blits

pink^abyss · 09 February 2021, 13:18

Quote:

Originally Posted by zero

Interesting you used C. Is it just the case that modern C compilers for 68k are good at producing efficient code now? Back in the day there would have been big gains to be had from using assembler for some routines I think.

My day job is writing C for embedded systems so I've become quite familiar with how compilers produce inefficient code! Especially on less well supported platforms.

Yeah, especially Bartmans GCC11 compiler is efficent enough. However, often it does not matter so much if C or ASM is used but what your algorithms are. In asm projects you often tend to micro optimize, while in C projects you simply try another algorithm.

08 February 2021, 10:04	#1
pink^abyss Registered User Join Date: Aug 2018 Location: Untergrund/Germany Posts: 408	Tinyus Tech For tech-savvy readers here comes some information about the inner workings of Tinyus: The game (including replayer) was coded in C99 except two short asm routines which were used to copy images from slowram to chipmem on demand. Gfx -The game runs at 256x224 (plus the top and bottom hud area). -The game runs at 32 colors (5 planes). -24 colors are shared among all levels. 8 colors are unique for each level. -A single sprite is used for the ~70 stars in background. Updating them takes around 8 rasterlines each frame. -In the last level sprites are used to create the large 'cage' enemy. Audio -Music was done with Pretracker. It contains 14 songs and 15 sfx. -Music takes 24.542 bytes ram, and 4.690 bytes chipram. -Music player is called by a copper irq on the line the bg graphics starts. Blitting -Blitting (with priority on) starts after the last displayed line. So blitter starts running when no gfx DMA is happening for maximum throughput -Blits are orchestrated by the CPU. In Tiny Bobble i used copper blits which left more CPU time, but Tinyus has much more diverse blitter setups and so the overhead for generating a copperlist for them was not viable. Scrolling -Scrolling is achived without copper splits to have less CPU overhead (but for much higher chipmem usage). -The game uses 3 buffers: One for restore, two for double buffering. -Each buffer is sized 288x448 -Horizontal scrolling is achieved by adding another scratch buffer of 4096 pixels to each of the buffers and using hardwarescroll plus plane offset -Vertical scrolling is achieved by plane offset plus duplicating all bg elements 256 pixel aprt on y. If the y scrollPos goes over 256 it wraps.

08 February 2021, 16:50	#10
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,408	I guess my first question would be: do you do anything else after blitting or is blitting the last step in a frame? And my second question would be: when does game logic start running? Start of VBL or some other time? By the way, 20% difference does sound reasonable for blitting costs while DMA is running. The point here is to try and optimise when the CPU runs rather than when the Blitter runs. Edit: the above was phrased a bit weirdly by me. What I mean is that a blit costing 20% if it starts and finishes during display DMA is reasonable, not that an overall 20% difference for blitting+logic is reasonable. Last edited by roondar; 08 February 2021 at 17:02.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Tinyus Open Beta Released (OCS Gradius port)	pink^abyss	News	213	11 May 2023 01:50
Tinyus - An arcade quality Amiga OCS port of Gradius/Nemesis	pink^abyss	News	103	12 May 2021 04:58
Tech AMIGA magazine	thinlega	request.Apps	9	19 February 2021 17:26
Trackmo tech	paraj	Coders. Asm / Hardware	4	30 March 2017 20:57
AmigaWorld Tech Journal	Shadowfire	AMR news	7	26 April 2009 19:14

08 February 2021, 10:48	#2
alpine9000 Registered User Join Date: Mar 2016 Location: Australia Posts: 881	I bet you were happy when you worked out you had enough free ram to avoid copper split blits! Thanks for sharing the info!

08 February 2021, 16:10	#7
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,408	Well, I've done a ton of experiments with the CPU/Blitter and throughput over the past few years and it has been my observation that in general you want the CPU to do it's thing during the scanlines where the bitplane/sprite/etc DMA is active. This is more efficient than starting at VBL because the CPU can do it's interleave with less losses while the bitplanes are fetching than the Blitter can. This is due to the CPU always having idle cycles, vs the Blitter usually not having any. Now, the difference here is indeed small. But it's still a couple of % over CPU starting at start of VBL so it may be worthwhile. However, I always try to start Blitting as soon as the CPU logic part is done, I never wait until any rasterline to do so (unless single buffering).

08 February 2021, 16:41	#8
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Yes, starting immediately after the VBI is not the best, but for this very reason it is also a bit worse to start the blitter queue after the DMA video, where the bus is almost totally free. What usually I first do is some clear operation where I overlap blitter and CPU. So when I start the blitter's queue I am very close or within the zone with the most DMA traffic. And yes, the 'after bitplanes DMA' approach is better suited (in a very limited manner) for single buffer mode. In any case, given the quality of Tinyus, I would say that this way works great too!

08 February 2021, 18:45	#13
aros-sg Registered User Join Date: Nov 2015 Location: Italy Posts: 191	If you want to also avoid a splitted restore buffer (blits), you would need one additional normal sized buffer -> 4 buffers inside master buffer -> 3 will always be non-splitting/wrapping. Think about the whole thing like a ROL or ROR of a 32 bit value (0xAABBCCDD).

09 February 2021, 12:30	#18
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	Interesting you used C. Is it just the case that modern C compilers for 68k are good at producing efficient code now? Back in the day there would have been big gains to be had from using assembler for some routines I think. My day job is writing C for embedded systems so I've become quite familiar with how compilers produce inefficient code! Especially on less well supported platforms.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)