English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 18 February 2020, 23:14   #1
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
CPU Assisted Blitting example + source

A while ago I had the idea to make the CPU and Blitter work together for blitting bobs on an unexpanded Amiga 1200. The idea being that the 32 bit access to chip memory of the 68020, coupled with the instruction cache might be capable of offering a boost in performance. Well, I've finally managed to get some time and managed to complete my example program.

And it works

In short, using the Blitter and CPU together can give you a modest, but definetly notable, performance increase for blitting bobs. I've decided to create an example program to show this, as well as a YouTube video and an article for my website.

Naturally, I've included the full source code and a download of the program itself on my website. I'm especially happy with how easy it is to use this (now that it all works) and that the effect doesn't usually require additional memory*. Note that my example uses the startup code by Photon of Scoopex and a random number generator I found on EAB, written by Meynaf.

I hope this might be useful for Amiga coders aiming their code at the unexpanded A1200.

The article/source can be found here:
http://powerprograms.nl/amiga/cpu-blit-assist.html

Here's the YouTube video:
[ Show youtube player ]

*) Extra memory is only needed for bobs that are not multiples of 32 pixels wide. Such bobs require one additional word of data stored per bitplane per line.

Edit: in my haste to get this done yesterday, I forgot to mention in my post what the performance gain actually ended up being. While this can also be found in the article/video, I do think it's nice to add to my post. I ended up getting around 113% of the speed of using the Blitter by itself.

Last edited by roondar; 19 February 2020 at 09:45.
roondar is offline  
Old 19 February 2020, 00:35   #2
Antiriad_UK
OCS forever!
 
Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
Neat. Thanks for the detailed write up!
Antiriad_UK is offline  
Old 19 February 2020, 01:56   #3
Hewitson
Registered User
 
Hewitson's Avatar
 
Join Date: Feb 2007
Location: Melbourne, Australia
Age: 41
Posts: 3,772
Very interesting. The only thing that I don't understand is why no one else thought of this. Maybe they did, and decided the added code complexity wasn't worth the increase in performance?

Good to finally put a face to your name too. Maybe I'll post a pic or video of me one day, you'll all get a big surprise if I do.
Hewitson is offline  
Old 19 February 2020, 09:43   #4
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
Quote:
Originally Posted by Antiriad_UK View Post
Neat. Thanks for the detailed write up!
Your welcome!
Got more ideas yet, so there will be more Amiga Tech episodes in the future
Quote:
Originally Posted by Hewitson View Post
Very interesting. The only thing that I don't understand is why no one else thought of this. Maybe they did, and decided the added code complexity wasn't worth the increase in performance?
My personal hypothesis is that the A1200 was seen very differently from the A500. The A500 was a machine that most owners seemingly never expanded (not beyond the 512K internal expansion anyway). It was also on the market for a long time, so programmers had a lot of time to experiment. The A1200 meanwhile was only initially on the market for 2 years and many more people expanded their A1200. Which probably is why there is so much more software for accelerated AGA Amiga's out there.

And if you do target accelerated AGA Amigas, you're usually better off doing everything in fast memory and only copying the results to chip memory.
Quote:
Good to finally put a face to your name too. Maybe I'll post a pic or video of me one day, you'll all get a big surprise if I do.
To be honest, I've never ever seen anyone do a "face reveal" were I ended up thinking "that's exactly what I expected"
roondar is offline  
Old 19 February 2020, 10:07   #5
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
As if you have read my mind roondar. I’m currently working on a bob routine for blitting objects into 6 bitplanes, the sizes vary.

What i would say is a more practical application for me would be to have the cpu blit one or two bitplanes where only a nand/or function is required as opposed to a cookie cut.

The example i have is that i have a 16 colour 32x64 bob, blitting into a 6 bitplane screen, i only coolie cut the first 4 planes but then select 16 colour banks by nanding or or’ing the last two planes.

I’ll be sure to try your theory out as it was on my ideas list for optimisation.

Real nice work!

Geezer
mcgeezer is offline  
Old 19 February 2020, 10:15   #6
DanScott
Lemon. / Core Design
 
DanScott's Avatar
 
Join Date: Mar 2016
Location: Tier 5
Posts: 1,211
Quote:
Originally Posted by Hewitson View Post
The only thing that I don't understand is why no one else thought of this.
I am 100% sure that quite a few Amiga programmer did.

Even on the A500, you were / are always trying to maximise blitter and CPU concurrency... the classic example being the Blitter/CPU clear combo.
DanScott is offline  
Old 19 February 2020, 10:19   #7
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
Quote:
Originally Posted by mcgeezer View Post
As if you have read my mind roondar. I’m currently working on a bob routine for blitting objects into 6 bitplanes, the sizes vary.
Like they say, great (or crazy ) minds think alike
Quote:
What i would say is a more practical application for me would be to have the cpu blit one or two bitplanes where only a nand/or function is required as opposed to a cookie cut.
It's probably worth noting then that the limiting factor for the speed of this effect is purely based on number of the memory reads/writes per longword and the number of cycles the CPU gets between each memory access. Unless you go way overboard on them, the logical operations are essentially "invisible" due to the cache.

This can be best seen in the speed of the copy (restore) routine vs the cookie-cut routine. The latter is actually more of a speedup over the Blitter than the copy routine (~115% vs ~110%), even though the copy routine is much simpler and is run with extra idle cycles for the CPU to use.
Quote:
The example i have is that i have a 16 colour 32x64 bob, blitting into a 6 bitplane screen, i only coolie cut the first 4 planes but then select 16 colour banks by nanding or or’ing the last two planes.
That is quite clever. I'm looking forward to seeing the results
Quote:
I’ll be sure to try your theory out as it was on my ideas list for optimisation.

Real nice work!

Geezer
Thanks, hope it ends up being useful
roondar is offline  
Old 19 February 2020, 11:31   #8
CFou!
Moderator
 
CFou!'s Avatar
 
Join Date: Sep 2004
Location: France
Age: 50
Posts: 4,277
Quote:
Originally Posted by Hewitson View Post
Very interesting. The only thing that I don't understand is why no one else thought of this. Maybe they did, and decided the added code complexity wasn't worth the increase in performance?

Good to finally put a face to your name too. Maybe I'll post a pic or video of me one day, you'll all get a big surprise if I do.
when blitter is working cpu can used to works on others needed operation.

An blitter interuption can be used to optimized usage of cpu and blitter.
CFou! is offline  
Old 19 February 2020, 11:32   #9
CFou!
Moderator
 
CFou!'s Avatar
 
Join Date: Sep 2004
Location: France
Age: 50
Posts: 4,277
Quote:
Originally Posted by Hewitson View Post
Very interesting. The only thing that I don't understand is why no one else thought of this. Maybe they did, and decided the added code complexity wasn't worth the increase in performance?

Good to finally put a face to your name too. Maybe I'll post a pic or video of me one day, you'll all get a big surprise if I do.
when blitter is working cpu can used to works on others needed operation.

An blitter interupt can be used to optimized usage of cpu and blitter.
CFou! is offline  
Old 19 February 2020, 13:38   #10
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
Quote:
Originally Posted by DanScott View Post
I am 100% sure that quite a few Amiga programmer did.

Even on the A500, you were / are always trying to maximise blitter and CPU concurrency... the classic example being the Blitter/CPU clear combo.
I'm definitely certain programmers do try to optimize concurrency . I'm even a bit hopeful that this effect might help programmers achieve more of that.

That said, I am completely unaware of coders using Blitter+CPU to draw bobs on the A1200. I did look for something like this on EAB and other places, but couldn't find it. I did find several threads/posts on Blitter queues, Blitter interrupts, etc. But those tended to be quite negative on the prospects of such methods actually ending up with a better performing game (other than in the case of Blitter clearing+CPU clearing, as you rightly point out).

Do you know of any examples? I would love to see stuff like this in a "real" program rather than my tech demos/examples
Quote:
Originally Posted by CFOU! View Post
when blitter is working cpu can used to works on others needed operation.

An blitter interupt can be used to optimized usage of cpu and blitter.
Setting aside the overhead of doing many interrupts in a frame, you are obviously quite correct: Blitter interrupts can be used to optimize concurrent use of the CPU/Blitter. However, this is only true as long as the CPU has something useful to do other than blitting, which isn't always the case

In fact, this was of the reasons for me to do this. I've read a number of posts where coders point out it was quite difficult to keep the Blitter and CPU working concurrently in a useful fashion. At the same time, I've also read multiple times that using Blitter interrupts is not very efficient.

My hope is that this method helps make it easier to get that extra efficiency
roondar is offline  
Old 19 February 2020, 14:39   #11
CFou!
Moderator
 
CFou!'s Avatar
 
Join Date: Sep 2004
Location: France
Age: 50
Posts: 4,277
Quote:
Originally Posted by roondar View Post

In fact, this was of the reasons for me to do this. I've read a number of posts where coders point out it was quite difficult to keep the Blitter and CPU working concurrently in a useful fashion. At the same time, I've also read multiple times that using Blitter interrupts is not very efficient.

My hope is that this method helps make it easier to get that extra efficiency
In any case good work, and I agree it could prove useful in very specific cases. whatever you want to do ...
CFou! is offline  
Old 19 February 2020, 16:39   #12
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
Quote:
Originally Posted by CFOU! View Post
In any case good work, and I agree it could prove useful in very specific cases. whatever you want to do ...
It probably won't surprise you, but I'm a lot more optimistic. I'm actually rather convinced this method can be useful for far more than very specific cases
roondar is offline  
Old 19 February 2020, 16:48   #13
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Kudos for article and video
ross is offline  
Old 19 February 2020, 17:01   #14
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
Quote:
Originally Posted by ross View Post
Kudos for article and video
That reminds me... We had a small discussion over PM about some of the weirdness I ran into when making this. Might be interesting to share some of it with a wider audience now. I'll add a bonus item so you'll have something to read as well

So there were some issues I ran into while coding this:
  • WinUAE was consistently somewhat faster than a real A1200. This is sadly unavoidable, considering the well known difficulties in emulating the 68020. It's also something to keep in mind when programming these kind of effects, a real machine might not be the same speed.
  • An odd thing was the difference in performance I got by moving some routines by two bytes in memory. Doing this could save up to 1% of the total cycle cost. Which was strange, as the code should fit in the cache regardless of this change in alignment.
  • I experimented with bitfield instructions, but the performance of these instructions is extremely dependent on alignment. While it's predictable when they perform well and when they don't, in the end they were slower in more cases than that they were faster.

Last edited by roondar; 19 February 2020 at 17:09.
roondar is offline  
Old 19 February 2020, 17:20   #15
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Item found
Quote:
Originally Posted by roondar View Post
  • I experimented with bitfield instructions, but the performance of these instructions is extremely dependent on alignment. While it's predictable when they perform well and when they don't, in the end they were slower in more cases than that they were faster.
Yep, bitfield instructions are great, unfortunately in many cases slower than direct access with shift and masks. To be used only in specific cases.
If you look at my non-system FFS 020+ loader, you can see it used appropriately (maybe.. ).
ross is offline  
Old 19 February 2020, 17:27   #16
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
Quote:
Originally Posted by ross View Post
Item found

Yep, bitfield instructions are great, unfortunately in many cases slower than direct access with shift and masks. To be used only in specific cases.
If you look at my non-system FFS 020+ loader, you can see it used appropriately (maybe.. ).
It's such a shame too, as they really are a great way to deal with graphics data like this. The bitfield version of the routine was much simpler. But all shall perish on the altar of performance

Anyway, I don't think all code must be hyper-optimized, so there are surely places in which they are useful
roondar is offline  
Old 19 February 2020, 17:32   #17
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by roondar View Post
  • An odd thing was the difference in performance I got by moving some routines by two bytes in memory. Doing this could save up to 1% of the total cycle cost. Which was strange, as the code should fit in the cache regardless of this change in alignment.
That might be due to prefetch being aligned to 32 bit addresses:
Quote:
Originally Posted by MC68020 user manual
The MC68020/EC020 always prefetches long words. When an instruction prefetch falls on an odd-word boundary (e.g., due to a branch to an odd-word location), the MC68020/EC020 will read the even word associated with the long-word base address at the same time as (32-bit memory) or before (8- or 16-bit memory) the odd word is read.When an instruction prefetch falls on an even-word boundary (as would be the normal case), the MC68020/EC020 reads both words at the long-word address, thus effectively prefetching the next two words.
Also, IIRC the cache entries are aligned to 32 bits, meaning AFAIU that if your code is 256 bytes long, it will not fit into the cache if it starts at an odd-word (not divisible by 4) address.
chb is offline  
Old 19 February 2020, 17:45   #18
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
That's quite interesting. Normally not all that important I guess, but the A1200 has quite slow memory and running the Blitter makes it even slower (from the perspective of the CPU), so I guess it all adds up.
roondar is offline  
Old 19 February 2020, 19:23   #19
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,281
You can even skip screens' restore for bob that don't overlap, so you can gaing a lot speed
sandruzzo is online now  
Old 19 February 2020, 20:37   #20
Retro1234
Phone Homer
 
Retro1234's Avatar
 
Join Date: Jun 2006
Location: 5150
Posts: 5,773
Interesting there was something for Amos called PowerBobs that I believe is along the same lines.
Retro1234 is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Selling A3660 CPU card, including Rev 5 CPU - NEW - professionally built tbtorro MarketPlace 1 17 June 2018 19:14
Blitting one bitplane Shatterhand Coders. Blitz Basic 13 01 February 2017 16:13
Problem with blitting.... xboxown Coders. Language 0 09 March 2014 21:51
Source for A4000D CPU card plastic standoffs alexh support.Hardware 38 12 June 2011 19:15
Blitting question sandruzzo Coders. General 30 06 April 2011 11:29

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 09:52.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.60755 seconds with 13 queries