English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 19 February 2020, 21:52   #21
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
I'm not aware of PowerBobs. However, it's pretty easy to check if it uses the same principle: if it doesn't require an AGA machine or an A3000 then it is very likely that it doesn't use the same idea. If it does require a 32 bit chip memory bus, it certainly might

Is it still available somewhere? Would love to check it out
roondar is offline  
Old 20 February 2020, 04:14   #22
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 1,779
since coockie cut isnt' so cpu frendly due to lack of barrel shift on low end 68k, why dont' do other blit ops that fit well on cpu?
sandruzzo is offline  
Old 20 February 2020, 07:36   #23
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 670
Quote:
Originally Posted by sandruzzo View Post
since coockie cut isnt' so cpu frendly due to lack of barrel shift on low end 68k, why dont' do other blit ops that fit well on cpu?
The 68020 (you know, the minimum used in AGA machines, which is the subject of this thread), has a barrel shifter.
britelite is offline  
Old 20 February 2020, 10:14   #24
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 1,779
Quote:
Originally Posted by britelite View Post
The 68020 (you know, the minimum used in AGA machines, which is the subject of this thread), has a barrel shifter.
Yes, but isnt' a fast as blitter. I think maybe doing screens' restore could help further to gain more speed
sandruzzo is offline  
Old 20 February 2020, 10:16   #25
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
Yes indeed, this effect relies on features of the 68020 (and 32 bit access to chip memory) to work: shifting is only fast enough because of the 68020's barrel shifter. It also relies on running the parts of the cookie-cut logic that don't need to access memory from the instruction cache while the Blitter is busy accessing the bus.

It also needs 32 bit access to chip memory. If the CPU can only do 16 bits at a time it can never be faster than the Blitter, which kind of defeats the point
roondar is offline  
Old 20 February 2020, 10:26   #26
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 1,779
@roondar

In fact when you have to shift things with cpu, you'll get extra work to do. I found usefull doing optimized cookie-cut with blitter and screen restoring via Cpu..
sandruzzo is offline  
Old 20 February 2020, 10:26   #27
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 670
Quote:
Originally Posted by sandruzzo View Post
I think maybe doing screens' restore could help further to gain more speed
That's not really the topic of this thread, now is it? So why not start your own thread with concrete examples and benchmarks then?
britelite is offline  
Old 20 February 2020, 10:30   #28
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 1,779
@britelite

Used this way into Proxima 3 with per pixel colling made by cpu too
sandruzzo is offline  
Old 20 February 2020, 10:33   #29
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 670
Quote:
Originally Posted by sandruzzo View Post
Used this way into Proxima 3 with per pixel colling made by cpu too
I'm sure your colling routine is the best ever

EDIT: Enough off-topic on my part

@sandruzzo: so how about you start that thread so we can discuss your ideas there?
britelite is offline  
Old 20 February 2020, 10:40   #30
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
Quote:
Originally Posted by sandruzzo View Post
@roondar

In fact when you have to shift things with cpu, you'll get extra work to do. I found usefull doing optimized cookie-cut with blitter and screen restoring via Cpu..
My code also does screen restoring via Blitter/CPU (assuming you mean the restoring of bobs). That part is actually slower than the cookie-cut part. In fact, I checked all options: CPU only copy, CPU only cookie-cut, CPU+Blitter only for copy, CPU+Blitter only for cookie-cut, CPU+Blitter for both.

The CPU only options were slower than the combined ones in all cases. That is, using my CPU+Blitter copy/restore routine gave me more of a performance boost than using just my CPU restore routine.

BTW, something I already pointed out: removing the shifts from the cookie-cut routine did not make it faster. I tried this to see what the maximum speed you could theoretically manage was. I saw no speed change, which means the shifts are fully absorbed by the Blitter cycles on the bus.

Last edited by roondar; 20 February 2020 at 10:43. Reason: Grammar...
roondar is offline  
Old 20 February 2020, 10:43   #31
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 670
Quote:
Originally Posted by roondar View Post
In fact, I checked all options: CPU only copy, CPU only cookie-cut, CPU+Blitter only for copy, CPU+Blitter only for cookie-cut.
You wouldn't happen to have any rough estimates on how the different methods compare against each other?
britelite is offline  
Old 20 February 2020, 10:48   #32
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
Quote:
Originally Posted by britelite View Post
You wouldn't happen to have any rough estimates on how the different methods compare against each other?
I do happen to have those, yes. From memory:

CPU only copy: about 100% of Blitter speed
CPU+Blitter copy (unrolled loop once): about 110% of Blitter speed
CPU only cookie-cut: very slow, somewhere between 25-40% of Blitter speed
CPU+Blitter cookie-cut: about 115% of Blitter speed

Edit: just to be clear on this, it's certainly possible some of my code is not fully optimized/fully optimal. As a coder, I'm always willing to consider better options. So if someone looks at my code and sees an obvious way to make things faster, do show me some code and I'll gladly test it out

Last edited by roondar; 20 February 2020 at 10:53.
roondar is offline  
Old 20 February 2020, 11:01   #33
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 670
Quote:
Originally Posted by roondar View Post
Edit: just to be clear on this, it's certainly possible some of my code is not fully optimized/fully optimal. As a coder, I'm always willing to consider better options. So if someone looks at my code and sees an obvious way to make things faster, do show me some code and I'll gladly test it out
Let the BOB-wars begin! Who can draw the most bobs on a vanilla A1200 in 50fps
britelite is offline  
Old 20 February 2020, 11:12   #34
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
Quote:
Originally Posted by britelite View Post
Let the BOB-wars begin! Who can draw the most bobs on a vanilla A1200 in 50fps
Precisely, bring out yer BOBs!
roondar is offline  
Old 20 February 2020, 11:15   #35
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 1,779
@roondar

If you don't use the whole 64 colors for bobs' you can consider to do full cookie-cut only on used planes, and use fast cookie-cut(make holes into unused planes) so you should be able to have them 25% faster, maybe you can do them with cpu
sandruzzo is offline  
Old 20 February 2020, 11:56   #36
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
Quote:
Originally Posted by sandruzzo View Post
@roondar

If you don't use the whole 64 colors for bobs' you can consider to do full cookie-cut only on used planes, and use fast cookie-cut(make holes into unused planes) so you should be able to have them 25% faster, maybe you can do them with cpu
I find it unlikely this method will give a 25% speed up. Here's why:

Cookie-cut is three reads and one write per longword output, or 4 accesses in total per longword. Mask is two reads and one write per longword output, or 3 accesses in total per longword.

With that info, we can make a little table showing the theoretical savings:
Code:
Bob depth	Mask depth	CC	MSK	Total	Cost
6 (64 col)	0		24	0	24	100%
5 (32 col)	1		20	3	23	 95,8%
4 (16 col)	2		16	6	22	 91,7%
3 (8 col)	3		12	9	21	 87,5%
2 (4 col)	4		8	12	20	 83,3%
1 (2 col)	5		4	15	19	 79,2%

Table shows one longword of blitting@6BPL
So, for a reasonable trade-off of 16 colour bobs on a 64 colour background, the maximum saving is only 8,3%, not 25%. That may still be worthwhile, but it's also not sure whether changing the code to do this instead of what it does now won't mean exceeding cache space (the current code does fit, but there's not a lot of space left), which would lower the result.

There's also my experience making this program. It became very clear to me that theoretical results only rarely end up applying in "the real world". An example: my 32 pixel wide bob routine optimizes the reading of masks by only reading the mask once per 6 bitplanes instead of once per bitplane (like the Blitter does). This saves 20% of all memory accesses done and therefore should result in a speed up of the blit of around 20%.

But in reality, doing this only sped up the blitting process by about 5% over the generic routine for the bob size I'm using.

Do note I don't mean this can't be tried or can't work. Nor do I mean to say it's a bad idea, I like (almost) all ideas . By all means, try everything - it can only help . All that I mean is that expecting a 25% gain is unrealistic.
roondar is offline  
Old 20 February 2020, 14:00   #37
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 1,779
@roondar

I meant that, since you're no longer using 3 source channel each fast cookie cut would be 25% faster than regular one for each plane, but not for the whole blitting.

if you're using 16 colors, one planes'll take 8 cycles for each word. The other planes only 6 cycles. 25% faster.
sandruzzo is offline  
Old 20 February 2020, 14:16   #38
Chrille
Registered User

 
Join Date: Sep 2018
Location: Germany
Posts: 33
@roondar:
Interesting, that you try a blitter processor combination even for copying memory. I guess that your copy routine is not optimized to the maximum.
I think your assumption is wrong that the processor can only have every second cycle. As far as I know, the processor could have every cycle to chip ram.
I can be wrong as this is more than 20 years ago ...

I would assume, that using the processor only, should be faster on 32 bit chip ram amigas as the blitter has only 16 bit accesses. So in theory the processor is twice as fast as the blitter in just copying memory.

At least in my unfinished game which I wrote in 1995, we used the processor for refreshing the bobs with graphic tiles. The graphic tiles were 32*16 (or 32*32 depending on the resolution we used). IIRC it was about twice as fast as using the blitter even on chip ram only systems. Also it ran faster if you could allocate fast ram for the graphic tiles. So in theory the performance boost could be up to 300% on fast processors for refreshing graphics tiles.

At least I was thinking of replacing the blitter routines with processor routines on fast amigas.

Also this might be interesting for you:
http://aminet.net/package/util/boot/CpuBlit98

May be my old source could be interesting (just a copy and paste here from my unfinished game and sorry of being incomplete, but I think that my code is a little bit faster ):
Code:
RfrshCSet256:
movea.l AufScreen,a4
adda.l (YDestTab,d1*4),a4
move.w d0,d5
move.w D0,D2
add.w D2,D2
add.w D2,D2
adda.w D2,a4
move.w ScreenX,D6
move.w d1,d3 ; Kopie der YPos nach d3
lsr.w #4,d3 ; YPos in CSets
add.w (CSetmYMap,d3*2),d5 ; Pos in LevelMap nach d0
move.w (LevelMap,d5*2),d3 ; CSet nach d3
move.l (CSetPointer,d3*4),a1 ; CSetGFX nach a1
move.w d1,d2 ;
and.w #$f,d2 ;
adda.w (CSetTab,d2*2),a1 ;
sub.w #15,d2 ; Berechne ...
neg.w d2 ; Zeilen die zu kopieren sind
btst #1,ObjType+1(a0) ; ist BOB nur 16 Pixel hoch ?
beq.s .loop ;
cmp.w #8,d2 ; ist das zu refreshende etwa größer als 8?
blo.s .loop ; nein -> .loop
moveq #8,d2 ; auf 8 Zeilen kürzen
.loop:
move.l (A1)+,(a4) ; 1.Plane kopieren
adda.w D6,a4 ; Ziel auf 2.Plane
move.l (A1)+,(a4) ; 2.Plane kopieren
adda.w D6,a4 ; Ziel auf 3.Plane
move.l (A1)+,(a4) ; 3.Plane kopieren
adda.w D6,a4 ; Ziel auf 4.Plane
move.l (A1)+,(a4) ; 4.Plane kopieren
adda.w D6,a4 ; Ziel auf 5.Plane
move.l (A1)+,(a4) ; 5.Plane kopieren
adda.w D6,a4 ; Ziel auf 6.Plane
move.l (A1)+,(a4) ; 6.Plane kopieren
adda.w D6,a4 ; Ziel auf 7.Plane
move.l (A1)+,(a4) ; 7.Plane kopieren
adda.w D6,a4 ; Ziel auf 8.Plane
move.l (A1)+,(a4) ; 8.Plane kopieren
adda.w D6,a4 ; Ziel auf 1.Plane+nächsteZeile
dbf D2,.loop ; bis alle Zeilen kopiert sind
move.w d1,d2 ; YPosition nach d2
and.w #$F,d2 ; nur YPosition Bits raus filtern
btst #1,ObjType+1(a0) ; ist Bob nur 16 Pixel hoch ?
beq.s .next1 ;
subq.w #8,d2 ;
.next1:
subq.w #1,d2 ; d2 -1 = wieviele Zeilen sind zu kopieren
bmi .end ; falls keine Zeile mehr raus
move.w (LevelMap+40,d5*2),d3 ; CSet aus nächster Zeile nach d3
move.l (CSetPointer,d3*4),a1 ; CSetGFX nach a1
.loop2:
move.l (A1)+,(a4) ; 1.Plane kopieren
adda.w D6,a4 ; Ziel auf 2.Plane
move.l (A1)+,(a4) ; 2.Plane kopieren
adda.w D6,a4 ; Ziel auf 3.Plane
move.l (A1)+,(a4) ; 3.Plane kopieren
adda.w D6,a4 ; Ziel auf 4.Plane
move.l (A1)+,(a4) ; 4.Plane kopieren
adda.w D6,a4 ; Ziel auf 5.Plane
move.l (A1)+,(a4) ; 5.Plane kopieren
adda.w D6,a4 ; Ziel auf 6.Plane
move.l (A1)+,(a4) ; 6.Plane kopieren
adda.w D6,a4 ; Ziel auf 7.Plane
move.l (A1)+,(a4) ; 7.Plane kopieren
adda.w D6,a4 ; Ziel auf 8.Plane
move.l (A1)+,(a4) ; 8.Plane kopieren
adda.w D6,a4 ; Ziel auf 1.Plane+nächsteZeile
dbf D2,.loop2 ; bis alle Zeilen kopiert sind
.end
rts
And don't be confused of the geman comments and cset are tiles I called it CSet for Charachter Set, because I learned assembler on C64

If I am wrong, then it has something to do with bitplanes and resolution, we used at the beginnig 640*512@6BPL and later we used 640*256@8BPL. And if my memories are not completely wrong then the CPU was twice as fast as the blitter at refreshing.

Last edited by Chrille; 20 February 2020 at 15:04.
Chrille is offline  
Old 20 February 2020, 15:13   #39
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,723
Quote:
Originally Posted by sandruzzo View Post
@roondar

I meant that, since you're no longer using 3 source channel each fast cookie cut would be 25% faster than regular one for each plane, but not for the whole blitting.

if you're using 16 colors, one planes'll take 8 cycles for each word. The other planes only 6 cycles. 25% faster.
Right, I see what you mean now. The overall blit will be about 8% faster, but the individual planes are faster. Assuming the extra overhead for running several blits rather than one won't get in the way.

Thanks for the explanation.

Quote:
Originally Posted by Chrille View Post
@roondar:
Interesting, that you try a blitter processor combination even for copying memory. I guess that your copy routine is not optimized to the maximum.
Well, my copy routine is really no more than:
Code:
.lp	move.l	(a1)+,(a2)+	; One of these in the source for every longword of width to copy
	add.l	d4,a1		; Modulo
	add.l	d4,a2		; Modulo
	dbra 	d7,.lp

Optionally unrolled several times, optimum speed was seen at two copies of the move commands.
Three+ copies didn't help performance in any way.
Also tried removing the add.l statements during my experimentation. 
This didn't change performance either.
Note that this is basically identical to the copy part of the code you posted, only I use 2x unrolling (as I found no speed benefits for further unrolling) and you do 8x
Quote:
I think your assumption is wrong that the processor can only have every second cycle. As far as I know, the processor could have every cycle to chip ram.
I can be wrong as this is more than 20 years ago ...
This isn't actually my assumption

I got this info from Toni Wilen. See here: http://eab.abime.net/showpost.php?p=1333109&postcount=4. It seems to be correct too, as running bustest on an A1200 lists a peak write rate to chip memory of 7MB/sec, which is identical to what the Blitter does and exactly what you'd get if the processor accesses the bus every other cycle.
Quote:
I would assume, that using the processor only, should be faster on 32 bit chip ram amigas as the blitter has only 16 bit accesses. So in theory the processor is twice as fast as the blitter in just copying memory.
This is sadly not the case, as outlined above
Quote:
At least in my unfinished game which I wrote in 1995, we used the processor for refreshing the bobs with graphic tiles. The graphic tiles were 32*16 (or 32*32 depending on the resolution we used). IIRC it was about twice as fast as using the blitter even on chip ram only systems.
I've heard several claims like these, but it really can't be true for chip-chip copies (assuming we run in VBLANK or high fetch modes - if you run at say 1x fetch during 16 colour display DMA then it does change to favour the CPU). The 68020 in the A1200 can't read faster from chip memory than 5,6MB/seconds and it can't write faster than 7MB/seconds (as per bustest). This is a maximum copy speed of 3,15MB/sec.

The Blitter can do copy at a maximum of 3,5MB/sec.
Quote:
Also it ran faster if you could allocate fast ram for the graphic tiles. So in theory the performance boost could be up to 300% on fast processors for refreshing graphics tiles.
Using fast ram changes things, yes. This will definitely allow you a pretty nice boost. Up to about 2x the speed of the Blitter in fact. But this idea is aimed at chip-only systems
Quote:
At least I was thinking of replacing the blitter routines with processor routines on fast amigas.

Also this might be interesting for you:
http://aminet.net/package/util/boot/CpuBlit98

May be my old source could be interesting (just a copy and paste here from my unfinished game and sorry of being incomplete, but I think that my code is more a little bit faster ):
Code:
<snip>
And don't be confused of the geman comments and cset are tiles I called it CSet for Charachter Set, because I learned assembler on C64
Thanks for the code and link, I'll check it out in more detail later
Quote:
If I am wrong, then it has something to do with bitplanes and resolution, we used at the beginnig 640*512@6BPL and later we used 640*256@8BPL. And if my memories are not completely wrong then the CPU was twice as fast as the blitter at refreshing.
Don't take me wrong, I'd love to be wrong on this - but the CPU really does not get more than half of the bus cycles. It's limited to a (theoretical) max of 7MB/sec so it can't ever be 2x the speed of the Blitter, as it also does 7MB/sec max.

Edit: I want to be clear here, this is not me trying to say I'm perfect and only my code is the best. It's just that several claims regarding A1200 CPU performance to chip memory have been floating around and they're not actually correct. I actually really appreciate the posts I've been seeing here

Last edited by roondar; 20 February 2020 at 15:25. Reason: Removed some accidental all-caps
roondar is offline  
Old 20 February 2020, 15:22   #40
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 229
Quote:
Originally Posted by roondar View Post
Don't take me wrong, I'd love to be wrong on this - but the CPU really does not get more than half of the bus cycles. It's limited to a (theoretical) max of 7MB/sec so it can't ever be 2x the speed of the Blitter, as it also does 7MB/sec max.
But as you wrote in an earlier post - doesn't this depend on DMA usage? The CPU can do twice as much in one memory access compared to the blitter (32 vs 16 bit), but can get only every second slot. But if at least half of the DMA slots* are used by display DMA, CPU and blitter get the same amount of slots, so the CPU is up to 2x faster. Chrille's display modes indicate that this may have been the case.


*well, precisely at least every second - with AGA fetch modes that's probably more complicated, because fetches do occur more in blocks.
chb is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Selling A3660 CPU card, including Rev 5 CPU - NEW - professionally built tbtorro MarketPlace 1 17 June 2018 19:14
Blitting one bitplane Shatterhand Coders. Blitz Basic 13 01 February 2017 16:13
Problem with blitting.... xboxown Coders. Language 0 09 March 2014 21:51
Source for A4000D CPU card plastic standoffs alexh support.Hardware 38 12 June 2011 19:15
Blitting question sandruzzo Coders. General 30 06 April 2011 11:29

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 13:31.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.
Page generated in 0.09399 seconds with 15 queries