Performance of Akiko C2P on 030/50 CD32 systems - Page 11

lmimmfn · 01 June 2024, 01:47

Quote:

Originally Posted by Karlos

Just a warning since getting my 030 UAE config in better shape, none of the CACR manipulation versions will run without the datacache enabled anyway, it just freezes up. I should've tested that more carefully. However, the plan is to run with the cache enabled, provided the CACR trick fixes the Akiko read issue.

I'm lurking on this chat but you really need to hail Tony.

Karlos · 01 June 2024, 01:50

Quote:

Originally Posted by lmimmfn

I'm lurking on this chat but you really need to hail Tony.

I don't know that it's a bug in UAE at all, it seems quite reasonable that you might be able to lock up the real 68030 doing silly things in supervisor mode while interrupts are disabled.

Lunda · 01 June 2024, 09:58

Quote:

Originally Posted by Karlos

@lunda if you do get to check this, it needs checking with the cache enabled as well as disabled. Each of the Akiko tests, including the verification has a CACR bashing version.

For some reason my machine could not run the test without data cache(crash reboot). It might be an issue with the beast. I tested with both SDRAM and SRAM.

edit: I found the reason after reading all new posts.

Karlos · 01 June 2024, 12:06

Well it's fair to say, disabled write allocation fixes the Akiko read back problem but the routine is clearly far behind the software C2P on this machine. It beats it by a clear 25%

It certainly seems to be pure IO bandwidth limitation, i.e. the accepted wisdom. If the chip RAM writes were faster, the simplicity would theoretically allow it to beat the software conversion, since it hides the ALU effort behind the slow writes. We can test that actually by just doing C2P from fast to fast.

abu_the_monkey · 01 June 2024, 13:23

it would still be nice to know where the crossover point is between using Akiko vs CPU.
Is it an 030@50mhz or faster?

Karlos · 01 June 2024, 13:26

Quote:

Originally Posted by abu_the_monkey

it would still be nice to know where the crossover point is between using Akiko vs CPU.
Is it an 030@50mhz or faster?

Maybe Lunda can test different clock crystals?

abu_the_monkey · 01 June 2024, 13:39

I guess Akiko will still perform the same?

Karlos · 01 June 2024, 13:53

Quote:

Originally Posted by abu_the_monkey

I guess Akiko will still perform the same?

I think it'll take the same number of cycles but the cycles will be longer. The total chip ram delay is the only real invariant. So they might just both end up converging to the same speed, limited by chip write bandwidth.

paraj · 01 June 2024, 15:56

Unfortunately it looks to me like it's just not going to be worth it on 030 unless "normal" accelerator cards behave radically different or some wizards comes up with a serious improvement to the instruction scheduling.

The time for "Naive (WA)" is still very close to "Null C2P + Akiko Limit (WA)", and Kalms - Null C2P is only 45314 ticks, so assuming just that part scales linearly with clock frequency, it'd start being faster at around 25MHz..

Karlos · 01 June 2024, 16:53

Not having DMA output to chip ram. What a missed opportunity. It's not as if there's much that runs on 020/14 + Fast that can use C2P that isn't just faster using chunky copper screen tricks, so it was only ever going to be truly useful with a faster CPU in the first place.

I know it was "for free", but it's also a bit of a chocolate teapot without being able to get the data out of it faster.

abu_the_monkey · 01 June 2024, 17:04

yep.

still, it would be nice to have the numbers from a range of setups.
at least then it can be put to bed once and for all.

Karlos · 01 June 2024, 19:25

I think the bus is maxed out when talking to Akiko. If it's doing a transfer every 3 cycles and the bus is 14 MHz, that's 4*14/3 = 18.67 MB/s

The conversion does 9MB/s, but considering it's a write and read workload, that's your 18MB/s nommed up.

abu_the_monkey · 01 June 2024, 19:42

random thunk.

Code:

; ############################################################################# 
 movem.l d1-d7/a2/a3/a6,-(sp)

    ; back up the inputs
    move.l  a0,a2
    move.l  a1,a3

        move.l  _SysBase,a6
        jsr             _LVOForbid(a6)
        jsr             _LVODisable(a6)

    move.l  #$00B80038,a0
    move.w  #2559-1,d0;was #2560-1 now an extra 1 less cos the last write falls through
        move.l  a3,a1

        ; a0 akiko
        ; a2 source
        ; a3
; #############################################################################
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
    move.l  (a2)+,(a0)
; #############################################################################
.loop:
    ; write plane 0
    move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),d1
        move.l  (a0),d2
        move.l  (a0),d3
        move.l  (a0),d4
        move.l  (a0),d5
        move.l  (a0),d6
        move.l  (a0),d7
        
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        move.l  (a2)+,(a0)
        
    move.l d1,(a1)
    add.w   #10240,a1
        
    move.l d2,(a1)
    add.w   #10240,a1

    move.l d3,(a1)
    add.w   #10240,a1

    move.l d4,(a1)
    add.w   #10240,a1

    move.l d5,(a1)
    add.w   #10240,a1

    move.l d6,(a1)
    add.w   #10240,a1
        add.w   #4,a3
                
        move.l d7,(a1)
    add.w   #10240,a1
    move.l  a3,a1
    dbra    d0,.loop
; #############################################################################
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        
        move.l  (a0),(a1)
    add.w   #10240,a1
        move.l  a3,a1
; #############################################################################
        jsr _LVOEnable(a6)
        jsr _LVOPermit(a6)

    movem.l (sp)+,d1-d7/a2/a3/a6
    rts
; #############################################################################

probably contains mistakes

NorthWay · 01 June 2024, 20:46

Would it be beneficial for some specs to do every other decode with Akiko and then cpu? You would have to interleave all 16 Akiko reads and writes in-between the cpu c2p (i.e. not blindly do one and then the other but both at the same time).

Karlos · 01 June 2024, 21:44

@abu_the _monkey

Try it. The only things to say are you don't need Forbid/Permit since Disable achieves the same thing regardless. You will probably want to disable write allocate before talking to Akiko too. The latest code does this but it's basically identical to what paraj posted a bit earlier.

Thomas Richter · 01 June 2024, 21:47

Quote:

Originally Posted by NorthWay

Would it be beneficial for some specs to do every other decode with Akiko and then cpu? You would have to interleave all 16 Akiko reads and writes in-between the cpu c2p (i.e. not blindly do one and then the other but both at the same time).

Hardly. Akiko is synchronous, and the slow part is attempting to read from its registers as the CPU needs to wait for the relatively slow chip bus. For a conversion from fast mem to chip mem, the CPU does not need to wait for anything - it can retire chip bus accesses in its push buffer while continuing to work. That does not help for Akiko,

Karlos · 01 June 2024, 22:14

It has been fun but I think we've pretty effectively demonstrated the common wisdom. It's all in the bus, you just can't move the data around fast enough to beat code able to execute on the CPU behind pending writes.

abu_the_monkey · 01 June 2024, 22:29

Quote:

Originally Posted by Karlos

@abu_the _monkey

Try it. The only things to say are you don't need Forbid/Permit since Disable achieves the same thing regardless. You will probably want to disable write allocate before talking to Akiko too. The latest code does this but it's basically identical to what paraj posted a bit earlier.

winuae will not be a good gauge and I don't have real hardware to test on. it would still have the overhead of using the Akiko, just wondered if some of the reads/writes could be done just after (during) a write to chip ram.

Quote:

Originally Posted by Karlos

It has been fun but I think we've pretty effectively demonstrated the common wisdom. It's all in the bus, you just can't move the data around fast enough to beat code able to execute on the CPU behind pending writes.

yes, but where is the point/speed where it becomes better to use the cpu is something I really wanted to know.

Karlos · 01 June 2024, 22:35

I don't know - isn't the bus logic busy servicing the chip ram write? I don't think you can just go and do a read from somewhere else (unless in cache I suppose) while you are waiting for it.

This isn't my area of expertise mind.

abu_the_monkey · 01 June 2024, 22:38

nor mine, just a thought that popped in my tired noggin

01 June 2024, 22:35	#219
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,480	I don't know - isn't the bus logic busy servicing the chip ram write? I don't think you can just go and do a read from somewhere else (unless in cache I suppose) while you are waiting for it. This isn't my area of expertise mind. Last edited by Karlos; 01 June 2024 at 22:49.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
C2P Performance issues	meeku	Coders. Asm / Hardware	10	09 April 2019 18:29
Alien Breed 3D CD32 - Akiko C2P?	wairnair	support.Games	9	06 July 2018 14:32
Gloom Akiko C2P?	Whitesnake	support.Games	5	23 April 2007 19:01
Blizzard 030/50 Accelerators	Parsec	Amiga scene	20	14 February 2004 17:48
Cd32 Emulator (AKIKO)	Doozy	support.WinUAE	3	06 December 2001 08:41

01 June 2024, 12:06	#204
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,480	Well it's fair to say, disabled write allocation fixes the Akiko read back problem but the routine is clearly far behind the software C2P on this machine. It beats it by a clear 25% It certainly seems to be pure IO bandwidth limitation, i.e. the accepted wisdom. If the chip RAM writes were faster, the simplicity would theoretically allow it to beat the software conversion, since it hides the ALU effort behind the slow writes. We can test that actually by just doing C2P from fast to fast.

01 June 2024, 13:23	#205
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	it would still be nice to know where the crossover point is between using Akiko vs CPU. Is it an 030@50mhz or faster?

01 June 2024, 13:39	#207
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	I guess Akiko will still perform the same?

01 June 2024, 15:56	#209
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,217	Unfortunately it looks to me like it's just not going to be worth it on 030 unless "normal" accelerator cards behave radically different or some wizards comes up with a serious improvement to the instruction scheduling. The time for "Naive (WA)" is still very close to "Null C2P + Akiko Limit (WA)", and Kalms - Null C2P is only 45314 ticks, so assuming just that part scales linearly with clock frequency, it'd start being faster at around 25MHz..

01 June 2024, 16:53	#210
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,480	Not having DMA output to chip ram. What a missed opportunity. It's not as if there's much that runs on 020/14 + Fast that can use C2P that isn't just faster using chunky copper screen tricks, so it was only ever going to be truly useful with a faster CPU in the first place. I know it was "for free", but it's also a bit of a chocolate teapot without being able to get the data out of it faster.

01 June 2024, 17:04	#211
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	yep. still, it would be nice to have the numbers from a range of setups. at least then it can be put to bed once and for all.

01 June 2024, 19:25	#212
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,480	I think the bus is maxed out when talking to Akiko. If it's doing a transfer every 3 cycles and the bus is 14 MHz, that's 4*14/3 = 18.67 MB/s The conversion does 9MB/s, but considering it's a write and read workload, that's your 18MB/s nommed up.

01 June 2024, 20:46	#214
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 854	Would it be beneficial for some specs to do every other decode with Akiko and then cpu? You would have to interleave all 16 Akiko reads and writes in-between the cpu c2p (i.e. not blindly do one and then the other but both at the same time).

01 June 2024, 21:44	#215
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,480	@abu_the _monkey Try it. The only things to say are you don't need Forbid/Permit since Disable achieves the same thing regardless. You will probably want to disable write allocate before talking to Akiko too. The latest code does this but it's basically identical to what paraj posted a bit earlier.

01 June 2024, 22:14	#217
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,480	It has been fun but I think we've pretty effectively demonstrated the common wisdom. It's all in the bus, you just can't move the data around fast enough to beat code able to execute on the CPU behind pending writes.

01 June 2024, 22:38	#220
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	nor mine, just a thought that popped in my tired noggin

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)