Performance of Akiko C2P on 030/50 CD32 systems - Page 3

Karlos · 22 May 2024, 17:39

Quote:

Originally Posted by alexh

tf360

There's flat out no way Akiko is going to be worth using on this though.

hitchhikr · 22 May 2024, 19:19

Quote:

You make 8 writes to it and then you read back from it 8 times.

Not necessarily 8, can be less.

EDIT: i mean, you don't need to read back the register 8 times if you don't use 8 bitplanes.

Karlos · 22 May 2024, 20:05

Quote:

Originally Posted by hitchhikr

Not necessarily 8, can be less.

EDIT: i mean, you don't need to read back the register 8 times if you don't use 8 bitplanes.

Sure, but we are in this case.

abu_the_monkey · 22 May 2024, 21:04

Quote:

Originally Posted by Karlos

There's flat out no way Akiko is going to be worth using on this though.

sure the 060 has no need for c2p hardware, but, if is not a hinderance then why not use it? the 060 is probably just waiting around for chip ram bus access so it could equally just wait around for akiko and bus access and do even less work

paraj · 22 May 2024, 21:19

Quote:

Originally Posted by abu_the_monkey

sure the 060 has no need for c2p hardware, but, if is not a hinderance then why not use it? the 060 is probably just waiting around for chip ram bus access so it could equally just wait around for akiko and bus access and do even less work

On 060 you can (mostly) overlap all C2P calculations while waiting for the chipmem writes to retire, so you're not just waiting around, instead you're fetching the next data from fast mem to be converted and doing so. For very simple stuff you can more or less render and C2P a complete frame just while waiting for chipmem!

Accessing the Akiko C2P hardware is not free, you need to write and read stuff back, it would need to be essentially free to compete on 060 and very fast on 030/050.

abu_the_monkey · 22 May 2024, 21:24

and I agree

that is why I said as long as it is not a hinderance.

the proof is in the testing, not just in the theory

paraj · 22 May 2024, 21:36

Quote:

Originally Posted by abu_the_monkey

and I agree

that is why I said as long as it is not a hinderance.

the proof is in the testing, not just in the theory

Ah yes, fully agree. Very annoying that even simple, raw numbers (R/W without dma/ints) aren't available. Hopefully this effort will bring them

Thomas Richter · 22 May 2024, 21:51

Quote:

Originally Posted by abu_the_monkey

sure the 060 has no need for c2p hardware, but, if is not a hinderance then why not use it?

The hindrance is that Akiko has relatively narrow 8-bit input and output registers, and each write and read access requires a full synchronization with the chip clock. For the CPU, it can essentially park four long words in the CPU push buffer and continue working (provided chip mem is marked as "imprecise" by the MMU), and while the CPU keeps working, the push buffer is "retiring" the writes.

Karlos · 22 May 2024, 21:59

I feel like this is starting to go beyond what I originally intended.

Thinking about it, Photon makes an interesting suggestion: hybrid C2P. *If* the CPU is able to execute instructions while waiting on writes to Akiko and Chip memory, it's not beyond the realms of possibility thay you might be able to craft a routine that uses both to perform C2P on different parts of the whole workload. A task for a hardcore optimisation expert

abu_the_monkey · 22 May 2024, 22:16

Quote:

Originally Posted by Karlos

I feel like this is starting to go beyond what I originally intended.

you do remember what site you are posting on

abu_the_monkey · 22 May 2024, 22:19

Quote:

Originally Posted by Karlos

Thinking about it, Photon makes an interesting suggestion: hybrid C2P. *If* the CPU is able to execute instructions while waiting on writes to Akiko and Chip memory, it's not beyond the realms of possibility thay you might be able to craft a routine that uses both to perform C2P on different parts of the whole workload. A task for a hardcore optimisation expert

this might be something to explore, but, for me I would start with the simplest implementation and see what happens and keep an open mine on possible improvements.

pipper · 22 May 2024, 22:20

Quote:

It's a single address that you can find via the graphics library. You make 8 writes to it and then you read back from it 8 times.

Yeah, this is like the weirdest/laziest interface they could come up with.
Nothing like a function call "GetAkikoInterface" or anything...

Cyprian · 22 May 2024, 22:21

Quote:

Originally Posted by paraj

From the schematics (https://www.amigawiki.org/doku.php?i...ice:schematics) it does look plausible that it the same access restrictions as chipmem apply, and that you'd be able to do proper 32-bit accesses. Looks like it's clocked at 7Mhz by the looks of it, but I'm not a HW person.

The doom attack source on aminet (http://aminet.net/game/shoot/DoomAttack_src.lha) has c2p routines, and they are very very simple, just write 8 longs to the chip, and read them back. From WinUAE source code I can see that the register in question is located at $b80038.

Would be interesting with measurements of the raw speed, i.e. interrupts and DMA off, and just

Code:

  rept 8
  move.l d0,(a0)
  endr
  rept 8
  move.l (a0),d0
  endr

in a loop as well as variations of the above reading from (chip/fast)/writing to chip.

It would be cool to see figures.
I wonder if accessing Akiko is similar to hardware registers, if I'm not mistaken, 2 cycles of 3.5MHz per access or faster.

pandy71 · 23 May 2024, 00:50

Quote:

Originally Posted by alexh

Just one 32-bit address : 0x00b8_0038

Oh... so you firstly wrote 8 times DWORD to this address and after this you just read 8 times DWORD from same address?
Strange - i would do 8 registers but perhaps it was idea behind such implementation.

Quote:

Originally Posted by alexh

I don't know. Looking at the CD32 schematic the Akiko must also contain the equivalent of the A1200 Budgie. It's a Zorro II FastRAM address but is it shared with accesses to the CHIP RAM bus? I'm not 100% sure.

Yes, i also checked CD32 schematics and obviously Akiko is accessible from CPU and CHIP by two independent 32 bit buses so technically C2P can be clocked with higher clock, also seem Akiko use CPU clock (made from XORed 7MHz and CDAC) i.e. 14MHz, side to this weirdly to me it is accessible(?) from CPU reserved type space.

Quote:

Originally Posted by alexh

It is, but it is taking place in the CPU data cache at the CPU clock frequency (e.g. 50MHz).

Yes but you need to read data by CPU, perform shuffling (time costly), write data to CHIP or somewhere else.
Let say Read and Write can be same speed as Write and Read to and from Akiko then data shuffling for sure will take more cycles than R/W.
And Akiko C2P HW perform data shuffling immediately as it is hardwired.

alexh · 23 May 2024, 12:22

Quote:

Originally Posted by pandy71

Quote:

Originally Posted by alexh

Quote:

Originally Posted by pandy71

My assumption is that C2P on CPU is more than just writing and reading - some additional operations must be performed like shift, mask etc so for 1 pixels more CPU cycles is required.

It is, but it is taking place in the CPU data cache at the CPU clock frequency (e.g. 50MHz).

Yes but you need to read data by CPU, perform shuffling (time costly), write data to CHIP

Yes.

Quote:

Originally Posted by pandy71

Let say Read and Write can be same speed as Write and Read to and from Akiko

Read from FastRAM into data cache which is the same. Write to ChipRAM which is the same (maybe?). One has a R/W to Akiko. The other has C2P. It's all down to the bandwidth from CPU cache to/from Akiko vs C2P running from processor cache.

Quote:

Originally Posted by pandy71

then data shuffling for sure will take more cycles than R/W.

I don't think so. Write and read to/from Akiko is at best 14MHz (but probably slower 3.5MHz) whereas the C2P is happening in cache on 030@50MHz. I think that gives C2P ~28 CPU cycles to break even with Akiko @14MHz. (Possibly more if the write to ChipRAM is more efficient)

hooverphonique · 23 May 2024, 14:17

The chip itself runs at the same clock as the cpu (~14MHz) as far as I can see from the schematic. What it does with this internally, I don't know.

All the bus arbitration stuff in the CD32 is handled by Akiko (i.e. it "controls" access to the rest of the custom chips, not the other way round), so you would need to know the internals of it to determine what the rules are for accessing the C2P register.

I would expect the rules are the same as for fastram, except that the address needs to be marked as non-cacheable.

Photon · 23 May 2024, 18:00

Quote:

Originally Posted by alexh

Akiko can't "do" anything. It is a slave. You write data into it and then read it back using the CPU.

Right. Sigh, you see some of the info sometimes, and then I forgot. This makes it "half-14MHz-copyspeed-with-some-extra-work-for-the-CPU". The reward is that hopefully the conversion only takes as long as the last write to the Akiko address.

Quote:

Originally Posted by alexh

This is to optimise the software C2P?

IIRC there's at least a way to get close to the write speed of the memory. The caching itself can't improve speed if you do an entire buffer conversion once per frame - you read each source address only once, and write each destination address only once.

Quote:

Originally Posted by alexh

I'm curious to know what this means?

It means that ideally before every write, have the CPU prepared to calculate internally immediately after, from already cached or register data, with instructions that won't have to read from memory, using instructions already in the cache and partially already in the pipeline.

This is just what would be ideal for a CPU that is several times faster than memory. The design of the individual model could deviate from the ideal for many reasons, or already detect write-throughs and defer them to not stall the pipeline.

Anyway, I thought you wrote an address to the Akiko register. Even if it completes a conversion in the time it takes to feed it data, it should assist soft C2P less than the Blitter.

I'm starting to think this extra chip is best used only if stock CD32 is detected. Possibly you could scatter a few move.l (a0),(a1) somewhere in a C2P routine without terrible consequences. Then it could help convert a few pixels per row maybe. But I think only if you get them virtually for free. And only for full 8-bit C2P, since you have to write all 8.

Karlos · 24 May 2024, 00:46

I may have hit a snag. I've written a small test C program that first tries to detect or akiko is present (looks for magic 0xCAFE ident at the hardware address that I've just forgotten having turned off the computer). Turning UAE chipset extra to CD32 results in this detection working as expected and reporting Akiko exists. Reverting to A1200 chipset fails the test and reports no Akiko, as expected, so I'm pretty sure this is fine.

Next, I have a tiny ASM function to write 8 ULONG pointed to by a0 to the hardware address at $B80038 and then read them back to a buffer pointed to by a1. This is just for validation purposes so far and I used assembler to ensure nothing could be optimised away by the compiler here.

However, it seems all I'm getting back is zero (the destination buffer is prefilled with a different value).

As I'm hitting the (virtual) metal directly, I naively assumed that under emulation conditions, this would just work up to this stage. I've tried messing CPU cache on the Amiga side and various UAE settings in the emulator, but so far, no dice.

abu_the_monkey · 24 May 2024, 07:01

This is the c2p from Adoom, don't know if it works under emulation.

Code:

mc68020
		multipass
	if (_eval(DEBUG)&$8000)
		debug	on,lattice4
	endc

;void __asm c2p_akiko (register __a0 UBYTE *chunky_data,
;                      register __a1 PLANEPTR raster,
;                      register __a2 UBYTE *dirty_list,
;                      register __d1 ULONG plsiz,
;                      register __a5 UBYTE *akiko_address);

; a0 -> width*height chunky pixels in fastmem
; a1 -> contiguous bitplanes in chipmem
; a2 -> dirty list (1-byte flag for whether each 32 pixel "unit" needs updating)
; d1 = width*height/8   (width*height must be a multiple of 32)

	ifeq	depth-8
		xdef	_c2p_8_akiko
_c2p_8_akiko:
	else
	ifeq	depth-6
		xdef	_c2p_6_akiko
_c2p_6_akiko:
	else
		fail	"unsupported depth!"
	endc
	endc

		xref	_GfxBase

		movem.l	a2/a3/a6,-(sp)

		move.l	d1,d0		; plsiz
		lsl.l	#3,d0		; 8*plsiz
		lea	(a0,d0.l),a3	; a3 -> end of chunky data
		sub.l	d1,d0		; d0 = 7*plsiz
	ifle depth-6
		sub.l	d1,d0
		sub.l	d1,d0		; d0 = 5*plsiz if depth=6
	endc

		movem.l	d0/d1/a0/a1,-(sp)
		movea.l	(_GfxBase).l,a6
		jsr	(_LVOOwnBlitter,a6) ; gain exclusive use of Akiko
		movem.l	(sp)+,d0/d1/a0/a1

loop:		tst.b	(a2)+		; does next 32 pixel unit need updating?
		bne.b	c2p		; branch if yes

		adda.w	#32,a0		; skip 32 pixels on input
		addq.l	#4,a1		; skip 32 pixels on output

		cmpa.l	a3,a0
		bne.b	loop
		bra.b	exit		; exit if no changes

c2p:		move.l	(a0)+,(a5)	; write 32 pixels to akiko
		move.l	(a0)+,(a5)
		move.l	(a0)+,(a5)
		move.l	(a0)+,(a5)
		move.l	(a0)+,(a5)
		move.l	(a0)+,(a5)
		move.l	(a0)+,(a5)
		move.l	(a0)+,(a5)

		move.l	(a5),(a1)	; plane 0
		adda.l	d1,a1
		move.l	(a5),(a1)	; plane 1
		adda.l	d1,a1
		move.l	(a5),(a1)	; plane 2
		adda.l	d1,a1
		move.l	(a5),(a1)	; plane 3
		adda.l	d1,a1
		move.l	(a5),(a1)	; plane 4
		adda.l	d1,a1
	ifgt depth-6
		move.l	(a5),(a1)	; plane 5
		adda.l	d1,a1
		move.l	(a5),(a1)	; plane 6
		adda.l	d1,a1
	endc
		move.l	(a5),(a1)+	; last plane

		suba.l	d0,a1		; -7*plsiz (or 5*plsiz) (or 3*plsiz)

		cmpa.l	a3,a0
		bne.b	loop

exit:		jsr	(_LVODisownBlitter,a6) ; free Akiko

		movem.l	(sp)+,a2/a3/a6
		rts

Maybe it helps?

alexh · 24 May 2024, 09:10

https://github.com/tonioni/WinUAE/blob/master/akiko.cpp

Line 300 for the code for the akiko emulation

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
C2P Performance issues	meeku	Coders. Asm / Hardware	10	09 April 2019 18:29
Alien Breed 3D CD32 - Akiko C2P?	wairnair	support.Games	9	06 July 2018 14:32
Gloom Akiko C2P?	Whitesnake	support.Games	5	23 April 2007 19:01
Blizzard 030/50 Accelerators	Parsec	Amiga scene	20	14 February 2004 17:48
Cd32 Emulator (AKIKO)	Doozy	support.WinUAE	3	06 December 2001 08:41

22 May 2024, 21:24	#46
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,087	and I agree that is why I said as long as it is not a hinderance. the proof is in the testing, not just in the theory

22 May 2024, 21:59	#49
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,918	I feel like this is starting to go beyond what I originally intended. Thinking about it, Photon makes an interesting suggestion: hybrid C2P. If the CPU is able to execute instructions while waiting on writes to Akiko and Chip memory, it's not beyond the realms of possibility thay you might be able to craft a routine that uses both to perform C2P on different parts of the whole workload. A task for a hardcore optimisation expert

23 May 2024, 14:17	#56
hooverphonique ex. demoscener "Bigmama" Join Date: Jun 2012 Location: Fyn / Denmark Posts: 1,663	The chip itself runs at the same clock as the cpu (~14MHz) as far as I can see from the schematic. What it does with this internally, I don't know. All the bus arbitration stuff in the CD32 is handled by Akiko (i.e. it "controls" access to the rest of the custom chips, not the other way round), so you would need to know the internals of it to determine what the rules are for accessing the C2P register. I would expect the rules are the same as for fastram, except that the address needs to be marked as non-cacheable.

24 May 2024, 00:46	#58
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,918	I may have hit a snag. I've written a small test C program that first tries to detect or akiko is present (looks for magic 0xCAFE ident at the hardware address that I've just forgotten having turned off the computer). Turning UAE chipset extra to CD32 results in this detection working as expected and reporting Akiko exists. Reverting to A1200 chipset fails the test and reports no Akiko, as expected, so I'm pretty sure this is fine. Next, I have a tiny ASM function to write 8 ULONG pointed to by a0 to the hardware address at $B80038 and then read them back to a buffer pointed to by a1. This is just for validation purposes so far and I used assembler to ensure nothing could be optimised away by the compiler here. However, it seems all I'm getting back is zero (the destination buffer is prefilled with a different value). As I'm hitting the (virtual) metal directly, I naively assumed that under emulation conditions, this would just work up to this stage. I've tried messing CPU cache on the Amiga side and various UAE settings in the emulator, but so far, no dice.

24 May 2024, 09:10	#60
alexh Thalion Webshrine Join Date: Jan 2004 Location: Oxford Posts: 14,641	https://github.com/tonioni/WinUAE/blob/master/akiko.cpp Line 300 for the code for the akiko emulation

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)