CIA access speed differs with Amiga model

patrik · 26 December 2021, 16:04

Hi,

Have been fiddling with a project which pokes the CIA 8520 parallel port. Did all work initially on an A3000, but later when testing it on an A1200, I noticed I got significantly lower results.

That was unexpected, so did some more thorough testing on more machines and the results were that on A500 (68000), A2000 (68030) or A1200 (68020, 68030, 68060), I got quite exactly half the speed when accessing CIA registers compared to an A3000 (68030) or A4000 (68060) and it was the same for a few repeated accesses or when mashing on it for seconds.

I always assumed that the CIA access speed would be the same on all Amigas, as they are all driven by the same E-clock at ~710kHz, only differing slighly between PAL and NTSC machines.

Figured out what I think is a quite good test to illustrate this difference - repeatedly reading the low byte register of one of the CIA timers in running mode. When the timer is running, this register will count down one step for each E-clock cycle, so if you do repeated reads of it and compare the values read, you can see how many E-clock cycles are required for each read.

Have written a utility which does this plus an interleaved read/write test so it is possible to see if there is any difference for writes. It does this for all available timers (four in total, but usually only two are available). The reads and writes are repeated move.b to/from registers, done inside Disable()/Enable() so should be fast enough on all machines and not possible to disturb. Executable and source is included in the archive:
http://megaburken.net/~patrik/CiaAccessTests.lha

Runs on kickstart 1.2+.

Results on A3000 (68030):

Code:

10.Ram Disk:CiaAccessTests> CiaAccessTests 
ciaa.talo(BFE401) reads:
0:  89
1:  88
2:  87
3:  86
ciab.talo(BFD400) reads:
0: 161
1: 160
2: 159
3: 158
ciaa.talo(BFE401) reads interleaved with ciaa.ddrb(BFE301) writes:
0: 124
1: 122
2: 120
3: 118
ciab.talo(BFD400) reads interleaved with ciaa.ddrb(BFE301) writes:
0:  91
1:  89
2:  87
3:  85

One cycle between each read and two cycles between each read/write so one cycle both for read and write on the A3000.

Results on A1200 (68060):

Code:

10.Ram Disk:CiaAccessTests> CiaAccessTests 
ciaa.talo(BFE401) reads:
0: 111
1: 109
2: 107
3: 105
ciab.talo(BFD400) reads:
0:   6
1:   4
2:   2
3:   0
ciaa.talo(BFE401) reads interleaved with ciaa.ddrb(BFE301) writes:
0: 233
1: 229
2: 225
3: 221
ciab.talo(BFD400) reads interleaved with ciaa.ddrb(BFE301) writes:
0: 199
1: 195
2: 191
3: 187

Two cycles between each read and four cycles between each read/write so two cycles both for read and write on the A1200.

This significant difference perhaps is common knowledge, but it was unknown to me. Does anyone know any details about the reason for it?

kamelito · 26 December 2021, 16:39

I think I read in the 1993 Devcon that CIA differ and they said to use the OS but also gave technical info about the differences.

SpeedGeek · 26 December 2021, 17:28

Thanks for making this benchmark tool available!

Unfortunately, I now have my GVP G-Force 030 installed in my A2000 (waiting support for another long delayed project). So I can't provide any immediate (E clock speedup mod) benchmark results:

https://eab.abime.net/showthread.php...92#post1523692

Niklas · 26 December 2021, 17:38

I believe the problem is not with the speed of CIA itself, but the speed of the processor that accesses CIA.

You should consider how the 68000 processor accesses the 6800 bus to which CIA is attached (the E, VPA, VMA signals that are described on page 185 of this document https://www.nxp.com/docs/en/referenc.../MC68000UM.pdf).

The E signal is low for six 7 MHz clock cycles, and then high for four 7 MHz clock cycles.

The access of CIA is done during the four cycles when E is high. If the E signal is high when an access to CIA starts then the processor will wait until the E signal is low again, and make the access during the next window when E goes high.

During the six cycles that E is low, the CPU needs to fetch and execute any instructions that are before the next instruction that is accessing CIA, and then fetch the instruction that accesses CIA.

For a normal 68000 processor, fetching one instruction takes at least 4 cycles, so if any instruction comes between the two instructions that access CIA then the second CIA access will miss the "E high window", and the processor stalls until the the next E high window, so now the speed of CIA accesses will effectively halve.

When using the 030 processor on the other hand, the processor is much faster and can manage to do one access in every E high window.

The fastest way to communicate over the parallel port that I could come up with is as follows: https://github.com/niklasekstrom/ami...i_low.asm#L170. This results in 2E speed (1 byte communicated on every other E cycle) on an Amiga with an accelerator that allows the processor to get an access in on every opportunity.

(For more background on the speed of protocols that communicate over the parallel port, see this page: https://lallafa.de/blog/2015/09/amig...st-can-you-go/.)

patrik · 26 December 2021, 17:49

@Niklas:
I think you are right in that the answer lies not in the speed of CIA itself, but in how it is accessed.

I can buy that the stock 68000 potentially might be busy fetching and executing the next instruction, missing the next E high window, however the 68030 A2000, the 68020, 68030 and 68060 A1200’s get the same 2-cycle result as the 68000 A500. Only the 68030 A3000 and 68060 A4000 manages 1-cycle access.

ross · 26 December 2021, 18:20

Check this thread https://eab.abime.net/showthread.php?t=107908 from message #8 onward.
There are a lot of similar arguments and test code.

Niklas · 26 December 2021, 18:58

Quote:

Originally Posted by patrik

I can buy that the stock 68000 potentially might be busy fetching and executing the next instruction, missing the next E high window, however the 68030 A2000, the 68020, 68030 and 68060 A1200’s get the same 2-cycle result as the 68000 A500. Only the 68030 A3000 and 68060 A4000 manages 1-cycle access.

I have observed 1E-cycle CIA accesses using a 50 MHz HC508 accelerator in an A500, so that is definitively possible.

Niklas · 27 December 2021, 00:06

More information that may be relevant to this thread... In Amigas whose processor doesn't have the 6800 interface logic, that logic is instead in Gayle. On page 6 of https://www.amigawiki.org/lib/exe/fe...cification.pdf it says: "_AS must be asserted by 3 clocks before E CLK goes high, or you wait until next time around". I'm not sure if the logic in 68000 works the same but I would guess so. Minimig seems to have the same idea: https://github.com/MiSTer-devel/Mini...8k_bridge.v#L3.

patrik · 27 December 2021, 03:10

@Kamelito:
Did an attempt at finding it, but the closest thing I could find was the 88 and 89 8520_Timing documents, but I don't think it is right?

@Speedgeek:
Cheers, please do a run on the A2630 when you install it next time.

@ross:
Very interesting thread, thank you very much! Saved logs from a couple of machines from your and simion's tests:
http://megaburken.net/~patrik/ciatest/cia-speed_b/ http://megaburken.net/~patrik/ciatest/CIA_tests/

@Niklas:
In the A3000 and A4000, Fat Gary generates the E-clock and chip-selects for the 8520's. http://megaburken.net/~patrik/A3000/...cification.pdf (downloaded from http://amiga.serveftp.net/datasheets.html to not put unnecessary strain on his connection) mentions some iteresting details:
"The ECLK signal is generated in GARY. It is a free running clock whose fequency is 1/10th of the 7M clock. Normally ECLK is low for six 7M clocks, and high for four 7M clocks. However, when the CIAs are accessed, the ECLK high time may be shorter than four 7M clocks. During writes to the CIAs, ECLK is high for only two 7M clocks. DUring reads ECLK stays high for a minimum of two 7M clocks, and a maximum of four 7M clocks. The frequency of ECLK does not change. If the ECLK high time is shortened during CIA access, the difference is made up by increasing the subsequency ECLK low time. Consequencyly, it is always ten 7M clocks from one rusing edge of ECLK to the next."

However, there is no explaining why it does it, but I assume it has something to do with why the A3000 and A4000 gets consistent 1-cycle 8520 accesses. On that note, the 50MHz A500 accelerator you had seen 1E accesses on perhaps does some similar trick when generating the E-clock.

Niklas · 27 December 2021, 11:13

Quote:

Originally Posted by patrik

In the A3000 and A4000, Fat Gary generates the E-clock and chip-selects for the 8520's. (...) mentions some interesting details: (...)

That is interesting indeed.

Quote:

Originally Posted by patrik

However, there is no explaining why it does it, but I assume it has something to do with why the A3000 and A4000 gets consistent 1-cycle 8520 accesses.

I guess so too, it seems like they came up with an optimization of sorts.

Quote:

Originally Posted by patrik

On that note, the 50MHz A500 accelerator you had seen 1E accesses on perhaps does some similar trick when generating the E-clock.

Quite possibly. I don't have the accelerator here, but it would be interesting to see if that is the case.

One thing I'm curious about is that the Fat Gary document says "During reads ECLK stays high for a minimum of two 7M clocks, and a maximum of four 7M clocks.". I wonder under what conditions reads become two or four clocks long.

patrik · 27 December 2021, 12:25

Quote:

Originally Posted by Niklas

One thing I'm curious about is that the Fat Gary document says "During reads ECLK stays high for a minimum of two 7M clocks, and a maximum of four 7M clocks.". I wonder under what conditions reads become two or four clocks long.

Could this be to be able to adaptively synchronize the local 25MHz 030 bus which runs asynchronous to the 7M clock (derived from the 28.x MHz "chipset clock")?

r.cade · 27 December 2021, 22:11

I always understood the A3000 and A4000 were the only fully 32-bit "path to chipset" machines, so this seems right.

The others are all only 16-bit, or there is another reason?

Jope · 28 December 2021, 11:19

Quote:

Originally Posted by r.cade

I always understood the A3000 and A4000 were the only fully 32-bit "path to chipset" machines, so this seems right.

The others are all only 16-bit, or there is another reason?

You're probably thinking about chip ram data bus width? There the A3000 is the only ECS machine that has 32 bits there. All AGA machines have 32 bits to chip ram.

Talking about CIA access, those have 8 bit wide registers, which are wired to the data bus so they take up 16 bits of the data bus in parallel, so the bus width of the CPU doesn't make that much difference.

Promilus · 28 December 2021, 11:45

Quote:

You're probably thinking about chip ram data bus width? There the A3000 is the only ECS machine that has 32 bits there

That's not entirely true. Clever buffering allows CPU to access CHIPRAM on A3000 with full-width 32b data interface and CHIPRAM itself is organized as 32bit but chipset only can access 16bit at a time so latching and proper control signals were used whenever Agnus access chipram.

Jope · 29 December 2021, 08:33

I was specifically thinking about cpu access to chip ram.

26 December 2021, 16:04	#1
patrik Registered User Join Date: Jan 2005 Location: Umeå Age: 43 Posts: 922	CIA access speed differs with Amiga model Hi, Have been fiddling with a project which pokes the CIA 8520 parallel port. Did all work initially on an A3000, but later when testing it on an A1200, I noticed I got significantly lower results. That was unexpected, so did some more thorough testing on more machines and the results were that on A500 (68000), A2000 (68030) or A1200 (68020, 68030, 68060), I got quite exactly half the speed when accessing CIA registers compared to an A3000 (68030) or A4000 (68060) and it was the same for a few repeated accesses or when mashing on it for seconds. I always assumed that the CIA access speed would be the same on all Amigas, as they are all driven by the same E-clock at ~710kHz, only differing slighly between PAL and NTSC machines. Figured out what I think is a quite good test to illustrate this difference - repeatedly reading the low byte register of one of the CIA timers in running mode. When the timer is running, this register will count down one step for each E-clock cycle, so if you do repeated reads of it and compare the values read, you can see how many E-clock cycles are required for each read. Have written a utility which does this plus an interleaved read/write test so it is possible to see if there is any difference for writes. It does this for all available timers (four in total, but usually only two are available). The reads and writes are repeated move.b to/from registers, done inside Disable()/Enable() so should be fast enough on all machines and not possible to disturb. Executable and source is included in the archive: http://megaburken.net/~patrik/CiaAccessTests.lha Runs on kickstart 1.2+. Results on A3000 (68030): Code: 10.Ram Disk:CiaAccessTests> CiaAccessTests ciaa.talo(BFE401) reads: 0: 89 1: 88 2: 87 3: 86 ciab.talo(BFD400) reads: 0: 161 1: 160 2: 159 3: 158 ciaa.talo(BFE401) reads interleaved with ciaa.ddrb(BFE301) writes: 0: 124 1: 122 2: 120 3: 118 ciab.talo(BFD400) reads interleaved with ciaa.ddrb(BFE301) writes: 0: 91 1: 89 2: 87 3: 85 One cycle between each read and two cycles between each read/write so one cycle both for read and write on the A3000. Results on A1200 (68060): Code: 10.Ram Disk:CiaAccessTests> CiaAccessTests ciaa.talo(BFE401) reads: 0: 111 1: 109 2: 107 3: 105 ciab.talo(BFD400) reads: 0: 6 1: 4 2: 2 3: 0 ciaa.talo(BFE401) reads interleaved with ciaa.ddrb(BFE301) writes: 0: 233 1: 229 2: 225 3: 221 ciab.talo(BFD400) reads interleaved with ciaa.ddrb(BFE301) writes: 0: 199 1: 195 2: 191 3: 187 Two cycles between each read and four cycles between each read/write so two cycles both for read and write on the A1200. This significant difference perhaps is common knowledge, but it was unknown to me. Does anyone know any details about the reason for it?

27 December 2021, 03:10	#9
patrik Registered User Join Date: Jan 2005 Location: Umeå Age: 43 Posts: 922	@Kamelito: Did an attempt at finding it, but the closest thing I could find was the 88 and 89 8520_Timing documents, but I don't think it is right? @Speedgeek: Cheers, please do a run on the A2630 when you install it next time. @ross: Very interesting thread, thank you very much! Saved logs from a couple of machines from your and simion's tests: http://megaburken.net/~patrik/ciatest/cia-speed_b/ http://megaburken.net/~patrik/ciatest/CIA_tests/ @Niklas: In the A3000 and A4000, Fat Gary generates the E-clock and chip-selects for the 8520's. http://megaburken.net/~patrik/A3000/...cification.pdf (downloaded from http://amiga.serveftp.net/datasheets.html to not put unnecessary strain on his connection) mentions some iteresting details: "The ECLK signal is generated in GARY. It is a free running clock whose fequency is 1/10th of the 7M clock. Normally ECLK is low for six 7M clocks, and high for four 7M clocks. However, when the CIAs are accessed, the ECLK high time may be shorter than four 7M clocks. During writes to the CIAs, ECLK is high for only two 7M clocks. DUring reads ECLK stays high for a minimum of two 7M clocks, and a maximum of four 7M clocks. The frequency of ECLK does not change. If the ECLK high time is shortened during CIA access, the difference is made up by increasing the subsequency ECLK low time. Consequencyly, it is always ten 7M clocks from one rusing edge of ECLK to the next." However, there is no explaining why it does it, but I assume it has something to do with why the A3000 and A4000 gets consistent 1-cycle 8520 accesses. On that note, the 50MHz A500 accelerator you had seen 1E accesses on perhaps does some similar trick when generating the E-clock. Last edited by patrik; 27 December 2021 at 03:24.

27 December 2021, 22:11	#12
r.cade Registered User Join Date: Aug 2006 Location: Augusta, Georgia, USA Posts: 548	I always understood the A3000 and A4000 were the only fully 32-bit "path to chipset" machines, so this seems right. The others are all only 16-bit, or there is another reason? Last edited by r.cade; 28 December 2021 at 00:33.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is there a way to speed up internal SD access?	AmigaNoob	support.Hardware	26	04 May 2020 12:17
memory access speed question	Lord Riton	Coders. General	42	27 February 2019 14:26
Program to speed up floppy disk access?	BarryB	support.Apps	22	26 March 2013 19:30
pinballs - same game speed or calcs differs	Chain	Retrogaming General Discussion	4	01 March 2009 21:02
Slow speed Direct HD access	Dan Andrea	support.WinUAE	3	27 December 2002 14:21

26 December 2021, 16:39	#2
kamelito Zone Friend Join Date: May 2006 Location: France Posts: 1,801	I think I read in the 1993 Devcon that CIA differ and they said to use the OS but also gave technical info about the differences.

26 December 2021, 17:28	#3
SpeedGeek Moderator Join Date: Dec 2010 Location: Wisconsin USA Age: 60 Posts: 839	Thanks for making this benchmark tool available! Unfortunately, I now have my GVP G-Force 030 installed in my A2000 (waiting support for another long delayed project). So I can't provide any immediate (E clock speedup mod) benchmark results: https://eab.abime.net/showthread.php...92#post1523692

26 December 2021, 17:38	#4
Niklas Registered User Join Date: Apr 2018 Location: Stockholm / Sweden Posts: 129	I believe the problem is not with the speed of CIA itself, but the speed of the processor that accesses CIA. You should consider how the 68000 processor accesses the 6800 bus to which CIA is attached (the E, VPA, VMA signals that are described on page 185 of this document https://www.nxp.com/docs/en/referenc.../MC68000UM.pdf). The E signal is low for six 7 MHz clock cycles, and then high for four 7 MHz clock cycles. The access of CIA is done during the four cycles when E is high. If the E signal is high when an access to CIA starts then the processor will wait until the E signal is low again, and make the access during the next window when E goes high. During the six cycles that E is low, the CPU needs to fetch and execute any instructions that are before the next instruction that is accessing CIA, and then fetch the instruction that accesses CIA. For a normal 68000 processor, fetching one instruction takes at least 4 cycles, so if any instruction comes between the two instructions that access CIA then the second CIA access will miss the "E high window", and the processor stalls until the the next E high window, so now the speed of CIA accesses will effectively halve. When using the 030 processor on the other hand, the processor is much faster and can manage to do one access in every E high window. The fastest way to communicate over the parallel port that I could come up with is as follows: https://github.com/niklasekstrom/ami...i_low.asm#L170. This results in 2E speed (1 byte communicated on every other E cycle) on an Amiga with an accelerator that allows the processor to get an access in on every opportunity. (For more background on the speed of protocols that communicate over the parallel port, see this page: https://lallafa.de/blog/2015/09/amig...st-can-you-go/.)

26 December 2021, 17:49	#5
patrik Registered User Join Date: Jan 2005 Location: Umeå Age: 43 Posts: 922	@Niklas: I think you are right in that the answer lies not in the speed of CIA itself, but in how it is accessed. I can buy that the stock 68000 potentially might be busy fetching and executing the next instruction, missing the next E high window, however the 68030 A2000, the 68020, 68030 and 68060 A1200’s get the same 2-cycle result as the 68000 A500. Only the 68030 A3000 and 68060 A4000 manages 1-cycle access.

26 December 2021, 18:20	#6
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Check this thread https://eab.abime.net/showthread.php?t=107908 from message #8 onward. There are a lot of similar arguments and test code.

27 December 2021, 00:06	#8
Niklas Registered User Join Date: Apr 2018 Location: Stockholm / Sweden Posts: 129	More information that may be relevant to this thread... In Amigas whose processor doesn't have the 6800 interface logic, that logic is instead in Gayle. On page 6 of https://www.amigawiki.org/lib/exe/fe...cification.pdf it says: "_AS must be asserted by 3 clocks before E CLK goes high, or you wait until next time around". I'm not sure if the logic in 68000 works the same but I would guess so. Minimig seems to have the same idea: https://github.com/MiSTer-devel/Mini...8k_bridge.v#L3.

29 December 2021, 08:33	#15
Jope - Join Date: Jul 2003 Location: Helsinki / Finland Age: 43 Posts: 9,861	I was specifically thinking about cpu access to chip ram.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)