Some 020 chipram access details.

Rst7 · 27 December 2022, 13:34

Hello.

When optimizing the code for A1200 (stock configuration) and comparing the behavior on real hardware and WinUAE, some details came up that may help clarify 68020 emulation in cycle exact mode.

This is due to the clarification of chipram access cycles.

For simplicity, we will consider than all code is executed from the cache and dma does not interfere.

Let's look to 68020 datasheet - https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf

move.l (a0),d0 cycles from datasheet seems as

Code:

68020 clock:
01234567890123456789....
ARRRMM

6 cycles total.
A - calculate effective adress, 1 cycle (2 cycles from 8.2.3, but one cycle shared with first cycle of ram reading, see Figure 8-5 for example)
RRR - read from ram. 3 cycles
MM - perform move, 2 cycles

Now how does it fit into chipram cycles (checked with logic analyzer on real hw):

There are 4 possible cases:

Code:

68020 clock:
01234567890123456789....
Color clock (chipram slots):
[00][01][02][03][04]....
ARRRRRRRMM
 ARRRRRRMM                            
  ARRRRRRRRRMM
   ARRRRRRRRMM

The R cycle is lengthened by additional waits.

Total execution time is 10/9 or 12/11 CPU clocks.
It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot. Note, that this behavior applies to the write cycle too, but MM cycles shared with write, thanks to write pending buffer.

Now let's look at the execution of two consecutive instructions with RAM reading:

Code:

move.l	(a0),d0		;M1
move.l	(a1),d1		;M2

68020 clock:
01234567890123456789012345678....
Color clock (chipram slots):
[00][01][02][03][04][05][06]
ARRRRRRRM1ARRRRRRRRRM2
 ARRRRRRM1ARRRRRRRRRM2
  ARRRRRRRRRM1ARRRRRRRRRM2
   ARRRRRRRRM1ARRRRRRRRRM2

In WinUAE, these two instructions take 16 (2*8) cycles to execute, compared to to 21+ on real hw. Moreover, the execution

Code:

move.l	(a0),d0		;M1
move.l	d2,d3		;M2
move.l	d2,d3		;M3
move.l	(a1),d1		;M4

also takes 16 cycles, which is not true. Expected behavior:

Code:

68020 clock:
01234567890123456789012345678901....
Color clock (chipram slots):
[00][01][02][03][04][05][06][07]
ARRRRRRRM1M2M3ARRRRRRRRRM4
 ARRRRRRM1M2M3ARRRRRRRRRM4
  ARRRRRRRRRM1M2M3ARRRRRRRRRM4
   ARRRRRRRRM1M2M3ARRRRRRRRRM4

It all looks like WinUAE emulate memory reading in the same way as writing, with a pending buffer. This doesn't seem very correct to me.

Perhaps this information will help improve the emulation. Sorry, but this is a stupid request about 020 cycle exact mode

Thanks

Rst7 · 27 December 2022, 18:51

Sorry, this is not correct:

It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot.

Correct version:
Assertion of ~AS signal shoud not be later than 1/2 chipram cycle.

The ~AS signal is asserted on the second memory read/write cycle. Therefore, the correct diagrams look like this:

Code:

68020 clock:
01234567890123456789....
Color clock (chipram slots):
[00][01][02][03][04]....
ARRRRRRRMM
 ARRRRRRRRRRMM                            
  ARRRRRRRRRMM
   ARRRRRRRRMM

10 clocks min, 13 clocks max.

Examples from logic analyzer:

7.PNG: ~AS asserted time = 6 cycles, total chipram read 7 cycles
10_and_8.PNG: ~AS asserted time = 9/7 cycles, total chipram read 10/8 cycles
9.PNG: 8 cycles with total 9 cycles.

For move (a0),d0 add one cycle for adress calculation and 2 cycles for move perform. So total execution time is 10...13 cycles.

All other cases should be corrected similarly.

I apologize for the incorrect information in the start post.

Rst7 · 27 December 2022, 19:10

Updated "text" diagrams:

Code:

move.l	(a0),d0		;M1
move.l	(a1),d1		;M2

68020 clock:
01234567890123456789012345678....
Color clock (chipram slots):
[00][01][02][03][04][05][06]
ARRRRRRRM1ARRRRRRRRRM2
 ARRRRRRRRRRRM1ARRRRRRRRM2
  ARRRRRRRRRRM1ARRRRRRRRM2
   ARRRRRRRRRM1ARRRRRRRRM2

22-25-24-23 clocks

Code:

move.l	(a0),d0		;M1
move.l	d2,d3		;M2
move.l	d2,d3		;M3
move.l	(a1),d1		;M4

68020 clock:
01234567890123456789012345678901....
Color clock (chipram slots):
[00][01][02][03][04][05][06][07]
ARRRRRRRM1M2M3ARRRRRRRRRM4
 ARRRRRRRRRRM1M2M3ARRRRRRRRRM4
  ARRRRRRRRRM1M2M3ARRRRRRRRRM4
   ARRRRRRRRM1M2M3ARRRRRRRRRM4

26-29-28-27 clocks

Toni Wilen · 28 December 2022, 19:46

Problem is not chip ram accesses (which are more or less like 68000 + write buffering by Budgie) but order of accesses and when 68020 does instruction prefetch (with or without caches) and other internal cycles are the missing piece. 68020/030 documentation wants to hide the internal details.

Main important question: how does everything work generally, not just when instructions are cached but in all possible conditions.

Other question (that I haven't yet tested): does all instructions always have same memory access order = cache or previous/next instruction does not affect it (prefetches, memory writes, memory reads). AFAIK 68020-68030 are still fully microcoded CPUs so it could be true or there might be multiple paths depending on something internal (but on the other hand, it probably would make micro rom too large). It gets practically impossible if order of accesses are not static..

Currently 68020 "cycle exact" takes it too safely: it is too fast because even single cycle too slow can break more programs than too fast CPU. Also "overlapping" cycles are most likely not fully handled (which can also explain why it would be too slow without extra hacks)

EDIT: another problem are programs that don't have caches enabled at all, there even single too slow instruction can mess up timing (missed frames). Just getting cache case working is useless if non-cache is wrong.

Note that I haven't examined any 68020 based hardware with logic analyzer. 68000 first (which is more or less finally done).

Rst7 · 28 December 2022, 22:11

Quote:

Originally Posted by Toni Wilen

+ write buffering by Budgie

There are no write buffers between CPU and chipram. On the waveforms from logic analyzer, you can see that the ~RAS, ~CAS, ~WE signals are removed before the ~AS signal is removed. Therefore, the ~DSACK signal is asserted after the end of writing to the memory.

More precisely, there are no such external buffers that allow CPU to finish the write cycle before the actual write to RAM is completed.

But the CPU has its own buffer inside, and while the bus is busy with a write cycle, instructions from cache can be executed. For example, look to Figure 8-6 from the datasheet.

Waccoon · 30 December 2022, 03:15

Are you sure about that? I thought the whole point of the Bridgette chip on the A4000 was to latch writes to chip ram, according to the official datasheet. Budgie is pretty similar to Bridgette but with some clock stuff and glue logic added.

Rst7 · 30 December 2022, 10:57

Quote:

Originally Posted by Waccoon

Are you sure about that?

Yes. For example attached file is a diagram of two write cycles (~WE active). First with ~AS length 8, second with 7. Cycles end with ~AS removes after ~RAS.

Cyprian · 09 February 2023, 21:19

Quote:

Originally Posted by Rst7

Updated "text" diagrams:

Code:

move.l    (a0),d0        ;M1
move.l    (a1),d1        ;M2

68020 clock:
01234567890123456789012345678....
Color clock (chipram slots):
[00][01][02][03][04][05][06]
ARRRRRRRM1ARRRRRRRRRM2
 ARRRRRRRRRRRM1ARRRRRRRRM2
  ARRRRRRRRRRM1ARRRRRRRRM2
   ARRRRRRRRRM1ARRRRRRRRM2

22-25-24-23 clocks

Code:

move.l    (a0),d0        ;M1
move.l    d2,d3        ;M2
move.l    d2,d3        ;M3
move.l    (a1),d1        ;M4

68020 clock:
01234567890123456789012345678901....
Color clock (chipram slots):
[00][01][02][03][04][05][06][07]
ARRRRRRRM1M2M3ARRRRRRRRRM4
 ARRRRRRRRRRM1M2M3ARRRRRRRRRM4
  ARRRRRRRRRM1M2M3ARRRRRRRRRM4
   ARRRRRRRRM1M2M3ARRRRRRRRRM4

26-29-28-27 clocks

Does M1/2/3/4 means instruction prefetch?

Rst7 · 09 February 2023, 21:28

Quote:

Originally Posted by Cyprian

Does M1/2/3/4 means instruction prefetch?

No. M1...M4 mean perform move instruction. You can see it in https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf from figure 8-3 (page 8-5) and later. Perform has 2 clock duration.

I am considering the option when all instructions are executed from the cache

Cyprian · 09 February 2023, 21:44

ok, thanks for the clarification

Toni Wilen · 10 February 2023, 20:51

I removed chipset "write buffer" part and my test statefiles didn't seem to break.

Main reason I probably appear very uninterested is because I don't want to touch 68020 CE stuff unless I really have time to rewrite it again and again because something always breaks. It is better to have it working than accurate at this point.. Maybe after 5.0 I'll check it again. and only if there really is enough new information.

Pipeline operation, timing, instruction having multiple data reads and writes and multiple prefetches and more. All this combined makes documentation useless. (68030 timing documentation is slightly more useful because it tries to hide less information than 68020)

btw, interesting detail in pipeline is that if partially prefetched instruction is any kind of unconditional branch/jump (including RTS etc..): CPU stops prefetches (except if jump is single word instruction, in that case following word is still prefetched) until full pipeline fill from new PC.

This quite simple feature was also annoying to emulate.

Rst7 · 10 February 2023, 22:11

From my point of view, if we consider the execution of code from the cache, then the behavior of simple instructions in 68020 is quite clear:

1. The source address is calculated.
2. The start of the reading cycle occurs at the last cycle of calculating the source address. This last cycle corresponds to a read cycle while no AS is set yet.
Apparently, the address has already been calculated by this time (i.e. the real calculation takes one cycle less than the specified "Fetch Effective Address").
3. While the read cycle is in progress, the calculation of the destination address is started, if necessary.
4. After the end of the read cycle, execution starts (for example, add will take 2 cycles).
5. One cycle before the end of the execution, the recording cycle starts.

The move instruction seems to have some kind of fast path within the 68020 because its data is ready one cycle early. For example move (a0),(a0) takes 7 cycles and add d0,(a0) will take 4+4=8 cycles. Although both of these instructions are equivalent in terms of the number of memory access cycles, the second instruction will have an empty clock in bus activity between read and write.

Code:

move (a0),(a0):

1234567
AS.....
.RRR...
..AD...
....PP.
....WWW
......next instruction

AS - calculate source EA
RRR - read memory
AD - calculate destination EA
PP - perform
WWW - write memory


add d0,(a0):

12345678
AS......
.RRR....
.....WWW
....PP..
......next instruction


move d0,(a0):

1234
AD..
.WWW
PP..
..next instruction


move (a0),d0:

123456
AD....
.RRR..
....PP
......next instruction


lsl #1,(a0):

123456789
AD.......
.RRR.....
....PPP..
......WWW
.......next instruction

And most importantly, what I wanted to convey in the first post:

The PP stage cannot start executing before the RRR stage has completed (with all wait states), but WinUAE behavior looks different.

27 December 2022, 18:51	#2
Rst7 Registered User Join Date: Jan 2022 Location: Kharkiv Posts: 48	Sorry, this is not correct: It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot. Correct version: Assertion of ~AS signal shoud not be later than 1/2 chipram cycle. The ~AS signal is asserted on the second memory read/write cycle. Therefore, the correct diagrams look like this: Code: 68020 clock: 01234567890123456789.... Color clock (chipram slots): [00][01][02][03][04].... ARRRRRRRMM ARRRRRRRRRRMM ARRRRRRRRRMM ARRRRRRRRMM 10 clocks min, 13 clocks max. Examples from logic analyzer: 7.PNG: ~AS asserted time = 6 cycles, total chipram read 7 cycles 10_and_8.PNG: ~AS asserted time = 9/7 cycles, total chipram read 10/8 cycles 9.PNG: 8 cycles with total 9 cycles. For move (a0),d0 add one cycle for adress calculation and 2 cycles for move perform. So total execution time is 10...13 cycles. All other cases should be corrected similarly. I apologize for the incorrect information in the start post. Attached Thumbnails

10 February 2023, 22:11	#12
Rst7 Registered User Join Date: Jan 2022 Location: Kharkiv Posts: 48	From my point of view, if we consider the execution of code from the cache, then the behavior of simple instructions in 68020 is quite clear: 1. The source address is calculated. 2. The start of the reading cycle occurs at the last cycle of calculating the source address. This last cycle corresponds to a read cycle while no AS is set yet. Apparently, the address has already been calculated by this time (i.e. the real calculation takes one cycle less than the specified "Fetch Effective Address"). 3. While the read cycle is in progress, the calculation of the destination address is started, if necessary. 4. After the end of the read cycle, execution starts (for example, add will take 2 cycles). 5. One cycle before the end of the execution, the recording cycle starts. The move instruction seems to have some kind of fast path within the 68020 because its data is ready one cycle early. For example move (a0),(a0) takes 7 cycles and add d0,(a0) will take 4+4=8 cycles. Although both of these instructions are equivalent in terms of the number of memory access cycles, the second instruction will have an empty clock in bus activity between read and write. Code: move (a0),(a0): 1234567 AS..... .RRR... ..AD... ....PP. ....WWW ......next instruction AS - calculate source EA RRR - read memory AD - calculate destination EA PP - perform WWW - write memory add d0,(a0): 12345678 AS...... .RRR.... .....WWW ....PP.. ......next instruction move d0,(a0): 1234 AD.. .WWW PP.. ..next instruction move (a0),d0: 123456 AD.... .RRR.. ....PP ......next instruction lsl #1,(a0): 123456789 AD....... .RRR..... ....PPP.. ......WWW .......next instruction And most importantly, what I wanted to convey in the first post: The PP stage cannot start executing before the RRR stage has completed (with all wait states), but WinUAE behavior looks different.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Does the CPU ever access Chip RAM without being aligned to the DMA access windows?	TommoH	Coders. Asm / Hardware	13	14 December 2021 13:23
020 running temps	Marchie	support.Hardware	4	13 December 2018 23:16
SFS on A600 020 KS 2.0	demolition	support.Other	27	22 December 2012 18:46
020 030 040?	Claw22000	support.Hardware	9	30 April 2011 06:43
020 + JIT bug?	smoorke	support.WinUAE	2	16 July 2009 20:52

27 December 2022, 13:34	#1
Rst7 Registered User Join Date: Jan 2022 Location: Kharkiv Posts: 48	Some 020 chipram access details. Hello. When optimizing the code for A1200 (stock configuration) and comparing the behavior on real hardware and WinUAE, some details came up that may help clarify 68020 emulation in cycle exact mode. This is due to the clarification of chipram access cycles. For simplicity, we will consider than all code is executed from the cache and dma does not interfere. Let's look to 68020 datasheet - https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf move.l (a0),d0 cycles from datasheet seems as Code: 68020 clock: 01234567890123456789.... ARRRMM 6 cycles total. A - calculate effective adress, 1 cycle (2 cycles from 8.2.3, but one cycle shared with first cycle of ram reading, see Figure 8-5 for example) RRR - read from ram. 3 cycles MM - perform move, 2 cycles Now how does it fit into chipram cycles (checked with logic analyzer on real hw): There are 4 possible cases: Code: 68020 clock: 01234567890123456789.... Color clock (chipram slots): [00][01][02][03][04].... ARRRRRRRMM ARRRRRRMM ARRRRRRRRRMM ARRRRRRRRMM The R cycle is lengthened by additional waits. Total execution time is 10/9 or 12/11 CPU clocks. It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot. Note, that this behavior applies to the write cycle too, but MM cycles shared with write, thanks to write pending buffer. Now let's look at the execution of two consecutive instructions with RAM reading: Code: move.l (a0),d0 ;M1 move.l (a1),d1 ;M2 68020 clock: 01234567890123456789012345678.... Color clock (chipram slots): [00][01][02][03][04][05][06] ARRRRRRRM1ARRRRRRRRRM2 ARRRRRRM1ARRRRRRRRRM2 ARRRRRRRRRM1ARRRRRRRRRM2 ARRRRRRRRM1ARRRRRRRRRM2 In WinUAE, these two instructions take 16 (2*8) cycles to execute, compared to to 21+ on real hw. Moreover, the execution Code: move.l (a0),d0 ;M1 move.l d2,d3 ;M2 move.l d2,d3 ;M3 move.l (a1),d1 ;M4 also takes 16 cycles, which is not true. Expected behavior: Code: 68020 clock: 01234567890123456789012345678901.... Color clock (chipram slots): [00][01][02][03][04][05][06][07] ARRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRM1M2M3ARRRRRRRRRM4 It all looks like WinUAE emulate memory reading in the same way as writing, with a pending buffer. This doesn't seem very correct to me. Perhaps this information will help improve the emulation. Sorry, but this is a stupid request about 020 cycle exact mode Thanks

28 December 2022, 19:46	#4
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,553	Problem is not chip ram accesses (which are more or less like 68000 + write buffering by Budgie) but order of accesses and when 68020 does instruction prefetch (with or without caches) and other internal cycles are the missing piece. 68020/030 documentation wants to hide the internal details. Main important question: how does everything work generally, not just when instructions are cached but in all possible conditions. Other question (that I haven't yet tested): does all instructions always have same memory access order = cache or previous/next instruction does not affect it (prefetches, memory writes, memory reads). AFAIK 68020-68030 are still fully microcoded CPUs so it could be true or there might be multiple paths depending on something internal (but on the other hand, it probably would make micro rom too large). It gets practically impossible if order of accesses are not static.. Currently 68020 "cycle exact" takes it too safely: it is too fast because even single cycle too slow can break more programs than too fast CPU. Also "overlapping" cycles are most likely not fully handled (which can also explain why it would be too slow without extra hacks) EDIT: another problem are programs that don't have caches enabled at all, there even single too slow instruction can mess up timing (missed frames). Just getting cache case working is useless if non-cache is wrong. Note that I haven't examined any 68020 based hardware with logic analyzer. 68000 first (which is more or less finally done).

30 December 2022, 03:15	#6
Waccoon Registered User Join Date: May 2022 Location: Boston / USA Age: 46 Posts: 39	Are you sure about that? I thought the whole point of the Bridgette chip on the A4000 was to latch writes to chip ram, according to the official datasheet. Budgie is pretty similar to Bridgette but with some clock stuff and glue logic added.

09 February 2023, 21:44	#10
Cyprian Registered User Join Date: Jul 2014 Location: Warsaw/Poland Posts: 192	ok, thanks for the clarification

10 February 2023, 20:51	#11
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,553	I removed chipset "write buffer" part and my test statefiles didn't seem to break. Main reason I probably appear very uninterested is because I don't want to touch 68020 CE stuff unless I really have time to rewrite it again and again because something always breaks. It is better to have it working than accurate at this point.. Maybe after 5.0 I'll check it again. and only if there really is enough new information. Pipeline operation, timing, instruction having multiple data reads and writes and multiple prefetches and more. All this combined makes documentation useless. (68030 timing documentation is slightly more useful because it tries to hide less information than 68020) btw, interesting detail in pipeline is that if partially prefetched instruction is any kind of unconditional branch/jump (including RTS etc..): CPU stops prefetches (except if jump is single word instruction, in that case following word is still prefetched) until full pipeline fill from new PC. This quite simple feature was also annoying to emulate.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)