English Amiga Board


Go Back   English Amiga Board > Support > support.WinUAE

 
 
Thread Tools
Old 27 December 2022, 13:34   #1
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
Some 020 chipram access details.

Hello.

When optimizing the code for A1200 (stock configuration) and comparing the behavior on real hardware and WinUAE, some details came up that may help clarify 68020 emulation in cycle exact mode.

This is due to the clarification of chipram access cycles.

For simplicity, we will consider than all code is executed from the cache and dma does not interfere.


Let's look to 68020 datasheet - https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf

move.l (a0),d0 cycles from datasheet seems as
Code:
68020 clock:
01234567890123456789....
ARRRMM
6 cycles total.
A - calculate effective adress, 1 cycle (2 cycles from 8.2.3, but one cycle shared with first cycle of ram reading, see Figure 8-5 for example)
RRR - read from ram. 3 cycles
MM - perform move, 2 cycles


Now how does it fit into chipram cycles (checked with logic analyzer on real hw):

There are 4 possible cases:
Code:
68020 clock:
01234567890123456789....
Color clock (chipram slots):
[00][01][02][03][04]....
ARRRRRRRMM
 ARRRRRRMM                            
  ARRRRRRRRRMM
   ARRRRRRRRMM
The R cycle is lengthened by additional waits.

Total execution time is 10/9 or 12/11 CPU clocks.
It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot. Note, that this behavior applies to the write cycle too, but MM cycles shared with write, thanks to write pending buffer.


Now let's look at the execution of two consecutive instructions with RAM reading:
Code:
move.l	(a0),d0		;M1
move.l	(a1),d1		;M2

68020 clock:
01234567890123456789012345678....
Color clock (chipram slots):
[00][01][02][03][04][05][06]
ARRRRRRRM1ARRRRRRRRRM2
 ARRRRRRM1ARRRRRRRRRM2
  ARRRRRRRRRM1ARRRRRRRRRM2
   ARRRRRRRRM1ARRRRRRRRRM2
In WinUAE, these two instructions take 16 (2*8) cycles to execute, compared to to 21+ on real hw. Moreover, the execution
Code:
move.l	(a0),d0		;M1
move.l	d2,d3		;M2
move.l	d2,d3		;M3
move.l	(a1),d1		;M4
also takes 16 cycles, which is not true. Expected behavior:
Code:
68020 clock:
01234567890123456789012345678901....
Color clock (chipram slots):
[00][01][02][03][04][05][06][07]
ARRRRRRRM1M2M3ARRRRRRRRRM4
 ARRRRRRM1M2M3ARRRRRRRRRM4
  ARRRRRRRRRM1M2M3ARRRRRRRRRM4
   ARRRRRRRRM1M2M3ARRRRRRRRRM4
It all looks like WinUAE emulate memory reading in the same way as writing, with a pending buffer. This doesn't seem very correct to me.

Perhaps this information will help improve the emulation. Sorry, but this is a stupid request about 020 cycle exact mode

Thanks
Rst7 is offline  
Old 27 December 2022, 18:51   #2
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
Sorry, this is not correct:

It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot.

Correct version:
Assertion of ~AS signal shoud not be later than 1/2 chipram cycle.


The ~AS signal is asserted on the second memory read/write cycle. Therefore, the correct diagrams look like this:

Code:
68020 clock:
01234567890123456789....
Color clock (chipram slots):
[00][01][02][03][04]....
ARRRRRRRMM
 ARRRRRRRRRRMM                            
  ARRRRRRRRRMM
   ARRRRRRRRMM
10 clocks min, 13 clocks max.

Examples from logic analyzer:

7.PNG: ~AS asserted time = 6 cycles, total chipram read 7 cycles
10_and_8.PNG: ~AS asserted time = 9/7 cycles, total chipram read 10/8 cycles
9.PNG: 8 cycles with total 9 cycles.

For move (a0),d0 add one cycle for adress calculation and 2 cycles for move perform. So total execution time is 10...13 cycles.

All other cases should be corrected similarly.

I apologize for the incorrect information in the start post.
Attached Thumbnails
Click image for larger version

Name:	7.PNG
Views:	80
Size:	22.4 KB
ID:	77555   Click image for larger version

Name:	10_and_8.PNG
Views:	66
Size:	25.9 KB
ID:	77556   Click image for larger version

Name:	9.PNG
Views:	64
Size:	22.9 KB
ID:	77558  
Rst7 is offline  
Old 27 December 2022, 19:10   #3
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
Updated "text" diagrams:

Code:
move.l	(a0),d0		;M1
move.l	(a1),d1		;M2

68020 clock:
01234567890123456789012345678....
Color clock (chipram slots):
[00][01][02][03][04][05][06]
ARRRRRRRM1ARRRRRRRRRM2
 ARRRRRRRRRRRM1ARRRRRRRRM2
  ARRRRRRRRRRM1ARRRRRRRRM2
   ARRRRRRRRRM1ARRRRRRRRM2

22-25-24-23 clocks
Code:
move.l	(a0),d0		;M1
move.l	d2,d3		;M2
move.l	d2,d3		;M3
move.l	(a1),d1		;M4

68020 clock:
01234567890123456789012345678901....
Color clock (chipram slots):
[00][01][02][03][04][05][06][07]
ARRRRRRRM1M2M3ARRRRRRRRRM4
 ARRRRRRRRRRM1M2M3ARRRRRRRRRM4
  ARRRRRRRRRM1M2M3ARRRRRRRRRM4
   ARRRRRRRRM1M2M3ARRRRRRRRRM4

26-29-28-27 clocks
Rst7 is offline  
Old 28 December 2022, 19:46   #4
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Problem is not chip ram accesses (which are more or less like 68000 + write buffering by Budgie) but order of accesses and when 68020 does instruction prefetch (with or without caches) and other internal cycles are the missing piece. 68020/030 documentation wants to hide the internal details.

Main important question: how does everything work generally, not just when instructions are cached but in all possible conditions.

Other question (that I haven't yet tested): does all instructions always have same memory access order = cache or previous/next instruction does not affect it (prefetches, memory writes, memory reads). AFAIK 68020-68030 are still fully microcoded CPUs so it could be true or there might be multiple paths depending on something internal (but on the other hand, it probably would make micro rom too large). It gets practically impossible if order of accesses are not static..

Currently 68020 "cycle exact" takes it too safely: it is too fast because even single cycle too slow can break more programs than too fast CPU. Also "overlapping" cycles are most likely not fully handled (which can also explain why it would be too slow without extra hacks)

EDIT: another problem are programs that don't have caches enabled at all, there even single too slow instruction can mess up timing (missed frames). Just getting cache case working is useless if non-cache is wrong.

Note that I haven't examined any 68020 based hardware with logic analyzer. 68000 first (which is more or less finally done).
Toni Wilen is online now  
Old 28 December 2022, 22:11   #5
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
Quote:
Originally Posted by Toni Wilen View Post
+ write buffering by Budgie
There are no write buffers between CPU and chipram. On the waveforms from logic analyzer, you can see that the ~RAS, ~CAS, ~WE signals are removed before the ~AS signal is removed. Therefore, the ~DSACK signal is asserted after the end of writing to the memory.

More precisely, there are no such external buffers that allow CPU to finish the write cycle before the actual write to RAM is completed.

But the CPU has its own buffer inside, and while the bus is busy with a write cycle, instructions from cache can be executed. For example, look to Figure 8-6 from the datasheet.
Rst7 is offline  
Old 30 December 2022, 03:15   #6
Waccoon
Registered User
 
Waccoon's Avatar
 
Join Date: May 2022
Location: Boston / USA
Age: 46
Posts: 38
Are you sure about that? I thought the whole point of the Bridgette chip on the A4000 was to latch writes to chip ram, according to the official datasheet. Budgie is pretty similar to Bridgette but with some clock stuff and glue logic added.
Waccoon is offline  
Old 30 December 2022, 10:57   #7
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
Quote:
Originally Posted by Waccoon View Post
Are you sure about that?
Yes. For example attached file is a diagram of two write cycles (~WE active). First with ~AS length 8, second with 7. Cycles end with ~AS removes after ~RAS.
Attached Thumbnails
Click image for larger version

Name:	wr_cycles.PNG
Views:	90
Size:	22.7 KB
ID:	77586  
Rst7 is offline  
Old 09 February 2023, 21:19   #8
Cyprian
Registered User
 
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 171
Quote:
Originally Posted by Rst7 View Post
Updated "text" diagrams:

Code:
move.l    (a0),d0        ;M1
move.l    (a1),d1        ;M2

68020 clock:
01234567890123456789012345678....
Color clock (chipram slots):
[00][01][02][03][04][05][06]
ARRRRRRRM1ARRRRRRRRRM2
 ARRRRRRRRRRRM1ARRRRRRRRM2
  ARRRRRRRRRRM1ARRRRRRRRM2
   ARRRRRRRRRM1ARRRRRRRRM2

22-25-24-23 clocks
Code:
move.l    (a0),d0        ;M1
move.l    d2,d3        ;M2
move.l    d2,d3        ;M3
move.l    (a1),d1        ;M4

68020 clock:
01234567890123456789012345678901....
Color clock (chipram slots):
[00][01][02][03][04][05][06][07]
ARRRRRRRM1M2M3ARRRRRRRRRM4
 ARRRRRRRRRRM1M2M3ARRRRRRRRRM4
  ARRRRRRRRRM1M2M3ARRRRRRRRRM4
   ARRRRRRRRM1M2M3ARRRRRRRRRM4

26-29-28-27 clocks

Does M1/2/3/4 means instruction prefetch?
Cyprian is offline  
Old 09 February 2023, 21:28   #9
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
Quote:
Originally Posted by Cyprian View Post
Does M1/2/3/4 means instruction prefetch?
No. M1...M4 mean perform move instruction. You can see it in https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf from figure 8-3 (page 8-5) and later. Perform has 2 clock duration.

I am considering the option when all instructions are executed from the cache
Rst7 is offline  
Old 09 February 2023, 21:44   #10
Cyprian
Registered User
 
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 171
ok, thanks for the clarification
Cyprian is offline  
Old 10 February 2023, 20:51   #11
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
I removed chipset "write buffer" part and my test statefiles didn't seem to break.

Main reason I probably appear very uninterested is because I don't want to touch 68020 CE stuff unless I really have time to rewrite it again and again because something always breaks. It is better to have it working than accurate at this point.. Maybe after 5.0 I'll check it again. and only if there really is enough new information.

Pipeline operation, timing, instruction having multiple data reads and writes and multiple prefetches and more. All this combined makes documentation useless. (68030 timing documentation is slightly more useful because it tries to hide less information than 68020)

btw, interesting detail in pipeline is that if partially prefetched instruction is any kind of unconditional branch/jump (including RTS etc..): CPU stops prefetches (except if jump is single word instruction, in that case following word is still prefetched) until full pipeline fill from new PC.

This quite simple feature was also annoying to emulate.
Toni Wilen is online now  
Old 10 February 2023, 22:11   #12
Rst7
Registered User
 
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
From my point of view, if we consider the execution of code from the cache, then the behavior of simple instructions in 68020 is quite clear:

1. The source address is calculated.
2. The start of the reading cycle occurs at the last cycle of calculating the source address. This last cycle corresponds to a read cycle while no AS is set yet.
Apparently, the address has already been calculated by this time (i.e. the real calculation takes one cycle less than the specified "Fetch Effective Address").
3. While the read cycle is in progress, the calculation of the destination address is started, if necessary.
4. After the end of the read cycle, execution starts (for example, add will take 2 cycles).
5. One cycle before the end of the execution, the recording cycle starts.

The move instruction seems to have some kind of fast path within the 68020 because its data is ready one cycle early. For example move (a0),(a0) takes 7 cycles and add d0,(a0) will take 4+4=8 cycles. Although both of these instructions are equivalent in terms of the number of memory access cycles, the second instruction will have an empty clock in bus activity between read and write.

Code:
move (a0),(a0):

1234567
AS.....
.RRR...
..AD...
....PP.
....WWW
......next instruction

AS - calculate source EA
RRR - read memory
AD - calculate destination EA
PP - perform
WWW - write memory


add d0,(a0):

12345678
AS......
.RRR....
.....WWW
....PP..
......next instruction


move d0,(a0):

1234
AD..
.WWW
PP..
..next instruction


move (a0),d0:

123456
AD....
.RRR..
....PP
......next instruction


lsl #1,(a0):

123456789
AD.......
.RRR.....
....PPP..
......WWW
.......next instruction
And most importantly, what I wanted to convey in the first post:

The PP stage cannot start executing before the RRR stage has completed (with all wait states), but WinUAE behavior looks different.
Rst7 is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Does the CPU ever access Chip RAM without being aligned to the DMA access windows? TommoH Coders. Asm / Hardware 13 14 December 2021 13:23
020 running temps Marchie support.Hardware 4 13 December 2018 23:16
SFS on A600 020 KS 2.0 demolition support.Other 27 22 December 2012 18:46
020 030 040? Claw22000 support.Hardware 9 30 April 2011 06:43
020 + JIT bug? smoorke support.WinUAE 2 16 July 2009 20:52

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 12:13.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10792 seconds with 16 queries