27 December 2022, 13:34 | #1 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
Some 020 chipram access details.
Hello.
When optimizing the code for A1200 (stock configuration) and comparing the behavior on real hardware and WinUAE, some details came up that may help clarify 68020 emulation in cycle exact mode. This is due to the clarification of chipram access cycles. For simplicity, we will consider than all code is executed from the cache and dma does not interfere. Let's look to 68020 datasheet - https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf move.l (a0),d0 cycles from datasheet seems as Code:
68020 clock: 01234567890123456789.... ARRRMM A - calculate effective adress, 1 cycle (2 cycles from 8.2.3, but one cycle shared with first cycle of ram reading, see Figure 8-5 for example) RRR - read from ram. 3 cycles MM - perform move, 2 cycles Now how does it fit into chipram cycles (checked with logic analyzer on real hw): There are 4 possible cases: Code:
68020 clock: 01234567890123456789.... Color clock (chipram slots): [00][01][02][03][04].... ARRRRRRRMM ARRRRRRMM ARRRRRRRRRMM ARRRRRRRRMM Total execution time is 10/9 or 12/11 CPU clocks. It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot. Note, that this behavior applies to the write cycle too, but MM cycles shared with write, thanks to write pending buffer. Now let's look at the execution of two consecutive instructions with RAM reading: Code:
move.l (a0),d0 ;M1 move.l (a1),d1 ;M2 68020 clock: 01234567890123456789012345678.... Color clock (chipram slots): [00][01][02][03][04][05][06] ARRRRRRRM1ARRRRRRRRRM2 ARRRRRRM1ARRRRRRRRRM2 ARRRRRRRRRM1ARRRRRRRRRM2 ARRRRRRRRM1ARRRRRRRRRM2 Code:
move.l (a0),d0 ;M1 move.l d2,d3 ;M2 move.l d2,d3 ;M3 move.l (a1),d1 ;M4 Code:
68020 clock: 01234567890123456789012345678901.... Color clock (chipram slots): [00][01][02][03][04][05][06][07] ARRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRM1M2M3ARRRRRRRRRM4 Perhaps this information will help improve the emulation. Sorry, but this is a stupid request about 020 cycle exact mode Thanks |
27 December 2022, 18:51 | #2 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
Sorry, this is not correct:
It looks like the start of the CPU read cycle should not be later than 1/2 CHIPRAM cycle to successfully use the current and next access slot. Correct version: Assertion of ~AS signal shoud not be later than 1/2 chipram cycle. The ~AS signal is asserted on the second memory read/write cycle. Therefore, the correct diagrams look like this: Code:
68020 clock: 01234567890123456789.... Color clock (chipram slots): [00][01][02][03][04].... ARRRRRRRMM ARRRRRRRRRRMM ARRRRRRRRRMM ARRRRRRRRMM Examples from logic analyzer: 7.PNG: ~AS asserted time = 6 cycles, total chipram read 7 cycles 10_and_8.PNG: ~AS asserted time = 9/7 cycles, total chipram read 10/8 cycles 9.PNG: 8 cycles with total 9 cycles. For move (a0),d0 add one cycle for adress calculation and 2 cycles for move perform. So total execution time is 10...13 cycles. All other cases should be corrected similarly. I apologize for the incorrect information in the start post. |
27 December 2022, 19:10 | #3 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
Updated "text" diagrams:
Code:
move.l (a0),d0 ;M1 move.l (a1),d1 ;M2 68020 clock: 01234567890123456789012345678.... Color clock (chipram slots): [00][01][02][03][04][05][06] ARRRRRRRM1ARRRRRRRRRM2 ARRRRRRRRRRRM1ARRRRRRRRM2 ARRRRRRRRRRM1ARRRRRRRRM2 ARRRRRRRRRM1ARRRRRRRRM2 22-25-24-23 clocks Code:
move.l (a0),d0 ;M1 move.l d2,d3 ;M2 move.l d2,d3 ;M3 move.l (a1),d1 ;M4 68020 clock: 01234567890123456789012345678901.... Color clock (chipram slots): [00][01][02][03][04][05][06][07] ARRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRRM1M2M3ARRRRRRRRRM4 ARRRRRRRRM1M2M3ARRRRRRRRRM4 26-29-28-27 clocks |
28 December 2022, 19:46 | #4 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,553
|
Problem is not chip ram accesses (which are more or less like 68000 + write buffering by Budgie) but order of accesses and when 68020 does instruction prefetch (with or without caches) and other internal cycles are the missing piece. 68020/030 documentation wants to hide the internal details.
Main important question: how does everything work generally, not just when instructions are cached but in all possible conditions. Other question (that I haven't yet tested): does all instructions always have same memory access order = cache or previous/next instruction does not affect it (prefetches, memory writes, memory reads). AFAIK 68020-68030 are still fully microcoded CPUs so it could be true or there might be multiple paths depending on something internal (but on the other hand, it probably would make micro rom too large). It gets practically impossible if order of accesses are not static.. Currently 68020 "cycle exact" takes it too safely: it is too fast because even single cycle too slow can break more programs than too fast CPU. Also "overlapping" cycles are most likely not fully handled (which can also explain why it would be too slow without extra hacks) EDIT: another problem are programs that don't have caches enabled at all, there even single too slow instruction can mess up timing (missed frames). Just getting cache case working is useless if non-cache is wrong. Note that I haven't examined any 68020 based hardware with logic analyzer. 68000 first (which is more or less finally done). |
28 December 2022, 22:11 | #5 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
There are no write buffers between CPU and chipram. On the waveforms from logic analyzer, you can see that the ~RAS, ~CAS, ~WE signals are removed before the ~AS signal is removed. Therefore, the ~DSACK signal is asserted after the end of writing to the memory.
More precisely, there are no such external buffers that allow CPU to finish the write cycle before the actual write to RAM is completed. But the CPU has its own buffer inside, and while the bus is busy with a write cycle, instructions from cache can be executed. For example, look to Figure 8-6 from the datasheet. |
30 December 2022, 03:15 | #6 |
Registered User
Join Date: May 2022
Location: Boston / USA
Age: 46
Posts: 39
|
Are you sure about that? I thought the whole point of the Bridgette chip on the A4000 was to latch writes to chip ram, according to the official datasheet. Budgie is pretty similar to Bridgette but with some clock stuff and glue logic added.
|
30 December 2022, 10:57 | #7 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
Yes. For example attached file is a diagram of two write cycles (~WE active). First with ~AS length 8, second with 7. Cycles end with ~AS removes after ~RAS.
|
09 February 2023, 21:19 | #8 | |
Registered User
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 192
|
Quote:
Does M1/2/3/4 means instruction prefetch? |
|
09 February 2023, 21:28 | #9 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
No. M1...M4 mean perform move instruction. You can see it in https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf from figure 8-3 (page 8-5) and later. Perform has 2 clock duration.
I am considering the option when all instructions are executed from the cache |
09 February 2023, 21:44 | #10 |
Registered User
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 192
|
ok, thanks for the clarification
|
10 February 2023, 20:51 | #11 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,553
|
I removed chipset "write buffer" part and my test statefiles didn't seem to break.
Main reason I probably appear very uninterested is because I don't want to touch 68020 CE stuff unless I really have time to rewrite it again and again because something always breaks. It is better to have it working than accurate at this point.. Maybe after 5.0 I'll check it again. and only if there really is enough new information. Pipeline operation, timing, instruction having multiple data reads and writes and multiple prefetches and more. All this combined makes documentation useless. (68030 timing documentation is slightly more useful because it tries to hide less information than 68020) btw, interesting detail in pipeline is that if partially prefetched instruction is any kind of unconditional branch/jump (including RTS etc..): CPU stops prefetches (except if jump is single word instruction, in that case following word is still prefetched) until full pipeline fill from new PC. This quite simple feature was also annoying to emulate. |
10 February 2023, 22:11 | #12 |
Registered User
Join Date: Jan 2022
Location: Kharkiv
Posts: 48
|
From my point of view, if we consider the execution of code from the cache, then the behavior of simple instructions in 68020 is quite clear:
1. The source address is calculated. 2. The start of the reading cycle occurs at the last cycle of calculating the source address. This last cycle corresponds to a read cycle while no AS is set yet. Apparently, the address has already been calculated by this time (i.e. the real calculation takes one cycle less than the specified "Fetch Effective Address"). 3. While the read cycle is in progress, the calculation of the destination address is started, if necessary. 4. After the end of the read cycle, execution starts (for example, add will take 2 cycles). 5. One cycle before the end of the execution, the recording cycle starts. The move instruction seems to have some kind of fast path within the 68020 because its data is ready one cycle early. For example move (a0),(a0) takes 7 cycles and add d0,(a0) will take 4+4=8 cycles. Although both of these instructions are equivalent in terms of the number of memory access cycles, the second instruction will have an empty clock in bus activity between read and write. Code:
move (a0),(a0): 1234567 AS..... .RRR... ..AD... ....PP. ....WWW ......next instruction AS - calculate source EA RRR - read memory AD - calculate destination EA PP - perform WWW - write memory add d0,(a0): 12345678 AS...... .RRR.... .....WWW ....PP.. ......next instruction move d0,(a0): 1234 AD.. .WWW PP.. ..next instruction move (a0),d0: 123456 AD.... .RRR.. ....PP ......next instruction lsl #1,(a0): 123456789 AD....... .RRR..... ....PPP.. ......WWW .......next instruction The PP stage cannot start executing before the RRR stage has completed (with all wait states), but WinUAE behavior looks different. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Does the CPU ever access Chip RAM without being aligned to the DMA access windows? | TommoH | Coders. Asm / Hardware | 13 | 14 December 2021 13:23 |
020 running temps | Marchie | support.Hardware | 4 | 13 December 2018 23:16 |
SFS on A600 020 KS 2.0 | demolition | support.Other | 27 | 22 December 2012 18:46 |
020 030 040? | Claw22000 | support.Hardware | 9 | 30 April 2011 06:43 |
020 + JIT bug? | smoorke | support.WinUAE | 2 | 16 July 2009 20:52 |
|
|