English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 14 October 2016, 20:12   #1
PiCiJi
Registered User
 
PiCiJi's Avatar
 
Join Date: Sep 2003
Location: germany
Age: 45
Posts: 402
68k timing

last time I have thought about 68k timing again.

Whats happening with internal operation during wait states ?

e.g. ASL Dx, Dy
sequence: prefetch n* n (* means shift count, n -> 2 cycles, prefetch -> 4 cycles (2 cycles to put address on bus, next 2 cycles repeated because of stalling ) )

Assume the prefetch is stalled by wait states means the internal register shifting will be stalled too? It could happen concurrent to bus wait state cycles.

Like instruction overlap this feature coud be possible only for 68020 cpus and higher ?

Last edited by PiCiJi; 14 October 2016 at 20:23.
PiCiJi is offline  
Old 15 October 2016, 19:49   #2
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Nothing happens during wait states with 68000.

Only 68020+ can do memory access(es) while CPU does internal operations.
Toni Wilen is offline  
Old 16 October 2016, 13:56   #3
PiCiJi
Registered User
 
PiCiJi's Avatar
 
Join Date: Sep 2003
Location: germany
Age: 45
Posts: 402
seems true for immediate and register ASL, because sequence describes shifting after prefetch

ASL (An)

sequence: nr (read from An), np (prefetch), nw (write shifted result)
shifting and decoding next opcode happens during prefetch

there are no additional 2 clocks for shifting.

Last edited by PiCiJi; 16 October 2016 at 14:01.
PiCiJi is offline  
Old 16 October 2016, 14:14   #4
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
I didn't say other things can't happen during memory access (microcode can do ALU operations, condition code setting etc simultaneously) but if memory access takes longer than normal 4 cycles (wait states added), nothing happens during those extra wait states.
Toni Wilen is offline  
Old 16 October 2016, 15:14   #5
PiCiJi
Registered User
 
PiCiJi's Avatar
 
Join Date: Sep 2003
Location: germany
Age: 45
Posts: 402
Thanks for clarity.

I am trying to understand prefetches for 68020.

It says a memory access costs 3 cycles. It seems a long word access is one bus cycle instead of 2 like the 68000.

a few questions
1. Can a long word be read/written within 3 cycles (no wait states) ?
2. each odd prefetch doesn't consume bus cycles because a long word is prefetched from external bus or cache?
3. How much cycles consumes a cache hit?
4. Consumes a cache miss additional cycles besides a external bus access ?

Last edited by PiCiJi; 16 October 2016 at 19:56.
PiCiJi is offline  
Old 16 October 2016, 20:30   #6
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Quote:
Originally Posted by PiCiJi View Post
Thanks for clarity.

I am trying to understand prefetches for 68020.

It says a memory access costs 3 cycles. It seems a long word access is one bus cycle instead of 2 like the 68000.
Yes, 68020+ is fully 32-bit. 68000 is internally 16-bit.

Quote:
a few questions
1. Can a long word be read/written within 3 cycles (no wait states) ?
2. each odd prefetch doesn't consume bus cycles because a long word is prefetched from external bus or cache?
3. How much cycles consumes a cache hit?
4. Consumes a cache miss additional cycles besides a external bus access ?
1: yes, if access is long aligned and bus is 32-bit wide. (for example custom chipset registers are always 16-bit)
2: yes, prefetch always loads long aligned long words and even if it is not cached (caches off), it goes to 32-bit prefetch buffer and next word comes from buffer (while the CPU can already start next long prefetch read). So better jump to long aligned addresses to make the best out of it
3: I think cache hit is free.
4: it depends, if CPU has something else to do, it may not cause any extra cycles.. for example longer logic operation, prefetching/decoding prefetch buffered word (or instruction cache). This makes accurate emulation practically impossible without more knowledge of CPU internals.

1 and 2 are quite clearly documented. 3, 4 and it becomes quite fuzzy..
Toni Wilen is offline  
Old 17 October 2016, 11:19   #7
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
Quote:
Originally Posted by Toni Wilen View Post
Yes, 68020+ is fully 32-bit. 68000 is internally 16-bit.
Don't you mean "externally" ?
hooverphonique is offline  
Old 17 October 2016, 11:47   #8
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
Quote:
Originally Posted by hooverphonique View Post
Don't you mean "externally" ?
The data bus is indeed 16bit. Most operations can be 32bit, but internally they're handled in steps of 16bit. Things like the ALU are 16bit.
Thorham is online now  
Old 17 October 2016, 12:07   #9
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Toni Wilen View Post
3: I think cache hit is free.
I have my doubts on that one. LEA (A0),A1 is faster than MOVE.L (A0),A1 even if the data is in dcache (that is, for 030).
meynaf is online now  
Old 17 October 2016, 12:18   #10
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Quote:
Originally Posted by meynaf View Post
I have my doubts on that one. LEA (A0),A1 is faster than MOVE.L (A0),A1 even if the data is in dcache (that is, for 030).
Possibly. But it can't be proved either way without more internal information. For example instruction decoding may be different and so on.

Quote:
Originally Posted by Thorham View Post
The data bus is indeed 16bit. Most operations can be 32bit, but internally they're handled in steps of 16bit. Things like the ALU are 16bit.
Exactly, all internal operations are done in one or more 16-bit-sized pieces.

I didn't mention external because it is obvious, just count data pins
Toni Wilen is offline  
Old 17 October 2016, 12:53   #11
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Toni Wilen View Post
Possibly. But it can't be proved either way without more internal information. For example instruction decoding may be different and so on.
I know this from observed instruction timings, and it was just an example among many : data cache hit always has a cost (which seems to be 3 clocks).
Where it exactly comes from, is of course another story.

Btw. I would like to have more accurate 020/030 timings under winuae for suitable code optimizations, as for now the "cycle exact" timing is about as wrong as in max speed with jit active. It doesn't need to be 100% cycle exact, just better than what we have now.
meynaf is online now  
Old 17 October 2016, 13:11   #12
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Impossible without more information. 68020/030 documentation is useless for internal timing purposes. Extremely useless for mul/div timing. No one knows the algorithm.

Only "hidden" info that seems to be true is that each prefetch pipeline state change is 2 cycles (when it comes from prefetch buffer = no extra wait states).
Toni Wilen is offline  
Old 17 October 2016, 14:01   #13
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Why would you need internal timing ? Isn't it only externally observable timing that counts ?
meynaf is online now  
Old 17 October 2016, 14:31   #14
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Quote:
Originally Posted by meynaf View Post
Why would you need internal timing ? Isn't it only externally observable timing that counts ?
It depends on internal timing. Everything depends on internal timing! (You can't keep CPU internal state by just watching the outside world, there is far too many internal variables, like cache, prefetch buffer, instruction decoding pipeline, bus sequencer state etc. All of them needs to be exactly right to get correct external timing)

Just like it does on 68000, only that 68000 is very simple compared to 68020, timing is always the same, previous or next instruction does not change timing of current instruction. 68000 internal timing is practically 100% accurately emulated.

Exact timing when memory access happens needs to be 100% accurate and only way to make it accurate is to emulate all internal cycles. Even 1 cycle difference can make multiple cycle difference in outside world when CPU memory access needs to be aligned to Amiga bus cycles (especially when accessing chip ram, chip registers of CIA). Even tiny error will become huge.

It gets even worse with variable cycle instructions (cycle amount depends on both parameters = without knowing the algorithm it is impossible to emulate accurate) like MUL or DIV.

Only more simple thing in 68020+ vs 68000 is shifts

(I am quite sure I have said something similar about n times already)
Toni Wilen is offline  
Old 17 October 2016, 14:43   #15
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
I explicitly wrote that it didn't need to be 100% accurate...
meynaf is online now  
Old 17 October 2016, 15:27   #16
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Quote:
Originally Posted by meynaf View Post
I explicitly wrote that it didn't need to be 100% accurate...
Not being 100% accurate makes it very inaccurate. Those extra or missing internal cycles makes huge difference.

All memory accesses are already accurately emulated. Non-100% internal timing: result is not accurate at all.

68000 is almost accurate even without internal timing because most instruction's cycle time is same as memory cycles (main differences being shifts, mul and div and some EA calculations). 68020 is something totally different.
Toni Wilen is offline  
Old 17 October 2016, 15:41   #17
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Toni Wilen View Post
Not being 100% accurate makes it very inaccurate. Those extra or missing internal cycles makes huge difference.
I don't get it. As we (asm programmers) can manually count clocks for a specified routine (without knowing the cpu's internals), why wouldn't the emulator be able to do the same ?
meynaf is online now  
Old 17 October 2016, 18:03   #18
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Because it is not that difficult to optimize some loop for best case, especially if most of the code fits in cache. Or when only fast RAM accesses are done.

It is not going to work with generic situation, most of code is not optimized that way. It must work in all situations. For example this kind of emulation would not help with the worst case situation where code and data is in chip RAM (=unexpanded A1200/CD32 demos and games. The most important reason for me.)

And it still does not help with MUL or DIV.
Toni Wilen is offline  
Old 17 October 2016, 18:41   #19
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Toni Wilen View Post
Because it is not that difficult to optimize some loop for best case, especially if most of the code fits in cache. Or when only fast RAM accesses are done.

It is not going to work with generic situation, most of code is not optimized that way. It must work in all situations. For example this kind of emulation would not help with the worst case situation where code and data is in chip RAM (=unexpanded A1200/CD32 demos and games. The most important reason for me.)
It doesn't need to be adapted to suboptimal code, f.e. considering all code fits in icache and there is no dcache, would lead to results that are "good enough".

Current 68030 approximate +0% speed does between 3 times slower and 2 times faster than 50Mhz 68030, depending on what's done. Such a lack of accuracy makes winuae unsuitable for asm cross-development (and it's a pity considering the speeeed at which phxass can assemble stuff there !).


Quote:
Originally Posted by Toni Wilen View Post
And it still does not help with MUL or DIV.
You can use worst case timing (e.g. 28c for mul.w). Should be enough.
meynaf is online now  
Old 17 October 2016, 18:47   #20
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
List few cache fitting routines with cycle counts included and I'll check if there is something obviously wrong. (no MULs or DIVs!)

Quote:
You can use worst case timing (e.g. 28c for mul.w). Should be enough.
It won't work. It would make many demos with MULs or DIVs to run too slow and skip frames. (worst cases are not that common, no static cycle count is common enough) They need to be accurate. Unfortunately 68020+ MUL and DIV are "too fast" to make any easy guesses of used algorithm. (vs 68000 which also had microcode listing helping)
Toni Wilen is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Copper timing yaqube Coders. General 61 08 April 2019 00:41
OS 4.1 Timing Issue Steve support.WinUAE 3 24 January 2015 20:49
How do I know if I need a timing fix? stu232 support.Hardware 4 05 October 2013 01:47
Even more sound timing issues... andreas support.WinUAE 11 30 November 2005 11:23
A1200 timing fixes? icewizard2k5 support.Hardware 2 28 February 2005 09:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 17:16.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.16714 seconds with 13 queries