68k timing

PiCiJi · 14 October 2016, 20:12

last time I have thought about 68k timing again.

Whats happening with internal operation during wait states ?

e.g. ASL Dx, Dy
sequence: prefetch n* n (* means shift count, n -> 2 cycles, prefetch -> 4 cycles (2 cycles to put address on bus, next 2 cycles repeated because of stalling ) )

Assume the prefetch is stalled by wait states means the internal register shifting will be stalled too? It could happen concurrent to bus wait state cycles.

Like instruction overlap this feature coud be possible only for 68020 cpus and higher ?

Toni Wilen · 15 October 2016, 19:49

Nothing happens during wait states with 68000.

Only 68020+ can do memory access(es) while CPU does internal operations.

PiCiJi · 16 October 2016, 13:56

seems true for immediate and register ASL, because sequence describes shifting after prefetch

ASL (An)

sequence: nr (read from An), np (prefetch), nw (write shifted result)
shifting and decoding next opcode happens during prefetch

there are no additional 2 clocks for shifting.

Toni Wilen · 16 October 2016, 14:14

I didn't say other things can't happen during memory access (microcode can do ALU operations, condition code setting etc simultaneously) but if memory access takes longer than normal 4 cycles (wait states added), nothing happens during those extra wait states.

PiCiJi · 16 October 2016, 15:14

Thanks for clarity.

I am trying to understand prefetches for 68020.

It says a memory access costs 3 cycles. It seems a long word access is one bus cycle instead of 2 like the 68000.

a few questions
1. Can a long word be read/written within 3 cycles (no wait states) ?
2. each odd prefetch doesn't consume bus cycles because a long word is prefetched from external bus or cache?
3. How much cycles consumes a cache hit?
4. Consumes a cache miss additional cycles besides a external bus access ?

Toni Wilen · 16 October 2016, 20:30

Quote:

Originally Posted by PiCiJi

Thanks for clarity.

I am trying to understand prefetches for 68020.

It says a memory access costs 3 cycles. It seems a long word access is one bus cycle instead of 2 like the 68000.

Yes, 68020+ is fully 32-bit. 68000 is internally 16-bit.

Quote:

a few questions
1. Can a long word be read/written within 3 cycles (no wait states) ?
2. each odd prefetch doesn't consume bus cycles because a long word is prefetched from external bus or cache?
3. How much cycles consumes a cache hit?
4. Consumes a cache miss additional cycles besides a external bus access ?

1: yes, if access is long aligned and bus is 32-bit wide. (for example custom chipset registers are always 16-bit)
2: yes, prefetch always loads long aligned long words and even if it is not cached (caches off), it goes to 32-bit prefetch buffer and next word comes from buffer (while the CPU can already start next long prefetch read). So better jump to long aligned addresses to make the best out of it

3: I think cache hit is free.
4: it depends, if CPU has something else to do, it may not cause any extra cycles.. for example longer logic operation, prefetching/decoding prefetch buffered word (or instruction cache). This makes accurate emulation practically impossible without more knowledge of CPU internals.

1 and 2 are quite clearly documented. 3, 4 and it becomes quite fuzzy..

hooverphonique · 17 October 2016, 11:19

Quote:

Originally Posted by Toni Wilen

Yes, 68020+ is fully 32-bit. 68000 is internally 16-bit.

Don't you mean "externally" ?

Thorham · 17 October 2016, 11:47

Quote:

Originally Posted by hooverphonique

Don't you mean "externally" ?

The data bus is indeed 16bit. Most operations can be 32bit, but internally they're handled in steps of 16bit. Things like the ALU are 16bit.

meynaf · 17 October 2016, 12:07

Quote:

Originally Posted by Toni Wilen

3: I think cache hit is free.

I have my doubts on that one. LEA (A0),A1 is faster than MOVE.L (A0),A1 even if the data is in dcache (that is, for 030).

Toni Wilen · 17 October 2016, 12:18

Quote:

Originally Posted by meynaf

I have my doubts on that one. LEA (A0),A1 is faster than MOVE.L (A0),A1 even if the data is in dcache (that is, for 030).

Possibly. But it can't be proved either way without more internal information. For example instruction decoding may be different and so on.

Quote:

Originally Posted by Thorham

The data bus is indeed 16bit. Most operations can be 32bit, but internally they're handled in steps of 16bit. Things like the ALU are 16bit.

Exactly, all internal operations are done in one or more 16-bit-sized pieces.

I didn't mention external because it is obvious, just count data pins

meynaf · 17 October 2016, 12:53

Quote:

Originally Posted by Toni Wilen

Possibly. But it can't be proved either way without more internal information. For example instruction decoding may be different and so on.

I know this from observed instruction timings, and it was just an example among many : data cache hit always has a cost (which seems to be 3 clocks).
Where it exactly comes from, is of course another story.

Btw. I would like to have more accurate 020/030 timings under winuae for suitable code optimizations, as for now the "cycle exact" timing is about as wrong as in max speed with jit active. It doesn't need to be 100% cycle exact, just better than what we have now.

Toni Wilen · 17 October 2016, 13:11

Impossible without more information. 68020/030 documentation is useless for internal timing purposes. Extremely useless for mul/div timing. No one knows the algorithm.

Only "hidden" info that seems to be true is that each prefetch pipeline state change is 2 cycles (when it comes from prefetch buffer = no extra wait states).

meynaf · 17 October 2016, 14:01

Why would you need internal timing ? Isn't it only externally observable timing that counts ?

Toni Wilen · 17 October 2016, 14:31

Quote:

Originally Posted by meynaf

Why would you need internal timing ? Isn't it only externally observable timing that counts ?

It depends on internal timing. Everything depends on internal timing! (You can't keep CPU internal state by just watching the outside world, there is far too many internal variables, like cache, prefetch buffer, instruction decoding pipeline, bus sequencer state etc. All of them needs to be exactly right to get correct external timing)

Just like it does on 68000, only that 68000 is very simple compared to 68020, timing is always the same, previous or next instruction does not change timing of current instruction. 68000 internal timing is practically 100% accurately emulated.

Exact timing when memory access happens needs to be 100% accurate and only way to make it accurate is to emulate all internal cycles. Even 1 cycle difference can make multiple cycle difference in outside world when CPU memory access needs to be aligned to Amiga bus cycles (especially when accessing chip ram, chip registers of CIA). Even tiny error will become huge.

It gets even worse with variable cycle instructions (cycle amount depends on both parameters = without knowing the algorithm it is impossible to emulate accurate) like MUL or DIV.

Only more simple thing in 68020+ vs 68000 is shifts

(I am quite sure I have said something similar about n times already)

meynaf · 17 October 2016, 14:43

I explicitly wrote that it didn't need to be 100% accurate...

Toni Wilen · 17 October 2016, 15:27

Quote:

Originally Posted by meynaf

I explicitly wrote that it didn't need to be 100% accurate...

Not being 100% accurate makes it very inaccurate. Those extra or missing internal cycles makes huge difference.

All memory accesses are already accurately emulated. Non-100% internal timing: result is not accurate at all.

68000 is almost accurate even without internal timing because most instruction's cycle time is same as memory cycles (main differences being shifts, mul and div and some EA calculations). 68020 is something totally different.

meynaf · 17 October 2016, 15:41

Quote:

Originally Posted by Toni Wilen

Not being 100% accurate makes it very inaccurate. Those extra or missing internal cycles makes huge difference.

I don't get it. As we (asm programmers) can manually count clocks for a specified routine (without knowing the cpu's internals), why wouldn't the emulator be able to do the same ?

Toni Wilen · 17 October 2016, 18:03

Because it is not that difficult to optimize some loop for best case, especially if most of the code fits in cache. Or when only fast RAM accesses are done.

It is not going to work with generic situation, most of code is not optimized that way. It must work in all situations. For example this kind of emulation would not help with the worst case situation where code and data is in chip RAM (=unexpanded A1200/CD32 demos and games. The most important reason for me.)

And it still does not help with MUL or DIV.

meynaf · 17 October 2016, 18:41

Quote:

Originally Posted by Toni Wilen

Because it is not that difficult to optimize some loop for best case, especially if most of the code fits in cache. Or when only fast RAM accesses are done.

It is not going to work with generic situation, most of code is not optimized that way. It must work in all situations. For example this kind of emulation would not help with the worst case situation where code and data is in chip RAM (=unexpanded A1200/CD32 demos and games. The most important reason for me.)

It doesn't need to be adapted to suboptimal code, f.e. considering all code fits in icache and there is no dcache, would lead to results that are "good enough".

Current 68030 approximate +0% speed does between 3 times slower and 2 times faster than 50Mhz 68030, depending on what's done. Such a lack of accuracy makes winuae unsuitable for asm cross-development (and it's a pity considering the speeeed at which phxass can assemble stuff there !).

Quote:

Originally Posted by Toni Wilen

And it still does not help with MUL or DIV.

You can use worst case timing (e.g. 28c for mul.w). Should be enough.

Toni Wilen · 17 October 2016, 18:47

List few cache fitting routines with cycle counts included and I'll check if there is something obviously wrong. (no MULs or DIVs!)

Quote:

You can use worst case timing (e.g. 28c for mul.w). Should be enough.

It won't work. It would make many demos with MULs or DIVs to run too slow and skip frames. (worst cases are not that common, no static cycle count is common enough) They need to be accurate. Unfortunately 68020+ MUL and DIV are "too fast" to make any easy guesses of used algorithm. (vs 68000 which also had microcode listing helping)

14 October 2016, 20:12	#1
PiCiJi Registered User Join Date: Sep 2003 Location: germany Age: 45 Posts: 402	68k timing last time I have thought about 68k timing again. Whats happening with internal operation during wait states ? e.g. ASL Dx, Dy sequence: prefetch n* n (* means shift count, n -> 2 cycles, prefetch -> 4 cycles (2 cycles to put address on bus, next 2 cycles repeated because of stalling ) ) Assume the prefetch is stalled by wait states means the internal register shifting will be stalled too? It could happen concurrent to bus wait state cycles. Like instruction overlap this feature coud be possible only for 68020 cpus and higher ? Last edited by PiCiJi; 14 October 2016 at 20:23.

16 October 2016, 13:56	#3
PiCiJi Registered User Join Date: Sep 2003 Location: germany Age: 45 Posts: 402	seems true for immediate and register ASL, because sequence describes shifting after prefetch ASL (An) sequence: nr (read from An), np (prefetch), nw (write shifted result) shifting and decoding next opcode happens during prefetch there are no additional 2 clocks for shifting. Last edited by PiCiJi; 16 October 2016 at 14:01.

16 October 2016, 15:14	#5
PiCiJi Registered User Join Date: Sep 2003 Location: germany Age: 45 Posts: 402	Thanks for clarity. I am trying to understand prefetches for 68020. It says a memory access costs 3 cycles. It seems a long word access is one bus cycle instead of 2 like the 68000. a few questions 1. Can a long word be read/written within 3 cycles (no wait states) ? 2. each odd prefetch doesn't consume bus cycles because a long word is prefetched from external bus or cache? 3. How much cycles consumes a cache hit? 4. Consumes a cache miss additional cycles besides a external bus access ? Last edited by PiCiJi; 16 October 2016 at 19:56.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Copper timing	yaqube	Coders. General	61	08 April 2019 00:41
OS 4.1 Timing Issue	Steve	support.WinUAE	3	24 January 2015 20:49
How do I know if I need a timing fix?	stu232	support.Hardware	4	05 October 2013 01:47
Even more sound timing issues...	andreas	support.WinUAE	11	30 November 2005 11:23
A1200 timing fixes?	icewizard2k5	support.Hardware	2	28 February 2005 09:37

15 October 2016, 19:49	#2
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,502	Nothing happens during wait states with 68000. Only 68020+ can do memory access(es) while CPU does internal operations.

16 October 2016, 14:14	#4
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,502	I didn't say other things can't happen during memory access (microcode can do ALU operations, condition code setting etc simultaneously) but if memory access takes longer than normal 4 cycles (wait states added), nothing happens during those extra wait states.

17 October 2016, 13:11	#12
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,502	Impossible without more information. 68020/030 documentation is useless for internal timing purposes. Extremely useless for mul/div timing. No one knows the algorithm. Only "hidden" info that seems to be true is that each prefetch pipeline state change is 2 cycles (when it comes from prefetch buffer = no extra wait states).

17 October 2016, 14:01	#13
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Why would you need internal timing ? Isn't it only externally observable timing that counts ?

17 October 2016, 14:43	#15
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	I explicitly wrote that it didn't need to be 100% accurate...

17 October 2016, 18:03	#18
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,502	Because it is not that difficult to optimize some loop for best case, especially if most of the code fits in cache. Or when only fast RAM accesses are done. It is not going to work with generic situation, most of code is not optimized that way. It must work in all situations. For example this kind of emulation would not help with the worst case situation where code and data is in chip RAM (=unexpanded A1200/CD32 demos and games. The most important reason for me.) And it still does not help with MUL or DIV.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)