Write waitstates on the 68020+

dissident · 04 November 2019, 15:13

I know that write-accesses to CHIP memory or CUSTOM chip registers incur wait-states. So on the 68020+ it is advised to put other instructions between two write accesses. The CPU can execute while results are being written to memory:

A0=CHIP memory / CUSTOM chip memory

Code:

move.l  d0,(a0)+        ; store 1st value
add.l   d2,d0           ; increase 1st value
move.l  d1,(a0)+        ; store 2nd value
add.l   d3,d1           ; increase 2nd value

Fine. But what happens on a 68030+, if I write to FAST memory between two writes to CHIP memory / CUSTOM CHIP registers? It could look like this:

A0=CHIP memory / CUSTOM chip memory
A1=FAST memory

Code:

move.l  d0,(a0)+        ; store 1st value
not.b   2(a1)           ; change Flag
move.l  d1,(a0)+        ; store 2nd value

To my mind, the NOT command will be executed during the waitstates of the first write command, because there are two different memory bussystems used.

Is that correct, or am I totally wrong here?

grond · 04 November 2019, 16:31

I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed.

This may look like what you wrote but it is really a different mechanism.

The fastmem access between the two chipmem accesses is similar, on the 020/030 it will stall until the chipmem write is completed. Furthermore being a NOT instruction, it will cause a read and a subsequent write. On 030+ the read can cause a burst read of four consecutive longs from fastmem to the 030s data cache. This may take too long to stuff between two chipmem writes. On 040/060 the write to fastmem should finish very quickly which is why on the 060 doing c2p to fastmem and then copying the planar data from fastmem to chipmem in some unrelated work routine working in fastmem makes sense.

SpeedGeek · 04 November 2019, 16:45

Actually, any external memory access (read or write) will incur wait states. But the most wait states typically occur in Chip RAM because that particular RAM was (by design) the slowest RAM in the system.

Motorola/Freescale added features to advanced 68K CPUs to try to improve performance for the condition of external memory access wait states. These features are an instruction cache, data cache, write buffer, copyback, store buffer and non-sequential pipeline execution (Note - These features vary with CPU model).

The general idea here is to prevent or at least reduce the occurrence of an execution pipeline stall. If the execution pipeline is kept busy doing things like instruction decode or an effective address calculation (while external memory access is pending or preferably avoiding the external access completely with a cache hit) then overall performance is improved.

The CPU only sees one bus. The difference in access speed for different address spaces on the bus is determined in hardware. The custom chips also see one bus which just happens to be a small part of the larger CPU bus.

Thomas Richter · 04 November 2019, 16:45

Quote:

Originally Posted by grond

I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

That is (almost) correct. To be more precise: The 68040 and the 68060 have a "push buffer" into which they can perform a write while the bus is busy, and execute another instruction while the write is being piled up.

However, to activate this push buffer, the caching mode of the chip memory has to be set accordingly, namely to "non-serialized" on the 040 and "imprecise" on the 68060.

If the caching mode is "cache inhibited", then writes will also stall the 68040 and 68060 as it then guarantees purely sequential operation.

meynaf · 04 November 2019, 17:01

Quote:

Originally Posted by grond

I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed.

If this were true, then we would be able to pad reads from fast between writes to chip and still not miss any access.
But my experience with 030 shows that it doesn't work this way. I could pad up to 22 clock cycles after a write to chip, but these must not contain any memory access or it will not work anymore.
Writes to fast also give a few extra cycles.

While for 020/030 it is not a true push buffer, the cpu is still able to execute instructions when waiting for a memory write to complete.

grond · 04 November 2019, 17:09

Quote:

Originally Posted by meynaf

If this were true, then we would be able to pad reads from fast between writes to chip and still not miss any access.
But my experience with 030 shows that it doesn't work this way. I could pad up to 22 clock cycles after a write to chip, but these must not contain any memory access or it will not work anymore.

As I wrote, on 030 the read will probably cause a read burst from fastmem and this will likely take more cycles than what is available between two chipmem accesses, no?

meynaf · 04 November 2019, 17:18

Quote:

Originally Posted by grond

As I wrote, on 030 the read will probably cause a read burst from fastmem and this will likely take more cycles than what is available between two chipmem accesses, no?

No. The read causes a read burst only if data burst is active - an on my A1230 it usually wasn't.

roondar · 04 November 2019, 17:20

I can confirm that on my Blizzard 1230MK IV data burst is normally off. I've tried to manually activate it and found no performance benefit whatsoever, so maybe it's off for the reason grond mentioned (i.e. to make chip-fast-chip transfers faster)?

grond · 04 November 2019, 17:23

How does the 030 then fill a 16byte data cache line? BTW, I also seem to remember finding that activating burst reads didn't seem to make any difference but I think I concluded that, since this didn't fit the theory, my tests were wrong...

Kalms · 04 November 2019, 17:38

Check out 68030um, section 11.2.5.2, 'Write pending buffer'. The 030 can pass a single write operation to this subsystem in the CPU and get on with processing while the bus microcontroller talks to whatever is connected on the other side (fastmem, chipmem, customchip regs).

Another memory access request while the current is in flight will cause the rest of the 030 core to pause until the in-progress memory access completes.

On 030/50, a write takes 2c (or was it 4c?) to complete in the core, and then it goes to the write pending buffer. Outside the CPU, the mem interface for chipmem can accept (start) a new write every 28 cycles. The bus microcontroller will wait until the next such period begins. Then it spends that period performing the transfer.

This is why chip/fast/chip accesses are costly - even though the fast access takes less than 28c, the CPU will miss out on a full 28c chipbus 'slot'.

grond · 04 November 2019, 17:44

Interesting! Thanks for the explanation. Do you also happen to know how the 030 will treat filling a dcache line for an uncached fastmem address?

Edited to add: is the 020 equal to the 030 with regards to the "write pending buffer"?

meynaf · 04 November 2019, 19:03

Quote:

Originally Posted by grond

How does the 030 then fill a 16byte data cache line?

The 030 does not need to read full cache lines, longwords inside a cache line are independent (i.e. separately seen as valid or not).

Quote:

Originally Posted by grond

BTW, I also seem to remember finding that activating burst reads didn't seem to make any difference but I think I concluded that, since this didn't fit the theory, my tests were wrong...

Data burst often makes no difference.
In many cases it can be slower due to extra memory accesses. This is why it's not enabled by default.
In some cases it can be faster, because burst accesses take (slightly) less clocks than normal accesses.

It is possible to optimise for dburst. You have to do serial mem access, and insert register-only instructions in between.
It means that :

Code:

 move.l (a0)+,d0
 and.l d5,d0
 move.l (a0)+,d1
 and.l d5,d1

is faster than :

Code:

 move.l (a0)+,d0
 move.l (a0)+,d1
 and.l d5,d0
 and.l d5,d1

but only if dburst is active (else, exact same timing).

Quote:

Originally Posted by grond

Edited to add: is the 020 equal to the 030 with regards to the "write pending buffer"?

IIRC, yes. But A1200's EC020 has a lot less waitstates because of lower clock rate, so not many instructions can fit between writes.

dissident · 04 November 2019, 19:15

Quote:

Originally Posted by grond

I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

My main target is not the 68020 it's more the 68040/60.

Quote:

Originally Posted by grond

The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed.

Yes, you are right. I noticed the phenomen on the 68020. If I put too many commands between the two CHIP memory writes my routine wasted more rastertime. Two commands seem to be the break even point. With three commands you'll loose.

Quote:

Originally Posted by grond

The fastmem access between the two chipmem accesses is similar, on the 020/030 it will stall until the chipmem write is completed. Furthermore being a NOT instruction, it will cause a read and a subsequent write. On 030+ the read can cause a burst read of four consecutive longs from fastmem to the 030s data cache. This may take too long to stuff between two chipmem writes. On 040/060 the write to fastmem should finish very quickly which is why on the 060 doing c2p to fastmem and then copying the planar data from fastmem to chipmem in some unrelated work routine working in fastmem makes sense.

Okay, the NOT command is a bad example. A MOVE command to FAST memory would be much clearer.

dissident · 04 November 2019, 19:21

Quote:

Originally Posted by SpeedGeek

Actually, any external memory access (read or write) will incur wait states. But the most wait states typically occur in Chip RAM because that particular RAM was (by design) the slowest RAM in the system.

Motorola/Freescale added features to advanced 68K CPUs to try to improve performance for the condition of external memory access wait states. These features are an instruction cache, data cache, write buffer, copyback, store buffer and non-sequential pipeline execution (Note - These features vary with CPU model).

The general idea here is to prevent or at least reduce the occurrence of an execution pipeline stall. If the execution pipeline is kept busy doing things like instruction decode or an effective address calculation (while external memory access is pending or preferably avoiding the external access completely with a cache hit) then overall performance is improved.

The CPU only sees one bus. The difference in access speed for different address spaces on the bus is determined in hardware. The custom chips also see one bus which just happens to be a small part of the larger CPU bus.

Thanks for your detailed explanation, SpeedGeek. Your last paragraph is the most interesting statement for me.

dissident · 04 November 2019, 19:36

Quote:

Originally Posted by Thomas Richter

That is (almost) correct. To be more precise: The 68040 and the 68060 have a "push buffer" into which they can perform a write while the bus is busy, and execute another instruction while the write is being piled up.

However, to activate this push buffer, the caching mode of the chip memory has to be set accordingly, namely to "non-serialized" on the 040 and "imprecise" on the 68060.

If the caching mode is "cache inhibited", then writes will also stall the 68040 and 68060 as it then guarantees purely sequential operation.

Yes, I know the push buffer of the 68060. Your statement about the 68040/60 cache modes confirms my knowledge about how to configurate the cache modes for CHIP memory. For a proper use it would be a disaster if the CHIP memory would have the write through cache mode.

grond · 04 November 2019, 19:39

In order to do burst accesses, I guess one would want 16 bytes aligned addresses similar to the 040's move16 instruction?

dissident · 04 November 2019, 19:41

Quote:

Originally Posted by meynaf

If this were true, then we would be able to pad reads from fast between writes to chip and still not miss any access.
But my experience with 030 shows that it doesn't work this way. I could pad up to 22 clock cycles after a write to chip, but these must not contain any memory access or it will not work anymore.
Writes to fast also give a few extra cycles.

For the 68030 you are right, but how about the 68060 with its push buffer? It seems I have to test this on a real 68060, but sadly at the moment I have not such a machine at hand.

dissident · 04 November 2019, 19:44

Quote:

Originally Posted by Kalms

Check out 68030um, section 11.2.5.2, 'Write pending buffer'. The 030 can pass a single write operation to this subsystem in the CPU and get on with processing while the bus microcontroller talks to whatever is connected on the other side (fastmem, chipmem, customchip regs).

Another memory access request while the current is in flight will cause the rest of the 030 core to pause until the in-progress memory access completes.

On 030/50, a write takes 2c (or was it 4c?) to complete in the core, and then it goes to the write pending buffer. Outside the CPU, the mem interface for chipmem can accept (start) a new write every 28 cycles. The bus microcontroller will wait until the next such period begins. Then it spends that period performing the transfer.

This is why chip/fast/chip accesses are costly - even though the fast access takes less than 28c, the CPU will miss out on a full 28c chipbus 'slot'.

Okay, a good explanation, thanks Kalms.

meynaf · 04 November 2019, 19:48

Quote:

Originally Posted by dissident

For the 68030 you are right, but how about the 68060 with its push buffer? It seems I have to test this on a real 68060, but sadly at the moment I have not such a machine at hand.

As the 68060 has a "true" push buffer, normally it should be able to access memory between writes, without blocking.
However, if said memory isn't in cache then it may experience wait states if there is a write currently being done (as it has single bus).
All the question here is when exactly the push buffer will be flushing its data to memory. And having no 060 i can't answer.

dissident · 04 November 2019, 19:54

Quote:

Originally Posted by meynaf

It is possible to optimise for dburst. You have to do serial mem access, and insert register-only instructions in between.
It means that :

Code:

 move.l (a0)+,d0
 and.l d5,d0
 move.l (a0)+,d1
 and.l d5,d1

is faster than :

Code:

 move.l (a0)+,d0
 move.l (a0)+,d1
 and.l d5,d0
 and.l d5,d1

but only if dburst is active (else, exact same timing).

That's why I generally try to put register-only instructions between commands which access memory. No matter if it is a read from memory or a write to memory. If I remember right, the blitter without the nasty bit set also has an impact on the CPU's reads from CHIP memory and may cause CPU wait states.

04 November 2019, 15:13	#1
dissident Registered User Join Date: Sep 2015 Location: Germany Posts: 256	Write waitstates on the 68020+ I know that write-accesses to CHIP memory or CUSTOM chip registers incur wait-states. So on the 68020+ it is advised to put other instructions between two write accesses. The CPU can execute while results are being written to memory: A0=CHIP memory / CUSTOM chip memory Code: move.l d0,(a0)+ ; store 1st value add.l d2,d0 ; increase 1st value move.l d1,(a0)+ ; store 2nd value add.l d3,d1 ; increase 2nd value Fine. But what happens on a 68030+, if I write to FAST memory between two writes to CHIP memory / CUSTOM CHIP registers? It could look like this: A0=CHIP memory / CUSTOM chip memory A1=FAST memory Code: move.l d0,(a0)+ ; store 1st value not.b 2(a1) ; change Flag move.l d1,(a0)+ ; store 2nd value To my mind, the NOT command will be executed during the waitstates of the first write command, because there are two different memory bussystems used. Is that correct, or am I totally wrong here?

04 November 2019, 16:45	#3
SpeedGeek Moderator Join Date: Dec 2010 Location: Wisconsin USA Age: 60 Posts: 839	Actually, any external memory access (read or write) will incur wait states. But the most wait states typically occur in Chip RAM because that particular RAM was (by design) the slowest RAM in the system. Motorola/Freescale added features to advanced 68K CPUs to try to improve performance for the condition of external memory access wait states. These features are an instruction cache, data cache, write buffer, copyback, store buffer and non-sequential pipeline execution (Note - These features vary with CPU model). The general idea here is to prevent or at least reduce the occurrence of an execution pipeline stall. If the execution pipeline is kept busy doing things like instruction decode or an effective address calculation (while external memory access is pending or preferably avoiding the external access completely with a cache hit) then overall performance is improved. The CPU only sees one bus. The difference in access speed for different address spaces on the bus is determined in hardware. The custom chips also see one bus which just happens to be a small part of the larger CPU bus. Last edited by SpeedGeek; 04 November 2019 at 17:23.

04 November 2019, 17:44	#11
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	Interesting! Thanks for the explanation. Do you also happen to know how the 030 will treat filling a dcache line for an uncached fastmem address? Edited to add: is the 020 equal to the 030 with regards to the "write pending buffer"? Last edited by grond; 04 November 2019 at 17:54.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
A1000 68020	Marchie	support.Hardware	6	10 November 2017 12:08
68020+ instruction timings?	oRBIT	Coders. Asm / Hardware	3	23 September 2017 12:38
Overclocking 68020?	Marchie	support.Hardware	8	11 October 2016 13:33
68020 33 MHz	Leandro Jardim	support.WinUAE	2	02 January 2012 19:21
Questions about 68020 CE	Maren	support.WinUAE	11	09 December 2009 21:01

04 November 2019, 16:31	#2
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that. The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed. This may look like what you wrote but it is really a different mechanism. The fastmem access between the two chipmem accesses is similar, on the 020/030 it will stall until the chipmem write is completed. Furthermore being a NOT instruction, it will cause a read and a subsequent write. On 030+ the read can cause a burst read of four consecutive longs from fastmem to the 030s data cache. This may take too long to stuff between two chipmem writes. On 040/060 the write to fastmem should finish very quickly which is why on the 060 doing c2p to fastmem and then copying the planar data from fastmem to chipmem in some unrelated work routine working in fastmem makes sense.

04 November 2019, 17:20	#8
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,411	I can confirm that on my Blizzard 1230MK IV data burst is normally off. I've tried to manually activate it and found no performance benefit whatsoever, so maybe it's off for the reason grond mentioned (i.e. to make chip-fast-chip transfers faster)?

04 November 2019, 17:23	#9
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	How does the 030 then fill a 16byte data cache line? BTW, I also seem to remember finding that activating burst reads didn't seem to make any difference but I think I concluded that, since this didn't fit the theory, my tests were wrong...

04 November 2019, 17:38	#10
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Check out 68030um, section 11.2.5.2, 'Write pending buffer'. The 030 can pass a single write operation to this subsystem in the CPU and get on with processing while the bus microcontroller talks to whatever is connected on the other side (fastmem, chipmem, customchip regs). Another memory access request while the current is in flight will cause the rest of the 030 core to pause until the in-progress memory access completes. On 030/50, a write takes 2c (or was it 4c?) to complete in the core, and then it goes to the write pending buffer. Outside the CPU, the mem interface for chipmem can accept (start) a new write every 28 cycles. The bus microcontroller will wait until the next such period begins. Then it spends that period performing the transfer. This is why chip/fast/chip accesses are costly - even though the fast access takes less than 28c, the CPU will miss out on a full 28c chipbus 'slot'.

04 November 2019, 19:39	#16
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	In order to do burst accesses, I guess one would want 16 bytes aligned addresses similar to the 040's move16 instruction?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)