English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 04 November 2019, 15:13   #1
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Write waitstates on the 68020+

I know that write-accesses to CHIP memory or CUSTOM chip registers incur wait-states. So on the 68020+ it is advised to put other instructions between two write accesses. The CPU can execute while results are being written to memory:

A0=CHIP memory / CUSTOM chip memory

Code:
move.l  d0,(a0)+        ; store 1st value
add.l   d2,d0           ; increase 1st value
move.l  d1,(a0)+        ; store 2nd value
add.l   d3,d1           ; increase 2nd value
Fine. But what happens on a 68030+, if I write to FAST memory between two writes to CHIP memory / CUSTOM CHIP registers? It could look like this:


A0=CHIP memory / CUSTOM chip memory
A1=FAST memory

Code:
move.l  d0,(a0)+        ; store 1st value
not.b   2(a1)           ; change Flag
move.l  d1,(a0)+        ; store 2nd value
To my mind, the NOT command will be executed during the waitstates of the first write command, because there are two different memory bussystems used.


Is that correct, or am I totally wrong here?
dissident is offline  
Old 04 November 2019, 16:31   #2
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed.

This may look like what you wrote but it is really a different mechanism.

The fastmem access between the two chipmem accesses is similar, on the 020/030 it will stall until the chipmem write is completed. Furthermore being a NOT instruction, it will cause a read and a subsequent write. On 030+ the read can cause a burst read of four consecutive longs from fastmem to the 030s data cache. This may take too long to stuff between two chipmem writes. On 040/060 the write to fastmem should finish very quickly which is why on the 060 doing c2p to fastmem and then copying the planar data from fastmem to chipmem in some unrelated work routine working in fastmem makes sense.
grond is offline  
Old 04 November 2019, 16:45   #3
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Actually, any external memory access (read or write) will incur wait states. But the most wait states typically occur in Chip RAM because that particular RAM was (by design) the slowest RAM in the system.

Motorola/Freescale added features to advanced 68K CPUs to try to improve performance for the condition of external memory access wait states. These features are an instruction cache, data cache, write buffer, copyback, store buffer and non-sequential pipeline execution (Note - These features vary with CPU model).

The general idea here is to prevent or at least reduce the occurrence of an execution pipeline stall. If the execution pipeline is kept busy doing things like instruction decode or an effective address calculation (while external memory access is pending or preferably avoiding the external access completely with a cache hit) then overall performance is improved.

The CPU only sees one bus. The difference in access speed for different address spaces on the bus is determined in hardware. The custom chips also see one bus which just happens to be a small part of the larger CPU bus.

Last edited by SpeedGeek; 04 November 2019 at 17:23.
SpeedGeek is offline  
Old 04 November 2019, 16:45   #4
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,216
Quote:
Originally Posted by grond View Post
I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.
That is (almost) correct. To be more precise: The 68040 and the 68060 have a "push buffer" into which they can perform a write while the bus is busy, and execute another instruction while the write is being piled up.

However, to activate this push buffer, the caching mode of the chip memory has to be set accordingly, namely to "non-serialized" on the 040 and "imprecise" on the 68060.

If the caching mode is "cache inhibited", then writes will also stall the 68040 and 68060 as it then guarantees purely sequential operation.
Thomas Richter is offline  
Old 04 November 2019, 17:01   #5
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed.
If this were true, then we would be able to pad reads from fast between writes to chip and still not miss any access.
But my experience with 030 shows that it doesn't work this way. I could pad up to 22 clock cycles after a write to chip, but these must not contain any memory access or it will not work anymore.
Writes to fast also give a few extra cycles.

While for 020/030 it is not a true push buffer, the cpu is still able to execute instructions when waiting for a memory write to complete.
meynaf is offline  
Old 04 November 2019, 17:09   #6
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
If this were true, then we would be able to pad reads from fast between writes to chip and still not miss any access.
But my experience with 030 shows that it doesn't work this way. I could pad up to 22 clock cycles after a write to chip, but these must not contain any memory access or it will not work anymore.
As I wrote, on 030 the read will probably cause a read burst from fastmem and this will likely take more cycles than what is available between two chipmem accesses, no?
grond is offline  
Old 04 November 2019, 17:18   #7
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
As I wrote, on 030 the read will probably cause a read burst from fastmem and this will likely take more cycles than what is available between two chipmem accesses, no?
No. The read causes a read burst only if data burst is active - an on my A1230 it usually wasn't.
meynaf is offline  
Old 04 November 2019, 17:20   #8
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
I can confirm that on my Blizzard 1230MK IV data burst is normally off. I've tried to manually activate it and found no performance benefit whatsoever, so maybe it's off for the reason grond mentioned (i.e. to make chip-fast-chip transfers faster)?
roondar is offline  
Old 04 November 2019, 17:23   #9
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
How does the 030 then fill a 16byte data cache line? BTW, I also seem to remember finding that activating burst reads didn't seem to make any difference but I think I concluded that, since this didn't fit the theory, my tests were wrong...
grond is offline  
Old 04 November 2019, 17:38   #10
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
Check out 68030um, section 11.2.5.2, 'Write pending buffer'. The 030 can pass a single write operation to this subsystem in the CPU and get on with processing while the bus microcontroller talks to whatever is connected on the other side (fastmem, chipmem, customchip regs).

Another memory access request while the current is in flight will cause the rest of the 030 core to pause until the in-progress memory access completes.

On 030/50, a write takes 2c (or was it 4c?) to complete in the core, and then it goes to the write pending buffer. Outside the CPU, the mem interface for chipmem can accept (start) a new write every 28 cycles. The bus microcontroller will wait until the next such period begins. Then it spends that period performing the transfer.

This is why chip/fast/chip accesses are costly - even though the fast access takes less than 28c, the CPU will miss out on a full 28c chipbus 'slot'.
Kalms is offline  
Old 04 November 2019, 17:44   #11
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Interesting! Thanks for the explanation. Do you also happen to know how the 030 will treat filling a dcache line for an uncached fastmem address?

Edited to add: is the 020 equal to the 030 with regards to the "write pending buffer"?

Last edited by grond; 04 November 2019 at 17:54.
grond is offline  
Old 04 November 2019, 19:03   #12
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
How does the 030 then fill a 16byte data cache line?
The 030 does not need to read full cache lines, longwords inside a cache line are independent (i.e. separately seen as valid or not).

Quote:
Originally Posted by grond View Post
BTW, I also seem to remember finding that activating burst reads didn't seem to make any difference but I think I concluded that, since this didn't fit the theory, my tests were wrong...
Data burst often makes no difference.
In many cases it can be slower due to extra memory accesses. This is why it's not enabled by default.
In some cases it can be faster, because burst accesses take (slightly) less clocks than normal accesses.

It is possible to optimise for dburst. You have to do serial mem access, and insert register-only instructions in between.
It means that :
Code:
 move.l (a0)+,d0
 and.l d5,d0
 move.l (a0)+,d1
 and.l d5,d1
is faster than :
Code:
 move.l (a0)+,d0
 move.l (a0)+,d1
 and.l d5,d0
 and.l d5,d1
but only if dburst is active (else, exact same timing).


Quote:
Originally Posted by grond View Post
Edited to add: is the 020 equal to the 030 with regards to the "write pending buffer"?
IIRC, yes. But A1200's EC020 has a lot less waitstates because of lower clock rate, so not many instructions can fit between writes.
meynaf is offline  
Old 04 November 2019, 19:15   #13
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Quote:
Originally Posted by grond View Post
I think you are making some wrong assumption. AFAIK the 020 will NOT continue execution of the instruction stream while the last written date is waiting for the chipmem to take it. The 060 is the only Motorola 68k that can do that.

My main target is not the 68020 it's more the 68040/60.


Quote:
Originally Posted by grond View Post
The reason to place instructions between two consecutive chipmem writes is that the first chipmem access will stall the CPU until chipmem is ready and then you get some CPU cycles synchronised with chipmem in such a way that the instructions will complete before the next chipmem slot opens. If you then do a second chipmem access, you will waste less cycles waiting for chipmem. Of course, if you stuff too many instructions between the two chipmem accesses, you will waste both a chipmem cycle and a lot of CPU cycles waiting for the next chipmem slot after the one you've just missed.

Yes, you are right. I noticed the phenomen on the 68020. If I put too many commands between the two CHIP memory writes my routine wasted more rastertime. Two commands seem to be the break even point. With three commands you'll loose.


Quote:
Originally Posted by grond View Post
The fastmem access between the two chipmem accesses is similar, on the 020/030 it will stall until the chipmem write is completed. Furthermore being a NOT instruction, it will cause a read and a subsequent write. On 030+ the read can cause a burst read of four consecutive longs from fastmem to the 030s data cache. This may take too long to stuff between two chipmem writes. On 040/060 the write to fastmem should finish very quickly which is why on the 060 doing c2p to fastmem and then copying the planar data from fastmem to chipmem in some unrelated work routine working in fastmem makes sense.

Okay, the NOT command is a bad example. A MOVE command to FAST memory would be much clearer.
dissident is offline  
Old 04 November 2019, 19:21   #14
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Quote:
Originally Posted by SpeedGeek View Post
Actually, any external memory access (read or write) will incur wait states. But the most wait states typically occur in Chip RAM because that particular RAM was (by design) the slowest RAM in the system.

Motorola/Freescale added features to advanced 68K CPUs to try to improve performance for the condition of external memory access wait states. These features are an instruction cache, data cache, write buffer, copyback, store buffer and non-sequential pipeline execution (Note - These features vary with CPU model).

The general idea here is to prevent or at least reduce the occurrence of an execution pipeline stall. If the execution pipeline is kept busy doing things like instruction decode or an effective address calculation (while external memory access is pending or preferably avoiding the external access completely with a cache hit) then overall performance is improved.

The CPU only sees one bus. The difference in access speed for different address spaces on the bus is determined in hardware. The custom chips also see one bus which just happens to be a small part of the larger CPU bus.

Thanks for your detailed explanation, SpeedGeek. Your last paragraph is the most interesting statement for me.
dissident is offline  
Old 04 November 2019, 19:36   #15
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Quote:
Originally Posted by Thomas Richter View Post
That is (almost) correct. To be more precise: The 68040 and the 68060 have a "push buffer" into which they can perform a write while the bus is busy, and execute another instruction while the write is being piled up.

However, to activate this push buffer, the caching mode of the chip memory has to be set accordingly, namely to "non-serialized" on the 040 and "imprecise" on the 68060.

If the caching mode is "cache inhibited", then writes will also stall the 68040 and 68060 as it then guarantees purely sequential operation.
Yes, I know the push buffer of the 68060. Your statement about the 68040/60 cache modes confirms my knowledge about how to configurate the cache modes for CHIP memory. For a proper use it would be a disaster if the CHIP memory would have the write through cache mode.
dissident is offline  
Old 04 November 2019, 19:39   #16
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
In order to do burst accesses, I guess one would want 16 bytes aligned addresses similar to the 040's move16 instruction?
grond is offline  
Old 04 November 2019, 19:41   #17
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Quote:
Originally Posted by meynaf View Post
If this were true, then we would be able to pad reads from fast between writes to chip and still not miss any access.
But my experience with 030 shows that it doesn't work this way. I could pad up to 22 clock cycles after a write to chip, but these must not contain any memory access or it will not work anymore.
Writes to fast also give a few extra cycles.
For the 68030 you are right, but how about the 68060 with its push buffer? It seems I have to test this on a real 68060, but sadly at the moment I have not such a machine at hand.
dissident is offline  
Old 04 November 2019, 19:44   #18
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Quote:
Originally Posted by Kalms View Post
Check out 68030um, section 11.2.5.2, 'Write pending buffer'. The 030 can pass a single write operation to this subsystem in the CPU and get on with processing while the bus microcontroller talks to whatever is connected on the other side (fastmem, chipmem, customchip regs).

Another memory access request while the current is in flight will cause the rest of the 030 core to pause until the in-progress memory access completes.

On 030/50, a write takes 2c (or was it 4c?) to complete in the core, and then it goes to the write pending buffer. Outside the CPU, the mem interface for chipmem can accept (start) a new write every 28 cycles. The bus microcontroller will wait until the next such period begins. Then it spends that period performing the transfer.

This is why chip/fast/chip accesses are costly - even though the fast access takes less than 28c, the CPU will miss out on a full 28c chipbus 'slot'.

Okay, a good explanation, thanks Kalms.
dissident is offline  
Old 04 November 2019, 19:48   #19
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by dissident View Post
For the 68030 you are right, but how about the 68060 with its push buffer? It seems I have to test this on a real 68060, but sadly at the moment I have not such a machine at hand.
As the 68060 has a "true" push buffer, normally it should be able to access memory between writes, without blocking.
However, if said memory isn't in cache then it may experience wait states if there is a write currently being done (as it has single bus).
All the question here is when exactly the push buffer will be flushing its data to memory. And having no 060 i can't answer.
meynaf is offline  
Old 04 November 2019, 19:54   #20
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Quote:
Originally Posted by meynaf View Post
It is possible to optimise for dburst. You have to do serial mem access, and insert register-only instructions in between.
It means that :
Code:
 move.l (a0)+,d0
 and.l d5,d0
 move.l (a0)+,d1
 and.l d5,d1
is faster than :
Code:
 move.l (a0)+,d0
 move.l (a0)+,d1
 and.l d5,d0
 and.l d5,d1
but only if dburst is active (else, exact same timing).
That's why I generally try to put register-only instructions between commands which access memory. No matter if it is a read from memory or a write to memory. If I remember right, the blitter without the nasty bit set also has an impact on the CPU's reads from CHIP memory and may cause CPU wait states.
dissident is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
A1000 68020 Marchie support.Hardware 6 10 November 2017 12:08
68020+ instruction timings? oRBIT Coders. Asm / Hardware 3 23 September 2017 12:38
Overclocking 68020? Marchie support.Hardware 8 11 October 2016 13:33
68020 33 MHz Leandro Jardim support.WinUAE 2 02 January 2012 19:21
Questions about 68020 CE Maren support.WinUAE 11 09 December 2009 21:01

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 22:33.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10256 seconds with 13 queries