How not to flush caches.

Toni Wilen · 24 October 2011, 17:43

AROS m68k CopyBack random hang bug aka How Not To Flush Caches.

68k CacheClearU() executed following code if CPU is 040 or 060 (left out Supervisor() call and other minor things, it isn't important here)

CPUSHA BC
CINVA BC

Question: Why is this horribly bad idea?

(No, I didn't write this code and I didn't see the problem until few days ago..)

matthey · 25 October 2011, 04:26

How did this only hang without the Supervisor call? No crash? I would expect this to work in Supervisor although it doesn't make much sense. The CPUSHA BC is all that is needed. CacheClearU() is a pretty simple function and there aren't many ways to write it the "correct" way. Did you check CacheClearE() also?

CacheClearU:
move.l a5,a0
lea (.super_push,pc),a5
jsr (Supervisor,a6)
move.l a0,a5
rts

.super_push:
cpusha bc
rte

Toni Wilen · 25 October 2011, 07:56

You misunderstood (and yes, single cpusha bc fixes it). I now know what the problem is (was), just wondering if anyone else can see the problem more quickly than me

Bonus question: why did it work fine (usually) without copyback cache?

TheDarkCoder · 25 October 2011, 14:09

just a rapid guess: since 040 is pipelined (and 060 superscalar) CINVA starts executing before the push of data cache is over. So some data cache entry is invalidated before being pushed (and so is not push back to ram).
Without copyback there are no problems because the ram is consistent with the caches (no need to push back, indeed)

matthey · 25 October 2011, 14:25

Quote:

Originally Posted by TheDarkCoder

just a rapid guess: since 040 is pipelined (and 060 superscalar) CINVA starts executing before the push of data cache is over. So some data cache entry is invalidated before being pushed (and so is not push back to ram).

That's a good guess but both the 040 and 060 do a pipeline synchronization before executing a CPUSHA or CINVA similar to having a NOP before it. However, there could be bugs in some processors (check CPU erratas). Writethrough shouldn't have to push any cache as it's pushed to memory when written. The cache and memory are already consistent.

TheDarkCoder · 25 October 2011, 15:37

Yeah, mine was a too easy explanation, Toni would have thought about it in 1 nanosecond.
Also, not doing a pipeline syncronization before executing such kind of instruction would have been a Stupid x86 Thing, unworthy of the wonderful 68k! :-D

so, unless there is some cpu bug, I don't know

Toni Wilen · 25 October 2011, 15:45

Something can happen between those two instructions..

Leffmann · 25 October 2011, 15:56

An interrupt could occur between the two instructions or even during the CPUSHA, but I can't see the full picture here. Tell us

TheDarkCoder · 25 October 2011, 16:16

Well the interrupt handler may write in data cache some information which is needed outside the handler. The CINVA invalidate the cache entry without writing the information written by the handler to the ram. So information is lost, if the cache is in copyback mode.
Toni said that sometimes there are also problems in write-through mode, though, I don't know why (maybe in some circumstance the write-through do not happen?)

By the way, the CINVA instruction is inherently dangerous because of this reason, it should be used with very special care.

Toni Wilen · 25 October 2011, 18:57

Interrupt between CPUSHA and CINVA was the problem. CPUSHA can take hundreds of cycles if there is lots of data to push. (CopyBack enabled)

I didn't think this code was wrong because it has been like this since ages ago. (I guess it was actually never used until m68k AROS was resurrected about a year ago..)

It also caused very confusing side-effects, 1230scsi.device (1260 + Blizzard SCSI Kit) worked fine 1-10 seconds before it hung. Interrupts and other tasks still kept working fine. Which pointed to task scheduling problem..

It gets even stranger, 1230scsi.device stopped detecting any drives if system was reset after hang.

How are you supposed to debug something like this?

CINVA removed and problem disappeared completely.

matthey · 26 October 2011, 05:15

Quote:

Originally Posted by Toni Wilen

Interrupt between CPUSHA and CINVA was the problem. CPUSHA can take hundreds of cycles if there is lots of data to push. (CopyBack enabled)

68060 User Manual...

CPUSHA <=5394(0/512) cycles
CINVA <=17(0/0) cycles

The CPUSHA instruction time assumes fast 2-1-1-1 memory too. Slower memory adds cycles very fast. That's with an 8k instruction and 8k data cache. The Natami will have several times that amount of cache and will be even slower at flushing the whole cache despite having very fast memory. That's why it's important to have a working CacheClearE() and encourage it's use to only flush the cache that needs to be.

TheDarkCoder · 26 October 2011, 10:28

Agreed. Anyway I would expect that CPUSHA only push to memory the dirty cache lines, i.e. 512 is just tha upper bound on the number of memory accesses

Toni Wilen · 26 October 2011, 21:04

Quote:

Originally Posted by matthey

That's why it's important to have a working CacheClearE() and encourage it's use to only flush the cache that needs to be.

Thanks for reminding

CacheClearE() was also bad (bad as in caused unnecessary performance loss), it only called CacheClearU().

Fixed today, now it uses CPUSHP if flushed region is small enough (megabyte or so, prevents buggy programs flushing whole memory space slowly, page by page...) and also flushes only requested cache type(s).

matthey · 27 October 2011, 08:21

@TheDarkCoder
Yes, pushing dirty (written) cache lines only sounds correct. Pushing all the cache lines would be rare but so is 2-1-1-1 memory on the Amiga. I would not be surprised if cpusha bc takes over a thousand cycles on average on Amiga 68060 accelerators.

@Toni Wilen
Thanks. That's the kind of support that projects like the Natami need in order to make a future possible.

TheDarkCoder · 27 October 2011, 08:31

Quote:

Originally Posted by matthey

@TheDarkCoder
Yes, pushing dirty (written) cache lines only sounds correct. Pushing all the cache lines would be rare but so is 2-1-1-1 memory on the Amiga. I would not be surprised if cpusha bc takes over a thousand cycles on average on Amiga 68060 accelerators.

could you please explain what does 2-1-1-1 memory mean?

Is it related with football ?

Geijer · 27 October 2011, 11:31

2-1-1-1 is a way to show the access speed of memory in burst mode.

The numbers represent the number of clock cycles it takes for each data to be fetched. In the above example the whole burst of 4 "words" takes 5 cycles. The first data in 2 cycles, the following takes additional 1 per data.

2-2-2-2 means that there is no difference in burst mode compared to non burst mode.

Old SDRAM is at best 5-1-1-1 iirc.

TheDarkCoder · 27 October 2011, 11:50

Quote:

Originally Posted by Geijer

2-1-1-1 is a way to show the access speed of memory in burst mode.

The numbers represent the number of clock cycles it takes for each data to be fetched. In the above example the whole burst of 4 "words" takes 5 cycles. The first data in 2 cycles, the following takes additional 1 per data.

2-2-2-2 means that there is no difference in burst mode compared to non burst mode.

Old SDRAM is at best 5-1-1-1 iirc.

ok, thanks. Very interesting!

@matthey (or others):
so, are you saying that Amiga accelerators do not have, in fast ram, 2-1-1-1 access speed?
What are the usual access speed? Are speed of widespread accelerators (Cyberstorm, various Blizzard, Apollo, etc.) known ?

matthey · 27 October 2011, 18:13

Quote:

Originally Posted by TheDarkCoder

so, are you saying that Amiga accelerators do not have, in fast ram, 2-1-1-1 access speed?
What are the usual access speed? Are speed of widespread accelerators (Cyberstorm, various Blizzard, Apollo, etc.) known ?

The accelerator manufacturers generally did not publish the access times. Some had jumpers or settings to add wait states. I have heard numbers float around but I don't know how reliable they are. Here are possible maximums for some accelerators...

3640 7-7-7-7 (no burst)
WarpEngine 040 4-2-2-2
Cyberstorm MKIII/PPC 5-1-1-1
Atari CT60 5-1-1-1 reads 3-1-1-1 writes

The Natami has an SRAM cache that will likely do 3-1-1-1 or 2-1-1-1. It's regular DDR2 memory will likely be similar to the CT60 or a little better.

TheDarkCoder · 28 October 2011, 10:05

thanks! :-)

probaly numbers variates with the type of memory installed.

24 October 2011, 17:43	#1
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,517	How not to flush caches. AROS m68k CopyBack random hang bug aka How Not To Flush Caches. 68k CacheClearU() executed following code if CPU is 040 or 060 (left out Supervisor() call and other minor things, it isn't important here) CPUSHA BC CINVA BC Question: Why is this horribly bad idea? (No, I didn't write this code and I didn't see the problem until few days ago..)

25 October 2011, 04:26	#2
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	How did this only hang without the Supervisor call? No crash? I would expect this to work in Supervisor although it doesn't make much sense. The CPUSHA BC is all that is needed. CacheClearU() is a pretty simple function and there aren't many ways to write it the "correct" way. Did you check CacheClearE() also? CacheClearU: move.l a5,a0 lea (.super_push,pc),a5 jsr (Supervisor,a6) move.l a0,a5 rts .super_push: cpusha bc rte

25 October 2011, 07:56	#3
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,517	You misunderstood (and yes, single cpusha bc fixes it). I now know what the problem is (was), just wondering if anyone else can see the problem more quickly than me Bonus question: why did it work fine (usually) without copyback cache?

25 October 2011, 14:09	#4
TheDarkCoder Registered User Join Date: Dec 2007 Location: Dark Kingdom Posts: 213	just a rapid guess: since 040 is pipelined (and 060 superscalar) CINVA starts executing before the push of data cache is over. So some data cache entry is invalidated before being pushed (and so is not push back to ram). Without copyback there are no problems because the ram is consistent with the caches (no need to push back, indeed)

25 October 2011, 15:37	#6
TheDarkCoder Registered User Join Date: Dec 2007 Location: Dark Kingdom Posts: 213	Yeah, mine was a too easy explanation, Toni would have thought about it in 1 nanosecond. Also, not doing a pipeline syncronization before executing such kind of instruction would have been a Stupid x86 Thing, unworthy of the wonderful 68k! :-D so, unless there is some cpu bug, I don't know

25 October 2011, 15:45	#7
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,517	Something can happen between those two instructions..

25 October 2011, 15:56	#8
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	An interrupt could occur between the two instructions or even during the CPUSHA, but I can't see the full picture here. Tell us

25 October 2011, 16:16	#9
TheDarkCoder Registered User Join Date: Dec 2007 Location: Dark Kingdom Posts: 213	Well the interrupt handler may write in data cache some information which is needed outside the handler. The CINVA invalidate the cache entry without writing the information written by the handler to the ram. So information is lost, if the cache is in copyback mode. Toni said that sometimes there are also problems in write-through mode, though, I don't know why (maybe in some circumstance the write-through do not happen?) By the way, the CINVA instruction is inherently dangerous because of this reason, it should be used with very special care.

25 October 2011, 18:57	#10
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,517	Interrupt between CPUSHA and CINVA was the problem. CPUSHA can take hundreds of cycles if there is lots of data to push. (CopyBack enabled) I didn't think this code was wrong because it has been like this since ages ago. (I guess it was actually never used until m68k AROS was resurrected about a year ago..) It also caused very confusing side-effects, 1230scsi.device (1260 + Blizzard SCSI Kit) worked fine 1-10 seconds before it hung. Interrupts and other tasks still kept working fine. Which pointed to task scheduling problem.. It gets even stranger, 1230scsi.device stopped detecting any drives if system was reset after hang. How are you supposed to debug something like this? CINVA removed and problem disappeared completely.

26 October 2011, 10:28	#12
TheDarkCoder Registered User Join Date: Dec 2007 Location: Dark Kingdom Posts: 213	Agreed. Anyway I would expect that CPUSHA only push to memory the dirty cache lines, i.e. 512 is just tha upper bound on the number of memory accesses

27 October 2011, 08:21	#14
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	@TheDarkCoder Yes, pushing dirty (written) cache lines only sounds correct. Pushing all the cache lines would be rare but so is 2-1-1-1 memory on the Amiga. I would not be surprised if cpusha bc takes over a thousand cycles on average on Amiga 68060 accelerators. @Toni Wilen Thanks. That's the kind of support that projects like the Natami need in order to make a future possible.

27 October 2011, 11:31	#16
Geijer Oldtimer Join Date: Nov 2010 Location: VXO / Sweden Posts: 153	2-1-1-1 is a way to show the access speed of memory in burst mode. The numbers represent the number of clock cycles it takes for each data to be fetched. In the above example the whole burst of 4 "words" takes 5 cycles. The first data in 2 cycles, the following takes additional 1 per data. 2-2-2-2 means that there is no difference in burst mode compared to non burst mode. Old SDRAM is at best 5-1-1-1 iirc.

28 October 2011, 10:05	#19
TheDarkCoder Registered User Join Date: Dec 2007 Location: Dark Kingdom Posts: 213	thanks! :-) probaly numbers variates with the type of memory installed.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)