24 October 2011, 17:43 | #1 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,517
|
How not to flush caches.
AROS m68k CopyBack random hang bug aka How Not To Flush Caches.
68k CacheClearU() executed following code if CPU is 040 or 060 (left out Supervisor() call and other minor things, it isn't important here) CPUSHA BC CINVA BC Question: Why is this horribly bad idea? (No, I didn't write this code and I didn't see the problem until few days ago..) |
25 October 2011, 04:26 | #2 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
How did this only hang without the Supervisor call? No crash? I would expect this to work in Supervisor although it doesn't make much sense. The CPUSHA BC is all that is needed. CacheClearU() is a pretty simple function and there aren't many ways to write it the "correct" way. Did you check CacheClearE() also?
CacheClearU: move.l a5,a0 lea (.super_push,pc),a5 jsr (Supervisor,a6) move.l a0,a5 rts .super_push: cpusha bc rte |
25 October 2011, 07:56 | #3 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,517
|
You misunderstood (and yes, single cpusha bc fixes it). I now know what the problem is (was), just wondering if anyone else can see the problem more quickly than me
Bonus question: why did it work fine (usually) without copyback cache? |
25 October 2011, 14:09 | #4 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
just a rapid guess: since 040 is pipelined (and 060 superscalar) CINVA starts executing before the push of data cache is over. So some data cache entry is invalidated before being pushed (and so is not push back to ram).
Without copyback there are no problems because the ram is consistent with the caches (no need to push back, indeed) |
25 October 2011, 14:25 | #5 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
That's a good guess but both the 040 and 060 do a pipeline synchronization before executing a CPUSHA or CINVA similar to having a NOP before it. However, there could be bugs in some processors (check CPU erratas). Writethrough shouldn't have to push any cache as it's pushed to memory when written. The cache and memory are already consistent.
Last edited by matthey; 25 October 2011 at 14:43. |
25 October 2011, 15:37 | #6 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Yeah, mine was a too easy explanation, Toni would have thought about it in 1 nanosecond.
Also, not doing a pipeline syncronization before executing such kind of instruction would have been a Stupid x86 Thing, unworthy of the wonderful 68k! :-D so, unless there is some cpu bug, I don't know |
25 October 2011, 15:45 | #7 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,517
|
Something can happen between those two instructions..
|
25 October 2011, 15:56 | #8 |
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
|
An interrupt could occur between the two instructions or even during the CPUSHA, but I can't see the full picture here. Tell us
|
25 October 2011, 16:16 | #9 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Well the interrupt handler may write in data cache some information which is needed outside the handler. The CINVA invalidate the cache entry without writing the information written by the handler to the ram. So information is lost, if the cache is in copyback mode.
Toni said that sometimes there are also problems in write-through mode, though, I don't know why (maybe in some circumstance the write-through do not happen?) By the way, the CINVA instruction is inherently dangerous because of this reason, it should be used with very special care. |
25 October 2011, 18:57 | #10 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,517
|
Interrupt between CPUSHA and CINVA was the problem. CPUSHA can take hundreds of cycles if there is lots of data to push. (CopyBack enabled)
I didn't think this code was wrong because it has been like this since ages ago. (I guess it was actually never used until m68k AROS was resurrected about a year ago..) It also caused very confusing side-effects, 1230scsi.device (1260 + Blizzard SCSI Kit) worked fine 1-10 seconds before it hung. Interrupts and other tasks still kept working fine. Which pointed to task scheduling problem.. It gets even stranger, 1230scsi.device stopped detecting any drives if system was reset after hang. How are you supposed to debug something like this? CINVA removed and problem disappeared completely. |
26 October 2011, 05:15 | #11 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
CPUSHA <=5394(0/512) cycles CINVA <=17(0/0) cycles The CPUSHA instruction time assumes fast 2-1-1-1 memory too. Slower memory adds cycles very fast. That's with an 8k instruction and 8k data cache. The Natami will have several times that amount of cache and will be even slower at flushing the whole cache despite having very fast memory. That's why it's important to have a working CacheClearE() and encourage it's use to only flush the cache that needs to be. |
|
26 October 2011, 10:28 | #12 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Agreed. Anyway I would expect that CPUSHA only push to memory the dirty cache lines, i.e. 512 is just tha upper bound on the number of memory accesses
|
26 October 2011, 21:04 | #13 | |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,517
|
Quote:
CacheClearE() was also bad (bad as in caused unnecessary performance loss), it only called CacheClearU(). Fixed today, now it uses CPUSHP if flushed region is small enough (megabyte or so, prevents buggy programs flushing whole memory space slowly, page by page...) and also flushes only requested cache type(s). |
|
27 October 2011, 08:21 | #14 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
@TheDarkCoder
Yes, pushing dirty (written) cache lines only sounds correct. Pushing all the cache lines would be rare but so is 2-1-1-1 memory on the Amiga. I would not be surprised if cpusha bc takes over a thousand cycles on average on Amiga 68060 accelerators. @Toni Wilen Thanks. That's the kind of support that projects like the Natami need in order to make a future possible. |
27 October 2011, 08:31 | #15 | |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Quote:
Is it related with football ? |
|
27 October 2011, 11:31 | #16 |
Oldtimer
Join Date: Nov 2010
Location: VXO / Sweden
Posts: 153
|
2-1-1-1 is a way to show the access speed of memory in burst mode.
The numbers represent the number of clock cycles it takes for each data to be fetched. In the above example the whole burst of 4 "words" takes 5 cycles. The first data in 2 cycles, the following takes additional 1 per data. 2-2-2-2 means that there is no difference in burst mode compared to non burst mode. Old SDRAM is at best 5-1-1-1 iirc. |
27 October 2011, 11:50 | #17 | |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Quote:
@matthey (or others): so, are you saying that Amiga accelerators do not have, in fast ram, 2-1-1-1 access speed? What are the usual access speed? Are speed of widespread accelerators (Cyberstorm, various Blizzard, Apollo, etc.) known ? |
|
27 October 2011, 18:13 | #18 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
3640 7-7-7-7 (no burst) WarpEngine 040 4-2-2-2 Cyberstorm MKIII/PPC 5-1-1-1 Atari CT60 5-1-1-1 reads 3-1-1-1 writes The Natami has an SRAM cache that will likely do 3-1-1-1 or 2-1-1-1. It's regular DDR2 memory will likely be similar to the CT60 or a little better. |
|
28 October 2011, 10:05 | #19 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
thanks! :-)
probaly numbers variates with the type of memory installed. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
|
|