English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 19 July 2019, 14:14   #21
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 1,823
Quote:
Originally Posted by Toni Wilen View Post
Don't attempt to use 68020+ emulation for anything accurate. It won't work.

EDIT: This is exactly the information that is missing: internal 68020 cycle usage.
Sure
WinUAE results are only for reference.
The aim is instead to see what the results are in a real machine.


Quote:
Originally Posted by grond View Post
Does the 68020 instruction cache cache instructions that are in chipmem? Data obviously can change due to the blitter and so on but code should not (but might anyway).
Of course yes.
And that's why I put the read/write speed code in a unrolled loop <256bytes.
ross is offline  
Old 19 July 2019, 14:46   #22
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 634
Quote:
Originally Posted by ross View Post
Of course yes.
And that's why I put the read/write speed code in a unrolled loop <256bytes.
Do you know any alignment restrictions for the 256 bytes instruction cache? AFAIK the 256 bytes are organised in 16 cache lines of 16 bytes each. I assume a 256-byte loop thus should be aligned to 16-byte boundaries? If it is not, the cache will have to reload all the time.
grond is offline  
Old 19 July 2019, 15:10   #23
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 1,823
Quote:
Originally Posted by grond View Post
Do you know any alignment restrictions for the 256 bytes instruction cache? AFAIK the 256 bytes are organised in 16 cache lines of 16 bytes each. I assume a 256-byte loop thus should be aligned to 16-byte boundaries? If it is not, the cache will have to reload all the time.
Yes, I should look at the manuals but from what I remember the organization of the 020 and 030 caches is different.
However, I considered the worst situation, that is the one you described, and I structured the code to stay completely in the cache.
There are 236 total bytes, divided into 3 sections: 4 (prologue, internal loop setup) + 224 (central, reads or writes) + 8 (2 dbf loops).
Also considering the worst case where I fill only the last two bytes in the first cache line I get to occupy up to the last line:
[..............xx] [14 cache lines] [xxxxxxxxxx......]
You can see it in the source.

So theoretically the code should run completely from the cache
ross is offline  
Old 19 July 2019, 15:18   #24
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 44
Posts: 22,943
68020 cache is 64 x 4, 68030 cache is 16 x 16.

Quote:
The aim is instead to see what the results are in a real machine.
Results are still almost worthless without knowledge of internal operation.

Different instructions can have different fetch sequence and possible internal idle cycles, different addressing modes, how to prefetch more opcode words, pipeline state/refill (branches) and so on.

EDIT: in other words: you can't only test single instruction (like with 68000) because next and previous instruction affect how instruction under test behaves.

Even single cycle different can make 2* speed difference when accessing chip ram.

Last edited by Toni Wilen; 19 July 2019 at 15:27.
Toni Wilen is offline  
Old 19 July 2019, 15:59   #25
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 1,823
Quote:
Originally Posted by Toni Wilen View Post
Results are still almost worthless without knowledge of internal operation.
Very true.

Quote:
Different instructions can have different fetch sequence and possible internal idle cycles, different addressing modes, how to prefetch more opcode words, pipeline state/refill (branches) and so on.
And it is for this reason that I am trying to help the OP to find the best situation/condition to use all available cycles for the CPU in the internal bus.
And I suppose there is no better way than using the simplest possible instruction for a continuous fetch from memory: an aligned full speed sequence of move.l (ax)+,dx
I realize that it is perfectly useless, but one day someone will be able to make a decap of a 68020 and a perfect emulation will be possible

Quote:
EDIT: in other words: you can't only test single instruction (like with 68000) because next and previous instruction affect how instruction under test behaves.

Even single cycle different can make 2* speed difference when accessing chip ram.
Or you should try all the possible combinations
(I'm just kidding)


However thinking back to the data reported by roondar on read using bustest, the 5,6MB/s do not seem at all random, 5,6*1,25=7...
It appears that the test code regularly loses 1 access cycle to the bus.
ross is offline  
Old 19 July 2019, 16:06   #26
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 634
Quote:
Originally Posted by Toni Wilen View Post
68020 cache is 64 x 4, 68030 cache is 16 x 16.
64 cache "lines" of 4 bytes or 4 sets of 64 bytes? In other word, a 256byte loop would have to be aligned to longs or to 64 bytes? The latter would be pretty bad.
grond is offline  
Old 19 July 2019, 16:34   #27
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 1,823
Quote:
Originally Posted by grond View Post
64 cache "lines" of 4 bytes or 4 sets of 64 bytes? In other word, a 256byte loop would have to be aligned to longs or to 64 bytes? The latter would be pretty bad.
The MC68020/EC020 on-chip instruction cache is a direct-mapped cache of 64 long-word
entries. Each cache entry consists of a tag field (A31–A8 and FC2), one valid bit, and 32
bits (two words) of instruction data.


64 cache lines of longword entries, so granularity is even better than 030.
ross is offline  
Old 19 July 2019, 17:06   #28
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 234
Quote:
Originally Posted by grond View Post
64 cache &quot;lines&quot; of 4 bytes or 4 sets of 64 bytes? In other word, a 256byte loop would have to be aligned to longs or to 64 bytes? The latter would be pretty bad.
There is no alignment constraint. Whenever the CPU attempts to fetch a new instruction, let it be on an even or an odd word, it fetches the complete long word from the external bus, and replaces the complete cache entry, no matter of its alignment.
Thomas Richter is offline  
Old 19 July 2019, 17:16   #29
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 234
Quote:
Originally Posted by grond View Post
Does the 68020 instruction cache cache instructions that are in chipmem? Data obviously can change due to the blitter and so on but code should not (but might anyway).
The 68020 does not have a data cache, so the second question does not apply. As far as the instruction cache is concerned, there is an external input (\CDIS) which can be used to selectively disable the cache, quite similar to \CI on the 68030.

For the A1200, however, \CDIS is not connected, so the 68020 can cache instructions in chip mem on this system with the on-board CPU. That does not mean that this has to hold for all turbo-boards for this machine.

For many if not all 68030 based turbo boards, \CI is connected to a logic that marks (at least) chip mem accesses and custom chip accesses as non-cacheable. Or actually, should, as \CI of the 68030 does not work on write accesses.
Thomas Richter is offline  
Old 19 July 2019, 18:06   #30
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 44
Posts: 22,943
Slightly more detailed answer:

Quote:
Originally Posted by grond View Post
64 cache "lines" of 4 bytes or 4 sets of 64 bytes? In other word, a 256byte loop would have to be aligned to longs or to 64 bytes? The latter would be pretty bad.
68030 has 4 longs in each cache line but it still has separate valid/invalid bit for each long word. Each cache line also have single common "base" address (Compared to 68020 where each cache long word is completely independent) . This cache structure allows optional burst filling.

It also means if loop is almost the size of cache, loop may not fit in cache without being 16-byte aligned.

68040+ needs to fill complete cache line. 68030-like partially filled cache lines are not supported.
Toni Wilen is offline  
Old 19 July 2019, 18:09   #31
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,241
Quote:
Originally Posted by ross View Post
WinUAE results (A1200 quickconfig, base non-expanded, CE full):
7064,51KB/s
No differences for read or write.
Ah, if only that'd be the case on the real machine... But I don't think so, bustest disagrees
Quote:
Originally Posted by ross View Post
Very true.
And it is for this reason that I am trying to help the OP to find the best situation/condition to use all available cycles for the CPU in the internal bus.
And I suppose there is no better way than using the simplest possible instruction for a continuous fetch from memory: an aligned full speed sequence of move.l (ax)+,dx
I realize that it is perfectly useless, but one day someone will be able to make a decap of a 68020 and a perfect emulation will be possible
Much appreciated
That said, won't a movem.l do better (even in cache)?

Thinking out loud, I'd say that you might want to 'interleave' any non-memory accessing instructions with any that do access memory. Might help alleviate the 'half speed' bus. Not that you need that many of those non-memory access instructions for copying data
Quote:
Or you should try all the possible combinations
(I'm just kidding)
Given 64 instructions or so of cache space and assuming, oh I don't know... Let's say a thousand or so different instruction/EA combinations you'd only need on the order of 1000^64 tests. Eminently reasonable
Quote:
However thinking back to the data reported by roondar on read using bustest, the 5,6MB/s do not seem at all random, 5,6*1,25=7...
It appears that the test code regularly loses 1 access cycle to the bus.
Bustest runs while the OS is running and leaves the screen active. It doesn't appear to shut down interrupts either. This might also affect results (it reports 6.9MB/s for writing rather than 7MB/s for instance).
roondar is offline  
Old 22 July 2019, 10:19   #32
zero
Registered User

 
Join Date: Jun 2016
Location: UK
Posts: 326
There are other factors as well as just the CPU type. All 68030 accelerators are not equal!

I'm not sure of the exact reason but it must have something to do with the way the CPU interface to the chip memory bus is implemented.
zero is offline  
Old 22 July 2019, 10:36   #33
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 46
Posts: 3,501
Quote:
Originally Posted by Toni Wilen View Post
It also means if loop is almost the size of cache, loop may not fit in cache without being 16-byte aligned.
My tests on B1230-IV have shown the loop always fits in cache if it's 240 bytes or less. Aligning it 16-byte doesn't help much, it can only reach 242 bytes this way. I suppose this is linked to the way the cache is filled during code execution.
meynaf is offline  
Old 22 July 2019, 10:51   #34
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 44
Posts: 22,943
Quote:
Originally Posted by meynaf View Post
My tests on B1230-IV have shown the loop always fits in cache if it's 240 bytes or less. Aligning it 16-byte doesn't help much, it can only reach 242 bytes this way. I suppose this is linked to the way the cache is filled during code execution.
Do you get different results if you end your loop with unconditional branch? It should keep CPU from unnecessarily prefetching opcodes after the branch.

Quote:
Originally Posted by zero View Post
There are other factors as well as just the CPU type. All 68030 accelerators are not equal!

I'm not sure of the exact reason but it must have something to do with the way the CPU interface to the chip memory bus is implemented.
One difference is sync vs async. Sync = board is clocked by Amiga's main 28.somethingMHz clock crystal. Async boards have separate on board clock crystal.

Synchronous boards generally can access any free slot, async boards can miss some free slots.
Toni Wilen is offline  
Old 22 July 2019, 11:14   #35
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 46
Posts: 3,501
Quote:
Originally Posted by Toni Wilen View Post
Do you get different results if you end your loop with unconditional branch? It should keep CPU from unnecessarily prefetching opcodes after the branch.
I didn't try this explicitly, but i would bet 68030 starts fetching long before the loop instruction is decoded.
meynaf is offline  
Old 22 July 2019, 13:46   #36
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 634
Quote:
Originally Posted by meynaf View Post
I didn't try this explicitly, but i would bet 68030 starts fetching long before the loop instruction is decoded.
I find your test results very interesting! And yes, it seems very plausible that the 030 trashes an entire cacheline by only fetching one unneeded instruction after the branching instruction at the end of the loop. I would assume that if your code is a few bytes smaller than the 240 bytes, aligning to 16 bytes boundaries can make the difference between running from cache or not.
grond is offline  
Old 22 July 2019, 18:31   #37
zero
Registered User

 
Join Date: Jun 2016
Location: UK
Posts: 326
Quote:
Originally Posted by Toni Wilen View Post
One difference is sync vs async. Sync = board is clocked by Amiga's main 28.somethingMHz clock crystal. Async boards have separate on board clock crystal.

Synchronous boards generally can access any free slot, async boards can miss some free slots.
Ah, that's probably it. It was a long time ago but I was developing the MHI driver for DCR's parallel port device when I first noticed it. Some 030s would barely see any noticeable load, some would be very slow.

The 50MHz Blizzard I was using for primary development was one of the fast ones. I don't remember what the slow one for the A1200 was... I had a slow one for the Amiga 600 too, which I think was the one that ended up in my car as a head unit.
zero is offline  
Old 22 July 2019, 21:59   #38
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 914
Quote:
Originally Posted by zero View Post
I had a slow one for the Amiga 600 too, which I think was the one that ended up in my car as a head unit.
You have an Amiga-based head unit in your car?
hooverphonique is offline  
Old 23 July 2019, 09:49   #39
zero
Registered User

 
Join Date: Jun 2016
Location: UK
Posts: 326
Quote:
Originally Posted by hooverphonique View Post
You have an Amiga-based head unit in your car?
Used to, many years ago.
zero is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
memory access speed question Lord Riton Coders. General 42 27 February 2019 14:26
Question about blitter speed / DMA usage LaBodilsen Coders. Asm / Hardware 3 25 January 2018 11:14
CPU speed slider in memory cycle-exact mode amilo3438 support.WinUAE 5 12 December 2017 21:05
Approximate A1200 CPU speed option? Mequa request.UAE Wishlist 3 12 November 2010 20:34
DMA memory to memory copy BlueAchenar Coders. General 14 22 January 2009 23:29

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:49.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.
Page generated in 0.09046 seconds with 15 queries