English Amiga Board    


Go Back   English Amiga Board > » Coders > Coders. General

Reply
 
Thread Tools
Old 24 January 2008, 17:29   #1
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
68030 pipelines

Perhaps I can introduce people here with something weird I've found.

Tests done on a 50 Mhz '030 (Blizzard 1230-IV) with 60ns memory, no databurst.

You know that the cpu can continue executing instructions after a memory write without waiting for the write to complete. But it won't do that for reads, of course. Adjusting memory reads will do no good, eh ? Well, let's see...

Provided that a0 is somewhere in fastmem, and a1 goes in chipmem (no mystery if in fastmem), can someone explain why this :
Code:
.loop
 move.l (a0)+,d1
 move.l (a0)+,d2
 move.l d1,d0
 move.l d1,d0
 move.l d3,(a1)+
 subq.l #1,d7
 bne.s .loop
isn't slower than this :
Code:
.loop
 move.l (a0)+,d1
 move.l (a0)+,d2
 move.l d3,(a1)+
 subq.l #1,d7
 bne.s .loop
(of course that code is useless, it's just to show the mystery)
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 24 January 2008, 17:45   #2
Wepl
Moderator
 
Wepl's Avatar
 
Join Date: Nov 2001
Location: Germany
Posts: 504
Send a message via Skype™ to Wepl
It could be that the write to the chip mem is thing which limits the execution speed here. So it uses all possible write cycles to chip mem. The time left can be filled with any instructions.
Wepl is offline   Reply With Quote
Old 24 January 2008, 17:51   #3
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Of course it's the write to chipmem which messes up things here.
But what exactly are "all possible write cycles to chip mem" ?
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 24 January 2008, 20:03   #4
ganralf
Registered User
 
Join Date: May 2006
Location: Germany
Posts: 97
IIRC the 68030 can only contnue execution, when there's no pending write on the same cache line. On the first loop it can continue but on the second it needs to wait for the first move to chip mem to finish (providing chip mem is slow enough). On the third loop it waits for the second move, and so on. The waits seem to give the extra time used to execute the additional moves.
If i'm right, it should still be a bit slower, as there is no need to wait when you cross a cache line.
ganralf is offline   Reply With Quote
Old 25 January 2008, 10:47   #5
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Quote:
Originally Posted by ganralf View Post
IIRC the 68030 can only contnue execution, when there's no pending write on the same cache line. On the first loop it can continue but on the second it needs to wait for the first move to chip mem to finish (providing chip mem is slow enough). On the third loop it waits for the second move, and so on. The waits seem to give the extra time used to execute the additional moves.
If i'm right, it should still be a bit slower, as there is no need to wait when you cross a cache line.
But when using fastmem instead of chipmem, a read right after a write stalls the pipeline as expected (and the addresses are very different so they're not in the same cache lines).
The two timings of the code I gave are exactly identical... and longer than what one single move in chipmem would be.

Furthermore, if you add a few more moves, you'll get it very slightly slower, but nothing compared to the real timing of the moves.
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 25 January 2008, 17:17   #6
Wepl
Moderator
 
Wepl's Avatar
 
Join Date: Nov 2001
Location: Germany
Posts: 504
Send a message via Skype™ to Wepl
Quote:
Originally Posted by meynaf View Post
Of course it's the write to chipmem which messes up things here.
But what exactly are "all possible write cycles to chip mem" ?
I don't know it exactly
My understanding is that your cpu is running at 50 MHz but the memory is much slower so it can only perform a memory access each 30 cycles (for example). Therefore in your loop it will wait for the memory access each time. This wait time you can also fill which some other instructions without changing the speed of the loop.
Wepl is offline   Reply With Quote
Old 25 January 2008, 17:29   #7
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Quote:
Originally Posted by Wepl View Post
I don't know it exactly
My understanding is that your cpu is running at 50 MHz but the memory is much slower so it can only perform a memory access each 30 cycles (for example). Therefore in your loop it will wait for the memory access each time. This wait time you can also fill which some other instructions without changing the speed of the loop.
Are you telling me the chipmem is so slow that the cpu has enough time to take a coffee (well, execute a bunch of instructions) before a new memory access can occur ?

Knowing what happens exactly (cycle by cycle) would help me optimize some code ; how to exactly pipeline things when accessing the chip mem.
Apparently no-one knows...
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 26 January 2008, 09:55   #8
TheDarkCoder
Registered User
 
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 114
@meynaf

I don't know the exact timing. Try to mesure how much a sequence of write to chip takes, compared to the same sequence writing in fast.
Anyway, I completely agree with Wepl and Ganral, their explanation of the "mistery" is the right one.
TheDarkCoder is offline   Reply With Quote
Old 26 January 2008, 16:35   #9
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
Could you (meynaf or someone else with a Blizzard 1230-IV at 50MHz) test the following code sequences please and tell how long they take?

; should run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 subq.l #1,d7
 bne.s .loop
; should run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 subq.l #1,d7
 bne.s .loop
; should run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop
; should run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop
; might run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop
; should not run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop
; might run at full speed
Code:
 
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 subq.l #1,d7
 bne.s .loop
; might run at full speed
Code:
 
.loop
 
 move.l (a0)+,d0
 move.l d3,(a1)+
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 subq.l #1,d7
 bne.s .loop
Performance figures from these tests will make it easier to guess how fastmem/chipmem accesses interact on the bus.
Kalms is offline   Reply With Quote
Old 28 January 2008, 10:01   #10
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
I can't do that right now, but be sure I will (you should get the result on next monday 'coz I don't access my Miggy during the week). I'll give you the number of clock cycles for each version.

What I expect is that (but you never know) :
- test #1 will run at full speed (tested long ago)
- test #2 should also run at full speed
- test #3 won't (can't pipeline the read)
- test #4 speed may be identical to test #3 (unsure)
- test #5 slower than test #4 but unsure
- test #6 twice the time from test #1
- test #7 ???
- test #8 ???
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 28 January 2008, 12:23   #11
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
(The text below assumes that all DMA is turned off, and that MOVEs to/from memory take 0 cycles, except for the time spent waiting for the memory accesses themselves.)

Ok, this is how I think it works:

All writes go through the Write Pending Buffer (see 68030UM, table 6-1, and section 11.2.5.2).
When the CPU wants to write a value, it first waits until the Write Pending Buffer is empty,
and then it dispatches the value to the buffer.
If the Write Pending Buffer is full, then it also means that the bus controller is busy (it is performing a memory write).
When the CPU wants to read a value, it first waits until the bus controller is available, and then it performs a read operation.

The CPU runs at a core clock frequency of 50MHz, so the cycle time for the CPU is 20ns.
The chipset has a clock frequency of ~3.57MHz, which yields a cycle time of 280ns.
Therefore, 1 chipset bus cycle ("bus cycle") == 14 CPU cycles ("cycles").

Now, the CPU and the chipset run pretty much asynchronously. When the CPU wants to write to chipmem, the bus controller will place a write request onto the bus, and then the chipset will acknowledge the write at some point in the future. The CPU has no idea/expectation of how long the write will take.
Since the CPU peaks at 1 chipmem-write per 28 CPU-cycles, and it is not possible to squeeze in any fastmem reads between two chipmem writes
, it seems that the bus controller needs to spend two full bus cycles
on the chipmem write.
To be a bit more precise: the chipset will look for a write request at the beginning of an even bus cycle, and it will complete the request at the end of an odd bus cycle.
The above gives the following scenario for a write:
1) the bus controller places a write request onto the bus.
2) 0..27 cycles pass, until the beginning of an even buscycle occurs.
3) 28 cycles pass, during which the chipset is servicing the write request.
4) the write is acknowledged to the bus controller.
If writes are chained back-to-back, then the delay in step 2 will be exactly 0 cycles. A gap of less than 28 cycles between each instruction that writes to memory ensures max throughput. However, if the gap is 30 cycles, then the code will start missing some of the available buscycles.

So, what does this mean in practice? well:
Code:
 move.l d0,(a1)+
 move.l d0,(a1)+
... when the CPU completes the above sequence, the second memory write has just been dispatched to the bus controller. Because the two writes are back-to-back (less than 28 cycles apart), the second write had to wait for the first write to relinquish the bus before it requested its own write. Therefore, you know that the chipset bus is at the beginning of an even cycle right now, and that the bus controller will be busy for the next 28 cycles.
This is a good time to do any sort of work that only touches registers and caches.
If you perform a fastmem access, it will wait until the bus is available. That is, until the end of the current 28-cycle period.
The fastmem access itself will cause 3 cycles of bus activity.
If you make a chipmem write directly after the fastmem access, then the CPU instruction will complete immediately, but it will take (28 - 3) cycles until the next even buscycle is begun, and another 28 cycles until the following buscycle has completed; therefore, the bus controller will be busy for 25+28 cycles. Don't access memory during those cycles, or you will stall the CPU.
If you make two chipmem writes directly after the fastmem access, this will be even more obvious: the second write will stall until the end of the 25+28 cycle period, and then the bus will be busy for 28 cycles after the second chipmem write instruction has finished execution.

So, what to do?
- group chipmem writes together, ensure that they are less than 28 cycles apart
- do CPU work after each chipmem write
- if you need to touch fastmem, touch several memory locations at once; you should be able to squeeze in 2 or 3 fastmem reads during a 28-cycle period

Ah, and you need to take into account the execution times of the instructions as well. I think that a MOVE.L d0,(a0)+ takes 3 cycles, and a MOVE.L (a0)+,d0 (from fastmem) takes 8/9 cycles (for 60ns/70ns memory).
Kalms is offline   Reply With Quote
Old 28 January 2008, 13:25   #12
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
I suspected something like that, however when I wanted to check it with tests I discovered that the execution times were not multiples of 28 cycles...

Also, I've found that the fastmem does not exhibit such behaviors.
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 28 January 2008, 17:23   #13
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
If you have code that writes less frequently than 28 cycles, like the loop below:

Code:
    ; write to chipmem
    ; write to chipmem
    ; wait 32 cycles
    ; write to chipmem
    ; wait 32 cycles
    ; write to chipmem
    ; wait 32 cycles
    ; write to chipmem
    ; wait 32 cycles
    ; write to chipmem
    ; wait 100 cycles
... then you will get the following behaviour:


The first write will not stall at all.

The second write will stall for (0..27) + 28 cycles, as explained in the previous post. CPU is now aligned with
the chipbus; chipbus is at the beginning of an even buscycle.

The chipbus immediately accepts the CPU's write. CPU waits for 28 cycles. The chipbus transaction completes. CPU waits for another 4 cycles. The chipbus is now 4 cycles into an even buscycle. The bus interface (and Write Pending Buffer) is free.

The third write will not stall, since the bus interface is idle. The write goes into the Write Pending Buffer.

CPU waits for 24 cycles. The chipbus accepts the CPU's write, and starts processing it. CPU waits for another 8 cycles. The chipbus is now busy, and it's 8 cycles into an even buscycle.

The fourth write will stall for 20 cycles, waiting for the write buffer to become available. The fourth write will behave like the second write (align with the chipbus, etc).

The fifth write will not stall.

The sixth write will stall for 20 cycles.

Etc.

On the chipbus, it will look as if the following buscycles were being used for writing:

-Y-Y-N-Y-Y-N-Y-Y

Y = yes, N = no


If the above model is correct, then the padding will occasionally (not always) cause chipbus cycles to be missed:

* for 0..28 cycles of padding, loop should run at full speed
* for 29..42 cycles of padding, loop should run at 2/3 speed
* for 43..56 cycles of padding, loop should run at 1/2 speed

etc. So each block of 14 cycles of padding makes the code lose out on one chipmem slot.


So one thing which would be interesting to test is to see, when adding 28 -> 30 -> 32 -> 34 cycles of padding, whether or not the execution time rises linearly or if there is a discrete jump.

If there is a linear rise, it might be because there is a write buffer between the CPU's bus interface and the chipmem interface on the Blizzard board which we're not taking into account.





Regarding fastmem: it is not connected to the chipbus at all, and therefore all the sync-to-buscycle rules etc does not apply to it. Think of it as if the CPU's bus interface is a 3-way connection, which ties together CPU, fastmem and chipbus interface. The CPU can communicate with one of the two memory subsystems at any given time. Chipmem runs on a slow clock, but fastmem runs unclocked with 60ns access cycles.
Kalms is offline   Reply With Quote
Old 28 January 2008, 17:34   #14
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
I can tell you I have observed something that looked like a discrete jump.

So, if I understand well, to optimise for chipmem accesses it's better if you fit non-chipmem accesses in blocks of 14 cpu cycles ? And the overall time will be a multiple of 14 ?

Hmmm... looks too simple. I'll perform more tests this week-end...
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 29 January 2008, 11:16   #15
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
I did some timing tests on Blizzard 1260 today. See attached files.

When the 060 runs with Store Buffer disabled, there is no decoupling mechanism for writes. The CPU pipeline waits until the bus transfer has completed.

Pages 4-5 of the attached document exhibit the 'staircasing' phenomenon quite well.

When the 060 runs with the Store Buffer enabled, there is a series of 4 buffers decoupling CPU from bus interface. In most cases, these buffers hide the staircasing. Right now, I'm not 100% sure how. (Pages 1-3 in the document)

The 030 should correspond to an 060 with a small Store Buffer (just 1 entry). It might be that the 030's single buffer is enough to hide most of the staircasing, too. I have no clear explanation for the phenomena right now.
Attached Files
File Type: pdf Blizzard1260ReadWriteStalls2.pdf (37.0 KB, 98 views)
Kalms is offline   Reply With Quote
Old 29 January 2008, 12:02   #16
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
I can't imagine the hell it must be with the 040 copyback cache...
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 04 February 2008, 10:28   #17
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Tests done !

None of them gives integer values
('xcept my reference test)

Code:
; should run at full speed (it does)
.loop
 move.l d3,(a1)+
 subq.l #1,d7
 bne.s .loop            ; 28.76 -> 28.33

; should run at full speed (it does too)
.loop
 move.l d3,(a1)+
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 move.l d0,d1
 subq.l #1,d7
 bne.s .loop            ; 28.76 -> 28.33

; should run at full speed (doesn't !)
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop            ; 57.50 -> 56.65

; should run at full speed (doesn't)
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop            ; 58.78 -> 57.91

; might run at full speed (oh no !)
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0
 subq.l #1,d7
 bne.s .loop            ; 75.00 -> 73.89

; should not run at full speed (obviously it doesn't...)
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0        ; 75.18 -> 74.07
 subq.l #1,d7
 bne.s .loop

; might run at full speed (doesn't)
.loop
 move.l d3,(a1)+
 move.l (a0)+,d0
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 subq.l #1,d7
 bne.s .loop            ; 58.62 -> 57.75

; might run at full speed (doesn't)
.loop
 move.l (a0)+,d0
 move.l d3,(a1)+
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 move.l d1,d2
 subq.l #1,d7
 bne.s .loop            ; 45.36 -> 44.69

; added as a reference (28 real clock cycles) :
 move.l d1,d2
 move.l d1,d2
 lsr.w #1,d4
 lsr.w #1,d4
 lsr.w #1,d4
 lsr.w #1,d4
 subq.l #1,d7
 bne.s .loop            ; 28.42 -> 28
Note : the test gave clock cycles * 1.015 (1.5% for interrupt/os cpu use). I give both values : original and recomputed one.
The timings are very stable, the little differences always show up the same way.
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 04 February 2008, 12:07   #18
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
Ok. First off, one possible reason why the tests aren't showing integer results is that if you have any system stuff running at all, then the CPU will lose a few of its buscycles to the chipset. Also, if the accelerator board's frequency is not an *exact* multiple of the bus frequency, that will mean that the number of buscycles between chipwrites vary (for instance: 99% of writes take 28 cycles, 1% of writes take 29 cycles, after quantization to integer CPU cycles).

That being said, if we recalculate the figures you gave into buscycles (assuming one buscycle is 28 cycles):

28 -> 1
56 -> 2
58 -> 2.07
74 -> 2.64
45 -> 1.60

That looks like pretty good quantization to me. I'm hoping to get hold of a 1230 board tonight (not Blizzard but it should do), and I'll post my results later on.
Kalms is offline   Reply With Quote
Old 04 February 2008, 12:18   #19
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
The OS / interrupts sure take some clock cycles, but it's very constant (1.05%) so I gave the recomputed value.

I also suspected the clocks (chip & cpu) to not be a multiple of each other.
But those little steps of less than 1 clock cycles are quite weird IMO.

If you get a 030 I'm sure you're up to some surprises in the timings
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 05 February 2008, 03:39   #20
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
Ok, I received the Microbotics 1230 board tonight. I've attached results in .odf format.

There is a write buffer in the 68030 by design (see 68030UM, table 6-1). If a write is requested and the write buffer is empty, then the write operation gets deferred to the write buffer and the CPU core continues executing instructions happily. If a read/write is requested and the write buffer is nonempty, then the read/write operation stalls util the write buffer is emptied.

I think that there is a second buffer with the same semantics on the MX1230 board (and probably on board the Blizzard 1230 as well). With less than two buffers between the CPU's instruction pipeline and chipram, there should be staircasing when doing straight chipram writes (Sheet4).

More analysis coming later.
Attached Files
File Type: zip MicroboticsMX1230ReadWriteStalls.zip (28.2 KB, 60 views)
Kalms is offline   Reply With Quote
Old 12 February 2008, 20:59   #21
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
So far, I haven't been able to think up a buffer configuration that would give the expected behaviour. With that said, there is still a lesson or two that can be drawn from the graphs:

* chained chipmem writes peak at 1 write per 2 buscycles. If there is more than 2 buscycles of non-bus work in between each chipmem write, the write performance scales accordingly (no staircasing).
* A chipmem write closely followed by a fastmem read causes quantization (staircasing). Quantization causes delays in integer multiples of buscycles. Therefore, the following code between two chipmem writes:

move.l (a0),d0 ; 10c on MX1230

... will cause on average one buscycle to be missed.

So that code "takes" the same amount of time as the following code:

move.l (a0),d0 ; 10c on MX1230
move.l d0,d1 ; 2c
move.l d0,d1 ; 2c

Whereas this code (dcache turned off):

move.l (a0),d0 ; 10c on MX1230
move.l 4(a0),d0 ; 11c? on MX1230

would performance-wise be equivalent to a piece of code where the second move is replaced with pure register-register operations.

So to utilize the bus efficiently, make sure to overlap the buswrite portion of chipwrites with reg-reg work, and if you do fastmem access closely after a chipwrite (say less than 28c) expect quantization and aim for a multiple of 14c of work between start of the first fastmem access and start of the next chipwrite.

Cluster multiple chipwrites together and multiple fastmem reads together, that way you minimize the number of synchronization points.

Again, the above is not 100% accurate, but it's the best I can do for now.
Kalms is offline   Reply With Quote
Old 13 February 2008, 10:02   #22
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Thanks for the explanations

What I can add is that if you have 60ns fastmem, move.l (a0),d0 is only 8c and drops to 4 (or 5 ?) if in data cache.
Also, my tests gave 2 more cycles for instructions such as move.l 4(a0),d0. Those two cycles can be done during the previous write.

I still can't predict the time some code can take, but I'll know in which direction to go - thanks again.

At least we know why Toni didn't implement cycle exact emulation for anything but a 68000
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 13 February 2008, 12:12   #23
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
Do note that the "after fastread, try to consume a multiple of 14 cycles" is a very fragile strategy (depends on CPU model and MHz). Clustering, and overlapping nearly all reg-reg work with the chipwrites, is much more solid.

Most 030 boards have good throughput to chipram (close to 100% efficiency, ~7MB/s write speed). Some common 060 boards (most notably the Blizzard 1260) do not, however; the B1260 gets about 5.7MB/s (35 cycles per write on average). This is with all store buffers enabled and it seems to be a design mistake with the card itself.

040/060 always fetch whole cachelines from fastram. If you do one aligned read, the CPU will stall until the first longword has been read; then, it will fetch the next three in the background. This means that a single fastmem access will always take more than 14c (so will cause two buscycles to be missed) on such CPUs. If you want to convince the CPU to just fetch a longword at a time, you have to disable cache or modify MMU tables, and that is hardly useful.

68030 with DBURST on behaves in the same way as 040/060 with regard to reads... on some boards. On others it is simply ignored. (On B1230-IV yes, on MX1230 no) Consider using dburst, it will help performance when you're not jumping around randomly in memory.
Why? If you have code like:
Code:
 move.l (a0)+,d0
 <some reg-reg work>
 move.l (a0)+,d0
 move.l (a0)+,d0
 move.l (a0)+,d0
... and DBURST is enabled, and a0 is 16-byte-aligned, then the first memory access will stall the same amount of time as a non-DBURST read would take, and the latter three memory accesses will be datacache hits. You just need to be more careful about when you hit memory (bus saturation happens more easily, especially with writes that always go onto the bus).
Kalms is offline   Reply With Quote
Old 13 February 2008, 12:14   #24
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
You mention that on your machine, MOVE.L (a0),d0 is 8c on datacache miss and 4/5c on datacache hit. Let's break that figure down a bit.
According to chapter 11.6.6, MOVE EA,Dn begins with a "Fetch Effective Address" EA calculation. Chapter 11.6.1 gives the FEA time for (An) as 3 cycles (1 heads, 1 tails). Back to 11.6.6 again, the MOVE itself is 2c (0 heads, 0 tails).

Now let's assume: 0 wait states memory, no datacache hit. Combine EA and operation time together, and the final MOVE will be 5c (1 heads, 0 tails).

What about if there are wait states? Chapter 11.5 describes that. Put briefly, N wait states will cause a simple read to add N cycles to the CC-time and the tail figure for the EA calculation. So, with 3 wait states, the FEA time for (An) is 6 cycles (1 heads, 4 tails). Combining EA and operation time, this gives a total execution time of 8c (1 heads, 0 tails).

====> MOVE.L (An),Dn takes 5c + wait states on datacache miss.

====> your accelerator board has 3 wait states, mine has 5.

3 wait states is good, that's exactly 60ns (theoretical minimum given the memory type installed). The 60/70ns switch on the B1230 switches between 3 and 4 wait states.

Now let's look at datacache hits. Chapter 11.4 describes what happens. FEA for (An) before adjusting for datacache hit is 3 cycles (1 heads, 1 tails). After adjusting (rule 1b) FEA for (An) is 2 cycles (1 heads, 0 tails). Combining EA and operation time gives 4 cycles (1 heads, 0 tails).

=====> MOVE.L (An),Dn takes 4c on datacache hit.

This is all quite messy. I'm glad I have moved over to other CPU generations/architectures.
Kalms is offline   Reply With Quote
Old 13 February 2008, 12:29   #25
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Quote:
Originally Posted by Kalms View Post
This is all quite messy. I'm glad I have moved over to other CPU generations/architectures.
What are those "other CPU generations/architectures" ?
If I assume you meant more recent ones, then they have even more impredictible timings.
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 13 February 2008, 12:58   #26
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
RISC based architectures like 68060, SH4, MIPS, PPC, PS2-VU, Cell SPEs etc.

I find it to be easier to memorize a 6-7 stage execution pipeline and the function of each stage, than to memorize page after page of instruction execution timings.

But then again, it's all about memory access patterns on most modern machines, and the only way to be sure there is to profile.
Kalms is offline   Reply With Quote
Old 13 February 2008, 13:35   #27
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
I'm not sure that knowing the timings on a RISC cpu is much useful, as they're built to run compiled code.
I've even read that on a cpu such as the PPC, the compiler will do a better job than the ASM programmer (hard to believe though).
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 13 February 2008, 14:19   #28
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
The 68060 is internally a RISC. Does that mean that it is built to run compiled code, and such lowlevel arcana as instruction execution timing is hardly useful when writing code for that chip?

I could well imagine that a PPC compiler will do a better job than an _average_ assembly programmer. However, a _good_ programmer should be able to outperform the compiler simply because he knows more about the input data than what the compiler does... and there is rarely any way of hinting this knowledge to the compiler.

Besides, compilers are normally geared for ordinary CPU/memory models. Some RISC CPUs (most notably the PS2 VUs) have instruction sets which simply don't map well to C code. Others (PS3 SPEs) have strange stalls which you only learn about by reading the processor manual. These architectures share the trait that the designers of the processors did not want to spend enough time/transistors on making the CPUs run arbitrary C code well, and it is too hard for the compiler writers to make a really good compiler for the architectures.

A good compromise these days is to learn how to learn how to _read_ assembly for the platforms you're working on, and if the need arises, either re-shape your C code such that it matches better with the assembly instructions on the platform, or use compiler instrinsics (these are special functions which translate directly into one assembly instruction each, like int __addlong(int x, int y) => ADD.L, and supplied by the compiler manufacturer)
Kalms is offline   Reply With Quote
Old 13 February 2008, 15:00   #29
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Quote:
Originally Posted by Kalms View Post
The 68060 is internally a RISC. Does that mean that it is built to run compiled code, and such lowlevel arcana as instruction execution timing is hardly useful when writing code for that chip?
The 68060 is seen by the programmer as a CISC. It has a CISC ISA. I don't think its timings are as regular as 1 cycle per instruction, are they ?
If you consider the 68060 as a RISC, then any cpu (read : x86) nowadays is a RISC.

Quote:
Originally Posted by Kalms View Post
I could well imagine that a PPC compiler will do a better job than an _average_ assembly programmer. However, a _good_ programmer should be able to outperform the compiler simply because he knows more about the input data than what the compiler does... and there is rarely any way of hinting this knowledge to the compiler.
Asm rulez forever then
Well, who's gonna write a lot of asm on a ppc anyway...

Quote:
Originally Posted by Kalms View Post
Besides, compilers are normally geared for ordinary CPU/memory models. Some RISC CPUs (most notably the PS2 VUs) have instruction sets which simply don't map well to C code. Others (PS3 SPEs) have strange stalls which you only learn about by reading the processor manual. These architectures share the trait that the designers of the processors did not want to spend enough time/transistors on making the CPUs run arbitrary C code well, and it is too hard for the compiler writers to make a really good compiler for the architectures.
I don't know a thing about the architectures you describe here. Do you have some links about them so I can look for more info ?

Quote:
Originally Posted by Kalms View Post
A good compromise these days is to learn how to learn how to _read_ assembly for the platforms you're working on, and if the need arises, either re-shape your C code such that it matches better with the assembly instructions on the platform, or use compiler instrinsics (these are special functions which translate directly into one assembly instruction each, like int __addlong(int x, int y) => ADD.L, and supplied by the compiler manufacturer)
I once tried to use some compiler intrinsics on GCC but they aren't easy things to create.

You seem to know a lot on the subject, man
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Old 13 February 2008, 18:47   #30
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
Quote:
Originally Posted by meynaf View Post
The 68060 is seen by the programmer as a CISC. It has a CISC ISA. I don't think its timings are as regular as 1 cycle per instruction, are they ?
Actually, most "simple" operations (move add sub and or xor lsl lsr asl asr rol ror and some others) do take 1 cycle and can run in pairs of 2 on the 060. The 68000-style addressing modes are also free.

Random code snippet:

Code:
.y
  move.l a5,d5
  moveq #16,d6
  move.l a4,d4
  add.l ldxdy(a6),a4
  asr.l d6,d4
  add.l rdxdy(a6),a5
  asr.l d6,d5
  cmp.w channel_clipXMin(a6),d4
  bge.s .nClipLeft
  move.w channel_clipXMin(a6),d4
  ext.l d4
.nClipLeft
  cmp.w channel_clipXMax(a6),d5
  ble.s .nClipRight
  move.w channel_clipXMax(a6),d5
  ext.l d5
.nClipRight
  cmp.w d4,d5
  ble.s .skipLine
  move.l a0,a3
  move.l a0,a2
  move.l d7,-(sp)
  add.l d4,a3
  move.l d3,-(sp)
  add.l d5,a2
  move.l d2,-(sp)
  move.l d4,d5
  mulu.l channel_dudx(a6),d4
  mulu.l channel_dvdx(a6),d5
  add.l d4,d2
  add.l d5,d3
  move.w d2,d7
  move.w d3,d2
  move.w channel_vIntLSL(a6),d4
  move.w d7,d3
  rol.l d6,d2
  rol.l d6,d3
  lsl.w d4,d3
  move.l channel_uORMask(a6),d4
  move.l d1,d6
  move.l channel_vORMask(a6),d5
  or.l d4,d2
  clr.w d6
  or.l d5,d3
  add.l d6,d3
  moveq #0,d6
  move.w d3,d6
  and.w d2,d6
  addx.l d0,d2
  addx.l d1,d3
  or.l d4,d2
.pix
  or.l d5,d3
  move.b (a1,d6.l),d7
  move.w d3,d6
  and.w d2,d6
  move.b d7,(a3)+
  addx.l d0,d2
  addx.l d1,d3
  or.l d4,d2
  cmp.l a3,a2
  bhi.s .pix
  move.l (sp)+,d2
  move.l (sp)+,d3
  move.l (sp)+,d7
.skipLine
  add.l channel_dudy(a6),d2
  add.l channel_dvdy(a6),d3
  add.l channel_targetPitch(a6),a0
  subq.w #1,d7
  bne .y
The above block of code renders half a texturemapped triangle. The MULUs take 2 cycles. All other instructions in there take 1 cycle each. In addition, most of the instructions in there can be run in pairs.

Optimizing for a RISC is more about keeping up instruction throughput and avoiding stalls, than about using archaic instructions in ways the CPU designers never through of.

Quote:
Originally Posted by meynaf View Post
If you consider the 68060 as a RISC, then any cpu (read : x86) nowadays is a RISC.
Exactly -- the x86 and 68k might take as input a CISC instruction set, but that is just an intermediate form. All its performance characteristics reflect how the RISC core operates.

Quote:
Originally Posted by meynaf View Post
Quote:
Originally Posted by kalms
blah blah ps2 blah blah ps3
I don't know a thing about the architectures you describe here. Do you have some links about them so I can look for more info ?
See http://www.research.scea.com/researc...C05/index.html for an introduction to the CELL CPU (which is the one in the PS3). With today's high clock speeds, the key to high performance lies in utilizing memory bandwidth efficiently. In the case of the CELL, every SPE has its own local memory so offload as much work as possible to the SPEs and feed them with data in suitably large chunks.

Quote:
Originally Posted by meynaf View Post
I once tried to use some compiler intrinsics on GCC but they aren't easy things to create.
Indeed not.
Kalms is offline   Reply With Quote
Old 15 February 2008, 12:03   #31
meynaf
68k wisdom
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
Thanks for all the infos. You detailed more than I needed !
__________________
He who insults the other in a discussion is the one who's wrong.
meynaf is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing a 68030 unox support.Hardware 1 25 October 2007 12:43
Why no 68030 emulation? p7h support.WinUAE 9 25 February 2007 19:23
68030/mmu Support in WinUAE dkovacs request.UAE Wishlist 19 22 August 2005 14:42
68030 heatsinks Shrub support.Hardware 6 15 August 2005 23:12
WinUAE needs a 68030 CPU Option Exodus request.UAE Wishlist 3 28 April 2004 10:00


All times are GMT +2. The time now is 18:44.

-->

Powered by vBulletin® Version 3.7.0
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Page generated in 0.49192 seconds with 10 queries