![]() |
|
|
#1 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
68030 pipelines
Perhaps I can introduce people here with something weird I've found.
Tests done on a 50 Mhz '030 (Blizzard 1230-IV) with 60ns memory, no databurst. You know that the cpu can continue executing instructions after a memory write without waiting for the write to complete. But it won't do that for reads, of course. Adjusting memory reads will do no good, eh ? Well, let's see... Provided that a0 is somewhere in fastmem, and a1 goes in chipmem (no mystery if in fastmem), can someone explain why this : Code:
.loop move.l (a0)+,d1 move.l (a0)+,d2 move.l d1,d0 move.l d1,d0 move.l d3,(a1)+ subq.l #1,d7 bne.s .loop Code:
.loop move.l (a0)+,d1 move.l (a0)+,d2 move.l d3,(a1)+ subq.l #1,d7 bne.s .loop ![]()
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#2 |
|
Moderator
|
It could be that the write to the chip mem is thing which limits the execution speed here. So it uses all possible write cycles to chip mem. The time left can be filled with any instructions.
|
|
|
|
|
|
#3 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Of course it's the write to chipmem which messes up things here.
But what exactly are "all possible write cycles to chip mem" ?
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#4 |
|
Registered User
Join Date: May 2006
Location: Germany
Posts: 97
|
IIRC the 68030 can only contnue execution, when there's no pending write on the same cache line. On the first loop it can continue but on the second it needs to wait for the first move to chip mem to finish (providing chip mem is slow enough). On the third loop it waits for the second move, and so on. The waits seem to give the extra time used to execute the additional moves.
If i'm right, it should still be a bit slower, as there is no need to wait when you cross a cache line. |
|
|
|
|
|
#5 | |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Quote:
The two timings of the code I gave are exactly identical... and longer than what one single move in chipmem would be. Furthermore, if you add a few more moves, you'll get it very slightly slower, but nothing compared to the real timing of the moves.
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
|
#6 | |
|
Moderator
|
Quote:
![]() My understanding is that your cpu is running at 50 MHz but the memory is much slower so it can only perform a memory access each 30 cycles (for example). Therefore in your loop it will wait for the memory access each time. This wait time you can also fill which some other instructions without changing the speed of the loop. |
|
|
|
|
|
|
#7 | |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Quote:
![]() Knowing what happens exactly (cycle by cycle) would help me optimize some code ; how to exactly pipeline things when accessing the chip mem. Apparently no-one knows...
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
|
#8 |
|
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 114
|
@meynaf
I don't know the exact timing. Try to mesure how much a sequence of write to chip takes, compared to the same sequence writing in fast. Anyway, I completely agree with Wepl and Ganral, their explanation of the "mistery" is the right one. |
|
|
|
|
|
#9 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
Could you (meynaf or someone else with a Blizzard 1230-IV at 50MHz) test the following code sequences please and tell how long they take?
; should run at full speed Code:
.loop move.l d3,(a1)+ subq.l #1,d7 bne.s .loop Code:
.loop move.l d3,(a1)+ move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 subq.l #1,d7 bne.s .loop Code:
.loop move.l d3,(a1)+ move.l (a0)+,d0 subq.l #1,d7 bne.s .loop Code:
.loop move.l d3,(a1)+ move.l (a0)+,d0 move.l (a0)+,d0 subq.l #1,d7 bne.s .loop Code:
.loop move.l d3,(a1)+ move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 subq.l #1,d7 bne.s .loop Code:
.loop move.l d3,(a1)+ move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 subq.l #1,d7 bne.s .loop Code:
.loop move.l d3,(a1)+ move.l (a0)+,d0 move.l d1,d2 move.l d1,d2 move.l d1,d2 move.l d1,d2 subq.l #1,d7 bne.s .loop Code:
.loop move.l (a0)+,d0 move.l d3,(a1)+ move.l d1,d2 move.l d1,d2 move.l d1,d2 move.l d1,d2 subq.l #1,d7 bne.s .loop |
|
|
|
|
|
#10 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
I can't do that right now, but be sure I will (you should get the result on next monday 'coz I don't access my Miggy during the week). I'll give you the number of clock cycles for each version.
What I expect is that (but you never know) : - test #1 will run at full speed (tested long ago) - test #2 should also run at full speed - test #3 won't (can't pipeline the read) - test #4 speed may be identical to test #3 (unsure) - test #5 slower than test #4 but unsure - test #6 twice the time from test #1 - test #7 ??? - test #8 ???
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#11 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
(The text below assumes that all DMA is turned off, and that MOVEs to/from memory take 0 cycles, except for the time spent waiting for the memory accesses themselves.)
Ok, this is how I think it works: All writes go through the Write Pending Buffer (see 68030UM, table 6-1, and section 11.2.5.2). When the CPU wants to write a value, it first waits until the Write Pending Buffer is empty, and then it dispatches the value to the buffer. If the Write Pending Buffer is full, then it also means that the bus controller is busy (it is performing a memory write). When the CPU wants to read a value, it first waits until the bus controller is available, and then it performs a read operation. The CPU runs at a core clock frequency of 50MHz, so the cycle time for the CPU is 20ns. The chipset has a clock frequency of ~3.57MHz, which yields a cycle time of 280ns. Therefore, 1 chipset bus cycle ("bus cycle") == 14 CPU cycles ("cycles"). Now, the CPU and the chipset run pretty much asynchronously. When the CPU wants to write to chipmem, the bus controller will place a write request onto the bus, and then the chipset will acknowledge the write at some point in the future. The CPU has no idea/expectation of how long the write will take. Since the CPU peaks at 1 chipmem-write per 28 CPU-cycles, and it is not possible to squeeze in any fastmem reads between two chipmem writes , it seems that the bus controller needs to spend two full bus cycles on the chipmem write. To be a bit more precise: the chipset will look for a write request at the beginning of an even bus cycle, and it will complete the request at the end of an odd bus cycle. The above gives the following scenario for a write: 1) the bus controller places a write request onto the bus. 2) 0..27 cycles pass, until the beginning of an even buscycle occurs. 3) 28 cycles pass, during which the chipset is servicing the write request. 4) the write is acknowledged to the bus controller. If writes are chained back-to-back, then the delay in step 2 will be exactly 0 cycles. A gap of less than 28 cycles between each instruction that writes to memory ensures max throughput. However, if the gap is 30 cycles, then the code will start missing some of the available buscycles. So, what does this mean in practice? well: Code:
move.l d0,(a1)+ move.l d0,(a1)+ This is a good time to do any sort of work that only touches registers and caches. If you perform a fastmem access, it will wait until the bus is available. That is, until the end of the current 28-cycle period. The fastmem access itself will cause 3 cycles of bus activity. If you make a chipmem write directly after the fastmem access, then the CPU instruction will complete immediately, but it will take (28 - 3) cycles until the next even buscycle is begun, and another 28 cycles until the following buscycle has completed; therefore, the bus controller will be busy for 25+28 cycles. Don't access memory during those cycles, or you will stall the CPU. If you make two chipmem writes directly after the fastmem access, this will be even more obvious: the second write will stall until the end of the 25+28 cycle period, and then the bus will be busy for 28 cycles after the second chipmem write instruction has finished execution. So, what to do? - group chipmem writes together, ensure that they are less than 28 cycles apart - do CPU work after each chipmem write - if you need to touch fastmem, touch several memory locations at once; you should be able to squeeze in 2 or 3 fastmem reads during a 28-cycle period Ah, and you need to take into account the execution times of the instructions as well. I think that a MOVE.L d0,(a0)+ takes 3 cycles, and a MOVE.L (a0)+,d0 (from fastmem) takes 8/9 cycles (for 60ns/70ns memory). |
|
|
|
|
|
#12 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
I suspected something like that, however when I wanted to check it with tests I discovered that the execution times were not multiples of 28 cycles...
Also, I've found that the fastmem does not exhibit such behaviors.
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#13 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
If you have code that writes less frequently than 28 cycles, like the loop below:
Code:
; write to chipmem
; write to chipmem
; wait 32 cycles
; write to chipmem
; wait 32 cycles
; write to chipmem
; wait 32 cycles
; write to chipmem
; wait 32 cycles
; write to chipmem
; wait 100 cycles
The first write will not stall at all. The second write will stall for (0..27) + 28 cycles, as explained in the previous post. CPU is now aligned with the chipbus; chipbus is at the beginning of an even buscycle. The chipbus immediately accepts the CPU's write. CPU waits for 28 cycles. The chipbus transaction completes. CPU waits for another 4 cycles. The chipbus is now 4 cycles into an even buscycle. The bus interface (and Write Pending Buffer) is free. The third write will not stall, since the bus interface is idle. The write goes into the Write Pending Buffer. CPU waits for 24 cycles. The chipbus accepts the CPU's write, and starts processing it. CPU waits for another 8 cycles. The chipbus is now busy, and it's 8 cycles into an even buscycle. The fourth write will stall for 20 cycles, waiting for the write buffer to become available. The fourth write will behave like the second write (align with the chipbus, etc). The fifth write will not stall. The sixth write will stall for 20 cycles. Etc. On the chipbus, it will look as if the following buscycles were being used for writing: -Y-Y-N-Y-Y-N-Y-Y Y = yes, N = no If the above model is correct, then the padding will occasionally (not always) cause chipbus cycles to be missed: * for 0..28 cycles of padding, loop should run at full speed * for 29..42 cycles of padding, loop should run at 2/3 speed * for 43..56 cycles of padding, loop should run at 1/2 speed etc. So each block of 14 cycles of padding makes the code lose out on one chipmem slot. So one thing which would be interesting to test is to see, when adding 28 -> 30 -> 32 -> 34 cycles of padding, whether or not the execution time rises linearly or if there is a discrete jump. If there is a linear rise, it might be because there is a write buffer between the CPU's bus interface and the chipmem interface on the Blizzard board which we're not taking into account. Regarding fastmem: it is not connected to the chipbus at all, and therefore all the sync-to-buscycle rules etc does not apply to it. Think of it as if the CPU's bus interface is a 3-way connection, which ties together CPU, fastmem and chipbus interface. The CPU can communicate with one of the two memory subsystems at any given time. Chipmem runs on a slow clock, but fastmem runs unclocked with 60ns access cycles. |
|
|
|
|
|
#14 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
I can tell you I have observed something that looked like a discrete jump.
So, if I understand well, to optimise for chipmem accesses it's better if you fit non-chipmem accesses in blocks of 14 cpu cycles ? And the overall time will be a multiple of 14 ? Hmmm... looks too simple. I'll perform more tests this week-end...
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#15 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
I did some timing tests on Blizzard 1260 today. See attached files.
When the 060 runs with Store Buffer disabled, there is no decoupling mechanism for writes. The CPU pipeline waits until the bus transfer has completed. Pages 4-5 of the attached document exhibit the 'staircasing' phenomenon quite well. When the 060 runs with the Store Buffer enabled, there is a series of 4 buffers decoupling CPU from bus interface. In most cases, these buffers hide the staircasing. Right now, I'm not 100% sure how. (Pages 1-3 in the document) The 030 should correspond to an 060 with a small Store Buffer (just 1 entry). It might be that the 030's single buffer is enough to hide most of the staircasing, too. I have no clear explanation for the phenomena right now. |
|
|
|
|
|
#16 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
I can't imagine the hell it must be with the 040 copyback cache...
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#17 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Tests done !
None of them gives integer values ![]() ('xcept my reference test) Code:
; should run at full speed (it does) .loop move.l d3,(a1)+ subq.l #1,d7 bne.s .loop ; 28.76 -> 28.33 ; should run at full speed (it does too) .loop move.l d3,(a1)+ move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 move.l d0,d1 subq.l #1,d7 bne.s .loop ; 28.76 -> 28.33 ; should run at full speed (doesn't !) .loop move.l d3,(a1)+ move.l (a0)+,d0 subq.l #1,d7 bne.s .loop ; 57.50 -> 56.65 ; should run at full speed (doesn't) .loop move.l d3,(a1)+ move.l (a0)+,d0 move.l (a0)+,d0 subq.l #1,d7 bne.s .loop ; 58.78 -> 57.91 ; might run at full speed (oh no !) .loop move.l d3,(a1)+ move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 subq.l #1,d7 bne.s .loop ; 75.00 -> 73.89 ; should not run at full speed (obviously it doesn't...) .loop move.l d3,(a1)+ move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 ; 75.18 -> 74.07 subq.l #1,d7 bne.s .loop ; might run at full speed (doesn't) .loop move.l d3,(a1)+ move.l (a0)+,d0 move.l d1,d2 move.l d1,d2 move.l d1,d2 move.l d1,d2 subq.l #1,d7 bne.s .loop ; 58.62 -> 57.75 ; might run at full speed (doesn't) .loop move.l (a0)+,d0 move.l d3,(a1)+ move.l d1,d2 move.l d1,d2 move.l d1,d2 move.l d1,d2 subq.l #1,d7 bne.s .loop ; 45.36 -> 44.69 ; added as a reference (28 real clock cycles) : move.l d1,d2 move.l d1,d2 lsr.w #1,d4 lsr.w #1,d4 lsr.w #1,d4 lsr.w #1,d4 subq.l #1,d7 bne.s .loop ; 28.42 -> 28 The timings are very stable, the little differences always show up the same way.
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#18 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
Ok. First off, one possible reason why the tests aren't showing integer results is that if you have any system stuff running at all, then the CPU will lose a few of its buscycles to the chipset. Also, if the accelerator board's frequency is not an *exact* multiple of the bus frequency, that will mean that the number of buscycles between chipwrites vary (for instance: 99% of writes take 28 cycles, 1% of writes take 29 cycles, after quantization to integer CPU cycles).
That being said, if we recalculate the figures you gave into buscycles (assuming one buscycle is 28 cycles): 28 -> 1 56 -> 2 58 -> 2.07 74 -> 2.64 45 -> 1.60 That looks like pretty good quantization to me. I'm hoping to get hold of a 1230 board tonight (not Blizzard but it should do), and I'll post my results later on. |
|
|
|
|
|
#19 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
The OS / interrupts sure take some clock cycles, but it's very constant (1.05%) so I gave the recomputed value.
I also suspected the clocks (chip & cpu) to not be a multiple of each other. But those little steps of less than 1 clock cycles are quite weird IMO. If you get a 030 I'm sure you're up to some surprises in the timings ![]()
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#20 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
Ok, I received the Microbotics 1230 board tonight. I've attached results in .odf format.
There is a write buffer in the 68030 by design (see 68030UM, table 6-1). If a write is requested and the write buffer is empty, then the write operation gets deferred to the write buffer and the CPU core continues executing instructions happily. If a read/write is requested and the write buffer is nonempty, then the read/write operation stalls util the write buffer is emptied. I think that there is a second buffer with the same semantics on the MX1230 board (and probably on board the Blizzard 1230 as well). With less than two buffers between the CPU's instruction pipeline and chipram, there should be staircasing when doing straight chipram writes (Sheet4). More analysis coming later. |
|
|
|
|
|
#21 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
So far, I haven't been able to think up a buffer configuration that would give the expected behaviour. With that said, there is still a lesson or two that can be drawn from the graphs:
* chained chipmem writes peak at 1 write per 2 buscycles. If there is more than 2 buscycles of non-bus work in between each chipmem write, the write performance scales accordingly (no staircasing). * A chipmem write closely followed by a fastmem read causes quantization (staircasing). Quantization causes delays in integer multiples of buscycles. Therefore, the following code between two chipmem writes: move.l (a0),d0 ; 10c on MX1230 ... will cause on average one buscycle to be missed. So that code "takes" the same amount of time as the following code: move.l (a0),d0 ; 10c on MX1230 move.l d0,d1 ; 2c move.l d0,d1 ; 2c Whereas this code (dcache turned off): move.l (a0),d0 ; 10c on MX1230 move.l 4(a0),d0 ; 11c? on MX1230 would performance-wise be equivalent to a piece of code where the second move is replaced with pure register-register operations. So to utilize the bus efficiently, make sure to overlap the buswrite portion of chipwrites with reg-reg work, and if you do fastmem access closely after a chipwrite (say less than 28c) expect quantization and aim for a multiple of 14c of work between start of the first fastmem access and start of the next chipwrite. Cluster multiple chipwrites together and multiple fastmem reads together, that way you minimize the number of synchronization points. Again, the above is not 100% accurate, but it's the best I can do for now. |
|
|
|
|
|
#22 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Thanks for the explanations
![]() What I can add is that if you have 60ns fastmem, move.l (a0),d0 is only 8c and drops to 4 (or 5 ?) if in data cache. Also, my tests gave 2 more cycles for instructions such as move.l 4(a0),d0. Those two cycles can be done during the previous write. I still can't predict the time some code can take, but I'll know in which direction to go - thanks again. At least we know why Toni didn't implement cycle exact emulation for anything but a 68000 ![]()
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#23 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
Do note that the "after fastread, try to consume a multiple of 14 cycles" is a very fragile strategy (depends on CPU model and MHz). Clustering, and overlapping nearly all reg-reg work with the chipwrites, is much more solid.
Most 030 boards have good throughput to chipram (close to 100% efficiency, ~7MB/s write speed). Some common 060 boards (most notably the Blizzard 1260) do not, however; the B1260 gets about 5.7MB/s (35 cycles per write on average). This is with all store buffers enabled and it seems to be a design mistake with the card itself. 040/060 always fetch whole cachelines from fastram. If you do one aligned read, the CPU will stall until the first longword has been read; then, it will fetch the next three in the background. This means that a single fastmem access will always take more than 14c (so will cause two buscycles to be missed) on such CPUs. If you want to convince the CPU to just fetch a longword at a time, you have to disable cache or modify MMU tables, and that is hardly useful. 68030 with DBURST on behaves in the same way as 040/060 with regard to reads... on some boards. On others it is simply ignored. (On B1230-IV yes, on MX1230 no) Consider using dburst, it will help performance when you're not jumping around randomly in memory. Why? If you have code like: Code:
move.l (a0)+,d0 <some reg-reg work> move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 |
|
|
|
|
|
#24 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
You mention that on your machine, MOVE.L (a0),d0 is 8c on datacache miss and 4/5c on datacache hit. Let's break that figure down a bit.
According to chapter 11.6.6, MOVE EA,Dn begins with a "Fetch Effective Address" EA calculation. Chapter 11.6.1 gives the FEA time for (An) as 3 cycles (1 heads, 1 tails). Back to 11.6.6 again, the MOVE itself is 2c (0 heads, 0 tails). Now let's assume: 0 wait states memory, no datacache hit. Combine EA and operation time together, and the final MOVE will be 5c (1 heads, 0 tails). What about if there are wait states? Chapter 11.5 describes that. Put briefly, N wait states will cause a simple read to add N cycles to the CC-time and the tail figure for the EA calculation. So, with 3 wait states, the FEA time for (An) is 6 cycles (1 heads, 4 tails). Combining EA and operation time, this gives a total execution time of 8c (1 heads, 0 tails). ====> MOVE.L (An),Dn takes 5c + wait states on datacache miss. ====> your accelerator board has 3 wait states, mine has 5. 3 wait states is good, that's exactly 60ns (theoretical minimum given the memory type installed). The 60/70ns switch on the B1230 switches between 3 and 4 wait states. Now let's look at datacache hits. Chapter 11.4 describes what happens. FEA for (An) before adjusting for datacache hit is 3 cycles (1 heads, 1 tails). After adjusting (rule 1b) FEA for (An) is 2 cycles (1 heads, 0 tails). Combining EA and operation time gives 4 cycles (1 heads, 0 tails). =====> MOVE.L (An),Dn takes 4c on datacache hit. This is all quite messy. I'm glad I have moved over to other CPU generations/architectures. ![]() |
|
|
|
|
|
#25 | |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Quote:
If I assume you meant more recent ones, then they have even more impredictible timings. ![]()
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
|
#26 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
RISC based architectures like 68060, SH4, MIPS, PPC, PS2-VU, Cell SPEs etc.
I find it to be easier to memorize a 6-7 stage execution pipeline and the function of each stage, than to memorize page after page of instruction execution timings. But then again, it's all about memory access patterns on most modern machines, and the only way to be sure there is to profile. |
|
|
|
|
|
#27 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
I'm not sure that knowing the timings on a RISC cpu is much useful, as they're built to run compiled code.
I've even read that on a cpu such as the PPC, the compiler will do a better job than the ASM programmer (hard to believe though).
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
|
|
#28 |
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
The 68060 is internally a RISC. Does that mean that it is built to run compiled code, and such lowlevel arcana as instruction execution timing is hardly useful when writing code for that chip?
I could well imagine that a PPC compiler will do a better job than an _average_ assembly programmer. However, a _good_ programmer should be able to outperform the compiler simply because he knows more about the input data than what the compiler does... and there is rarely any way of hinting this knowledge to the compiler. Besides, compilers are normally geared for ordinary CPU/memory models. Some RISC CPUs (most notably the PS2 VUs) have instruction sets which simply don't map well to C code. Others (PS3 SPEs) have strange stalls which you only learn about by reading the processor manual. These architectures share the trait that the designers of the processors did not want to spend enough time/transistors on making the CPUs run arbitrary C code well, and it is too hard for the compiler writers to make a really good compiler for the architectures. A good compromise these days is to learn how to learn how to _read_ assembly for the platforms you're working on, and if the need arises, either re-shape your C code such that it matches better with the assembly instructions on the platform, or use compiler instrinsics (these are special functions which translate directly into one assembly instruction each, like int __addlong(int x, int y) => ADD.L, and supplied by the compiler manufacturer) |
|
|
|
|
|
#29 | ||||
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Quote:
If you consider the 68060 as a RISC, then any cpu (read : x86) nowadays is a RISC. Quote:
![]() Well, who's gonna write a lot of asm on a ppc anyway... Quote:
Quote:
![]() You seem to know a lot on the subject, man ![]()
__________________
He who insults the other in a discussion is the one who's wrong. |
||||
|
|
|
|
|
#30 | ||||
|
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 121
|
Quote:
Random code snippet: Code:
.y move.l a5,d5 moveq #16,d6 move.l a4,d4 add.l ldxdy(a6),a4 asr.l d6,d4 add.l rdxdy(a6),a5 asr.l d6,d5 cmp.w channel_clipXMin(a6),d4 bge.s .nClipLeft move.w channel_clipXMin(a6),d4 ext.l d4 .nClipLeft cmp.w channel_clipXMax(a6),d5 ble.s .nClipRight move.w channel_clipXMax(a6),d5 ext.l d5 .nClipRight cmp.w d4,d5 ble.s .skipLine move.l a0,a3 move.l a0,a2 move.l d7,-(sp) add.l d4,a3 move.l d3,-(sp) add.l d5,a2 move.l d2,-(sp) move.l d4,d5 mulu.l channel_dudx(a6),d4 mulu.l channel_dvdx(a6),d5 add.l d4,d2 add.l d5,d3 move.w d2,d7 move.w d3,d2 move.w channel_vIntLSL(a6),d4 move.w d7,d3 rol.l d6,d2 rol.l d6,d3 lsl.w d4,d3 move.l channel_uORMask(a6),d4 move.l d1,d6 move.l channel_vORMask(a6),d5 or.l d4,d2 clr.w d6 or.l d5,d3 add.l d6,d3 moveq #0,d6 move.w d3,d6 and.w d2,d6 addx.l d0,d2 addx.l d1,d3 or.l d4,d2 .pix or.l d5,d3 move.b (a1,d6.l),d7 move.w d3,d6 and.w d2,d6 move.b d7,(a3)+ addx.l d0,d2 addx.l d1,d3 or.l d4,d2 cmp.l a3,a2 bhi.s .pix move.l (sp)+,d2 move.l (sp)+,d3 move.l (sp)+,d7 .skipLine add.l channel_dudy(a6),d2 add.l channel_dvdy(a6),d3 add.l channel_targetPitch(a6),a0 subq.w #1,d7 bne .y Optimizing for a RISC is more about keeping up instruction throughput and avoiding stalls, than about using archaic instructions in ways the CPU designers never through of. Quote:
Quote:
Indeed not. |
||||
|
|
|
|
|
#31 |
|
68k wisdom
Join Date: Nov 2007
Location: Lyon (France)
Age: 40
Posts: 979
|
Thanks for all the infos. You detailed more than I needed !
__________________
He who insults the other in a discussion is the one who's wrong. |
|
|
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Replacing a 68030 | unox | support.Hardware | 1 | 25 October 2007 12:43 |
| Why no 68030 emulation? | p7h | support.WinUAE | 9 | 25 February 2007 19:23 |
| 68030/mmu Support in WinUAE | dkovacs | request.UAE Wishlist | 19 | 22 August 2005 14:42 |
| 68030 heatsinks | Shrub | support.Hardware | 6 | 15 August 2005 23:12 |
| WinUAE needs a 68030 CPU Option | Exodus | request.UAE Wishlist | 3 | 28 April 2004 10:00 |