03 April 2011, 20:40 | #1 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
memory access speed question
Hi,
First, i'm sorry if this has already been asked, i tryed to find it and didn't. I have read somewhere (don't remember where), that this code here: Code:
move.w someMemoryAdr,d1 move.w someMemoryAdr2,d2 addq #6,d3 Code:
move.w someMemoryAdr,d1 addq #6,d3 move.w someMemoryAdr2,d2 And now my main question, does the memory writing waitestate affect the memory reading ? or are they on different waits ? in exemple, is this: Code:
move.w someMemoryAdr,d1 move.w d2,someMemoryAdr2 addq #6,d3 Code:
move.w someMemoryAdr,d1 addq #6,d3 move.w d2,someMemoryAdr2 |
03 April 2011, 21:39 | #2 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
What is your target system? What CPU? Are you reading/writing to chipmem or fastmem? The question is waaay to broad to give a simple answer. All that can be said from your description is that "generally, the latter will be faster or equally fast as the former".
|
03 April 2011, 21:49 | #3 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
I didn't think this was dependent of the type of memory.
I tought this was just general for the 68020+ processors, now i'm confused even more. |
04 April 2011, 01:36 | #4 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
If you're targetting fastmem and you're hitting the cache on 68040+ then both alternatives will be equally fast.
If you're targetting fastmem and you're not hitting the cache on 68040+ then there is a bunch of cycles after the 1st read during which the bus interface is busy (this is due to the CPU fetching the entire cacheline). Any reads/writes which generate bus traffic during that period will stall until the first cachline fetch has completed. You can see the same effect on 68030 with DBURST on. So the 2nd alternative will be faster under those circumstances. If you're targetting fastmem and you're on 68020, or 68030 with DBURST off, then they should be equally fast. If you're targetting chipmem then it depends a lot on the exact timing i.e. how your CPU instructions align to the chipbus cycle boundaries. And the alignment requirements for optimal performance will be different between different accelerator boards. |
04 April 2011, 04:29 | #5 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
Ok i found the source where i read it again. It was not exactly like i had it in memory, it talked only of writes and not reads.
here is the article: http://www.mways.co.uk/amiga/howtoco...80x0issues.php It's under "A1200 speed issues". Edit: Ok i understand it better now how this works, if you're interested, you can look at this: Last edited by Lord Riton; 04 April 2011 at 12:40. |
04 April 2011, 22:35 | #6 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Chipmem and fastmem accesses are different. To be precise, chipmem accesses are uncached. (so they behave largely the same way on all 68020+ systems.) Also, chipmem is very slow compared to the CPU clockrate.
If you read from a chipmem location, the CPU will stall during the entire duration of the memory read operation. This is because the CPU it needs the value stored in the memory location before the read operation can be completed. If you write however, in most system configurations the write will get chucked into a buffer, and then the CPU continues processing other stuff while the bus interface is busy. (On most accelerator board there is such a write buffer on the accelerator board. In addition, the 68060 has a 4-slot write buffer internally in the CPU.) If any subsequent instruction tries to hit the bus while there are still pending writes, then the CPU will stall until the bus is available again. For 50MHz accelerator boards, the bus will typically remain busy for 26-28 cycles after you have performed a chipmem write. During that period, don't touch the bus. |
04 April 2011, 22:55 | #7 | |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
Quote:
And i bet here is also my problem with my new c2p code i just made for it, it should be faster than my old, but it isn't. The only reason i see why my old c2p is faster , is that it's faster to use fast ram for the conversion first and then to simply copy the whole screen from fast ram into chip ram with fat movem.l's .. i'm a bit desesperated, i feel i'll soon abandon the Amiga again and just go to easy c++ PC programming .. now i'm really going to play some mass effect 1 on my xbox 360 to forget this... |
|
04 April 2011, 23:14 | #8 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Yup. There are two things you can try which are practical:
1) only do c2p outside of screen display - if you have a 200 lines high display window then you'd still have 112 lines during which the display DMA isn't touching chipram. It will take you multiple frames to complete c2p conversion. 2) Find a way (that is specific to your application) which requires less overall memory access than reading the entire fastmem buffer and writing the entire chipmem buffer. The standard c2ps are (from a performance perspective) equivalent to a fast-chip copy on 68040@40 and faster CPUs. i.e. the actual c2p transformation logic is done while the chipbus is busy. So if you want higher performance, you need to do something about the memory accesses. |
05 April 2011, 01:48 | #9 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
I should not play mass effect in hardcore mode as relaxing game.. this got me even more frustrated..
I tested the engine without any chip memory writing (i did put them all as comment), and it's still exactly as slow !?!? there must be something else.. will see that tommorrow.. Edit: I'll just post the code here, maybe someone sees something wrong or suspect i didn't see myself. Code:
move.l ptr_dess_vue,a0 ; a0 = source chunky screen move.w offset_image,d0 ; d0 = screen offset and.l #$ffff,d0 add.l _bitp,d0 move.l d0,a1 ; a1 = adr bitplane 0 of destination screen move.w long_x_3,d0 move.w d0,d1 lsr.w #5,d0 and.l #$ffff,d0 move.l d0,a3 ; a3 = number of 32pixel parts per line move.l a3,a2 lsr.w #3,d1 ; /32 *4 neg.w d1 ext.l d1 add.l #80,d1 move.l d1,a4 ; a4 = offset to add to end of line till next line move.w long_y,d0 and.l #$ffff,d0 move.l d0,a5 ; a5 = y counter affdv_do_a_screen_line affdv_do_32_pixels move.l #8,a6 ; 8 packs (of 4 pixels each) counter affdv_do_4_pixels move.l (a0)+,d6 ; get 4 chunky pixels moveq.l #4,d7 ; 4 pixels counter affdv_do_1_pixel lsl.l #3,d6 ; we don't need the 2 ham8 control bits (7+6) addx.l d5,d5 ; bit 5 of a pixel to bitplane 5 (0-5) add.l d6,d6 addx.l d4,d4 ; bit 4 of a pixel to bitplane 4 (0-5) add.l d6,d6 addx.l d3,d3 ; bit 3 of a pixel to bitplane 3 (0-5) add.l d6,d6 addx.l d2,d2 ; bit 2 of a pixel to bitplane 2 (0-5) add.l d6,d6 addx.l d1,d1 ; bit 1 of a pixel to bitplane 1 (0-5) add.l d6,d6 addx.l d0,d0 ; bit 0 of a pixel to bitplane 0 (0-5) subq.l #1,d7 bne.b affdv_do_1_pixel subq.l #1,a6 cmpa.l #0,a6 bne.b affdv_do_4_pixels move.l d0,(a1) ; set bitplan 0 of 32 pixels add.l #80*256,a1 move.l d1,(a1) ; set bitplan 1 of 32 pixels add.l #80*256,a1 move.l d2,(a1) ; set bitplan 2 of 32 pixels add.l #80*256,a1 move.l d3,(a1) ; set bitplan 3 of 32 pixels add.l #80*256,a1 move.l d4,(a1) ; set bitplan 4 of 32 pixels add.l #80*256,a1 move.l d5,(a1)+ ; set bitplan 5 of 32 pixels sub.l #5*80*256,a1 sub.l #1,a2 cmpa.l #0,a2 bne.b affdv_do_32_pixels move.l a3,a2 ; reset 32pixel counter add.l a4,a1 ; put a1 on start of next screen line sub.l #1,a5 cmpa.l #0,a5 bne.b affdv_do_a_screen_line movem.l (sp)+,d0-d7/a0-a6 rts Last edited by Lord Riton; 05 April 2011 at 01:55. |
05 April 2011, 08:36 | #10 |
Registered User
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,344
|
we could even use the horizontal blanking,isn't true?
|
05 April 2011, 10:22 | #11 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
It's not even the chip write access that slows it down. When i put all the "move.l d0,(a1) ; set bitplan 0 of 32 pixels"
lines as comments, it's about the same speed. Guess my method is just to slow compared to my old.. Or maybe it's WinUAE that gives false results, i should try on my real Amiga, but it's a pain to transfere stuff from my PC to it. Edit: if you want to help testing this you can do this there: http://eab.abime.net/showthread.php?t=58617 Last edited by Lord Riton; 05 April 2011 at 16:11. |
05 April 2011, 16:17 | #12 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
How about estimating how many CPU cycles the computational work would take? That should give you an idea if it is the computations that overshadow the time spent in the memory accesses.
On a 50MHz system, 1 frame = 1 million cycles. The chipwrites ought to occupy the CPU's bus interface for about 0.5 frames (most of which can be overlapped with the computations) and the fastreads stall the entire CPU for about 0.2 frames (some of which can be overlapped with the computations on 68030+). |
05 April 2011, 17:55 | #13 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
I'm pretty sure my old c2p routine is faster because it uses a lot more of RAM accesses, mainly fast ram, but still i guess WinUAE does not emulate the RAM's real speed and therefore will improve these accesses much more than on a real Amiga.
As for computing the whole cpu cycles both routines are taking, that's probably a bit beyond my knowledge. I am not to sure how much each instruction is taking. On the 68020 manual i have there are 3 different amount of cycles for each instructions (best, in cache, worst). |
05 April 2011, 18:12 | #14 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,582
|
68020 CE only emulates memory access speeds cycle-exactly. (chip, fast, rom, cia etc..) also instruction cache is emulated.
CPU internal timing emulation is usually "immediate". (Because it is very complex compared to simple 68000). Fortunately it is good enough for most purposes, limit is usually always Agnus bus (chip ram, custom registers). |
05 April 2011, 22:12 | #15 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
Ok seems i get confirmed from people with a real Amiga, my new c2p routine is faster than my old, all is good finnaly
So far there is just one guy that found the old Qon version faster, he is also the only one with a 040, maybe that's why (?) |
06 April 2011, 01:36 | #16 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
regarding estimating performance: sure you can. Start out small. Assume in-cache for all instructions. Ignore any instructions that access memory because those are much more complicated to compute. Write the number of cycles for the instruction in the right-hand column.
Example: Code:
.loop: move.l (a0)+,d0 ; 0 [because it's too complicated to look up] add.l d1,d0 ; <look this up in manual> add.l d2,d0 ; <look this up in manual> add.l d3,d0 ; <look this up in manual> add.l (a1)+,d0 ; 0 [because it's too complicated to look up] move.l d0,(a2)+ ; 0 [because it's too complicated to look up] dbf d7,.loop ; <look this up in manual> ; = <sum of the above instructions> |
06 April 2011, 04:42 | #17 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
I don't know much about planar to chunky (I use a gfx card), but the code could use some optimization. This should run better on 68020-68060...
Code:
moveq #0,d0 move.l ptr_dess_vue,a0 ; a0 = source chunky screen move.w offset_image,d0 ; d0 = screen offset move.l _bitp,a1 add.l d0,a1 move.w long_x_3,d0 move.w #80,a4 move.l d0,d1 lsr.l #5,d0 move.l d0,a3 ; a3 = number of 32pixel parts per line move.l d0,a2 lsr.l #3,d1 ; /32 *4 neg.l d1 moveq #80,d0 add.l d0,d1 move.w long_y,d0 move.l d1,a4 ; a4 = offset to add to end of line till next line move.l d0,a5 ; a5 = y counter affdv_do_a_screen_line affdv_do_32_pixels move.w #8,a6 ; 8 packs (of 4 pixels each) counter affdv_do_4_pixels move.l (a0)+,d6 ; get 4 chunky pixels moveq.l #4,d7 ; 4 pixels counter affdv_do_1_pixel lsl.l #3,d6 ; we don't need the 2 ham8 control bits (7+6) addx.l d5,d5 ; bit 5 of a pixel to bitplane 5 (0-5) add.l d6,d6 addx.l d4,d4 ; bit 4 of a pixel to bitplane 4 (0-5) add.l d6,d6 addx.l d3,d3 ; bit 3 of a pixel to bitplane 3 (0-5) add.l d6,d6 addx.l d2,d2 ; bit 2 of a pixel to bitplane 2 (0-5) add.l d6,d6 addx.l d1,d1 ; bit 1 of a pixel to bitplane 1 (0-5) add.l d6,d6 addx.l d0,d0 ; bit 0 of a pixel to bitplane 0 (0-5) subq.l #1,d7 bne.b affdv_do_1_pixel subq.l #1,a6 tst.l a6 bne.b affdv_do_4_pixels move.l d0,(a1) ; set bitplan 0 of 32 pixels add.w #80*256,a1 move.l d1,(a1) ; set bitplan 1 of 32 pixels add.w #80*256,a1 move.l d2,(a1) ; set bitplan 2 of 32 pixels add.w #80*256,a1 move.l d3,(a1) ; set bitplan 3 of 32 pixels add.w #80*256,a1 move.l d4,(a1) ; set bitplan 4 of 32 pixels add.w #80*256,a1 move.l d5,(a1) ; set bitplan 5 of 32 pixels sub.l #5*80*256-4,a1 subq.l #1,a2 tst.l a2 bne.b affdv_do_32_pixels move.l a3,a2 ; reset 32pixel counter add.l a4,a1 ; put a1 on start of next screen line subq.l #1,a5 tst.l a5 bne.b affdv_do_a_screen_line Last edited by matthey; 06 April 2011 at 14:54. Reason: fix |
06 April 2011, 10:34 | #18 |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,865
|
Ahem.
Ahem. Ahem. This code will not work, you might want to check your 680x0 manual. |
06 April 2011, 10:44 | #19 |
Registered User
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
|
That kind of optimization will not work, because:
SUBQ Subtract Quick SUBQ (M68000 Family) Operation: Destination – Immediate Data ® Destination Assembler Syntax: SUBQ # < data > , < ea > Attributes: Size = (Byte, Word, Long) Description: Subtracts the immediate data (1 – 8) from the destination operand. The size of the operation is specified as byte, word, or long. Only word and long operations can be used with address registers, and the condition codes are not affected. .... Edit: lol, i swear Stingray's post was not there when i started to write this Edit2: Anyway, i will change the code some more, because the chip Ram writes waitstates are not "absorbed" at all (just by an "add.w"). I will have to write out only words instead of long-words, so this makes twice as much writes, but their waitstates should get much more "absorbed" by following instructions, so it should be faster at the end. Last edited by Lord Riton; 06 April 2011 at 11:06. |
06 April 2011, 15:09 | #20 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
@StingRay & Lord Riton
You're correct. I was thinking math on an address register set the CC but not movea for some reason. Must have been because it was late . Motorola should have made address register operations set the CC like data registers. Anyway, I corrected the code above using tst.l instead of cmp.l as that is allowed in address registers on 68020+. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Program to speed up floppy disk access? | BarryB | support.Apps | 22 | 26 March 2013 19:30 |
Break on Memory Access? | Khyron | support.WinUAE | 3 | 21 August 2010 00:10 |
access emulated memory | ara | support.WinUAE | 6 | 03 April 2010 13:05 |
difference winuae memory speed ? | turrican3 | New to Emulation or Amiga scene | 3 | 07 June 2007 21:36 |
Slow speed Direct HD access | Dan Andrea | support.WinUAE | 3 | 27 December 2002 14:21 |
|
|