English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 03 April 2011, 20:40   #1
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
memory access speed question

Hi,

First, i'm sorry if this has already been asked, i tryed to find it and didn't.

I have read somewhere (don't remember where), that this code here:

Code:
move.w someMemoryAdr,d1
move.w someMemoryAdr2,d2
addq   #6,d3
is actualy slower than this one:

Code:
move.w someMemoryAdr,d1
addq   #6,d3
move.w someMemoryAdr2,d2
I guess it's the same for writing to memory, in exemple if i replace "someMemoryAdr,d1" by "d1,someMemoryAdr" ?

And now my main question, does the memory writing waitestate affect the memory reading ? or are they on different waits ?

in exemple, is this:

Code:
move.w someMemoryAdr,d1
move.w d2,someMemoryAdr2
addq   #6,d3
slower than this:

Code:
move.w someMemoryAdr,d1
addq   #6,d3
move.w d2,someMemoryAdr2
Lord Riton is offline  
Old 03 April 2011, 21:39   #2
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
What is your target system? What CPU? Are you reading/writing to chipmem or fastmem? The question is waaay to broad to give a simple answer. All that can be said from your description is that "generally, the latter will be faster or equally fast as the former".
Kalms is offline  
Old 03 April 2011, 21:49   #3
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
I didn't think this was dependent of the type of memory.

I tought this was just general for the 68020+ processors, now i'm confused even more.
Lord Riton is offline  
Old 04 April 2011, 01:36   #4
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
If you're targetting fastmem and you're hitting the cache on 68040+ then both alternatives will be equally fast.

If you're targetting fastmem and you're not hitting the cache on 68040+ then there is a bunch of cycles after the 1st read during which the bus interface is busy (this is due to the CPU fetching the entire cacheline). Any reads/writes which generate bus traffic during that period will stall until the first cachline fetch has completed. You can see the same effect on 68030 with DBURST on. So the 2nd alternative will be faster under those circumstances.

If you're targetting fastmem and you're on 68020, or 68030 with DBURST off, then they should be equally fast.

If you're targetting chipmem then it depends a lot on the exact timing i.e. how your CPU instructions align to the chipbus cycle boundaries. And the alignment requirements for optimal performance will be different between different accelerator boards.
Kalms is offline  
Old 04 April 2011, 04:29   #5
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
Ok i found the source where i read it again. It was not exactly like i had it in memory, it talked only of writes and not reads.

here is the article: http://www.mways.co.uk/amiga/howtoco...80x0issues.php

It's under "A1200 speed issues".


Edit: Ok i understand it better now how this works, if you're interested, you can look at this:

Click image for larger version

Name:	68020processorActivity.JPG
Views:	406
Size:	198.7 KB
ID:	28298

Last edited by Lord Riton; 04 April 2011 at 12:40.
Lord Riton is offline  
Old 04 April 2011, 22:35   #6
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
Chipmem and fastmem accesses are different. To be precise, chipmem accesses are uncached. (so they behave largely the same way on all 68020+ systems.) Also, chipmem is very slow compared to the CPU clockrate.


If you read from a chipmem location, the CPU will stall during the entire duration of the memory read operation. This is because the CPU it needs the value stored in the memory location before the read operation can be completed.

If you write however, in most system configurations the write will get chucked into a buffer, and then the CPU continues processing other stuff while the bus interface is busy. (On most accelerator board there is such a write buffer on the accelerator board. In addition, the 68060 has a 4-slot write buffer internally in the CPU.) If any subsequent instruction tries to hit the bus while there are still pending writes, then the CPU will stall until the bus is available again.

For 50MHz accelerator boards, the bus will typically remain busy for 26-28 cycles after you have performed a chipmem write. During that period, don't touch the bus.
Kalms is offline  
Old 04 April 2011, 22:55   #7
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
Quote:
Originally Posted by Kalms View Post
For 50MHz accelerator boards, the bus will typically remain busy for 26-28 cycles after you have performed a chipmem write. During that period, don't touch the bus.
I guess it can wary much more, especially if you use a hires screen (like i do with my QON game), and if the display is building up the screen at the moment of the chip memory write.
And i bet here is also my problem with my new c2p code i just made for it, it should be faster than my old, but it isn't. The only reason i see why my old c2p is faster , is that it's faster to use fast ram for the conversion first and then to simply copy the whole screen from fast ram into chip ram with fat movem.l's .. i'm a bit desesperated, i feel i'll soon abandon the Amiga again and just go to easy c++ PC programming

.. now i'm really going to play some mass effect 1 on my xbox 360 to forget this...
Lord Riton is offline  
Old 04 April 2011, 23:14   #8
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
Yup. There are two things you can try which are practical:
1) only do c2p outside of screen display - if you have a 200 lines high display window then you'd still have 112 lines during which the display DMA isn't touching chipram. It will take you multiple frames to complete c2p conversion.
2) Find a way (that is specific to your application) which requires less overall memory access than reading the entire fastmem buffer and writing the entire chipmem buffer.

The standard c2ps are (from a performance perspective) equivalent to a fast-chip copy on 68040@40 and faster CPUs. i.e. the actual c2p transformation logic is done while the chipbus is busy. So if you want higher performance, you need to do something about the memory accesses.
Kalms is offline  
Old 05 April 2011, 01:48   #9
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
I should not play mass effect in hardcore mode as relaxing game.. this got me even more frustrated..


I tested the engine without any chip memory writing (i did put them all as comment), and it's still exactly as slow !?!? there must be something else.. will see that tommorrow..

Edit:

I'll just post the code here, maybe someone sees something wrong or suspect i didn't see myself.

Code:
	move.l	ptr_dess_vue,a0	; a0 = source chunky screen

	move.w	offset_image,d0	; d0 = screen offset
	and.l	#$ffff,d0
	add.l	_bitp,d0
	move.l	d0,a1		; a1 = adr bitplane 0 of destination screen

	move.w	long_x_3,d0
	move.w	d0,d1
	lsr.w	#5,d0
	and.l	#$ffff,d0
	move.l	d0,a3		; a3 = number of 32pixel parts per line
	move.l	a3,a2

	lsr.w	#3,d1		; /32  *4
	neg.w	d1
	ext.l	d1
	add.l	#80,d1
	move.l	d1,a4 	; a4 = offset to add to end of line till next line
	move.w	long_y,d0
	and.l	#$ffff,d0
	move.l	d0,a5		; a5 = y counter

affdv_do_a_screen_line
affdv_do_32_pixels
	move.l	#8,a6		; 8 packs (of 4 pixels each) counter
affdv_do_4_pixels
	move.l	(a0)+,d6	; get 4 chunky pixels
	moveq.l	#4,d7		; 4 pixels counter
affdv_do_1_pixel
	lsl.l	#3,d6		; we don't need the 2 ham8 control bits (7+6)
	addx.l	d5,d5		; bit 5 of a pixel to bitplane 5 (0-5)
	add.l	d6,d6
	addx.l	d4,d4		; bit 4 of a pixel to bitplane 4 (0-5)
	add.l	d6,d6
	addx.l	d3,d3		; bit 3 of a pixel to bitplane 3 (0-5)
	add.l	d6,d6
	addx.l	d2,d2		; bit 2 of a pixel to bitplane 2 (0-5)
	add.l	d6,d6
	addx.l	d1,d1		; bit 1 of a pixel to bitplane 1 (0-5)
	add.l	d6,d6
	addx.l	d0,d0		; bit 0 of a pixel to bitplane 0 (0-5)

	subq.l	#1,d7
	bne.b	affdv_do_1_pixel

	subq.l	#1,a6
	cmpa.l	#0,a6
	bne.b	affdv_do_4_pixels

	move.l	d0,(a1)		; set bitplan 0 of 32 pixels
	add.l	#80*256,a1
	move.l	d1,(a1)		; set bitplan 1 of 32 pixels
	add.l	#80*256,a1
	move.l	d2,(a1)		; set bitplan 2 of 32 pixels
	add.l	#80*256,a1
	move.l	d3,(a1)		; set bitplan 3 of 32 pixels
	add.l	#80*256,a1
	move.l	d4,(a1)		; set bitplan 4 of 32 pixels
	add.l	#80*256,a1
	move.l	d5,(a1)+	; set bitplan 5 of 32 pixels
	sub.l	#5*80*256,a1

	sub.l	#1,a2
	cmpa.l	#0,a2
	bne.b	affdv_do_32_pixels

	move.l	a3,a2		; reset 32pixel counter
	add.l	a4,a1		; put a1 on start of next screen line
	sub.l	#1,a5
	cmpa.l	#0,a5
	bne.b	affdv_do_a_screen_line
		
	movem.l	(sp)+,d0-d7/a0-a6
	rts
Tommorrow i'll wake up and some nice faery fixed it while i sleep, let's hope

Last edited by Lord Riton; 05 April 2011 at 01:55.
Lord Riton is offline  
Old 05 April 2011, 08:36   #10
sandruzzo
Registered User
 
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,281
we could even use the horizontal blanking,isn't true?
sandruzzo is offline  
Old 05 April 2011, 10:22   #11
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
Quote:
Originally Posted by sandruzzo View Post
we could even use the horizontal blanking,isn't true?
It's not even the chip write access that slows it down. When i put all the "move.l d0,(a1) ; set bitplan 0 of 32 pixels"
lines as comments, it's about the same speed. Guess my method is just to slow compared to my old.. Or maybe it's WinUAE that gives false results, i should try on my real Amiga, but it's a pain to transfere stuff from my PC to it.


Edit: if you want to help testing this you can do this there: http://eab.abime.net/showthread.php?t=58617

Last edited by Lord Riton; 05 April 2011 at 16:11.
Lord Riton is offline  
Old 05 April 2011, 16:17   #12
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
How about estimating how many CPU cycles the computational work would take? That should give you an idea if it is the computations that overshadow the time spent in the memory accesses.
On a 50MHz system, 1 frame = 1 million cycles.
The chipwrites ought to occupy the CPU's bus interface for about 0.5 frames (most of which can be overlapped with the computations) and the fastreads stall the entire CPU for about 0.2 frames (some of which can be overlapped with the computations on 68030+).
Kalms is offline  
Old 05 April 2011, 17:55   #13
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
I'm pretty sure my old c2p routine is faster because it uses a lot more of RAM accesses, mainly fast ram, but still i guess WinUAE does not emulate the RAM's real speed and therefore will improve these accesses much more than on a real Amiga.

As for computing the whole cpu cycles both routines are taking, that's probably a bit beyond my knowledge. I am not to sure how much each instruction is taking. On the 68020 manual i have there are 3 different amount of cycles for each instructions (best, in cache, worst).
Lord Riton is offline  
Old 05 April 2011, 18:12   #14
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,506
68020 CE only emulates memory access speeds cycle-exactly. (chip, fast, rom, cia etc..) also instruction cache is emulated.

CPU internal timing emulation is usually "immediate". (Because it is very complex compared to simple 68000). Fortunately it is good enough for most purposes, limit is usually always Agnus bus (chip ram, custom registers).
Toni Wilen is online now  
Old 05 April 2011, 22:12   #15
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
Ok seems i get confirmed from people with a real Amiga, my new c2p routine is faster than my old, all is good finnaly
So far there is just one guy that found the old Qon version faster, he is also the only one with a 040, maybe that's why (?)
Lord Riton is offline  
Old 06 April 2011, 01:36   #16
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
regarding estimating performance: sure you can. Start out small. Assume in-cache for all instructions. Ignore any instructions that access memory because those are much more complicated to compute. Write the number of cycles for the instruction in the right-hand column.

Example:

Code:
.loop:
	move.l	(a0)+,d0		; 0 [because it's too complicated to look up]
	add.l	d1,d0			; <look this up in manual>
	add.l	d2,d0			; <look this up in manual>
	add.l	d3,d0			; <look this up in manual>
	add.l	(a1)+,d0		; 0 [because it's too complicated to look up]
	move.l	d0,(a2)+		; 0 [because it's too complicated to look up]
	dbf	d7,.loop		; <look this up in manual>
					; = <sum of the above instructions>
The reason why I suggest this is that you will soon get an intuitive understanding of the relative speed of different instructions. And this helps you when guessing how quick a piece of code will run on the target hardware.
Kalms is offline  
Old 06 April 2011, 04:42   #17
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
I don't know much about planar to chunky (I use a gfx card), but the code could use some optimization. This should run better on 68020-68060...

Code:
    moveq    #0,d0
    move.l    ptr_dess_vue,a0    ; a0 = source chunky screen
    move.w   offset_image,d0    ; d0 = screen offset
    move.l    _bitp,a1
    add.l      d0,a1

    move.w   long_x_3,d0
    move.w  #80,a4
    move.l    d0,d1
    lsr.l        #5,d0
    move.l    d0,a3        ; a3 = number of 32pixel parts per line
    move.l    d0,a2

    lsr.l        #3,d1        ; /32  *4
    neg.l      d1
    moveq    #80,d0
    add.l      d0,d1
    move.w   long_y,d0
    move.l    d1,a4        ; a4 = offset to add to end of line till next line
    move.l    d0,a5        ; a5 = y counter

affdv_do_a_screen_line
affdv_do_32_pixels
    move.w    #8,a6        ; 8 packs (of 4 pixels each) counter
affdv_do_4_pixels
    move.l     (a0)+,d6    ; get 4 chunky pixels
    moveq.l    #4,d7        ; 4 pixels counter
affdv_do_1_pixel
    lsl.l       #3,d6        ; we don't need the 2 ham8 control bits (7+6)
    addx.l    d5,d5        ; bit 5 of a pixel to bitplane 5 (0-5)
    add.l    d6,d6
    addx.l    d4,d4        ; bit 4 of a pixel to bitplane 4 (0-5)
    add.l    d6,d6
    addx.l    d3,d3        ; bit 3 of a pixel to bitplane 3 (0-5)
    add.l    d6,d6
    addx.l    d2,d2        ; bit 2 of a pixel to bitplane 2 (0-5)
    add.l    d6,d6
    addx.l    d1,d1        ; bit 1 of a pixel to bitplane 1 (0-5)
    add.l    d6,d6
    addx.l    d0,d0        ; bit 0 of a pixel to bitplane 0 (0-5)

    subq.l    #1,d7
    bne.b    affdv_do_1_pixel

    subq.l    #1,a6
    tst.l      a6
    bne.b    affdv_do_4_pixels

    move.l    d0,(a1)        ; set bitplan 0 of 32 pixels
    add.w    #80*256,a1
    move.l    d1,(a1)        ; set bitplan 1 of 32 pixels
    add.w    #80*256,a1
    move.l    d2,(a1)        ; set bitplan 2 of 32 pixels
    add.w    #80*256,a1
    move.l    d3,(a1)        ; set bitplan 3 of 32 pixels
    add.w    #80*256,a1
    move.l    d4,(a1)        ; set bitplan 4 of 32 pixels
    add.w    #80*256,a1
    move.l    d5,(a1)        ; set bitplan 5 of 32 pixels
    sub.l      #5*80*256-4,a1

    subq.l     #1,a2
    tst.l       a2
    bne.b     affdv_do_32_pixels

    move.l    a3,a2        ; reset 32pixel counter
    add.l      a4,a1        ; put a1 on start of next screen line
    subq.l    #1,a5
    tst.l      a5
    bne.b    affdv_do_a_screen_line
This is just the obvious optimizations. Several of the optimizations could be done automatically with an optimizing assembler. There are places where the sign extension of an address register could be used if the upper bit of the unsigned word would never be used. 68060 performance is much improved by using long operations more and improved scheduling. Feel free to ask any questions about an optimization or post any errors. Of course a good algorithm and taking into account the slow chip mem is more important in this case. This is not my area of expertise so I'll leave that to assembler programmers that have it down to an art .

Last edited by matthey; 06 April 2011 at 14:54. Reason: fix
matthey is offline  
Old 06 April 2011, 10:34   #18
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by matthey View Post
subq.l #1,a6
bne.b affdv_do_4_pixels
Ahem.

Quote:
Originally Posted by matthey View Post
subq.l #1,a2
bne.b affdv_do_32_pixels
Ahem.

Quote:
Originally Posted by matthey View Post
subq.l #1,a5
bne.b affdv_do_a_screen_line
Ahem.

This code will not work, you might want to check your 680x0 manual.
StingRay is offline  
Old 06 April 2011, 10:44   #19
Lord Riton
Registered User
 
Lord Riton's Avatar
 
Join Date: Jan 2011
Location: France
Age: 52
Posts: 507
Quote:
Originally Posted by matthey View Post
Code:
    subq.l    #1,a6
    bne.b    affdv_do_4_pixels
That kind of optimization will not work, because:

SUBQ Subtract Quick SUBQ
(M68000 Family)

Operation: Destination – Immediate Data ® Destination
Assembler

Syntax: SUBQ # < data > , < ea >

Attributes: Size = (Byte, Word, Long)
Description: Subtracts the immediate data (1 – 8) from the destination operand. The size
of the operation is specified as byte, word, or long. Only word and long operations can
be used with address registers, and the condition codes are not affected. ....

Edit: lol, i swear Stingray's post was not there when i started to write this

Edit2: Anyway, i will change the code some more, because the chip Ram writes waitstates are not "absorbed" at all (just by an "add.w"). I will have to write out only words instead of long-words, so this makes twice as much writes, but their waitstates should get much more "absorbed" by following instructions, so it should be faster at the end.

Last edited by Lord Riton; 06 April 2011 at 11:06.
Lord Riton is offline  
Old 06 April 2011, 15:09   #20
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
@StingRay & Lord Riton
You're correct. I was thinking math on an address register set the CC but not movea for some reason. Must have been because it was late . Motorola should have made address register operations set the CC like data registers. Anyway, I corrected the code above using tst.l instead of cmp.l as that is allowed in address registers on 68020+.
matthey is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Program to speed up floppy disk access? BarryB support.Apps 22 26 March 2013 19:30
Break on Memory Access? Khyron support.WinUAE 3 21 August 2010 00:10
access emulated memory ara support.WinUAE 6 03 April 2010 13:05
difference winuae memory speed ? turrican3 New to Emulation or Amiga scene 3 07 June 2007 21:36
Slow speed Direct HD access Dan Andrea support.WinUAE 3 27 December 2002 14:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 19:48.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.20604 seconds with 14 queries