memory access speed question

Lord Riton · 03 April 2011, 20:40

Hi,

First, i'm sorry if this has already been asked, i tryed to find it and didn't.

I have read somewhere (don't remember where), that this code here:

Code:

move.w someMemoryAdr,d1
move.w someMemoryAdr2,d2
addq   #6,d3

is actualy slower than this one:

Code:

move.w someMemoryAdr,d1
addq   #6,d3
move.w someMemoryAdr2,d2

I guess it's the same for writing to memory, in exemple if i replace "someMemoryAdr,d1" by "d1,someMemoryAdr" ?

And now my main question, does the memory writing waitestate affect the memory reading ? or are they on different waits ?

in exemple, is this:

Code:

move.w someMemoryAdr,d1
move.w d2,someMemoryAdr2
addq   #6,d3

slower than this:

Code:

move.w someMemoryAdr,d1
addq   #6,d3
move.w d2,someMemoryAdr2

Kalms · 03 April 2011, 21:39

What is your target system? What CPU? Are you reading/writing to chipmem or fastmem? The question is waaay to broad to give a simple answer. All that can be said from your description is that "generally, the latter will be faster or equally fast as the former".

Lord Riton · 03 April 2011, 21:49

I didn't think this was dependent of the type of memory.

I tought this was just general for the 68020+ processors, now i'm confused even more.

Kalms · 04 April 2011, 01:36

If you're targetting fastmem and you're hitting the cache on 68040+ then both alternatives will be equally fast.

If you're targetting fastmem and you're not hitting the cache on 68040+ then there is a bunch of cycles after the 1st read during which the bus interface is busy (this is due to the CPU fetching the entire cacheline). Any reads/writes which generate bus traffic during that period will stall until the first cachline fetch has completed. You can see the same effect on 68030 with DBURST on. So the 2nd alternative will be faster under those circumstances.

If you're targetting fastmem and you're on 68020, or 68030 with DBURST off, then they should be equally fast.

If you're targetting chipmem then it depends a lot on the exact timing i.e. how your CPU instructions align to the chipbus cycle boundaries. And the alignment requirements for optimal performance will be different between different accelerator boards.

Lord Riton · 04 April 2011, 04:29

Ok i found the source where i read it again. It was not exactly like i had it in memory, it talked only of writes and not reads.

here is the article: http://www.mways.co.uk/amiga/howtoco...80x0issues.php

It's under "A1200 speed issues".

Edit: Ok i understand it better now how this works, if you're interested, you can look at this:

Click image for larger version

Name: 68020processorActivity.JPG
Views: 406
Size: 198.7 KB
ID: 28298

Kalms · 04 April 2011, 22:35

Chipmem and fastmem accesses are different. To be precise, chipmem accesses are uncached. (so they behave largely the same way on all 68020+ systems.) Also, chipmem is very slow compared to the CPU clockrate.

If you read from a chipmem location, the CPU will stall during the entire duration of the memory read operation. This is because the CPU it needs the value stored in the memory location before the read operation can be completed.

If you write however, in most system configurations the write will get chucked into a buffer, and then the CPU continues processing other stuff while the bus interface is busy. (On most accelerator board there is such a write buffer on the accelerator board. In addition, the 68060 has a 4-slot write buffer internally in the CPU.) If any subsequent instruction tries to hit the bus while there are still pending writes, then the CPU will stall until the bus is available again.

For 50MHz accelerator boards, the bus will typically remain busy for 26-28 cycles after you have performed a chipmem write. During that period, don't touch the bus.

Lord Riton · 04 April 2011, 22:55

Quote:

Originally Posted by Kalms

For 50MHz accelerator boards, the bus will typically remain busy for 26-28 cycles after you have performed a chipmem write. During that period, don't touch the bus.

I guess it can wary much more, especially if you use a hires screen (like i do with my QON game), and if the display is building up the screen at the moment of the chip memory write.
And i bet here is also my problem with my new c2p code i just made for it, it should be faster than my old, but it isn't. The only reason i see why my old c2p is faster , is that it's faster to use fast ram for the conversion first and then to simply copy the whole screen from fast ram into chip ram with fat movem.l's .. i'm a bit desesperated, i feel i'll soon abandon the Amiga again and just go to easy c++ PC programming

.. now i'm really going to play some mass effect 1 on my xbox 360 to forget this...

Kalms · 04 April 2011, 23:14

Yup. There are two things you can try which are practical:
1) only do c2p outside of screen display - if you have a 200 lines high display window then you'd still have 112 lines during which the display DMA isn't touching chipram. It will take you multiple frames to complete c2p conversion.
2) Find a way (that is specific to your application) which requires less overall memory access than reading the entire fastmem buffer and writing the entire chipmem buffer.

The standard c2ps are (from a performance perspective) equivalent to a fast-chip copy on 68040@40 and faster CPUs. i.e. the actual c2p transformation logic is done while the chipbus is busy. So if you want higher performance, you need to do something about the memory accesses.

Lord Riton · 05 April 2011, 01:48

I should not play mass effect in hardcore mode as relaxing game.. this got me even more frustrated..

I tested the engine without any chip memory writing (i did put them all as comment), and it's still exactly as slow !?!? there must be something else.. will see that tommorrow..

Edit:

I'll just post the code here, maybe someone sees something wrong or suspect i didn't see myself.

Code:

	move.l	ptr_dess_vue,a0	; a0 = source chunky screen

	move.w	offset_image,d0	; d0 = screen offset
	and.l	#$ffff,d0
	add.l	_bitp,d0
	move.l	d0,a1		; a1 = adr bitplane 0 of destination screen

	move.w	long_x_3,d0
	move.w	d0,d1
	lsr.w	#5,d0
	and.l	#$ffff,d0
	move.l	d0,a3		; a3 = number of 32pixel parts per line
	move.l	a3,a2

	lsr.w	#3,d1		; /32  *4
	neg.w	d1
	ext.l	d1
	add.l	#80,d1
	move.l	d1,a4 	; a4 = offset to add to end of line till next line
	move.w	long_y,d0
	and.l	#$ffff,d0
	move.l	d0,a5		; a5 = y counter

affdv_do_a_screen_line
affdv_do_32_pixels
	move.l	#8,a6		; 8 packs (of 4 pixels each) counter
affdv_do_4_pixels
	move.l	(a0)+,d6	; get 4 chunky pixels
	moveq.l	#4,d7		; 4 pixels counter
affdv_do_1_pixel
	lsl.l	#3,d6		; we don't need the 2 ham8 control bits (7+6)
	addx.l	d5,d5		; bit 5 of a pixel to bitplane 5 (0-5)
	add.l	d6,d6
	addx.l	d4,d4		; bit 4 of a pixel to bitplane 4 (0-5)
	add.l	d6,d6
	addx.l	d3,d3		; bit 3 of a pixel to bitplane 3 (0-5)
	add.l	d6,d6
	addx.l	d2,d2		; bit 2 of a pixel to bitplane 2 (0-5)
	add.l	d6,d6
	addx.l	d1,d1		; bit 1 of a pixel to bitplane 1 (0-5)
	add.l	d6,d6
	addx.l	d0,d0		; bit 0 of a pixel to bitplane 0 (0-5)

	subq.l	#1,d7
	bne.b	affdv_do_1_pixel

	subq.l	#1,a6
	cmpa.l	#0,a6
	bne.b	affdv_do_4_pixels

	move.l	d0,(a1)		; set bitplan 0 of 32 pixels
	add.l	#80*256,a1
	move.l	d1,(a1)		; set bitplan 1 of 32 pixels
	add.l	#80*256,a1
	move.l	d2,(a1)		; set bitplan 2 of 32 pixels
	add.l	#80*256,a1
	move.l	d3,(a1)		; set bitplan 3 of 32 pixels
	add.l	#80*256,a1
	move.l	d4,(a1)		; set bitplan 4 of 32 pixels
	add.l	#80*256,a1
	move.l	d5,(a1)+	; set bitplan 5 of 32 pixels
	sub.l	#5*80*256,a1

	sub.l	#1,a2
	cmpa.l	#0,a2
	bne.b	affdv_do_32_pixels

	move.l	a3,a2		; reset 32pixel counter
	add.l	a4,a1		; put a1 on start of next screen line
	sub.l	#1,a5
	cmpa.l	#0,a5
	bne.b	affdv_do_a_screen_line
		
	movem.l	(sp)+,d0-d7/a0-a6
	rts

Tommorrow i'll wake up and some nice faery fixed it while i sleep, let's hope

sandruzzo · 05 April 2011, 08:36

we could even use the horizontal blanking,isn't true?

Lord Riton · 05 April 2011, 10:22

Quote:

Originally Posted by sandruzzo

we could even use the horizontal blanking,isn't true?

It's not even the chip write access that slows it down. When i put all the "move.l d0,(a1) ; set bitplan 0 of 32 pixels"
lines as comments, it's about the same speed. Guess my method is just to slow compared to my old.. Or maybe it's WinUAE that gives false results, i should try on my real Amiga, but it's a pain to transfere stuff from my PC to it.

Edit: if you want to help testing this you can do this there: http://eab.abime.net/showthread.php?t=58617

Kalms · 05 April 2011, 16:17

How about estimating how many CPU cycles the computational work would take? That should give you an idea if it is the computations that overshadow the time spent in the memory accesses.
On a 50MHz system, 1 frame = 1 million cycles.
The chipwrites ought to occupy the CPU's bus interface for about 0.5 frames (most of which can be overlapped with the computations) and the fastreads stall the entire CPU for about 0.2 frames (some of which can be overlapped with the computations on 68030+).

Lord Riton · 05 April 2011, 17:55

I'm pretty sure my old c2p routine is faster because it uses a lot more of RAM accesses, mainly fast ram, but still i guess WinUAE does not emulate the RAM's real speed and therefore will improve these accesses much more than on a real Amiga.

As for computing the whole cpu cycles both routines are taking, that's probably a bit beyond my knowledge. I am not to sure how much each instruction is taking. On the 68020 manual i have there are 3 different amount of cycles for each instructions (best, in cache, worst).

Toni Wilen · 05 April 2011, 18:12

68020 CE only emulates memory access speeds cycle-exactly. (chip, fast, rom, cia etc..) also instruction cache is emulated.

CPU internal timing emulation is usually "immediate". (Because it is very complex compared to simple 68000). Fortunately it is good enough for most purposes, limit is usually always Agnus bus (chip ram, custom registers).

Lord Riton · 05 April 2011, 22:12

Ok seems i get confirmed from people with a real Amiga, my new c2p routine is faster than my old, all is good finnaly

So far there is just one guy that found the old Qon version faster, he is also the only one with a 040, maybe that's why (?)

Kalms · 06 April 2011, 01:36

regarding estimating performance: sure you can. Start out small. Assume in-cache for all instructions. Ignore any instructions that access memory because those are much more complicated to compute. Write the number of cycles for the instruction in the right-hand column.

Example:

Code:

.loop:
	move.l	(a0)+,d0		; 0 [because it's too complicated to look up]
	add.l	d1,d0			; <look this up in manual>
	add.l	d2,d0			; <look this up in manual>
	add.l	d3,d0			; <look this up in manual>
	add.l	(a1)+,d0		; 0 [because it's too complicated to look up]
	move.l	d0,(a2)+		; 0 [because it's too complicated to look up]
	dbf	d7,.loop		; <look this up in manual>
					; = <sum of the above instructions>

The reason why I suggest this is that you will soon get an intuitive understanding of the relative speed of different instructions. And this helps you when guessing how quick a piece of code will run on the target hardware.

matthey · 06 April 2011, 04:42

I don't know much about planar to chunky (I use a gfx card), but the code could use some optimization. This should run better on 68020-68060...

Code:

    moveq    #0,d0
    move.l    ptr_dess_vue,a0    ; a0 = source chunky screen
    move.w   offset_image,d0    ; d0 = screen offset
    move.l    _bitp,a1
    add.l      d0,a1

    move.w   long_x_3,d0
    move.w  #80,a4
    move.l    d0,d1
    lsr.l        #5,d0
    move.l    d0,a3        ; a3 = number of 32pixel parts per line
    move.l    d0,a2

    lsr.l        #3,d1        ; /32  *4
    neg.l      d1
    moveq    #80,d0
    add.l      d0,d1
    move.w   long_y,d0
    move.l    d1,a4        ; a4 = offset to add to end of line till next line
    move.l    d0,a5        ; a5 = y counter

affdv_do_a_screen_line
affdv_do_32_pixels
    move.w    #8,a6        ; 8 packs (of 4 pixels each) counter
affdv_do_4_pixels
    move.l     (a0)+,d6    ; get 4 chunky pixels
    moveq.l    #4,d7        ; 4 pixels counter
affdv_do_1_pixel
    lsl.l       #3,d6        ; we don't need the 2 ham8 control bits (7+6)
    addx.l    d5,d5        ; bit 5 of a pixel to bitplane 5 (0-5)
    add.l    d6,d6
    addx.l    d4,d4        ; bit 4 of a pixel to bitplane 4 (0-5)
    add.l    d6,d6
    addx.l    d3,d3        ; bit 3 of a pixel to bitplane 3 (0-5)
    add.l    d6,d6
    addx.l    d2,d2        ; bit 2 of a pixel to bitplane 2 (0-5)
    add.l    d6,d6
    addx.l    d1,d1        ; bit 1 of a pixel to bitplane 1 (0-5)
    add.l    d6,d6
    addx.l    d0,d0        ; bit 0 of a pixel to bitplane 0 (0-5)

    subq.l    #1,d7
    bne.b    affdv_do_1_pixel

    subq.l    #1,a6
    tst.l      a6
    bne.b    affdv_do_4_pixels

    move.l    d0,(a1)        ; set bitplan 0 of 32 pixels
    add.w    #80*256,a1
    move.l    d1,(a1)        ; set bitplan 1 of 32 pixels
    add.w    #80*256,a1
    move.l    d2,(a1)        ; set bitplan 2 of 32 pixels
    add.w    #80*256,a1
    move.l    d3,(a1)        ; set bitplan 3 of 32 pixels
    add.w    #80*256,a1
    move.l    d4,(a1)        ; set bitplan 4 of 32 pixels
    add.w    #80*256,a1
    move.l    d5,(a1)        ; set bitplan 5 of 32 pixels
    sub.l      #5*80*256-4,a1

    subq.l     #1,a2
    tst.l       a2
    bne.b     affdv_do_32_pixels

    move.l    a3,a2        ; reset 32pixel counter
    add.l      a4,a1        ; put a1 on start of next screen line
    subq.l    #1,a5
    tst.l      a5
    bne.b    affdv_do_a_screen_line

This is just the obvious optimizations. Several of the optimizations could be done automatically with an optimizing assembler. There are places where the sign extension of an address register could be used if the upper bit of the unsigned word would never be used. 68060 performance is much improved by using long operations more and improved scheduling. Feel free to ask any questions about an optimization or post any errors. Of course a good algorithm and taking into account the slow chip mem is more important in this case. This is not my area of expertise so I'll leave that to assembler programmers that have it down to an art

.

StingRay · 06 April 2011, 10:34

Quote:

Originally Posted by matthey

subq.l #1,a6
bne.b affdv_do_4_pixels

Ahem.

Quote:

Originally Posted by matthey

subq.l #1,a2
bne.b affdv_do_32_pixels

Ahem.

Quote:

Originally Posted by matthey

subq.l #1,a5
bne.b affdv_do_a_screen_line

Ahem.

This code will not work, you might want to check your 680x0 manual.

Lord Riton · 06 April 2011, 10:44

Quote:

Originally Posted by matthey

Code:

    subq.l    #1,a6
    bne.b    affdv_do_4_pixels

That kind of optimization will not work, because:

SUBQ Subtract Quick SUBQ
(M68000 Family)

Operation: Destination – Immediate Data ® Destination
Assembler

Syntax: SUBQ # < data > , < ea >

Attributes: Size = (Byte, Word, Long)
Description: Subtracts the immediate data (1 – 8) from the destination operand. The size
of the operation is specified as byte, word, or long. Only word and long operations can
be used with address registers, and the condition codes are not affected. ....

Edit: lol, i swear Stingray's post was not there when i started to write this

Edit2: Anyway, i will change the code some more, because the chip Ram writes waitstates are not "absorbed" at all (just by an "add.w"). I will have to write out only words instead of long-words, so this makes twice as much writes, but their waitstates should get much more "absorbed" by following instructions, so it should be faster at the end.

matthey · 06 April 2011, 15:09

@StingRay & Lord Riton
You're correct. I was thinking math on an address register set the CC but not movea for some reason. Must have been because it was late

. Motorola should have made address register operations set the CC like data registers. Anyway, I corrected the code above using tst.l instead of cmp.l as that is allowed in address registers on 68020+.

06 April 2011, 01:36	#16
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	regarding estimating performance: sure you can. Start out small. Assume in-cache for all instructions. Ignore any instructions that access memory because those are much more complicated to compute. Write the number of cycles for the instruction in the right-hand column. Example: Code: .loop: move.l (a0)+,d0 ; 0 [because it's too complicated to look up] add.l d1,d0 ; <look this up in manual> add.l d2,d0 ; <look this up in manual> add.l d3,d0 ; <look this up in manual> add.l (a1)+,d0 ; 0 [because it's too complicated to look up] move.l d0,(a2)+ ; 0 [because it's too complicated to look up] dbf d7,.loop ; <look this up in manual> ; = <sum of the above instructions> The reason why I suggest this is that you will soon get an intuitive understanding of the relative speed of different instructions. And this helps you when guessing how quick a piece of code will run on the target hardware.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Program to speed up floppy disk access?	BarryB	support.Apps	22	26 March 2013 19:30
Break on Memory Access?	Khyron	support.WinUAE	3	21 August 2010 00:10
access emulated memory	ara	support.WinUAE	6	03 April 2010 13:05
difference winuae memory speed ?	turrican3	New to Emulation or Amiga scene	3	07 June 2007 21:36
Slow speed Direct HD access	Dan Andrea	support.WinUAE	3	27 December 2002 14:21

03 April 2011, 20:40	#1
Lord Riton Registered User Join Date: Jan 2011 Location: France Age: 52 Posts: 507	memory access speed question Hi, First, i'm sorry if this has already been asked, i tryed to find it and didn't. I have read somewhere (don't remember where), that this code here: Code: move.w someMemoryAdr,d1 move.w someMemoryAdr2,d2 addq #6,d3 is actualy slower than this one: Code: move.w someMemoryAdr,d1 addq #6,d3 move.w someMemoryAdr2,d2 I guess it's the same for writing to memory, in exemple if i replace "someMemoryAdr,d1" by "d1,someMemoryAdr" ? And now my main question, does the memory writing waitestate affect the memory reading ? or are they on different waits ? in exemple, is this: Code: move.w someMemoryAdr,d1 move.w d2,someMemoryAdr2 addq #6,d3 slower than this: Code: move.w someMemoryAdr,d1 addq #6,d3 move.w d2,someMemoryAdr2

03 April 2011, 21:39	#2
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	What is your target system? What CPU? Are you reading/writing to chipmem or fastmem? The question is waaay to broad to give a simple answer. All that can be said from your description is that "generally, the latter will be faster or equally fast as the former".

03 April 2011, 21:49	#3
Lord Riton Registered User Join Date: Jan 2011 Location: France Age: 52 Posts: 507	I didn't think this was dependent of the type of memory. I tought this was just general for the 68020+ processors, now i'm confused even more.

04 April 2011, 01:36	#4
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	If you're targetting fastmem and you're hitting the cache on 68040+ then both alternatives will be equally fast. If you're targetting fastmem and you're not hitting the cache on 68040+ then there is a bunch of cycles after the 1st read during which the bus interface is busy (this is due to the CPU fetching the entire cacheline). Any reads/writes which generate bus traffic during that period will stall until the first cachline fetch has completed. You can see the same effect on 68030 with DBURST on. So the 2nd alternative will be faster under those circumstances. If you're targetting fastmem and you're on 68020, or 68030 with DBURST off, then they should be equally fast. If you're targetting chipmem then it depends a lot on the exact timing i.e. how your CPU instructions align to the chipbus cycle boundaries. And the alignment requirements for optimal performance will be different between different accelerator boards.

04 April 2011, 04:29	#5
Lord Riton Registered User Join Date: Jan 2011 Location: France Age: 52 Posts: 507	Ok i found the source where i read it again. It was not exactly like i had it in memory, it talked only of writes and not reads. here is the article: http://www.mways.co.uk/amiga/howtoco...80x0issues.php It's under "A1200 speed issues". Edit: Ok i understand it better now how this works, if you're interested, you can look at this: Last edited by Lord Riton; 04 April 2011 at 12:40.

04 April 2011, 22:35	#6
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Chipmem and fastmem accesses are different. To be precise, chipmem accesses are uncached. (so they behave largely the same way on all 68020+ systems.) Also, chipmem is very slow compared to the CPU clockrate. If you read from a chipmem location, the CPU will stall during the entire duration of the memory read operation. This is because the CPU it needs the value stored in the memory location before the read operation can be completed. If you write however, in most system configurations the write will get chucked into a buffer, and then the CPU continues processing other stuff while the bus interface is busy. (On most accelerator board there is such a write buffer on the accelerator board. In addition, the 68060 has a 4-slot write buffer internally in the CPU.) If any subsequent instruction tries to hit the bus while there are still pending writes, then the CPU will stall until the bus is available again. For 50MHz accelerator boards, the bus will typically remain busy for 26-28 cycles after you have performed a chipmem write. During that period, don't touch the bus.

04 April 2011, 23:14	#8
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Yup. There are two things you can try which are practical: 1) only do c2p outside of screen display - if you have a 200 lines high display window then you'd still have 112 lines during which the display DMA isn't touching chipram. It will take you multiple frames to complete c2p conversion. 2) Find a way (that is specific to your application) which requires less overall memory access than reading the entire fastmem buffer and writing the entire chipmem buffer. The standard c2ps are (from a performance perspective) equivalent to a fast-chip copy on 68040@40 and faster CPUs. i.e. the actual c2p transformation logic is done while the chipbus is busy. So if you want higher performance, you need to do something about the memory accesses.

05 April 2011, 08:36	#10
sandruzzo Registered User Join Date: Feb 2011 Location: Italy/Rome Posts: 2,281	we could even use the horizontal blanking,isn't true?

05 April 2011, 16:17	#12
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	How about estimating how many CPU cycles the computational work would take? That should give you an idea if it is the computations that overshadow the time spent in the memory accesses. On a 50MHz system, 1 frame = 1 million cycles. The chipwrites ought to occupy the CPU's bus interface for about 0.5 frames (most of which can be overlapped with the computations) and the fastreads stall the entire CPU for about 0.2 frames (some of which can be overlapped with the computations on 68030+).

05 April 2011, 17:55	#13
Lord Riton Registered User Join Date: Jan 2011 Location: France Age: 52 Posts: 507	I'm pretty sure my old c2p routine is faster because it uses a lot more of RAM accesses, mainly fast ram, but still i guess WinUAE does not emulate the RAM's real speed and therefore will improve these accesses much more than on a real Amiga. As for computing the whole cpu cycles both routines are taking, that's probably a bit beyond my knowledge. I am not to sure how much each instruction is taking. On the 68020 manual i have there are 3 different amount of cycles for each instructions (best, in cache, worst).

05 April 2011, 18:12	#14
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,506	68020 CE only emulates memory access speeds cycle-exactly. (chip, fast, rom, cia etc..) also instruction cache is emulated. CPU internal timing emulation is usually "immediate". (Because it is very complex compared to simple 68000). Fortunately it is good enough for most purposes, limit is usually always Agnus bus (chip ram, custom registers).

05 April 2011, 22:12	#15
Lord Riton Registered User Join Date: Jan 2011 Location: France Age: 52 Posts: 507	Ok seems i get confirmed from people with a real Amiga, my new c2p routine is faster than my old, all is good finnaly So far there is just one guy that found the old Qon version faster, he is also the only one with a 040, maybe that's why (?)

06 April 2011, 15:09	#20
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	@StingRay & Lord Riton You're correct. I was thinking math on an address register set the CC but not movea for some reason. Must have been because it was late . Motorola should have made address register operations set the CC like data registers. Anyway, I corrected the code above using tst.l instead of cmp.l as that is allowed in address registers on 68020+.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)