General question about efficiency and WORDs vs LONGs

Ernst Blofeld · 01 November 2020, 13:57

I've moved onto the next step of my little toy project and I'm now drawing polygons. I've rasterized them using Bresenham storing the edge points in an array and I'm drawing lines between matching points, currently using a LineTo function that calls my SetPixel function over and over.

I know that's bad, and I've started writing a DrawLineFaster function for the special case of horizontal lines. This will write either a WORD or a LONG at a time. My gut feeling is that LONGs will be quicker, same number of writes to memory but only half as many instruction fetches.

The tradeoff as far as I can see is just the the lookup tables that I think will need to exist for the bits at the start and end that don't fit within the WORD / LONG boundaries will have to be twice as big for the LONG version.

Does this sound sensible, or does it sound like I've not understood something?

Edit: OCS, standard Amiga 500, deliberately writing in C only, deliberately not using the blitter.

Samurai_Crow · 01 November 2020, 14:43

You haven't told us your system configuration. An AGA machine has 32-bit busses. OCS and ECS chipsets have 16-bit Chip RAM. While the blitter-based polygon filler is not as fast as a modern system, filling multiple polygons in one step can help it gain some time.

Ernst Blofeld · 01 November 2020, 14:48

Quote:

Originally Posted by Samurai_Crow

You haven't told us your system configuration. An AGA machine has 32-bit busses. OCS and ECS chipsets have 16-bit Chip RAM. While the blitter-based polygon filler is not as fast as a modern system, filling multiple polygons in one step can help it gain some time.

OCS, standard Amiga 500, deliberately writing in C only, deliberately not using the blitter.

Samurai_Crow · 01 November 2020, 14:55

The 32 bit write won't be any faster or slower than 16 bit writes because the Chip RAM bus accesses give the same number of 16 bit writes regardless of the size being written to them and will delay the second word just as long as if there was another instruction there.

Ernst Blofeld · 01 November 2020, 15:12

Quote:

Originally Posted by Samurai_Crow

The 32 bit write won't be any faster or slower than 16 bit writes because the Chip RAM bus accesses give the same number of 16 bit writes regardless of the size being written to them and will delay the second word just as long as if there was another instruction there.

Ok, thanks, I'll stick to words.

chb · 01 November 2020, 15:41

Quote:

Originally Posted by Ernst Blofeld

I know that's bad, and I've started writing a DrawLineFaster function for the special case of horizontal lines. This will write either a WORD or a LONG at a time. My gut feeling is that LONGs will be quicker, same number of writes to memory but only half as many instruction fetches.

Yes, correct IMHO.

Quote:

Originally Posted by Samurai_Crow

The 32 bit write won't be any faster or slower than 16 bit writes because the Chip RAM bus accesses give the same number of 16 bit writes regardless of the size being written to them and will delay the second word just as long as if there was another instruction there.

No, because the number of instructions fetched is lower, as EB wrote. To write out two words from a register (a typical solid polygon span filler), the following

Code:

move.l d0,(a0)+

needs only 75% the mem accesses (1 fetch/2 data) compared to a

Code:

move.w d0,(a0)+
move.w d0,(a0)+

which is (2/2).
The fastest way would be using the movem instruction for long horizontal lines - then the instruction fetch overhead word vs. long also vanishes. But I do not know how you can trick the compiler into using movem.

Samurai_Crow · 01 November 2020, 15:59

Quote:

Originally Posted by chb

Yes, correct IMHO.

No, because the number of instructions fetched is lower, as EB wrote. To write out two words from a register (a typical solid polygon span filler), the following

Code:

move.l d0,(a0)+

needs only 75% the mem accesses (1 fetch/2 data) compared to a

Code:

move.w d0,(a0)+
move.w d0,(a0)+

which is (2/2).
The fastest way would be using the movem instruction for long horizontal lines - then the instruction fetch overhead word vs. long also vanishes. But I do not know how you can trick the compiler into using movem.

I guess it depends if you use chip RAM to hold your code in.

Ernst Blofeld · 01 November 2020, 16:02

Quote:

Originally Posted by chb

Yes, correct IMHO.

No, because the number of instructions fetched is lower, as EB wrote. To write out two words from a register (a typical solid polygon span filler), the following

Code:

move.l d0,(a0)+

needs only 75% the mem accesses (1 fetch/2 data) compared to a

Code:

move.w d0,(a0)+
move.w d0,(a0)+

which is (2/2).
The fastest way would be using the movem instruction for long horizontal lines - then the instruction fetch overhead word vs. long also vanishes. But I do not know how you can trick the compiler into using movem.

There will also be the loop control instructions, which will be executed twice as often for words vs longs, and the special case of the start and end being the same address won't happen as often. I know I can unroll the loop into a switch statement with fall throughs to remove the loop control instructions, but I don't know how significant the special case will be.

roondar · 01 November 2020, 17:50

As far as I've been able to tell, using longwords is faster than using words for 32 bit writes, even on the A500. It doesn't appear to matter if the code is in chip memory or fast memory, though if DMA is really busy and code is in fast memory the speed difference does diminish.

There's two reasons for longwords being faster. The first is, as both you and chb pointed out, instruction fetching taking time. The second is that fast memory does not actually make the 68000 run faster, it merely doesn't slow it down in case of heavy DMA use (such as more than 4 bitplanes on the screen). What helps to understand why longwords are still faster in that case is that any slowdown due to chip memory being busy will affect both word and longword writes equally, but the longword writes require fewer instructions fetched so still end up faster.

For reference, here's the cycle count + memory access count

Code:

68000 move instruction cycle use / memory accesses
cycles instruction               memory accesses
  8    move.w d0,(a0)+            2 (1r/1w)
 12    move.l d0,(a0)+            3 (1r/2w)
 64    movem.w d0-d7/a1-a6,(a0)  16 (2r/14w)
120    movem.l d0-d7/a1-a6,(a0)  30 (2r/28w)

Note that the movem.l is still slightly faster than the movem.w for the same 
amount of memory, though it's not by much.

Now, it is certainly possible that there are scenarios in which longword reads/writes to memory are less efficient than word ones, but I've not found them myself to date.

Ernst Blofeld · 01 November 2020, 18:08

Thanks everyone, I've now written it and it seems to work. Not surprisingly it is at least 10 x the speed of the version that set each pixel individually.

Code:

static const ULONG leftEnd [] = {
    0xffffffff, 0x7fffffff, 0x3fffffff, 0x1fffffff, 0x0fffffff, 0x07ffffff, 0x03ffffff, 0x01ffffff,
    0x00ffffff, 0x007fffff, 0x003fffff, 0x001fffff, 0x000fffff, 0x0007ffff, 0x0003ffff, 0x0001ffff,
    0x0000ffff, 0x00007fff, 0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff,
    0x000000ff, 0x0000007f, 0x0000003f, 0x0000001f, 0x0000000f, 0x00000007, 0x00000003, 0x00000001

};

static const ULONG rightEnd [] = {
    0x80000000, 0xc0000000, 0xe0000000, 0xf0000000, 0xf8000000, 0xfc000000, 0xfe000000, 0xff000000,
    0xff800000, 0xffc00000, 0xffe00000, 0xfff00000, 0xfff80000, 0xfffc0000, 0xfffe0000, 0xffff0000,
    0xffff8000, 0xffffc000, 0xffffe000, 0xfffff000, 0xfffff800, 0xfffffc00, 0xfffffe00, 0xffffff00,
    0xffffff80, 0xffffffc0, 0xffffffe0, 0xfffffff0, 0xfffffff8, 0xfffffffc, 0xfffffffe, 0xffffffff
};

static void DrawHorizontalLine(const UWORD y, const UWORD x0, const UWORD x1) {
	UWORD start = x0 >> 5;
	UWORD end = x1 >> 5;

	ULONG left = leftEnd[x0 & 0x001f];
	ULONG right = rightEnd[x1 & 0x001f];

	ULONG * p = ((ULONG *) currentBuffer) + y * ROW_SIZE_IN_LONGS + start;

	if (start == end) {
		ULONG m = left & right;

		for (UWORD i = 1; i < DISPLAY_NUM_COLOURS; i += i, p += DISPLAY_WIDTH_IN_LONGS) {
			if (pen.colour & i)
				*p |= m;
			else
				*p &= ~m;
		}
	} else {
		for (UWORD i = 1; i < DISPLAY_NUM_COLOURS; i += i, p += DISPLAY_WIDTH_IN_LONGS) {
			ULONG * q = p;

			if (pen.colour & i) {
				*q++ |= left;
				switch (end - start) {
					case 11: *q++ = 0xffffffff;
					case 10: *q++ = 0xffffffff;
					case  9: *q++ = 0xffffffff;
					case  8: *q++ = 0xffffffff;
					case  7: *q++ = 0xffffffff;
					case  6: *q++ = 0xffffffff;
					case  5: *q++ = 0xffffffff;
					case  4: *q++ = 0xffffffff;
					case  3: *q++ = 0xffffffff;
					case  2: *q++ = 0xffffffff;
				}
				*q |= right;
			} else {
				*q++ &= ~left;
				switch (end - start) {
					case 11: *q++ = 0x00000000;
					case 10: *q++ = 0x00000000;
					case  9: *q++ = 0x00000000;
					case  8: *q++ = 0x00000000;
					case  7: *q++ = 0x00000000;
					case  6: *q++ = 0x00000000;
					case  5: *q++ = 0x00000000;
					case  4: *q++ = 0x00000000;
					case  3: *q++ = 0x00000000;
					case  2: *q++ = 0x00000000;
				}
				*q &= ~right;
			}
		}
	}
}

alkis · 01 November 2020, 20:43

I don't know if you are already aware, but Deluxe Paint version I released source code.
Maybe you'd pick an idea or two from there.
https://computerhistory.org/blog/ele...ly-source-code

01 November 2020, 13:57	#1
Ernst Blofeld <optimized out> Join Date: Sep 2020 Location: <optimized out> Posts: 321	General question about efficiency and WORDs vs LONGs I've moved onto the next step of my little toy project and I'm now drawing polygons. I've rasterized them using Bresenham storing the edge points in an array and I'm drawing lines between matching points, currently using a LineTo function that calls my SetPixel function over and over. I know that's bad, and I've started writing a DrawLineFaster function for the special case of horizontal lines. This will write either a WORD or a LONG at a time. My gut feeling is that LONGs will be quicker, same number of writes to memory but only half as many instruction fetches. The tradeoff as far as I can see is just the the lookup tables that I think will need to exist for the bits at the start and end that don't fit within the WORD / LONG boundaries will have to be twice as big for the LONG version. Does this sound sensible, or does it sound like I've not understood something? Edit: OCS, standard Amiga 500, deliberately writing in C only, deliberately not using the blitter. Last edited by Ernst Blofeld; 01 November 2020 at 14:48.

01 November 2020, 17:50	#9
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,410	As far as I've been able to tell, using longwords is faster than using words for 32 bit writes, even on the A500. It doesn't appear to matter if the code is in chip memory or fast memory, though if DMA is really busy and code is in fast memory the speed difference does diminish. There's two reasons for longwords being faster. The first is, as both you and chb pointed out, instruction fetching taking time. The second is that fast memory does not actually make the 68000 run faster, it merely doesn't slow it down in case of heavy DMA use (such as more than 4 bitplanes on the screen). What helps to understand why longwords are still faster in that case is that any slowdown due to chip memory being busy will affect both word and longword writes equally, but the longword writes require fewer instructions fetched so still end up faster. For reference, here's the cycle count + memory access count Code: 68000 move instruction cycle use / memory accesses cycles instruction memory accesses 8 move.w d0,(a0)+ 2 (1r/1w) 12 move.l d0,(a0)+ 3 (1r/2w) 64 movem.w d0-d7/a1-a6,(a0) 16 (2r/14w) 120 movem.l d0-d7/a1-a6,(a0) 30 (2r/28w) Note that the movem.l is still slightly faster than the movem.w for the same amount of memory, though it's not by much. Now, it is certainly possible that there are scenarios in which longword reads/writes to memory are less efficient than word ones, but I've not found them myself to date.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Random lockups, general question	Leandro Jardim	support.WinUAE	6	03 September 2014 13:49
32 and 64 bit sprite control words question	FrenchShark	Coders. General	8	10 January 2008 02:32
General A1200 040 question	JonSick	support.Hardware	1	14 October 2006 20:54
General asm question	Haakon	Coders. General	14	15 February 2006 21:42
Swear words	Kodoichi	project.EAB	19	14 December 2001 00:53

01 November 2020, 14:43	#2
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,186	You haven't told us your system configuration. An AGA machine has 32-bit busses. OCS and ECS chipsets have 16-bit Chip RAM. While the blitter-based polygon filler is not as fast as a modern system, filling multiple polygons in one step can help it gain some time.

01 November 2020, 14:55	#4
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,186	The 32 bit write won't be any faster or slower than 16 bit writes because the Chip RAM bus accesses give the same number of 16 bit writes regardless of the size being written to them and will delay the second word just as long as if there was another instruction there.

01 November 2020, 20:43	#11
alkis Registered User Join Date: Dec 2010 Location: Athens/Greece Age: 53 Posts: 719	I don't know if you are already aware, but Deluxe Paint version I released source code. Maybe you'd pick an idea or two from there. https://computerhistory.org/blog/ele...ly-source-code

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)