Optimizing polygonfill bitcopy - Page 2

TCH · 25 November 2019, 18:15

Well, AFAIK "modulo" means "remainder (until the next)" and "delta" means "change/difference (between two)". Both can be applied to this.

I've tried your new code. The speed once again has been significantly boosted, but yet again it is still slower than the C code. The difference is insignificant (~0.1-0.2%), but stable. It is very weird.

Don_Adan · 25 November 2019, 18:29

Quote:

Originally Posted by TCH

Well, AFAIK "modulo" means "remainder (until the next)" and "delta" means "change/difference (between two)". Both can be applied to this.

I've tried your new code. The speed once again has been significantly boosted, but yet again it is still slower than the C code. The difference is insignificant (~0.1-0.2%), but stable. It is very weird.

For which CPU? Reads or writes from chip memory?

TCH · 25 November 2019, 18:41

68000. Both. (Stock 1M A500.)

a/b · 25 November 2019, 19:07

Interesting...
Is the rest of the code the same? For example, you have both functions (c and asm) in the executable, and then say call the c one 100 times and then the asm one 100 times again. Or have you also replaced some other parts of the code as well?
In any case, maybe if you look at the compiled code and see what's different.. Or just post the executable here.
The asm version should be pretty decent. I don't like the innermost loop though, because I always keep thinking there might be some eor trick to speed it up that the compiler can see and I can't (and in general, those kind of optimizations are on my weaker side, I just can't spot them right away).

Don_Adan · 25 November 2019, 19:31

Quote:

Originally Posted by TCH

68000. Both. (Stock 1M A500.)

Perhaps some code can be changed from longword to word, f.e asl.l to asl.w, but i dont know maximum values for inputs.

TCH · 25 November 2019, 21:29

@a/b:
Not entirely. I am running the call of the "parent" function (DrawPolygon2DCPU) 64 times and measure the time with my AMTime tool (GNU 'time' replacement). And i either call your ASM code and put my C code in a comment block, or i do this the call to your ASM routine and let the C code run.

Code:

	DestArea = (rp->BitMap->Planes[0] + RectOffset);
	PlaneSize >>= 2;
	RowSize <<= 2;
	Modulo = PlaneSize - w;

/* Either i comment this... */
	PolygonBitmapToPlanes32((void *)DestArea, (void *)TempArea, h, w, Modulo, RowSize, D, (void *)cpattern32);

/* ...or this. */
	th = h;
	while (--th >= 0)
	{
		dstptr32 = (ULONG *)DestArea;
		i = D;
		while (--i >= 0)
		{
			srcptr32 = (ULONG *)TempArea;
			pattern32 = cpattern32[th & 1][i];
			t = w;
			while (--t >= 0)
			{
				read32 = *srcptr32++;
				*dstptr32 &= ~read32;
				*dstptr32++ |= (read32 & pattern32);
			}
			dstptr32 += Modulo;
		}
		TempArea += RowSize;
		DestArea += RowSize;
	}

Here are the asm results after compiling:
http://oscomp.hu/depot/polygon_ab_asm.s
http://oscomp.hu/depot/polygon_tch_c.s

@Don_Adan:
Perhaps, but with that we will only gain a few cycles.

ross · 25 November 2019, 22:52

Quote:

Originally Posted by a/b

.. because I always keep thinking there might be some eor trick to speed it up that the compiler can see ...

Code:

.L101:
	move.l (a0),d2
	move.l d3,d0
	eor.l d2,d0
	and.l (a1)+,d0
	eor.l d2,d0
	move.l d0,(a0)+
	dbra d1,.L101

Great compiler work here

a/b · 25 November 2019, 23:37

Yup, called it

.

So now you can change the inner loop to:

Code:

c_w:	move.l	(a0),d7
	move.l	d6,d4
	eor.l	d7,d4
	and.l	(a3)+,d4
	eor.l	d7,d4
	move.l	d4,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);

The rest is not as good. It's interesting how it managed to optimize some of the code, but then on the other hand there are also some stinkers in there as well, like tst/bcc/move just before the inner loop.

Don_Adan · 26 November 2019, 02:20

You can check this version, can be a few cycles fastest, if works. I optimised a few a/b version.

Code:

_PolygonBitmapToPlanes32: 
	movem.l	d2-d7/a2-a6,-(a7) 
	
        move.l	d2,a6 
	Add.l   a6,a6
	Add.l	a6,a6		; a6 = Modulo<<2 = BitplaneSize-Width<<2 
	add.w	d1,d2 
	mulu.w	d4,d2 
	Lsl.l	#2,d2		; longwords to bytes 
	sub.l	d3,d2 
	move.l	d2,-(a7)	; (a7) = Depth*BitplaneSize-RowSize 
	subq.w	#1,d0		; Height--;
	subq.w	#1,d1		; Width--;
	subq.w	#1,d4		; Depth--; 
        Swap    D4
        Move.w  d1,d4

	moveq	#0,d2 
c_h:
        Move.l a2,a4
        Add.w d2,a4
	eor.w	#8<<2,d2	; alternate between 0 and 8<<2 
	move.l	d4,d1
        Swap d1		    ; PlaneCounter = Depth; 
c_p:
	movea.l	a1,a3		; SrcPtr = TempArea; 
	move.l	(a4)+,A5	; CurrentPattern
	move.w	D4,d5		; WidthCounter = Width-1;
c_w:
	move.l	(a0),d7 
	move.l	A5,d6 
	eor.l	d7,d6 
	and.l	(a3)+,d6 
	eor.l	d7,d6 
	move.l	d6,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);
	dbf	d5,c_w		; if (--WidthCounter >= 0) goto c_w;

	adda.l	a6,a0		; DestArea += BitplaneSize-Width<<2;
	dbf	d1,c_p		; if (--PlaneCounter >= 0) goto c_p;

	suba.l	(a7),a0		; DestArea += RowSize-Depth*BitplaneSize;
	adda.l	d3,a1		; TempArea += RowSize; 
	dbf	d0,c_h		; if (--Height >= 0) goto c_h;

; 	addq.l	#4,a7 
	movem.l	(a7)+,d1-d7/a2-a6
        rts

a/b · 26 November 2019, 04:01

Nice catch, it's a 4 cycles gain in the outer loop.

About lea vs. move/adda (after c_h label):
- 000/010: it's the same, 12 vs. 4+8 cycles
- 020/030: guessing it's the same
- 040: lea is faster
- 060: guessing lea is faster since all those should be "1 cycle", and 1 < 2
All in all, I'd use lea.

And yeah, -(a7) at the end. I left that intentionally, the whole movem situation has to be looked over since, probably, you don't have to preserve d2-d7/a2-a6 (the original code didn't, but I put there all the regs because it's mixed with c).

ross · 26 November 2019, 11:46

But I would like to understand what are the steps to go from:

y=(a&c)|(b&~a);

to:

y=b^(a&(b^c));

Yes, surely with some boolean algebra optimization you can get here, but how in a human 'intuitive' manner?
(minimize equations using ^ is hard, even more with sequential partials..)

Don_Adan · 26 November 2019, 14:24

Quote:

Originally Posted by a/b

Nice catch, it's a 4 cycles gain in the outer loop.

About lea vs. move/adda (after c_h label):
- 000/010: it's the same, 12 vs. 4+8 cycles
- 020/030: guessing it's the same
- 040: lea is faster
- 060: guessing lea is faster since all those should be "1 cycle", and 1 < 2
All in all, I'd use lea.

And yeah, -(a7) at the end. I left that intentionally, the whole movem situation has to be looked over since, probably, you don't have to preserve d2-d7/a2-a6 (the original code didn't, but I put there all the regs because it's mixed with c).

I never tested lea vs move/adda, my 68k optimisation mentor told me that in real move/adda version is fastest. Anyway I checked my assembler book and also
Move.w d2,A4
add.l a2,a4
version can be used. Perhaps 2 cycles fastest, but again not tested by me.
(SP) access can be removed too, but it needs 2 extra swap commands, perhaps speed for 68000 will be same.

a/b · 26 November 2019, 14:48

Quote:

Originally Posted by ross

But I would like to understand what are the steps to go from:

y=(a&c)|(b&~a);

to:

y=b^(a&(b^c));

Yes, surely with some boolean algebra optimization you can get here, but how in a human 'intuitive' manner?
(minimize equations using ^ is hard, even more with sequential partials..)

I just memorized: (a&mask)|(b&!mask) = ((a^b)&mask)^b
So you eor a with b before and after masking it.

It reminded me of c2p code, but that was after I posted the code and went to sleep and kept thinking about it, and figured I better check some c2p docs tomorrow to try finding any similarities, but when I woke up it was gone.
I'd guess anyone who spent a bunch of time working on c2p converters would spot it from a mil.... kilometer (die imperial scum ;p, and rip SW btw) away.

ross · 26 November 2019, 14:58

Quote:

Originally Posted by Don_Adan

Anyway I checked my assembler book and also
Move.w d2,A4
add.l a2,a4
version can be used.

Same (12) cycles on bare 68k as

lea (a2,d2.x),a4

I also always prefer

lea

in similar cases (you also have a free 8bits offset

).

Quote:

Originally Posted by a/b

It reminded me of c2p code, ..

Well, waiting some c2p expert because this equivalence was not simple for me

a/b · 26 November 2019, 14:59

adda.x reg,areg is 8 cycles, even addq.x #y,areg is 8 on 68000. It's just slower because it's always 32-bit. You'd have to make a destination true 16-bit (e.g. add.w reg,dx) to make it work in 4 cycles.
All move(a).x reg,reg are 4, though.

Don_Adan · 26 November 2019, 16:53

Quote:

Originally Posted by ross

Same (12) cycles on bare 68k as

lea (a2,d2.x),a4

I also always prefer

lea

in similar cases (you also have a free 8bits offset

).

Well, waiting some c2p expert because this equivalence was not simple for me

In my assembler book
Add.w Dx,Ax is 8 cycles
But
Add.l Ax,Ax is 6 cycles only.
I never tested this in real, then im not sure.

Antiriad_UK · 26 November 2019, 17:04

From the page I use below. I read it as 8 (well I have been!

)
http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML

op<ea>,An

ADD byte,word 8(1/0) +
long 6(1/0) +**

+ Add effective address calculation time
** The base time of six clock periods is increased to eight
if the effective address mode is register direct or
immediate (effective address time should also be added)

ross · 26 November 2019, 17:10

Yes, I also saw these 6 cycles in some documents.
They simply misinterpret <ea>,An not adding 2 cycles for register direct (to base time of six clock cycles).

ross · 26 November 2019, 17:37

Found this paper: "Rule-Based Optimization of AND-XOR Expressions".
Abstract: The problem of finding a minimum AND-XOR expression for a given boolean function is known to be very hard. In this paper we investigate whether a rule-based approach can help minimizing AND-XOR expressions for functions which are too large to be handled by algorithmic-based approaches.

I just need to find a basic version of this algorithm/rule to see if I can apply it by hand without having to go through a compiler

Don_Adan · 26 November 2019, 17:56

Quote:

Originally Posted by ross

Yes, I also saw these 6 cycles in some documents.
They simply misinterpret <ea>,An not adding 2 cycles for register direct (to base time of six clock cycles).

No, my book is different. 0 cycles for registers using. When/if i will back to life, i will check this

25 November 2019, 21:29	#26
TCH Newbie Amiga programmer Join Date: Jun 2012 Location: Front of my A500+ Age: 38 Posts: 372	@a/b: Not entirely. I am running the call of the "parent" function (DrawPolygon2DCPU) 64 times and measure the time with my AMTime tool (GNU 'time' replacement). And i either call your ASM code and put my C code in a comment block, or i do this the call to your ASM routine and let the C code run. Code: DestArea = (rp->BitMap->Planes[0] + RectOffset); PlaneSize >>= 2; RowSize <<= 2; Modulo = PlaneSize - w; /* Either i comment this... / PolygonBitmapToPlanes32((void )DestArea, (void )TempArea, h, w, Modulo, RowSize, D, (void )cpattern32); /* ...or this. / th = h; while (--th >= 0) { dstptr32 = (ULONG )DestArea; i = D; while (--i >= 0) { srcptr32 = (ULONG )TempArea; pattern32 = cpattern32[th & 1][i]; t = w; while (--t >= 0) { read32 = srcptr32++; dstptr32 &= ~read32; dstptr32++ \|= (read32 & pattern32); } dstptr32 += Modulo; } TempArea += RowSize; DestArea += RowSize; } Here are the asm results after compiling: http://oscomp.hu/depot/polygon_ab_asm.s http://oscomp.hu/depot/polygon_tch_c.s @Don_Adan: Perhaps, but with that we will only gain a few cycles.

25 November 2019, 23:37	#28
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Yup, called it . So now you can change the inner loop to: Code: c_w: move.l (a0),d7 move.l d6,d4 eor.l d7,d4 and.l (a3)+,d4 eor.l d7,d4 move.l d4,(a0)+ ; DestPtr++ = (DestPtr&~Temp)\|(CurrentPattern&Temp); The rest is not as good. It's interesting how it managed to optimize some of the code, but then on the other hand there are also some stinkers in there as well, like tst/bcc/move just before the inner loop.

26 November 2019, 11:46	#31
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	But I would like to understand what are the steps to go from: y=(a&c)\|(b&~a); to: y=b^(a&(b^c)); Yes, surely with some boolean algebra optimization you can get here, but how in a human 'intuitive' manner? (minimize equations using ^ is hard, even more with sequential partials..)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Optimizing HAM8 renderer.	Thorham	Coders. Asm / Hardware	5	22 June 2017 18:29
NetSurf AGA optimizing	arti	Coders. Asm / Hardware	199	10 November 2013 14:36
Layered tile engine optimizing.	Thorham	Coders. General	0	30 September 2011 20:43
Benching and optimizing CF-IDE speed	Photon	support.Hardware	12	15 July 2009 01:48
For people who like optimizing 680x0 code.	Thorham	Coders. General	5	28 May 2008 11:48

25 November 2019, 18:15	#21
TCH Newbie Amiga programmer Join Date: Jun 2012 Location: Front of my A500+ Age: 38 Posts: 372	Well, AFAIK "modulo" means "remainder (until the next)" and "delta" means "change/difference (between two)". Both can be applied to this. I've tried your new code. The speed once again has been significantly boosted, but yet again it is still slower than the C code. The difference is insignificant (~0.1-0.2%), but stable. It is very weird.

25 November 2019, 18:41	#23
TCH Newbie Amiga programmer Join Date: Jun 2012 Location: Front of my A500+ Age: 38 Posts: 372	68000. Both. (Stock 1M A500.)

25 November 2019, 19:07	#24
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Interesting... Is the rest of the code the same? For example, you have both functions (c and asm) in the executable, and then say call the c one 100 times and then the asm one 100 times again. Or have you also replaced some other parts of the code as well? In any case, maybe if you look at the compiled code and see what's different.. Or just post the executable here. The asm version should be pretty decent. I don't like the innermost loop though, because I always keep thinking there might be some eor trick to speed it up that the compiler can see and I can't (and in general, those kind of optimizations are on my weaker side, I just can't spot them right away).

26 November 2019, 04:01	#30
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Nice catch, it's a 4 cycles gain in the outer loop. About lea vs. move/adda (after c_h label): - 000/010: it's the same, 12 vs. 4+8 cycles - 020/030: guessing it's the same - 040: lea is faster - 060: guessing lea is faster since all those should be "1 cycle", and 1 < 2 All in all, I'd use lea. And yeah, -(a7) at the end. I left that intentionally, the whole movem situation has to be looked over since, probably, you don't have to preserve d2-d7/a2-a6 (the original code didn't, but I put there all the regs because it's mixed with c).

26 November 2019, 14:59	#35
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	adda.x reg,areg is 8 cycles, even addq.x #y,areg is 8 on 68000. It's just slower because it's always 32-bit. You'd have to make a destination true 16-bit (e.g. add.w reg,dx) to make it work in 4 cycles. All move(a).x reg,reg are 4, though.

26 November 2019, 17:04	#37
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	From the page I use below. I read it as 8 (well I have been! ) http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML op<ea>,An ADD byte,word 8(1/0) + long 6(1/0) + + Add effective address calculation time The base time of six clock periods is increased to eight if the effective address mode is register direct or immediate (effective address time should also be added)

26 November 2019, 17:10	#38
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Yes, I also saw these 6 cycles in some documents. They simply misinterpret <ea>,An not adding 2 cycles for register direct (to base time of six clock cycles).

26 November 2019, 17:37	#39
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Found this paper: "Rule-Based Optimization of AND-XOR Expressions". Abstract: The problem of finding a minimum AND-XOR expression for a given boolean function is known to be very hard. In this paper we investigate whether a rule-based approach can help minimizing AND-XOR expressions for functions which are too large to be handled by algorithmic-based approaches. I just need to find a basic version of this algorithm/rule to see if I can apply it by hand without having to go through a compiler

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)