English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 25 November 2019, 19:15   #21
TCH
Newbie Amiga programmer

 
Join Date: Jun 2012
Location: Front of my A500+
Age: 33
Posts: 165
Well, AFAIK "modulo" means "remainder (until the next)" and "delta" means "change/difference (between two)". Both can be applied to this.

I've tried your new code. The speed once again has been significantly boosted, but yet again it is still slower than the C code. The difference is insignificant (~0.1-0.2%), but stable. It is very weird.
TCH is offline  
Old 25 November 2019, 19:29   #22
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 51
Posts: 1,184
Quote:
Originally Posted by TCH View Post
Well, AFAIK "modulo" means "remainder (until the next)" and "delta" means "change/difference (between two)". Both can be applied to this.

I've tried your new code. The speed once again has been significantly boosted, but yet again it is still slower than the C code. The difference is insignificant (~0.1-0.2%), but stable. It is very weird.
For which CPU? Reads or writes from chip memory?
Don_Adan is offline  
Old 25 November 2019, 19:41   #23
TCH
Newbie Amiga programmer

 
Join Date: Jun 2012
Location: Front of my A500+
Age: 33
Posts: 165
68000. Both. (Stock 1M A500.)
TCH is offline  
Old 25 November 2019, 20:07   #24
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 174
Interesting...
Is the rest of the code the same? For example, you have both functions (c and asm) in the executable, and then say call the c one 100 times and then the asm one 100 times again. Or have you also replaced some other parts of the code as well?
In any case, maybe if you look at the compiled code and see what's different.. Or just post the executable here.
The asm version should be pretty decent. I don't like the innermost loop though, because I always keep thinking there might be some eor trick to speed it up that the compiler can see and I can't (and in general, those kind of optimizations are on my weaker side, I just can't spot them right away).
a/b is offline  
Old 25 November 2019, 20:31   #25
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 51
Posts: 1,184
Quote:
Originally Posted by TCH View Post
68000. Both. (Stock 1M A500.)
Perhaps some code can be changed from longword to word, f.e asl.l to asl.w, but i dont know maximum values for inputs.
Don_Adan is offline  
Old 25 November 2019, 22:29   #26
TCH
Newbie Amiga programmer

 
Join Date: Jun 2012
Location: Front of my A500+
Age: 33
Posts: 165
@a/b:
Not entirely. I am running the call of the "parent" function (DrawPolygon2DCPU) 64 times and measure the time with my AMTime tool (GNU 'time' replacement). And i either call your ASM code and put my C code in a comment block, or i do this the call to your ASM routine and let the C code run.
Code:
	DestArea = (rp->BitMap->Planes[0] + RectOffset);
	PlaneSize >>= 2;
	RowSize <<= 2;
	Modulo = PlaneSize - w;

/* Either i comment this... */
	PolygonBitmapToPlanes32((void *)DestArea, (void *)TempArea, h, w, Modulo, RowSize, D, (void *)cpattern32);

/* ...or this. */
	th = h;
	while (--th >= 0)
	{
		dstptr32 = (ULONG *)DestArea;
		i = D;
		while (--i >= 0)
		{
			srcptr32 = (ULONG *)TempArea;
			pattern32 = cpattern32[th & 1][i];
			t = w;
			while (--t >= 0)
			{
				read32 = *srcptr32++;
				*dstptr32 &= ~read32;
				*dstptr32++ |= (read32 & pattern32);
			}
			dstptr32 += Modulo;
		}
		TempArea += RowSize;
		DestArea += RowSize;
	}
Here are the asm results after compiling:
http://oscomp.hu/depot/polygon_ab_asm.s
http://oscomp.hu/depot/polygon_tch_c.s

@Don_Adan:
Perhaps, but with that we will only gain a few cycles.
TCH is offline  
Old 25 November 2019, 23:52   #27
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 2,422
Quote:
Originally Posted by a/b View Post
.. because I always keep thinking there might be some eor trick to speed it up that the compiler can see ...


Code:
.L101:
	move.l (a0),d2
	move.l d3,d0
	eor.l d2,d0
	and.l (a1)+,d0
	eor.l d2,d0
	move.l d0,(a0)+
	dbra d1,.L101
Great compiler work here
ross is offline  
Old 26 November 2019, 00:37   #28
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 174
Yup, called it .

So now you can change the inner loop to:
Code:
c_w:	move.l	(a0),d7
	move.l	d6,d4
	eor.l	d7,d4
	and.l	(a3)+,d4
	eor.l	d7,d4
	move.l	d4,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);
The rest is not as good. It's interesting how it managed to optimize some of the code, but then on the other hand there are also some stinkers in there as well, like tst/bcc/move just before the inner loop.
a/b is offline  
Old 26 November 2019, 03:20   #29
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 51
Posts: 1,184
You can check this version, can be a few cycles fastest, if works. I optimised a few a/b version.
Code:
_PolygonBitmapToPlanes32: 
	movem.l	d2-d7/a2-a6,-(a7) 
	
        move.l	d2,a6 
	Add.l   a6,a6
	Add.l	a6,a6		; a6 = Modulo<<2 = BitplaneSize-Width<<2 
	add.w	d1,d2 
	mulu.w	d4,d2 
	Lsl.l	#2,d2		; longwords to bytes 
	sub.l	d3,d2 
	move.l	d2,-(a7)	; (a7) = Depth*BitplaneSize-RowSize 
	subq.w	#1,d0		; Height--;
	subq.w	#1,d1		; Width--;
	subq.w	#1,d4		; Depth--; 
        Swap    D4
        Move.w  d1,d4

	moveq	#0,d2 
c_h:
        Move.l a2,a4
        Add.w d2,a4
	eor.w	#8<<2,d2	; alternate between 0 and 8<<2 
	move.l	d4,d1
        Swap d1		    ; PlaneCounter = Depth; 
c_p:
	movea.l	a1,a3		; SrcPtr = TempArea; 
	move.l	(a4)+,A5	; CurrentPattern
	move.w	D4,d5		; WidthCounter = Width-1;
c_w:
	move.l	(a0),d7 
	move.l	A5,d6 
	eor.l	d7,d6 
	and.l	(a3)+,d6 
	eor.l	d7,d6 
	move.l	d6,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);
	dbf	d5,c_w		; if (--WidthCounter >= 0) goto c_w;

	adda.l	a6,a0		; DestArea += BitplaneSize-Width<<2;
	dbf	d1,c_p		; if (--PlaneCounter >= 0) goto c_p;

	suba.l	(a7),a0		; DestArea += RowSize-Depth*BitplaneSize;
	adda.l	d3,a1		; TempArea += RowSize; 
	dbf	d0,c_h		; if (--Height >= 0) goto c_h;

; 	addq.l	#4,a7 
	movem.l	(a7)+,d1-d7/a2-a6
        rts
Don_Adan is offline  
Old 26 November 2019, 05:01   #30
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 174
Nice catch, it's a 4 cycles gain in the outer loop.

About lea vs. move/adda (after c_h label):
- 000/010: it's the same, 12 vs. 4+8 cycles
- 020/030: guessing it's the same
- 040: lea is faster
- 060: guessing lea is faster since all those should be "1 cycle", and 1 < 2
All in all, I'd use lea.

And yeah, -(a7) at the end. I left that intentionally, the whole movem situation has to be looked over since, probably, you don't have to preserve d2-d7/a2-a6 (the original code didn't, but I put there all the regs because it's mixed with c).
a/b is offline  
Old 26 November 2019, 12:46   #31
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 2,422
But I would like to understand what are the steps to go from:
y=(a&c)|(b&~a);

to:
y=b^(a&(b^c));


Yes, surely with some boolean algebra optimization you can get here, but how in a human 'intuitive' manner?
(minimize equations using ^ is hard, even more with sequential partials..)
ross is offline  
Old 26 November 2019, 15:24   #32
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 51
Posts: 1,184
Quote:
Originally Posted by a/b View Post
Nice catch, it's a 4 cycles gain in the outer loop.

About lea vs. move/adda (after c_h label):
- 000/010: it's the same, 12 vs. 4+8 cycles
- 020/030: guessing it's the same
- 040: lea is faster
- 060: guessing lea is faster since all those should be "1 cycle", and 1 < 2
All in all, I'd use lea.

And yeah, -(a7) at the end. I left that intentionally, the whole movem situation has to be looked over since, probably, you don't have to preserve d2-d7/a2-a6 (the original code didn't, but I put there all the regs because it's mixed with c).
I never tested lea vs move/adda, my 68k optimisation mentor told me that in real move/adda version is fastest. Anyway I checked my assembler book and also
Move.w d2,A4
add.l a2,a4
version can be used. Perhaps 2 cycles fastest, but again not tested by me.
(SP) access can be removed too, but it needs 2 extra swap commands, perhaps speed for 68000 will be same.

Last edited by Don_Adan; 26 November 2019 at 15:41.
Don_Adan is offline  
Old 26 November 2019, 15:48   #33
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 174
Quote:
Originally Posted by ross View Post
But I would like to understand what are the steps to go from:
y=(a&c)|(b&~a);

to:
y=b^(a&(b^c));


Yes, surely with some boolean algebra optimization you can get here, but how in a human 'intuitive' manner?
(minimize equations using ^ is hard, even more with sequential partials..)
I just memorized: (a&mask)|(b&!mask) = ((a^b)&mask)^b
So you eor a with b before and after masking it.

It reminded me of c2p code, but that was after I posted the code and went to sleep and kept thinking about it, and figured I better check some c2p docs tomorrow to try finding any similarities, but when I woke up it was gone.
I'd guess anyone who spent a bunch of time working on c2p converters would spot it from a mil.... kilometer (die imperial scum ;p, and rip SW btw) away.
a/b is offline  
Old 26 November 2019, 15:58   #34
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 2,422
Quote:
Originally Posted by Don_Adan View Post
Anyway I checked my assembler book and also
Move.w d2,A4
add.l a2,a4
version can be used.
Same (12) cycles on bare 68k as
lea (a2,d2.x),a4

I also always prefer
lea
in similar cases (you also have a free 8bits offset ).


Quote:
Originally Posted by a/b View Post
It reminded me of c2p code, ..
Well, waiting some c2p expert because this equivalence was not simple for me
ross is offline  
Old 26 November 2019, 15:59   #35
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 174
adda.x reg,areg is 8 cycles, even addq.x #y,areg is 8 on 68000. It's just slower because it's always 32-bit. You'd have to make a destination true 16-bit (e.g. add.w reg,dx) to make it work in 4 cycles.
All move(a).x reg,reg are 4, though.
a/b is offline  
Old 26 November 2019, 17:53   #36
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 51
Posts: 1,184
Quote:
Originally Posted by ross View Post
Same (12) cycles on bare 68k as
lea (a2,d2.x),a4

I also always prefer
lea
in similar cases (you also have a free 8bits offset ).



Well, waiting some c2p expert because this equivalence was not simple for me
In my assembler book
Add.w Dx,Ax is 8 cycles
But
Add.l Ax,Ax is 6 cycles only.
I never tested this in real, then im not sure.
Don_Adan is offline  
Old 26 November 2019, 18:04   #37
Antiriad_UK
OCS forever!

Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 170
From the page I use below. I read it as 8 (well I have been! )
http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML

op<ea>,An

ADD byte,word 8(1/0) +
long 6(1/0) +**

+ Add effective address calculation time
** The base time of six clock periods is increased to eight
if the effective address mode is register direct or
immediate (effective address time should also be added)
Antiriad_UK is offline  
Old 26 November 2019, 18:10   #38
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 2,422
Yes, I also saw these 6 cycles in some documents.
They simply misinterpret <ea>,An not adding 2 cycles for register direct (to base time of six clock cycles).
ross is offline  
Old 26 November 2019, 18:37   #39
ross
Per aspera ad astra

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 49
Posts: 2,422
Found this paper: "Rule-Based Optimization of AND-XOR Expressions".
Abstract: The problem of finding a minimum AND-XOR expression for a given boolean function is known to be very hard. In this paper we investigate whether a rule-based approach can help minimizing AND-XOR expressions for functions which are too large to be handled by algorithmic-based approaches.


I just need to find a basic version of this algorithm/rule to see if I can apply it by hand without having to go through a compiler
ross is offline  
Old 26 November 2019, 18:56   #40
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 51
Posts: 1,184
Quote:
Originally Posted by ross View Post
Yes, I also saw these 6 cycles in some documents.
They simply misinterpret <ea>,An not adding 2 cycles for register direct (to base time of six clock cycles).
No, my book is different. 0 cycles for registers using. When/if i will back to life, i will check this
Don_Adan is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Optimizing HAM8 renderer. Thorham Coders. Asm / Hardware 5 22 June 2017 19:29
NetSurf AGA optimizing arti Coders. Asm / Hardware 199 10 November 2013 15:36
Layered tile engine optimizing. Thorham Coders. General 0 30 September 2011 21:43
Benching and optimizing CF-IDE speed Photon support.Hardware 12 15 July 2009 02:48
For people who like optimizing 680x0 code. Thorham Coders. General 5 28 May 2008 12:48

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 12:13.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.
Page generated in 0.09755 seconds with 15 queries