25 November 2019, 18:15 | #21 |
Newbie Amiga programmer
Join Date: Jun 2012
Location: Front of my A500+
Age: 38
Posts: 372
|
Well, AFAIK "modulo" means "remainder (until the next)" and "delta" means "change/difference (between two)". Both can be applied to this.
I've tried your new code. The speed once again has been significantly boosted, but yet again it is still slower than the C code. The difference is insignificant (~0.1-0.2%), but stable. It is very weird. |
25 November 2019, 18:29 | #22 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Quote:
|
|
25 November 2019, 18:41 | #23 |
Newbie Amiga programmer
Join Date: Jun 2012
Location: Front of my A500+
Age: 38
Posts: 372
|
68000. Both. (Stock 1M A500.)
|
25 November 2019, 19:07 | #24 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Interesting...
Is the rest of the code the same? For example, you have both functions (c and asm) in the executable, and then say call the c one 100 times and then the asm one 100 times again. Or have you also replaced some other parts of the code as well? In any case, maybe if you look at the compiled code and see what's different.. Or just post the executable here. The asm version should be pretty decent. I don't like the innermost loop though, because I always keep thinking there might be some eor trick to speed it up that the compiler can see and I can't (and in general, those kind of optimizations are on my weaker side, I just can't spot them right away). |
25 November 2019, 19:31 | #25 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
|
25 November 2019, 21:29 | #26 |
Newbie Amiga programmer
Join Date: Jun 2012
Location: Front of my A500+
Age: 38
Posts: 372
|
@a/b:
Not entirely. I am running the call of the "parent" function (DrawPolygon2DCPU) 64 times and measure the time with my AMTime tool (GNU 'time' replacement). And i either call your ASM code and put my C code in a comment block, or i do this the call to your ASM routine and let the C code run. Code:
DestArea = (rp->BitMap->Planes[0] + RectOffset); PlaneSize >>= 2; RowSize <<= 2; Modulo = PlaneSize - w; /* Either i comment this... */ PolygonBitmapToPlanes32((void *)DestArea, (void *)TempArea, h, w, Modulo, RowSize, D, (void *)cpattern32); /* ...or this. */ th = h; while (--th >= 0) { dstptr32 = (ULONG *)DestArea; i = D; while (--i >= 0) { srcptr32 = (ULONG *)TempArea; pattern32 = cpattern32[th & 1][i]; t = w; while (--t >= 0) { read32 = *srcptr32++; *dstptr32 &= ~read32; *dstptr32++ |= (read32 & pattern32); } dstptr32 += Modulo; } TempArea += RowSize; DestArea += RowSize; } http://oscomp.hu/depot/polygon_ab_asm.s http://oscomp.hu/depot/polygon_tch_c.s @Don_Adan: Perhaps, but with that we will only gain a few cycles. |
25 November 2019, 22:52 | #27 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
|
25 November 2019, 23:37 | #28 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Yup, called it .
So now you can change the inner loop to: Code:
c_w: move.l (a0),d7 move.l d6,d4 eor.l d7,d4 and.l (a3)+,d4 eor.l d7,d4 move.l d4,(a0)+ ; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp); |
26 November 2019, 02:20 | #29 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
You can check this version, can be a few cycles fastest, if works. I optimised a few a/b version.
Code:
_PolygonBitmapToPlanes32: movem.l d2-d7/a2-a6,-(a7) move.l d2,a6 Add.l a6,a6 Add.l a6,a6 ; a6 = Modulo<<2 = BitplaneSize-Width<<2 add.w d1,d2 mulu.w d4,d2 Lsl.l #2,d2 ; longwords to bytes sub.l d3,d2 move.l d2,-(a7) ; (a7) = Depth*BitplaneSize-RowSize subq.w #1,d0 ; Height--; subq.w #1,d1 ; Width--; subq.w #1,d4 ; Depth--; Swap D4 Move.w d1,d4 moveq #0,d2 c_h: Move.l a2,a4 Add.w d2,a4 eor.w #8<<2,d2 ; alternate between 0 and 8<<2 move.l d4,d1 Swap d1 ; PlaneCounter = Depth; c_p: movea.l a1,a3 ; SrcPtr = TempArea; move.l (a4)+,A5 ; CurrentPattern move.w D4,d5 ; WidthCounter = Width-1; c_w: move.l (a0),d7 move.l A5,d6 eor.l d7,d6 and.l (a3)+,d6 eor.l d7,d6 move.l d6,(a0)+ ; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp); dbf d5,c_w ; if (--WidthCounter >= 0) goto c_w; adda.l a6,a0 ; DestArea += BitplaneSize-Width<<2; dbf d1,c_p ; if (--PlaneCounter >= 0) goto c_p; suba.l (a7),a0 ; DestArea += RowSize-Depth*BitplaneSize; adda.l d3,a1 ; TempArea += RowSize; dbf d0,c_h ; if (--Height >= 0) goto c_h; ; addq.l #4,a7 movem.l (a7)+,d1-d7/a2-a6 rts |
26 November 2019, 04:01 | #30 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Nice catch, it's a 4 cycles gain in the outer loop.
About lea vs. move/adda (after c_h label): - 000/010: it's the same, 12 vs. 4+8 cycles - 020/030: guessing it's the same - 040: lea is faster - 060: guessing lea is faster since all those should be "1 cycle", and 1 < 2 All in all, I'd use lea. And yeah, -(a7) at the end. I left that intentionally, the whole movem situation has to be looked over since, probably, you don't have to preserve d2-d7/a2-a6 (the original code didn't, but I put there all the regs because it's mixed with c). |
26 November 2019, 11:46 | #31 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
But I would like to understand what are the steps to go from:
y=(a&c)|(b&~a); to: y=b^(a&(b^c)); Yes, surely with some boolean algebra optimization you can get here, but how in a human 'intuitive' manner? (minimize equations using ^ is hard, even more with sequential partials..) |
26 November 2019, 14:24 | #32 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Quote:
Move.w d2,A4 add.l a2,a4 version can be used. Perhaps 2 cycles fastest, but again not tested by me. (SP) access can be removed too, but it needs 2 extra swap commands, perhaps speed for 68000 will be same. Last edited by Don_Adan; 26 November 2019 at 14:41. |
|
26 November 2019, 14:48 | #33 | |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Quote:
So you eor a with b before and after masking it. It reminded me of c2p code, but that was after I posted the code and went to sleep and kept thinking about it, and figured I better check some c2p docs tomorrow to try finding any similarities, but when I woke up it was gone. I'd guess anyone who spent a bunch of time working on c2p converters would spot it from a mil.... kilometer (die imperial scum ;p, and rip SW btw) away. |
|
26 November 2019, 14:58 | #34 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
lea (a2,d2.x),a4 I also always prefer leain similar cases (you also have a free 8bits offset ). Well, waiting some c2p expert because this equivalence was not simple for me |
|
26 November 2019, 14:59 | #35 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
adda.x reg,areg is 8 cycles, even addq.x #y,areg is 8 on 68000. It's just slower because it's always 32-bit. You'd have to make a destination true 16-bit (e.g. add.w reg,dx) to make it work in 4 cycles.
All move(a).x reg,reg are 4, though. |
26 November 2019, 16:53 | #36 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Quote:
Add.w Dx,Ax is 8 cycles But Add.l Ax,Ax is 6 cycles only. I never tested this in real, then im not sure. |
|
26 November 2019, 17:04 | #37 |
OCS forever!
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
|
From the page I use below. I read it as 8 (well I have been! )
http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML op<ea>,An ADD byte,word 8(1/0) + long 6(1/0) +** + Add effective address calculation time ** The base time of six clock periods is increased to eight if the effective address mode is register direct or immediate (effective address time should also be added) |
26 November 2019, 17:10 | #38 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Yes, I also saw these 6 cycles in some documents.
They simply misinterpret <ea>,An not adding 2 cycles for register direct (to base time of six clock cycles). |
26 November 2019, 17:37 | #39 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Found this paper: "Rule-Based Optimization of AND-XOR Expressions".
Abstract: The problem of finding a minimum AND-XOR expression for a given boolean function is known to be very hard. In this paper we investigate whether a rule-based approach can help minimizing AND-XOR expressions for functions which are too large to be handled by algorithmic-based approaches. I just need to find a basic version of this algorithm/rule to see if I can apply it by hand without having to go through a compiler |
26 November 2019, 17:56 | #40 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Optimizing HAM8 renderer. | Thorham | Coders. Asm / Hardware | 5 | 22 June 2017 18:29 |
NetSurf AGA optimizing | arti | Coders. Asm / Hardware | 199 | 10 November 2013 14:36 |
Layered tile engine optimizing. | Thorham | Coders. General | 0 | 30 September 2011 20:43 |
Benching and optimizing CF-IDE speed | Photon | support.Hardware | 12 | 15 July 2009 01:48 |
For people who like optimizing 680x0 code. | Thorham | Coders. General | 5 | 28 May 2008 11:48 |
|
|