23 April 2017, 23:35 | #141 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something?
I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter? Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach. EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast Code:
; d0 *= d1, assumes a5=square table, d6 used as scratch ; Instruction to beat: muls.w d1, d0 38-70(1/0) move.w d0, d6 ; 4(1/0) add.w d1, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) move.l (a5,d0.w), d0 ; 18(4/0) d0 = (A+B)^2 sub.w d1, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) sub.l (a5,d6.w), d0 ; 20(3/0) d0 = (A+B)^2 - (A-B)^2 ; see edit asr.l #2, d0 ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4 ;-------- ; 66(14/0) ; ... dc.l 9>>2 dc.l 4>>2 dc.l 1>>2 squares: ; <- a5 points here dc.l 0>>2 dc.l 1>>2 dc.l 4>>2 ; ... Last edited by paraj; 24 April 2017 at 00:00. |
24 April 2017, 00:47 | #142 | |
Going nowhere
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 8,986
|
Quote:
|
|
24 April 2017, 01:11 | #143 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
2 x add.w is 8(2/0), lsl is 6+2*2=10(1/0) (according to the instruction references I'm currently looking at). So lsl is probably better when DMA is contended (and it's shorter), but it might be close. I mostly did it because it's mentioned in the very first post of this thread and I didn't want to be scolded
|
24 April 2017, 07:51 | #144 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead. 2x14 fetch + 2x4 align = 36 Something like: Code:
add.w d0,d0 add.w d1,d1 move.w d0,d6 add.w d1,d0 move.w (a5,d0.w),d0 sub.w d1,d6 sub.w (a5,d6.w),d0 |
24 April 2017, 09:43 | #145 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Quote:
I was thinking of using the optimization in a bog-standard vector/matrix multiplication routine, so some of the cost could be amortized, but I still have a hard time gaining anything (register pressure is also increased and the code has to be changed to accommodate the optimization). I.e. when doing X' = MUL(MXX, VX) + MUL(MXY, VY) + MUL(MXZ, VZ), M?? can be premultiplied by 2/4 (the pointer to the square table could also be added). And as VX/VY/VZ are used 3 times, we can save 2*2 scalings for those. Something like this, but again the gain seems marginal at best: Code:
; assume \1.l is premultipled by 4 (amortized cost=2*4/3) ; \2 is premultipled by 4 and added with a5 (amortized cost=3*9*4/npoints) ; \3 dest ; a6 scratch ; \1 datareg ; \2 any ea ; \3 should be datareg move.l \2, a6 ; 4(1/0) + ea [e.g. 12 for (d16,An), 0 for registers] move.l (a6,\1.w), \3 ; 18(4/0) suba.w \1, a6 ; 8(1/0) sub.l (a6), \3 ; 14(3/0) ; ------- ; 44(9/0) + ea + (8/3 + 108/npoints ~ 3) |
|
24 April 2017, 11:53 | #146 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Hi paraj, long time ago I came across the same question...
My results? Use mulu/muls, but whit care. In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as: MULU: n = the number of ones in the source MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source. So: destination must have max bit resolution (16), source can be minor. Suppose a 16x16 multiply but in source 8 bit precision suffice. With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54! [worst case MULU dx=#$FF00->1111111100000000->38+8*2, worst case MULS dx=#$5500->0101010100000000->38+4*2+4*2] Cheers, ross |
24 April 2017, 12:43 | #147 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Quote:
Too bad there isn't an easy way to speed up multiplication like there is with division at least it's not too slow... |
|
25 April 2017, 08:32 | #148 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.
|
26 April 2017, 20:16 | #149 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Quote:
I was already considering re-using the setup code for the matrix across many operations, but couldn't get the cycle down due to too many expensive uses of long words, but your input made me rethink some ideas, and now I think a decent saving could be achieved. My transform loop has the following structure: Code:
; a0 = dest (3 x cnt x 8.8 fixpoint words) ; a1 = src ( --- "" ---) ; a2 = mat ( 3 x 4 x 8.8 fixpoint words) ; d7 = cnt-1 .loop: move.l a2, a3 ; reload matrix pointer move.w (a0)+, d0 ; 8(2/0) d0 = X move.w (a0)+, d1 ; 8(2/0) d1 = Y move.w (a0)+, d2 ; 8(2/0) d2 = Z ; Process X ; ... ; Process Y ; ... ; Process Z (can trash d0-d2 here) ; ... dbf d7, .loop Code:
move.w d0, d4 ; 4(1/0) d4 = X muls.w (a3)+, d4 ; 42(2/0)+ d4 = X*XX (16.16 fixpoint) move.w d1, d5 ; 4(1/0) d5 = Y muls.w (a3)+, d5 ; 42(2/0)+ d5 = Y*XY add.l d5, d4 ; 8(1/0) d4 = X*XX + Y*XY move.w d2, d5 ; 4(1/0) d5 = Z muls.w (a3)+, d5 ; 42(2/0)+ d5 = Z*XZ add.l d5, d4 ; 8(1/0) d4 = X*XX + Y*XY + Z*XZ asr.l #8, d4 ; 24(1/0) convert back to 8.8 move.w (a3)+, d5 ; 8(2/0) d5 = XW (translation) addx.w d5, d4 ; 4(1/0) d4 = X*XX + Y*XY + Z*XZ + XW move.w d4, (a0)+ ; 8(1/1) store transformed coordinate ; -------- ; 198+ Code:
XX.w XY.w XZ.w XW.w YX.w YY.w YZ.w YW.w ZX.w ZY.w ZZ.w ZW.w Code:
&squares[XX].l &squares[XY].l &squares[XZ].l XW.w &squares[YX].l ... Processing X could be done like this: Code:
move.l (a3)+, a4 ; 12(3/0) a4 = &squares[XX] move.w (a4,d0.w), d4 ; 14(3/0) d4 = (XX+X)^2 neg.w d0 ; 4(1/0) sub.w (a4,d0.w), d4 ; 14(3/0) d4 = (XX+X)^2 - (XX-X)^2 = X*XX neg.w d0 ; 4(1/0) restore d0 for processing Y move.l (a3)+, a4 ; 12(3/0) a4 = &squares[XY] add.w (a4,d1.w), d4 ; 14(3/0) d4 = X*XX + (XY+Y)^2 neg.w d1 ; 4(1/0) sub.w (a4,d1.w), d4 ; 14(3/0) d4 = X*XX + (XY+Y)^2 - (XY-Y)^2 = X*XX+Y*XY neg.w d1 ; 4(1/0) restore d1 for processing Y move.l (a3)+, a4 ; 12(3/0) a4 = &squares[XZ] add.w (a4,d2.w), d4 ; 14(3/0) d4 = X*XX + Y*XY + (XZ+Z)^2 neg.w d2 ; 4(1/0) sub.w (a4,d2.w), d4 ; 14(3/0) d4 = X*XX + Y*XY + (XZ+Z)^2 - (XZ-Z)^2 = X*XX+Y*XY+Z*XZ neg.w d2 ; 4(1/0) restore d2 for processing Y add.w (a3)+, d4 ; 14(3/0) d4 = X*XX + Y*XY + Z*XZ + XW move.w d4, (a0)+ ; 8(1/1) ;--------- ; 166 Doing the transforms in logarithmic space is definitely also an interesting option, I think I'll have to look into later. Now I at least have some baselines to compare cycle count precision against. EDIT: while redoing the cycle timing i also realized that a lot of the neg.w's could be omitted by inverting the add/sub sequence for Y.. Last edited by paraj; 27 April 2017 at 01:50. Reason: Messed up instruction timing, gave it another try |
|
28 April 2017, 02:00 | #150 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.
Last edited by Kalms; 28 April 2017 at 02:12. |
01 May 2017, 11:00 | #151 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Quote:
I have to admit I don't quite understand what you're getting at with 'data parallel computations' though. Do you mean like using both the upper and lower word simultaneously? For ayone curious with the above tricks and the square table in low memory (16-bit addressable) the following loop is possible: Code:
movem.w (a1)+, a2-a4 ; 24(7/0) ; X .offXXp: move.w $1234(a2), d0 ; 12(3/0) .offXXm: sub.w $1234(a2), d0 ; 12(3/0) .offXYp: add.w $1234(a3), d0 ; 12(3/0) .offXYm: sub.w $1234(a3), d0 ; 12(3/0) .offXZp: add.w $1234(a4), d0 ; 12(3/0) .offXZm: sub.w $1234(a4), d0 ; 12(3/0) .offXW: add.w #$1234, d0 ; 8(2/0) move.w d0, (a0)+ ; 8(1/1) ; ------- ; 88 ; Y & Z handled similarly EDIT: My simple 1-plane transform/project/putpixel test went from being able to handle 140 points to 237 in one frame just by applying the above optimization - pretty decent gain! And of course the square table should be used for projection as well - &squares[$10000/z].w seems like a nice format for the reciprocal table. Last edited by paraj; 01 May 2017 at 13:22. Reason: Added some stats |
|
11 May 2017, 12:05 | #152 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
As this thread is for generic 68000 optimizations, here is another.
(Actually this one is valid for just about any cpu ) You can replace : Code:
not.w d0 and.w d0,d1 not.w d0 Code:
or.w d0,d1 eor.w d0,d1 |
11 May 2017, 20:31 | #153 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Sorry if the earlier MUL discussion ended up being too specific, that's the opposite of my intend, but I guess it's hard to be generic when tables are involved.
And nice little bitwise identity I hadn't seen before! |
05 February 2018, 01:10 | #154 |
Registered User
Join Date: Oct 2007
Location: Bratislava / Slovakia
Posts: 34
|
My humble contribution.
Sometimes I need to isolate the carry: Code:
subx.b d1,d1 ; neg.b d1 ; add.b d1,d0 ; add carry do d0's lower byte Code:
subx.w d1,d1 ; negx.b d1 ; sub.w d1,d0 ; add carry to d0's upper byte Last edited by mikro; 16 February 2018 at 11:12. |
16 February 2018, 14:11 | #155 |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,294
|
Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks:
multiply d0 by 29: Code:
move.l d0, d1 lsl.l #4, d0 sub.l d1, d0 add.l d0, d0 sub.l d1, d0 Code:
move.l d0, d1 lsl.l #2, d0 add.l d1, d0 lsl.l #3, d0 sub.l d1, d0 lsl.l #4, d0 add.l d1, d0 Code:
move.l d0, d1 lsl.l #2, d1 add.l d1, d0 lsl.l #5, d0 sub.l d1, d0 |
21 November 2018, 10:20 | #156 | |
Registered User
Join Date: Sep 2015
Location: Germany
Posts: 256
|
Quote:
|
|
02 December 2018, 06:27 | #157 | |
Registered User
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
|
Quote:
Which reminded me of something; once upon a time there was a thing called the GNU(gcc?) super-optimizer that supposedly knew about all sorts of of tricks to get things done in as few instructions as possible (and I think IBM was very proud of the Power instruction set as it had been proven to be possible to do 'all' of the instruction optimization equivalents in a single opcode). Does anyone know if that is still a thing, or if it has been folded into the mainline product? I'm guessing there would have to be a description of what it did for each and every architecture - possibly some interesting stuff there. |
|
27 April 2019, 00:02 | #158 | |
Registered User
Join Date: Feb 2015
Location: Copehagen
Posts: 36
|
sorry to wake up this old thread,
i use it from time to time, getting to old to remember, so i have to see these tips every time i start coding again. Quote:
|
|
27 April 2019, 00:39 | #159 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
lsl #8you can: Code:
lea spare(pc),a0 clr.w (a0) .... move.b d0,(a0) ;8 move.w (a0),d0 ;8 .... spare ds.w 1 |
|
27 April 2019, 00:57 | #160 |
Registered User
Join Date: Feb 2015
Location: Copehagen
Posts: 36
|
Many Thanks trying to optimize an old Atari ST game.
it was a shi..y arcade conversion(outrun) for both the ST and the Amiga. just want to see how it would run if it had got some more love and it uses a lot of lsr #8 and Lsl #8 so the about is just perfect Last edited by PeterJ; 27 April 2019 at 01:04. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68000 boot code | billt | Coders. General | 15 | 05 May 2012 20:13 |
Wasted Dreams on 68000 | sanjyuubi | support.Games | 5 | 27 May 2011 17:11 |
680x0 to 68000 | Counia | Hardware mods | 1 | 01 March 2011 10:18 |
quitting on 68000? | Hungry Horace | project.WHDLoad | 60 | 19 December 2006 20:17 |
3D code and/or internet code for Blitz Basic 2.1 | EdzUp | Retrogaming General Discussion | 0 | 10 February 2002 11:40 |
|
|