English Amiga Board 68000 code optimisations
 Register Amiga FAQ Rules & Help Members List  /  Moderators List Today's Posts Mark Forums Read

 24 April 2017, 00:35 #141 paraj Registered User   Join Date: Feb 2017 Location: Denmark Posts: 78 Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something? I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter? Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach. EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast Code: ```; d0 *= d1, assumes a5=square table, d6 used as scratch ; Instruction to beat: muls.w d1, d0 38-70(1/0) move.w d0, d6 ; 4(1/0) add.w d1, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) move.l (a5,d0.w), d0 ; 18(4/0) d0 = (A+B)^2 sub.w d1, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) sub.l (a5,d6.w), d0 ; 20(3/0) d0 = (A+B)^2 - (A-B)^2 ; see edit asr.l #2, d0 ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4 ;-------- ; 66(14/0) ; ... dc.l 9>>2 dc.l 4>>2 dc.l 1>>2 squares: ; <- a5 points here dc.l 0>>2 dc.l 1>>2 dc.l 4>>2 ; ...``` Last edited by paraj; 24 April 2017 at 01:00.
24 April 2017, 01:47   #142
Going nowhere

Join Date: Oct 2001
Location: United Kingdom
Age: 45
Posts: 7,070
Quote:
 Originally Posted by paraj Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something? I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter? Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach. EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast Code: ```; d0 *= d1, assumes a5=square table, d6 used as scratch ; Instruction to beat: muls.w d1, d0 38-70(1/0) move.w d0, d6 ; 4(1/0) add.w d1, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) move.l (a5,d0.w), d0 ; 18(4/0) d0 = (A+B)^2 sub.w d1, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) sub.l (a5,d6.w), d0 ; 20(3/0) d0 = (A+B)^2 - (A-B)^2 ; see edit asr.l #2, d0 ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4 ;-------- ; 66(14/0) ; ... dc.l 9>>2 dc.l 4>>2 dc.l 1>>2 squares: ; <- a5 points here dc.l 0>>2 dc.l 1>>2 dc.l 4>>2 ; ...```
Surely where you have the 2x add.w d0,d0, it would be quicker to do LSL.W #2,d0 and the same again for the 2x add.w d6,d6 being LSL.W #2,d6 ?

24 April 2017, 02:11   #143
paraj
Registered User

Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
 Originally Posted by Galahad/FLT Surely where you have the 2x add.w d0,d0, it would be quicker to do LSL.W #2,d0 and the same again for the 2x add.w d6,d6 being LSL.W #2,d6 ?
2 x add.w is 8(2/0), lsl is 6+2*2=10(1/0) (according to the instruction references I'm currently looking at). So lsl is probably better when DMA is contended (and it's shorter), but it might be close. I mostly did it because it's mentioned in the very first post of this thread and I didn't want to be scolded

 24 April 2017, 08:51 #144 a/b Registered User   Join Date: Jun 2016 Location: europe Posts: 77 muls is 56 on average 2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead. 2x14 fetch + 2x4 align = 36 Something like: Code: ``` add.w d0,d0 add.w d1,d1 move.w d0,d6 add.w d1,d0 move.w (a5,d0.w),d0 sub.w d1,d6 sub.w (a5,d6.w),d0``` That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.
24 April 2017, 10:43   #145
paraj
Registered User

Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
 Originally Posted by a/b muls is 56 on average 2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead. 2x14 fetch + 2x4 align = 36 Something like: Code: ``` add.w d0,d0 add.w d1,d1 move.w d0,d6 add.w d1,d0 move.w (a5,d0.w),d0 sub.w d1,d6 sub.w (a5,d6.w),d0``` That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.
Yeah, that could work, but if the input is limited to \$100 won't the worst case for MULS.W be (38+8*2) 54 cycles (46 average), in which case we're back to muls being hard to beat?

I was thinking of using the optimization in a bog-standard vector/matrix multiplication routine, so some of the cost could be amortized, but I still have a hard time gaining anything (register pressure is also increased and the code has to be changed to accommodate the optimization).

I.e. when doing X' = MUL(MXX, VX) + MUL(MXY, VY) + MUL(MXZ, VZ), M?? can be premultiplied by 2/4 (the pointer to the square table could also be added). And as VX/VY/VZ are used 3 times, we can save 2*2 scalings for those.

Something like this, but again the gain seems marginal at best:

Code:
```        ; assume \1.l is premultipled by 4 (amortized cost=2*4/3)
;        \2 is premultipled by 4 and added with a5 (amortized cost=3*9*4/npoints)
;        \3 dest
;        a6 scratch
;        \1 datareg
;        \2 any ea
;        \3 should be datareg

move.l  \2, a6          ;  4(1/0) + ea [e.g. 12 for (d16,An), 0 for registers]
move.l  (a6,\1.w), \3   ; 18(4/0)
suba.w  \1, a6          ;  8(1/0)
sub.l   (a6), \3        ; 14(3/0)
; -------
; 44(9/0) + ea + (8/3 + 108/npoints ~ 3)```

 24 April 2017, 12:53 #146 ross Sum, ergo Cogito   Join Date: Mar 2017 Location: Crossing the Rubicon Age: 48 Posts: 1,363 Hi paraj, long time ago I came across the same question... My results? Use mulu/muls, but whit care. In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as: MULU: n = the number of ones in the source MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source. So: destination must have max bit resolution (16), source can be minor. Suppose a 16x16 multiply but in source 8 bit precision suffice. With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54! [worst case MULU dx=#\$FF00->1111111100000000->38+8*2, worst case MULS dx=#\$5500->0101010100000000->38+4*2+4*2] Cheers, ross
24 April 2017, 13:43   #147
paraj
Registered User

Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
 Originally Posted by ross Hi paraj, long time ago I came across the same question... My results? Use mulu/muls, but whit care. In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as: MULU: n = the number of ones in the source MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source. So: destination must have max bit resolution (16), source can be minor. Suppose a 16x16 multiply but in source 8 bit precision suffice. With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54! [worst case MULU dx=#\$FF00->1111111100000000->38+8*2, worst case MULS dx=#\$5500->0101010100000000->38+4*2+4*2] Cheers, ross
A yeah, that's a good point. For non-scaling transformations the matrix elements will usually have a magnitude of <= 1 (\$100 for 8.8 fixpoint), while the vectors also probably have a known range. So it should be easy to decide the best order.

Too bad there isn't an easy way to speed up multiplication like there is with division at least it's not too slow...

 25 April 2017, 09:32 #148 Kalms Registered User   Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 198 Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.
26 April 2017, 21:16   #149
paraj
Registered User

Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
 Originally Posted by Kalms Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.
Hi Kalms,

I was already considering re-using the setup code for the matrix across many operations, but couldn't get the cycle down due to too many expensive uses of long words, but your input made me rethink some ideas, and now I think a decent saving could be achieved.

My transform loop has the following structure:
Code:
```; a0 = dest (3 x cnt x 8.8 fixpoint words)
; a1 = src   ( --- "" ---)
; a2 = mat  ( 3 x 4 x 8.8 fixpoint words)
; d7 = cnt-1
.loop:
move.l  a2, a3          ; reload matrix pointer

move.w  (a0)+, d0       ;  8(2/0) d0 = X
move.w  (a0)+, d1       ;  8(2/0) d1 = Y
move.w  (a0)+, d2       ;  8(2/0) d2 = Z
; Process X
; ...
; Process Y
; ...
; Process Z (can trash d0-d2 here)
; ...
dbf d7, .loop```
Focusing on the "Process X" part my code currently looks as follows (should be fairly standard I think):
Code:
```        move.w  d0, d4          ;   4(1/0)      d4 = X
muls.w  (a3)+, d4       ;  42(2/0)+     d4 = X*XX (16.16 fixpoint)
move.w  d1, d5          ;   4(1/0)      d5 = Y
muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Y*XY
add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY
move.w  d2, d5          ;   4(1/0)      d5 = Z
muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Z*XZ
add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY + Z*XZ
asr.l   #8, d4          ;  24(1/0)      convert back to 8.8
move.w  (a3)+, d5       ;   8(2/0)      d5 = XW (translation)
addx.w  d5, d4          ;   4(1/0)      d4 = X*XX + Y*XY + Z*XZ + XW
move.w  d4, (a0)+       ;   8(1/1)      store transformed coordinate
; --------
; 198+```
Now pre-multiplying the input vectors by 2 (cheap even if they're not static) and changing the matrix format from:
Code:
```XX.w XY.w XZ.w XW.w
YX.w YY.w YZ.w YW.w
ZX.w ZY.w ZZ.w ZW.w```
to:
Code:
```&squares[XX].l &squares[XY].l &squares[XZ].l XW.w
&squares[YX].l ...```
where squares contains (n*n)>>10

Processing X could be done like this:
Code:
```        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XX]
move.w  (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2
neg.w   d0              ;   4(1/0)
sub.w   (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2 - (XX-X)^2 = X*XX
neg.w   d0              ;   4(1/0)      restore d0 for processing Y
move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XY]
add.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2
neg.w   d1              ;   4(1/0)
sub.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2 - (XY-Y)^2 = X*XX+Y*XY
neg.w   d1              ;   4(1/0)      restore d1 for processing Y
move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XZ]
add.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2
neg.w   d2              ;   4(1/0)
sub.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2 - (XZ-Z)^2 = X*XX+Y*XY+Z*XZ
neg.w   d2              ;   4(1/0)      restore d2 for processing Y
add.w   (a3)+, d4       ;  14(3/0)      d4 = X*XX + Y*XY + Z*XZ + XW
move.w  d4, (a0)+       ;   8(1/1)
;---------
; 166```
And it's starting to look OK in terms of cycles won vs. loss of precision!

Doing the transforms in logarithmic space is definitely also an interesting option, I think I'll have to look into later. Now I at least have some baselines to compare cycle count precision against.

EDIT: while redoing the cycle timing i also realized that a lot of the neg.w's could be omitted by inverting the add/sub sequence for Y..

Last edited by paraj; 27 April 2017 at 02:50. Reason: Messed up instruction timing, gave it another try

 28 April 2017, 03:00 #150 Kalms Registered User   Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 198 Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well. Last edited by Kalms; 28 April 2017 at 03:12.
01 May 2017, 12:00   #151
paraj
Registered User

Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
 Originally Posted by Kalms Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.
Once again, massive thanks for the hints! Too many years of self modifying code being verbotten has crippled my optimization skills I think

I have to admit I don't quite understand what you're getting at with 'data parallel computations' though. Do you mean like using both the upper and lower word simultaneously?

For ayone curious with the above tricks and the square table in low memory (16-bit addressable) the following loop is possible:

Code:
```                movem.w (a1)+, a2-a4    ; 24(7/0)

; X
.offXXp:        move.w  \$1234(a2), d0   ; 12(3/0)
.offXXm:        sub.w   \$1234(a2), d0   ; 12(3/0)
.offXYp:        add.w   \$1234(a3), d0   ; 12(3/0)
.offXYm:        sub.w   \$1234(a3), d0   ; 12(3/0)
.offXZp:        add.w   \$1234(a4), d0   ; 12(3/0)
.offXZm:        sub.w   \$1234(a4), d0   ; 12(3/0)
.offXW:         add.w   #\$1234, d0      ;  8(2/0)
move.w  d0, (a0)+       ;  8(1/1)
; -------
; 88
; Y & Z handled similarly```
with the points prescaled by 2 and the offset at .offXXp being &squares[XX].w and .offXXm &squares[-XX].w and so on.

EDIT: My simple 1-plane transform/project/putpixel test went from being able to handle 140 points to 237 in one frame just by applying the above optimization - pretty decent gain!

And of course the square table should be used for projection as well - &squares[\$10000/z].w seems like a nice format for the reciprocal table.

Last edited by paraj; 01 May 2017 at 14:22. Reason: Added some stats

 11 May 2017, 13:05 #152 meynaf son of 68k   Join Date: Nov 2007 Location: Lyon / France Age: 45 Posts: 3,339 As this thread is for generic 68000 optimizations, here is another. (Actually this one is valid for just about any cpu ) You can replace : Code: ``` not.w d0 and.w d0,d1 not.w d0``` By : Code: ``` or.w d0,d1 eor.w d0,d1``` (Makes "bit clear" / "and not" instruction proposals a lot less sexy doesn't it ?)
11 May 2017, 21:31   #153
paraj
Registered User

Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
 Originally Posted by meynaf As this thread is for generic 68000 optimizations,
Sorry if the earlier MUL discussion ended up being too specific, that's the opposite of my intend, but I guess it's hard to be generic when tables are involved.

And nice little bitwise identity I hadn't seen before!

 05 February 2018, 02:10 #154 mikro Registered User   Join Date: Oct 2007 Location: Bratislava / Slovakia Posts: 31 My humble contribution. Sometimes I need to isolate the carry: Code: ```subx.b d1,d1 ; neg.b d1 ; add.b d1,d0 ; add carry do d0's lower byte``` Code: ```subx.w d1,d1 ; negx.b d1 ; sub.w d1,d0 ; add carry to d0's upper byte``` Nice feature is that you can use use long size if needed, making it adda/suba friendly. Last edited by mikro; 16 February 2018 at 12:12.
 16 February 2018, 15:11 #155 Gorf Registered User   Join Date: May 2017 Location: Munich/Bavaria Posts: 551 Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks: multiply d0 by 29: Code: ```move.l d0, d1 lsl.l #4, d0 sub.l d1, d0 add.l d0, d0 sub.l d1, d0``` multiply d0 by 625: Code: ```move.l d0, d1 lsl.l #2, d0 add.l d1, d0 lsl.l #3, d0 sub.l d1, d0 lsl.l #4, d0 add.l d1, d0``` multiply d0 by 156: Code: ```move.l d0, d1 lsl.l #2, d1 add.l d1, d0 lsl.l #5, d0 sub.l d1, d0```
21 November 2018, 11:20   #156
dissident
Registered User

Join Date: Sep 2015
Location: Germany
Posts: 174
Quote:
 Originally Posted by Gorf Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks: multiply d0 by 29: Code: ```move.l d0, d1 lsl.l #4, d0 sub.l d1, d0 add.l d0, d0 sub.l d1, d0``` multiply d0 by 625: Code: ```move.l d0, d1 lsl.l #2, d0 add.l d1, d0 lsl.l #3, d0 sub.l d1, d0 lsl.l #4, d0 add.l d1, d0``` multiply d0 by 156: Code: ```move.l d0, d1 lsl.l #2, d1 add.l d1, d0 lsl.l #5, d0 sub.l d1, d0```
I've attached my collection of these multiply tricks based on the MC68000 execution times (including these three cases above) as macros, so they can be easily used in every sourcecode.
Attached Files
 MULU-Macros.s (25.1 KB, 20 views)

02 December 2018, 07:27   #157
NorthWay
Registered User

Join Date: May 2013
Posts: 523
Quote:
 Originally Posted by pmc Code: ``` lea vals(pc),An movem.w (An),Dn-Dn``` gets you the ext.l's for free

Which reminded me of something; once upon a time there was a thing called the GNU(gcc?) super-optimizer that supposedly knew about all sorts of of tricks to get things done in as few instructions as possible (and I think IBM was very proud of the Power instruction set as it had been proven to be possible to do 'all' of the instruction optimization equivalents in a single opcode).
Does anyone know if that is still a thing, or if it has been folded into the mainline product? I'm guessing there would have to be a description of what it did for each and every architecture - possibly some interesting stuff there.

 Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 Similar Threads Thread Thread Starter Forum Replies Last Post billt Coders. General 15 05 May 2012 21:13 sanjyuubi support.Games 5 27 May 2011 18:11 Counia Hardware mods 1 01 March 2011 11:18 Hungry Horace project.WHDLoad 60 19 December 2006 21:17 EdzUp Retrogaming General Discussion 0 10 February 2002 12:40

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home News Main     Amiga scene     Retrogaming General Discussion     Nostalgia & memories Support     New to Emulation or Amiga scene         Member Introductions     support.WinUAE     support.WinFellow     support.OtherUAE     support.FS-UAE         project.AmigaLive     support.Hardware         Hardware mods         Hardware pics     support.Games     support.Demos     support.Apps     support.Amiga Forever     support.Amix     support.Other Requests     request.UAE Wishlist     request.Old Rare Games     request.Demos     request.Apps     request.Modules     request.Music     request.Other     Looking for a game name ?     Games images which need to be WHDified abime.net - Hall Of Light     HOL news     HOL suggestions and feedback     HOL data problems     HOL contributions abime.net - Amiga Magazine Rack     AMR news     AMR suggestions and feedback     AMR data problems     AMR contributions abime.net - Home Projects     project.Amiga Lore     project.EAB     project.IRC     project.Mods Jukebox     project.Wiki abime.net - Hosted Projects     project.aGTW     project.APoV     project.ClassicWB     project.Jambo!     project.Green Amiga Alien GUIDES     project.Maptapper     project.Sprites     project.WinUAE - Kaillera Other Projects     project.Amiga Demo DVD     project.Amiga Game Factory     project.CARE     project.EAB File Server     project.CD32 Conversion     project.Game Cover Art         GCA.Feedback and Suggestions         GCA.Work in Progress         GCA.Cover Requests         GCA.Usefull Programs         GCA.Helpdesk     project.KGLoad     project.MAGE     project.Missing Full Shareware Games     project.SPS (was CAPS)     project.TOSEC (amiga only)     project.WHDLoad         project.Killergorilla's WHD packs Misc     Amiga websites reviews     MarketPlace         Swapshop     Collections     EAB's competition Coders     Coders. General         Coders. Releases         Coders. Tutorials     Coders. Asm / Hardware     Coders. System         Coders. Scripting         Coders. Nextgen     Coders. Language         Coders. C/C++         Coders. AMOS         Coders. Blitz Basic     Coders. Contest         Coders. Entries Off Topic     OT - General     OT - Entertainment     OT - Sports     OT - Technical     OT - Gaming

All times are GMT +2. The time now is 15:59.

 -- EAB3 skin ---- EAB2 skin ---- Mobile skin Archive - Top