Quickest way to test collisions - Page 7

Daedalus · 25 June 2018, 17:33

Most functions in Blitz will automatically (and silently) cast numeric variables from one type to another when required - I'm struggling to think of any that don't. This can be useful, but can also lead to bugs later on with overflows, loss of precision and all that good stuff. I don't know how much CPU time that casting actually takes, but I can't imagine it's for free.

I also suspect there might be a more pronounced difference in performance on a real 68000 system, where conversions involving 32-bit variables (like a quick) will be slower than 16-bit word-specific versions.

E-Penguin · 25 June 2018, 22:20

With the most bare-bones ASM I can think of (am I missing a trick?) I can't beat QABS. It must use voodoo.

-- edited to add --
Trying the shift, XOR, mask approach here gives me the same results as ABS, leading me to think that must be how it's done internally. I don't think it can be beaten.

Code:

WBStartup
DEFTYPE.w

ResetTimer
For i=0 To 9999
  a =  Rnd(500)
  b =  Rnd(500)
  c = Abs(a - b)
Next i
NPrint "ABS " , Ticks

ResetTimer
For i=0 To 9999
  a =  Rnd(500)
  b =  Rnd(500)

  GetReg d0, a
  GetReg d1, b
  SUB.w d1, d0
  BMI adIsNeg2
  JMP ed
adIsNeg2:
  NEG.w d0
ed:
  PutReg d0, a
Next i
NPrint "AbsDiff " , Ticks

ResetTimer
For i=0 To 9999
  a =  Rnd(500)
  b =  Rnd(500)
  c = QAbs(a - b)
Next i
NPrint "QABS " , Ticks

VWait 500
End

Daedalus · 25 June 2018, 23:38

What's also interesting is that the speed of Abs() and QAbs() is the same... What CPU was that run on? I don't know about the relative speeds, but perhaps Abs() and QAbs() are using a bit test and EOR internally instead, gaining some cycles that way?

Edit: sorry, didn't see your edit adding that you've already tried it

E-Penguin · 26 June 2018, 00:12

Abs and QAbs are usually within a tick of eachother; I'm putting that down to variance in the Rnd command (I avoided literals to ensure they weren't optimised away). Standard A1200 WinUAE config.

QAbs and Abs can't be doing any branching; it's too slow. I suppose I could try using the shiny new debugger in winuae4 and step through the ASM, but ain't nobody got time for that.

Summary: Abs/QAbs are more or less equivalent, and there's little-to-no scope for optimisation.

Daedalus · 26 June 2018, 10:52

I wonder what difference WinUAE might me making... If I have time I might try it out on a 68000 machine later today to see how it goes. I don't think the A1200 is fully cycle-exact, which means there could be shortcuts taken in calculations that are more or less 1:1 with x86 equivalents for example, and the 16-bit bus of the 68000 won't be slowing things down either...

idrougge · 26 June 2018, 11:56

Quote:

Originally Posted by E-Penguin

Trying the shift, XOR, mask approach here gives me the same results as ABS, leading me to think that must be how it's done internally. I don't think it can be beaten.

Shifting on the plain 68000 is a half-expensive operation, at least if you shift that many steps.

E-Penguin · 26 June 2018, 13:19

I guess it's a matter of shift vs a conditional branch + jmp. They look about the same order of duration.

Obviously this could be done very quickly with a lookup table if one doesn't mind creating an array of 128Kb... (that's not necessarily a silly suggestion if you have a bit of Fast ram going spare).

idrougge · 26 June 2018, 13:57

Here is a branchless solution I found. It might not be any faster on a non-pipelined CPU, though. https://gist.github.com/cahirwpz/19c...f03025874530fc

Master484 · 26 June 2018, 14:53

Also the different versions of Blitz is one factor that can affect speed. ABS and QABS may give different results on AmiBlitz and Classic Blitz 2.1, because the code might be different, and also some AmiBlitz commands use the FPU, although I don't know if ABS/QABS is one of them.

But I only use Classic Blitz, and I tested ABS vs QABS on 4 different WinUAE configurations, using this code:

Code:

loop=0
Repeat
 a = RND (100)
 b = ABS (a)
 loop + 1
Until loop = 1000

And these were the results, with Cycle Exact ON:

A500, No Fast RAM
ABS : Frame 11, VPOS at 14
QABS : Frame 5, VPOS at 275

A500 + Fast RAM
ABS : Frame 9, VPOS at 200
QABS : Frame 4, VPOS at 300

A1200, No Fast RAM
ABS : Frame 4, VPOS at 250
QABS : Frame 2, VPOS at 275

A1200 + Fast RAM
ABS : Frame 3, VPOS at 130
QABS : Frame 2, VPOS at 50

---

Also I tested this Q-Penquins code:

Code:

ResetTimer
For i=0 To 9999
  a =  Rnd(500)
  b =  Rnd(500)
  c = Abs(a - b)
 Next i

And got these results:

A1200, No Fast
ABS: 51 Ticks
QABS: 25 Ticks

A1200 + Fast RAM
ABS: 32 Ticks
QABS: 19 Ticks

A500, No Fast
ABS: 136 Ticks
QABS: 87 Ticks

So in all cases QABS was faster than ABS. And also the Blitz manual says that because QABS handles only Quick variables, it improves the commands speed "quite dramatically", although it doesn't tell how this speed increase happens.

So if you have gotten results where the speed of ABS and QABS are the same, then maybe this is the case on AmiBlitz only, but not on Classic Blitz 2.1 ?

E-Penguin · 26 June 2018, 17:01

I was using 2.1, but didn't have cycle exact on. Maybe it makes a difference in this case. I'll code up an ASM function per idrougge's link when I get a chance.

Niklas · 26 June 2018, 17:46

Quote:

Originally Posted by idrougge

Here is a branchless solution I found. It might not be any faster on a non-pipelined CPU, though. https://gist.github.com/cahirwpz/19c...f03025874530fc

That's a pretty clever solution. Still (as you point out) on a 68000 CPU the branching solution is quite a bit faster:

Code:

    move.l   d0,d1  ; 4
    add.l    d1,d1  ; 8
    subx.l   d1,d1  ; 8
    eor.l    d1,d0  ; 8
    sub.l    d1,d0  ; 8
                    ; =36 cycles

Code:

    tst.l    d0     ; 4
    bpl.b    done   ; 10
    neg.l    d0     ; 6
done:
                    ; =14 or 20 cycles, depending on the sign of the input value

E-Penguin · 27 June 2018, 09:55

Quote:

Originally Posted by Niklas

Code:

    tst.l    d0     ; 4
    bpl.b    done   ; 10
    neg.l    d0     ; 6
done:
                    ; =14 or 20 cycles, depending on the sign of the input value

I tried with the logic flipped (BMI rather than BPL) and it was slower than the built-in function. I'll give it a go with things that way round. Maybe it's the overhead of the statement call

clenched · 27 June 2018, 15:11

Quote:

Originally Posted by E-Penguin

I tried with the logic flipped (BMI rather than BPL) and it was slower than the built-in function. I'll give it a go with things that way round. Maybe it's the overhead of the statement call

What is happening is the machine code part is actually running more BASIC statements than the other two. There are a few things to be done. Hopefully they are commented well enough on the snippet. Before and after made with latest WinUAE. Stock A1200 CE.

Code:

 
ResetTimer
For i=0 To 9999
  ;switch order so D0 is loaded with last variable
  b =  Rnd(500) 
  a =  Rnd(500) 
  ;GetReg d0, a
  ;GetReg d1, b
  ; Here 2(a2)=a 4(a2)=b 6(a2)=c
  ; d0 is already loaded with a
  
  MOVE.w 4(a2),d1 ;b to d1
  SUB.w d1, d0
  BMI adIsNeg2  ;this part could be adjusted
  JMP ed
adIsNeg2:
  NEG.w d0
ed:
;  PutReg d0, a
MOVE.w d0,6(a2) ;d0 to c  - changed from a for consistency
Next i
NPrint "AbsDiff " , Ticks

Code:

 
before           after  
=========================
ABS 147          ABS 134
AbsDiff 151      AbsDiff 87
QABS 111         QABS 110
 
ABS 144          ABS 135
AbsDiff 136      AbsDiff 88
QABS 110         QABS 108
 
ABS 145          ABS 142
AbsDiff 137      AbsDiff 88
QABS 110         QABS 109
 
ABS 135          ABS 141
AbsDiff 137      AbsDiff 87
QABS 112         QABS 109
 
ABS 136          ABS 140
AbsDiff 136      AbsDiff 91
QABS 117         QABS 111

E-Penguin · 27 June 2018, 17:00

Nice. Instructive about how the variables are mapped to the data registers too. Thanks

idrougge · 27 June 2018, 21:55

What is located at 0(A2)?

clenched · 27 June 2018, 23:12

Quote:

Originally Posted by idrougge

What is located at 0(A2)?

Offhand I would say that is i from the for/next loop.
splice in move.w (a2),$200 somewhere.
When program finishes $200 contains $270f (9999)

E-Penguin - Replace the first two ML lines for a slight reduction:
SUB.w 4(a2),d0

E-Penguin · 27 June 2018, 23:51

I'm beginning to think that the art of 68k programming lies in the mastery of the various addressing modes.

25 June 2018, 22:20	#122
E-Penguin Banana Join Date: Jul 2016 Location: Darmstadt Posts: 1,213	With the most bare-bones ASM I can think of (am I missing a trick?) I can't beat QABS. It must use voodoo. -- edited to add -- Trying the shift, XOR, mask approach here gives me the same results as ABS, leading me to think that must be how it's done internally. I don't think it can be beaten. Code: WBStartup DEFTYPE.w ResetTimer For i=0 To 9999 a = Rnd(500) b = Rnd(500) c = Abs(a - b) Next i NPrint "ABS " , Ticks ResetTimer For i=0 To 9999 a = Rnd(500) b = Rnd(500) GetReg d0, a GetReg d1, b SUB.w d1, d0 BMI adIsNeg2 JMP ed adIsNeg2: NEG.w d0 ed: PutReg d0, a Next i NPrint "AbsDiff " , Ticks ResetTimer For i=0 To 9999 a = Rnd(500) b = Rnd(500) c = QAbs(a - b) Next i NPrint "QABS " , Ticks VWait 500 End Attached Thumbnails Last edited by E-Penguin; 25 June 2018 at 23:13.

25 June 2018, 23:38	#123
Daedalus Registered User Join Date: Jun 2009 Location: Dublin, then Glasgow Posts: 6,334	What's also interesting is that the speed of Abs() and QAbs() is the same... What CPU was that run on? I don't know about the relative speeds, but perhaps Abs() and QAbs() are using a bit test and EOR internally instead, gaining some cycles that way? Edit: sorry, didn't see your edit adding that you've already tried it Last edited by Daedalus; 26 June 2018 at 12:25.

26 June 2018, 14:53	#129
Master484 Registered User Join Date: Nov 2015 Location: Vaasa, Finland Posts: 524	Also the different versions of Blitz is one factor that can affect speed. ABS and QABS may give different results on AmiBlitz and Classic Blitz 2.1, because the code might be different, and also some AmiBlitz commands use the FPU, although I don't know if ABS/QABS is one of them. But I only use Classic Blitz, and I tested ABS vs QABS on 4 different WinUAE configurations, using this code: Code: loop=0 Repeat a = RND (100) b = ABS (a) loop + 1 Until loop = 1000 And these were the results, with Cycle Exact ON: A500, No Fast RAM ABS : Frame 11, VPOS at 14 QABS : Frame 5, VPOS at 275 A500 + Fast RAM ABS : Frame 9, VPOS at 200 QABS : Frame 4, VPOS at 300 A1200, No Fast RAM ABS : Frame 4, VPOS at 250 QABS : Frame 2, VPOS at 275 A1200 + Fast RAM ABS : Frame 3, VPOS at 130 QABS : Frame 2, VPOS at 50 --- Also I tested this Q-Penquins code: Code: ResetTimer For i=0 To 9999 a = Rnd(500) b = Rnd(500) c = Abs(a - b) Next i And got these results: A1200, No Fast ABS: 51 Ticks QABS: 25 Ticks A1200 + Fast RAM ABS: 32 Ticks QABS: 19 Ticks A500, No Fast ABS: 136 Ticks QABS: 87 Ticks So in all cases QABS was faster than ABS. And also the Blitz manual says that because QABS handles only Quick variables, it improves the commands speed "quite dramatically", although it doesn't tell how this speed increase happens. So if you have gotten results where the speed of ABS and QABS are the same, then maybe this is the case on AmiBlitz only, but not on Classic Blitz 2.1 ?

27 June 2018, 21:55	#135
idrougge Registered User Join Date: Sep 2007 Location: Stockholm Posts: 4,332	What is located at 0(A2)? Last edited by idrougge; 27 June 2018 at 22:06.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
SetCol/DoColl-How to test collisions with different sprites against different colors?	Shatterhand	Coders. Blitz Basic	1	12 January 2017 18:51
Quickest code....	Galahad/FLT	Coders. Asm / Hardware	10	01 January 2017 17:23
[REQ:ASM] Sprite collisions basics	jman	Coders. Tutorials	5	03 September 2011 00:07
What is the quickest way	Doc Mindie	support.WinUAE	6	17 October 2007 21:15
Disable Sprite Collisions	DeAdLy_cOoKiE	Retrogaming General Discussion	4	24 March 2006 17:56

25 June 2018, 17:33	#121
Daedalus Registered User Join Date: Jun 2009 Location: Dublin, then Glasgow Posts: 6,334	Most functions in Blitz will automatically (and silently) cast numeric variables from one type to another when required - I'm struggling to think of any that don't. This can be useful, but can also lead to bugs later on with overflows, loss of precision and all that good stuff. I don't know how much CPU time that casting actually takes, but I can't imagine it's for free. I also suspect there might be a more pronounced difference in performance on a real 68000 system, where conversions involving 32-bit variables (like a quick) will be slower than 16-bit word-specific versions.

26 June 2018, 00:12	#124
E-Penguin Banana Join Date: Jul 2016 Location: Darmstadt Posts: 1,213	Abs and QAbs are usually within a tick of eachother; I'm putting that down to variance in the Rnd command (I avoided literals to ensure they weren't optimised away). Standard A1200 WinUAE config. QAbs and Abs can't be doing any branching; it's too slow. I suppose I could try using the shiny new debugger in winuae4 and step through the ASM, but ain't nobody got time for that. Summary: Abs/QAbs are more or less equivalent, and there's little-to-no scope for optimisation.

26 June 2018, 10:52	#125
Daedalus Registered User Join Date: Jun 2009 Location: Dublin, then Glasgow Posts: 6,334	I wonder what difference WinUAE might me making... If I have time I might try it out on a 68000 machine later today to see how it goes. I don't think the A1200 is fully cycle-exact, which means there could be shortcuts taken in calculations that are more or less 1:1 with x86 equivalents for example, and the 16-bit bus of the 68000 won't be slowing things down either...

26 June 2018, 13:19	#127
E-Penguin Banana Join Date: Jul 2016 Location: Darmstadt Posts: 1,213	I guess it's a matter of shift vs a conditional branch + jmp. They look about the same order of duration. Obviously this could be done very quickly with a lookup table if one doesn't mind creating an array of 128Kb... (that's not necessarily a silly suggestion if you have a bit of Fast ram going spare).

26 June 2018, 13:57	#128
idrougge Registered User Join Date: Sep 2007 Location: Stockholm Posts: 4,332	Here is a branchless solution I found. It might not be any faster on a non-pipelined CPU, though. https://gist.github.com/cahirwpz/19c...f03025874530fc

26 June 2018, 17:01	#130
E-Penguin Banana Join Date: Jul 2016 Location: Darmstadt Posts: 1,213	I was using 2.1, but didn't have cycle exact on. Maybe it makes a difference in this case. I'll code up an ASM function per idrougge's link when I get a chance.

27 June 2018, 17:00	#134
E-Penguin Banana Join Date: Jul 2016 Location: Darmstadt Posts: 1,213	Nice. Instructive about how the variables are mapped to the data registers too. Thanks

27 June 2018, 23:51	#137
E-Penguin Banana Join Date: Jul 2016 Location: Darmstadt Posts: 1,213	I'm beginning to think that the art of 68k programming lies in the mastery of the various addressing modes.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)