68000 code optimisations - Page 12

ross · 09 June 2022, 00:52

Quote:

Originally Posted by a/b

How about this (020+, as you mentioned)?

This bfins version is nice.
Now someone need to time it in real machine versus the generic one

Quote:

Originally Posted by phx

EDIT: Wow... ross and me posted in the same minute again. How likely is that?

phx · 09 June 2022, 00:59

Quote:

Originally Posted by a/b

How about this (020+, as you mentioned)?

Interesting. I wonder if something like that is also possible for more than 32 bit shifts?
Maybe jotd wants to replace LSR by ASR, unless he wants an unsigned shift.

ross · 09 June 2022, 01:04

EDIT: removed the nocturnal nonsense about temporal coincidences

Quote:

Originally Posted by phx

Interesting. I wonder if something like that is also possible for more than 32 bit shifts?

Yes, but inserting

D1:D0

in memory --> slow...

So it is better:

Code:

        moveq   #32,d2
        sub.l   d3,d2
        bgt.b   .1
        move.l  d1,d0
        neg.l   d2
        moveq   #0,d1
        lsr.l   d2,d0
        rts
.1:	bfins	d1,d0{d2:d3}
	rol.l	d2,d0
	lsr.l	d3,d1
        rts

Provided that in fact bfins is faster than the 3 separate instructions ..

meynaf · 09 June 2022, 09:02

You don't necessarily need to do 32-n.
What about :

Code:

 moveq #-1,d2
 lsl.l d3,d2
 eor.l d1,d0
 and.l d2,d0
 eor.l d1,d0
 ror.l d3,d0
 asr.l d3,d1

ross · 09 June 2022, 09:09

Quote:

Originally Posted by meynaf

You don't necessarily need to do 32-n.

Nice, I like the EOR tricks.

Speed is probably the same.

dansalvato · 02 August 2022, 04:27

I'm pretty new and have a fear of being outclassed by the veterans, but I wanted to share this optimized sin/cos function I came up with: LINK

It takes your byte- or word-length angle and returns a word-length sin in d0, and cos in d1. It returns values from -256 (0xff00) to 255 (0x00ff). At 6 instructions, it's highly suitable for inline ASM to avoid the overhead of a subroutine. It takes 44 cycles and 10 memory reads. The lookup table is 256 bytes.

Code:

; This snippet assumes the lookup table pointer is in a0
moveq     #64,d1
add.b     d0,d1
ext.w     d1
ext.w     d0
move.b    (a0,d1.w),d1
move.b    (a0,d0.w),d0

And if you don't need cos, it's only 2 instructions:

Code:

ext.w     d0
move.b    (a0,d0.w),d0

The "trick" is less in the code above, but more about the lookup table, and how it leverages signed values. Your input value is sign-extended so that angles 128 to 255 are negative (-128 to -1). Sine values are also negative in this range, so the high byte of your sign-extended input value is also used for the resulting sin/cos. So you can think of the lookup table as being word-length signed values, but the upper byte has been discarded from each entry, because it's provided by your input. Finally, the lookup table pointer is actually the center of the table, rather than the top, so that your signed index gets the correct data in either direction. Finally, the sign extension doubles as a convenient way of clearing the upper byte of your angle, something you'd otherwise have to do manually to avoid indexing out of bounds.

The tradeoff here is that you might want your resulting data to be in a different or more precise range than this provides, such as -0x8000 to 0x7fff. It kind of depends on how you need to shift/multiply the data after you retrieve it. I find this works especially well if you're using it for objects whose positions have 4- or 8-bit fractional values.

paraj · 02 August 2022, 07:49

If you extend the table to cover 90 degrees more (repeat the 64 first entries at the end) you can do the double lookup in 32 cycles (and 7 memory accesses):

Code:

    ext.w     d0
    move.b    64(a0,d0.w),d1
    move.b    (a0,d0.w),d0

Doesn't extend as nicely to larger tables (or word sized values), but the idea of having an extra pi/2 values at the end (or start) of a sin/cos table can often be used for a slight speed-up at the cost of extra memory usage.

dansalvato · 02 August 2022, 09:16

Quote:

Originally Posted by paraj

If you extend the table to cover 90 degrees more (repeat the 64 first entries at the end) you can do the double lookup in 32 cycles (and 7 memory accesses):

It doesn't work for this specific implementation, because the sign of the angle is used as the sign of the resulting value, which is how I only retrieve a byte but have a range of -256 to 255. So, d1 needs to be separately sign-extended before fetching the result.

Even if I traded a bit of precision and made the table values signed bytes (-128 to 127), I'd have to sign-extend the result before I can use it in further calculations, so there's no gain. Another option is giving the table full word-sized values, but then there's the overhead of clearing the upper byte from your angle and doubling it to fetch from the table—plus, the cos value is then at an inconvenient displacement of 128.

paraj · 02 August 2022, 17:21

D'oh, of course you're right. Brain fart on my side. Apologies.

Photon · 07 August 2022, 02:52

@dansalvato, it's a bad feeling to feel that way ("great code has already been written"), and also a good challenge because you could be great, not all code has been written. And a good feeling that someone cares about things like this nowadays with one line of code being translated to a kilobyte of code with megabytes and more of dependencies to even run at all. Pure binary math is very fun

Current size record for sine calc is 32b (Raylight/PWL recently beat my Bhaskara-based 36b) and you can easily make an interpolating one in 50b (2008).

A nice goal is func(angle, ampl) real-time in a few cycles within a not too bad aberration.

a/b · 07 August 2022, 03:24

Quote:

Originally Posted by Photon

Current size record for sine calc is 32b (Raylight/PWL recently beat my Bhaskara-based 36b) and you can easily make an interpolating one in 50b (2008).

What are the sine parameters and error constraints? We did a compo a while ago here on EAB with: 1024 entries, 16384 amplitude, max. 5% error.
My 2nd order parabolic is 24 bytes, or 38 bytes very accurate (<1% error): http://eab.abime.net/showthread.php?t=106304

Photon · 07 August 2022, 03:31

@dansalvato, you see?

@a/b cool, what's the aberration for the 24b one?

a/b · 07 August 2022, 04:34

Jobbo did some number crunching and put all the data into this spreadsheet: https://docs.google.com/spreadsheets...it?usp=sharing ...
Column D is a reference sine, my stuff is in V (24, lower acc) and Z (38, higher acc).
The lower accuracy version is probably not accurate enough, the higher accuracy version is probably too accurate :P as we also wanted to break the 1% threshold (so a better size vs. accuracy trade off is possible, reason why I asked about error constraints).

NorthWay · 22 February 2023, 14:45

While working on a log2(pow2?) function I realized that the bitfield instructions can be used to test and clear a register in one instruction:

Code:

bfclr d0{#0:#32}
bne NotClear

The alternative would be to store, clear and re-test the value (might be faster in some cases, but I like the compactness).

My NextPowerOf2 function ended up being 6 instructions (valid for values 2 -> 2^31) or 6 non-branching instructions (valid for values 2 -> 2^31-1). (If the answer to 1 as input is 1 then that can be considered working too.)

hooverphonique · 22 February 2023, 16:20

Quote:

Originally Posted by NorthWay

While working on a log2(pow2?) function I realized that the bitfield instructions can be used to test and clear a register in one instruction:

I just realized the bset/bchg/bclr instructions also test before modifying the destination

phx · 22 February 2023, 17:37

Quote:

Originally Posted by hooverphonique

I just realized the bset/bchg/bclr instructions also test before modifying the destination

Better late than never...

NorthWay · 02 March 2023, 11:25

If you want a conditional rotate by 1 and prefer it to be branch free then sCC is your friend:

Code:

(needs already cleared upper 3 bytes of d0) (no it doesn't)
sne d0
rol.l d0,d1

Though 68000 might be doing all the rotates and not only 5 bits worth of it?

meynaf · 02 March 2023, 12:16

Quote:

Originally Posted by NorthWay

(needs already cleared upper 3 bytes of d0)

It does not. Upper bytes are unimportant here.

Quote:

Originally Posted by NorthWay

Though 68000 might be doing all the rotates and not only 5 bits worth of it?

Even 68000 should do only 31 rotates. However these 31 shifts are gonna take very long, much longer than what a branch would.
Actually, on 020-030 the above isn't faster than a branch either.

a/b · 02 March 2023, 19:56

It's only the low 6 bits that matter (modulo 64, so 0-63 shifts/rotates). And yes, it actually does 63 operations if the low 6 bits are all set to 1.

remz · 04 March 2023, 15:59

Curiosity: Why were 6 bits allocated to shift&rotate instructions instead of 5?
Is there any practicality to rotate more than 32?

09 June 2022, 09:02	#224
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,377	You don't necessarily need to do 32-n. What about : Code: moveq #-1,d2 lsl.l d3,d2 eor.l d1,d0 and.l d2,d0 eor.l d1,d0 ror.l d3,d0 asr.l d3,d1

02 August 2022, 04:27	#226
dansalvato Registered User Join Date: Jun 2009 Location: United States Posts: 57	I'm pretty new and have a fear of being outclassed by the veterans, but I wanted to share this optimized sin/cos function I came up with: LINK It takes your byte- or word-length angle and returns a word-length sin in d0, and cos in d1. It returns values from -256 (0xff00) to 255 (0x00ff). At 6 instructions, it's highly suitable for inline ASM to avoid the overhead of a subroutine. It takes 44 cycles and 10 memory reads. The lookup table is 256 bytes. Code: ; This snippet assumes the lookup table pointer is in a0 moveq #64,d1 add.b d0,d1 ext.w d1 ext.w d0 move.b (a0,d1.w),d1 move.b (a0,d0.w),d0 And if you don't need cos, it's only 2 instructions: Code: ext.w d0 move.b (a0,d0.w),d0 The "trick" is less in the code above, but more about the lookup table, and how it leverages signed values. Your input value is sign-extended so that angles 128 to 255 are negative (-128 to -1). Sine values are also negative in this range, so the high byte of your sign-extended input value is also used for the resulting sin/cos. So you can think of the lookup table as being word-length signed values, but the upper byte has been discarded from each entry, because it's provided by your input. Finally, the lookup table pointer is actually the center of the table, rather than the top, so that your signed index gets the correct data in either direction. Finally, the sign extension doubles as a convenient way of clearing the upper byte of your angle, something you'd otherwise have to do manually to avoid indexing out of bounds. The tradeoff here is that you might want your resulting data to be in a different or more precise range than this provides, such as -0x8000 to 0x7fff. It kind of depends on how you need to shift/multiply the data after you retrieve it. I find this works especially well if you're using it for objects whose positions have 4- or 8-bit fractional values. Last edited by dansalvato; 02 August 2022 at 09:21.

02 August 2022, 07:49	#227
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,284	If you extend the table to cover 90 degrees more (repeat the 64 first entries at the end) you can do the double lookup in 32 cycles (and 7 memory accesses): Code: ext.w d0 move.b 64(a0,d0.w),d1 move.b (a0,d0.w),d0 Doesn't extend as nicely to larger tables (or word sized values), but the idea of having an extra pi/2 values at the end (or start) of a sin/cos table can often be used for a slight speed-up at the cost of extra memory usage.

22 February 2023, 14:45	#234
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 862	While working on a log2(pow2?) function I realized that the bitfield instructions can be used to test and clear a register in one instruction: Code: bfclr d0{#0:#32} bne NotClear The alternative would be to store, clear and re-test the value (might be faster in some cases, but I like the compactness). My NextPowerOf2 function ended up being 6 instructions (valid for values 2 -> 2^31) or 6 non-branching instructions (valid for values 2 -> 2^31-1). (If the answer to 1 as input is 1 then that can be considered working too.)

02 March 2023, 11:25	#237
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 862	If you want a conditional rotate by 1 and prefer it to be branch free then sCC is your friend: Code: (needs already cleared upper 3 bytes of d0) (no it doesn't) sne d0 rol.l d0,d1 Though 68000 might be doing all the rotates and not only 5 bits worth of it? Last edited by NorthWay; 02 March 2023 at 12:35. Reason: I knew better it doesn't need to clear... thanks meynaf

02 August 2022, 17:21	#229
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,284	D'oh, of course you're right. Brain fart on my side. Apologies.

07 August 2022, 02:52	#230
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,752	@dansalvato, it's a bad feeling to feel that way ("great code has already been written"), and also a good challenge because you could be great, not all code has been written. And a good feeling that someone cares about things like this nowadays with one line of code being translated to a kilobyte of code with megabytes and more of dependencies to even run at all. Pure binary math is very fun Current size record for sine calc is 32b (Raylight/PWL recently beat my Bhaskara-based 36b) and you can easily make an interpolating one in 50b (2008). A nice goal is func(angle, ampl) real-time in a few cycles within a not too bad aberration.

07 August 2022, 03:31	#232
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,752	@dansalvato, you see? @a/b cool, what's the aberration for the 24b one?

07 August 2022, 04:34	#233
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,087	Jobbo did some number crunching and put all the data into this spreadsheet: https://docs.google.com/spreadsheets...it?usp=sharing ... Column D is a reference sine, my stuff is in V (24, lower acc) and Z (38, higher acc). The lower accuracy version is probably not accurate enough, the higher accuracy version is probably too accurate :P as we also wanted to break the 1% threshold (so a better size vs. accuracy trade off is possible, reason why I asked about error constraints).

02 March 2023, 19:56	#239
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,087	It's only the low 6 bits that matter (modulo 64, so 0-63 shifts/rotates). And yes, it actually does 63 operations if the low 6 bits are all set to 1.

04 March 2023, 15:59	#240
remz Registered User Join Date: May 2022 Location: Canada Posts: 147	Curiosity: Why were 6 bits allocated to shift&rotate instructions instead of 5? Is there any practicality to rotate more than 32?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 boot code	billt	Coders. General	15	05 May 2012 20:13
Wasted Dreams on 68000	sanjyuubi	support.Games	5	27 May 2011 17:11
680x0 to 68000	Counia	Hardware mods	1	01 March 2011 10:18
quitting on 68000?	Hungry Horace	project.WHDLoad	60	19 December 2006 20:17
3D code and/or internet code for Blitz Basic 2.1	EdzUp	Retrogaming General Discussion	0	10 February 2002 11:40