out of registers

meynaf · 26 August 2023, 10:19

If you like code that's totally out of registers, here's a nice (?) example.
Not only it uses all regs, but it also uses 4 imaginary regs (d8,d9,a8,a9) because there aren't enough ! And it would have the use for more.

The whole routine is much larger than that, but it's the critical part. Nonexisting registers are currently mapped to memory (ds.l).

It's code that works and i wish to cleanup and optimize.
I can do it myself but i'd like to see how others handle this kind of case.

If you really want to know, that code is the loop for remapping 2bpp bitmaps to other colors, computing transparency plane in the process. It also supports inverting the data for selection rendering. A call to BltMaskBitMapRastPort follows (on a normal wb window's rp). It's for a GUI library i'm developing.

Code:

; a0-a1 = input planes (a0=a1 for 1bpp), a2=output, a3=transp., d8=0/-1 (sel'd render)
; a4=plane loop / color 1, a5=xloop/color 2, a6=color 3
.yloop
 move.l a5,d5
.xloop
 move.l (a0)+,d1
 move.l (a1)+,d2
 move.l d8,d0
 eor.l d0,d1
 eor.l d0,d2
 move.l d1,d0
 or.l d2,d0
 move.l d0,(a3)+
 move.l d1,d3
 and.l d2,d3
 move.l d1,d0
 not.l d0
 or.l d2,d1
 eor.l d2,d1
 and.l d0,d2
 move.l a4,d4
 move.w a5,d5
 move.w a6,d6
.ploop
 lsr.b #1,d4
 subx.l d7,d7
 and.l d1,d7
 lsr.b #1,d5
 subx.l d0,d0
 and.l d2,d0
 or.l d0,d7
 lsr.b #1,d6
 subx.l d0,d0
 and.l d3,d0
 or.l d0,d7
 move.l d7,(a2)
 add.l a8,a2
 subi.l #$10000,d4
 bcc .ploop
 sub.l a9,a2
 subi.l #$10000,d5
 bcc .xloop
 add.l d9,a0
 add.l d9,a1
 subi.l #$10000,d6
 bcc .yloop

Isn't it strange that such a small code block can use so many registers ?

jotd · 26 August 2023, 10:30

subs to dx could use swap sub 1 swap.

also maybe load d9 to addw to address regs

Don_Adan · 26 August 2023, 12:15

After first look, You can reduce size of this routine. But i dont know if it will be fastest.
After second look, perhaps max. one register can be free, but it needs extra swaps, which You dont like.
Then perhaps You must use stack for extra registers.
But i will check it more later.

Code:

 move.l d7,(a2)
 moveq #1,d7
 swap d7
 add.l a8,a2 ; a8
 sub.l d7,d4
 bcc .ploop
 sub.l a9,a2 ;a9
 sub.l d7,d5
 bcc .xloop
 add.l d9,a0 ;d9
 add.l d9,a1 ;d9
 sub.l d7,d6
 bcc .yloop

a/b · 26 August 2023, 14:14

Things you probably are aware of and are not a major concern here:
1. assuming there's a move.l a6,d6 just before .yloop label
2. -4c/2b: sub.l #1<<16,d4 to sub.w #1<<8,d4 (plus prep a4 as d8:d8) since 8 bits suffices for the bitplanes
3. -12c: convert 3x lsr.b to add.b by reordering the bits
4. 2x -4c: convert the other 2x sub.l #1<<16,dx to swap/subq.w/swap and pre-swap the ax regs
5. you are not using all registers, a7 is not used (assuming this runs 100% in user mode)

Now to the relevant part, using #5 you could do this:

stack: d8.l, a6.w, a9.l, (a7 is here), d9.l

y_in:

x_in:
lea (-10,a7),a7
d8.l => move.l (a7)+,d0
a6.w => move.w (a7)+,d6

pl:
a8.l => a6.l (innermost loop 100% reg based)

x_out:
a9.l => sub.l (a7)+,a2

y_out:
d9.l => move.l (a7),d0 | add.l d0,a0 | add.l d0,a1

Don_Adan · 26 August 2023, 14:26

Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order.
This register will be looks as:
D4=$33xx2211
and code can be looks next:

Code:

.ploop
add.b d4,d4
subx.l d7,d7
lsr.b #1,d4
and.l d1,d7
add.w d4,d4
subx.l d0,d0
lsr.w #1,d4
and.l d2,d0
or.l d0,d7
add.l d4,d4
subx.l d0,d0
and.l d3,d0
or.l d0,d7
move.l d7,(a2)

If no my bugs, perhaps it can works Ok. My head dont works very good (I'm too old), but code looks Ok for me.

Thomas Richter · 26 August 2023, 15:20

Quote:

Originally Posted by meynaf

If you like code that's totally out of registers, here's a nice (?) example.

That also happens in several low-level bit-fiddling operations in P96. Only look at the inner loops, place the rest on the stack. For example, the horizontal loop is typically critical, the vertical iteration does not matter too much - only the parts of the loop that are most often executed contribute.

Quote:

Originally Posted by meynaf

I
Not only it uses all regs, but it also uses 4 imaginary regs (d8,d9,a8,a9) because there aren't enough ! And it would have the use for more.

As this "d8" has only two values, the most sensible approach would be to have two separate functions, one for each value, and you get rid of a stack access in an inner loop.

a/b · 26 August 2023, 15:38

Quote:

Originally Posted by Don_Adan

Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order.
This register will be looks as:
D4=$33xx2211
and code can be looks next:

Code:

.ploop
add.b d4,d4
subx.l d7,d7
lsr.b #1,d4
and.l d1,d7
add.w d4,d4
subx.l d0,d0
lsr.w #1,d4
and.l d2,d0
or.l d0,d7
add.l d4,d4
subx.l d0,d0
and.l d3,d0
or.l d0,d7
move.l d7,(a2)

If no my bugs, perhaps it can works Ok. My head dont works very good (I'm too old), but code looks Ok for me.

a4/d4 could be initialized to contain abcabc..abc0..0 and then use 3x add.[b|w|l] (depending on maximum number of bits there could be).
If there's enough room you could use 4 bits per iteration (abcXabcX... X = stop flag) and do add.blw d4,d4 + bcc.b .ploop instead of sub.wl #.

meynaf · 26 August 2023, 16:03

Quote:

Originally Posted by a/b

1. assuming there's a move.l a6,d6 just before .yloop label

No, d6 is prepared directly. High word of a6 is currently unused.
This is because d6 loop does not need to be repeated, value of d6 at exit does not matter (it's the outermost loop).

Quote:

Originally Posted by a/b

2. -4c/2b: sub.l #1<<16,d4 to sub.w #1<<8,d4 (plus prep a4 as d8:d8) since 8 bits suffices for the bitplanes

Right.

Quote:

Originally Posted by a/b

3. -12c: convert 3x lsr.b to add.b by reordering the bits

Hmm.

Quote:

Originally Posted by a/b

4. 2x -4c: convert the other 2x sub.l #1<<16,dx to swap/subq.w/swap and pre-swap the ax regs

That would be 68000 only. For 020+ this is slower. I'd rather avoid that.

Quote:

Originally Posted by a/b

5. you are not using all registers, a7 is not used (assuming this runs 100% in user mode)

This is normal user code.

Quote:

Originally Posted by a/b

Now to the relevant part, using #5 you could do this:

stack: d8.l, a6.w, a9.l, (a7 is here), d9.l

y_in:

x_in:
lea (-10,a7),a7
d8.l => move.l (a7)+,d0
a6.w => move.w (a7)+,d6

pl:
a8.l => a6.l (innermost loop 100% reg based)

x_out:
a9.l => sub.l (a7)+,a2

y_out:
d9.l => move.l (a7),d0 | add.l d0,a0 | add.l d0,a1

I would prefer keeping the multitask running if possible...

Quote:

Originally Posted by Don_Adan

Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order.
This register will be looks as:
D4=$33xx2211

It can work.
Two register halves at the price of two extra shifts and a lot of bit fiddling in the prep part. Hmm...

Quote:

Originally Posted by Thomas Richter

Only look at the inner loops, place the rest on the stack. For example, the horizontal loop is typically critical, the vertical iteration does not matter too much - only the parts of the loop that are most often executed contribute.

I know.

Quote:

Originally Posted by Thomas Richter

As this "d8" has only two values, the most sensible approach would be to have two separate functions, one for each value, and you get rid of a stack access in an inner loop.

That would duplicate quite a lot of code. Let's keep that for later if no better solution is found.

Quote:

Originally Posted by a/b

a4/d4 could be initialized to contain abcabc..abc0..0 and then use 3x add.[b|w|l] (depending on maximum number of bits there could be).

Number of target bitplanes is that of the public screen the window has opened on, so anything from 1 to 8.
Max ends up 24 bits, so add.l. Complicated to init, but nice.

Thomas Richter · 26 August 2023, 16:34

Quote:

Originally Posted by meynaf

That would duplicate quite a lot of code. Let's keep that for later if no better solution is found.

In P96, the work is usually split in two parts, the general control logic and the actual executer - which is often generated by macros. Thus, while the binary contains duplicate codes, the source code does not.

meynaf · 26 August 2023, 16:45

Quote:

Originally Posted by Thomas Richter

In P96, the work is usually split in two parts, the general control logic and the actual executer - which is often generated by macros. Thus, while the binary contains duplicate codes, the source code does not.

Of course, but in my case the resulting code would be included in every program using it, so i'd like to keep it small.

Don_Adan · 26 August 2023, 16:49

If You will be use my idea, then you can replace:

Code:

 move.l d8,d0

with

Code:

 
 move.l a4,d0
 swap d0
 extb.l d0 ; or ext.w d0 and ext.l d0

Of course, you must prepare A4 register. And place D8 value (0 or -1) in unused space.

meynaf · 26 August 2023, 17:44

I'm now wondering if it wouldn't be easier to try to inflict an interleaved bitmap to BltMaskBitMapRastPort.
Would allow writing to (a2)+ and remove the need for a8/a9.
But would this function accept it ? Isn't it v39+ ?

Don_Adan · 26 August 2023, 17:45

I rethinked a few code, and for me optimal will be A4 as next input:

A4= $221133d8

Code:

.xloop
 move.l (a0)+,d1
 move.l (a1)+,d2
; move.l d8,d0

 move.l a4,d4
 add.b d4,d4
 subx.l d0,d0
 swap d4

 eor.l d0,d1
 eor.l d0,d2
 move.l d1,d0
 or.l d2,d0
 move.l d0,(a3)+
 move.l d1,d3
 and.l d2,d3
 move.l d1,d0
 not.l d0
 or.l d2,d1
 eor.l d2,d1
 and.l d0,d2
; move.l a4,d4
; move.w a5,d5
; move.w a6,d6
.ploop
 add.b d4,d4
 subx.l d7,d7
 lsr.b #1,d4
 and.l d1,d7
 add.w d4,d4
 subx.l d0,d0
 lsr.w #1,d4
 and.l d2,d0
 or.l d0,d7
 add.l d4,d4
 subx.l d0,d0
 and.l d3,d0
 or.l d0,d7
 move.l d7,(a2)

Same number of instructions between xloop and ploop.
2 instructions more for ploop, but faster (add vs lsr).
And now You have enough free registers, i think.

meynaf · 26 August 2023, 19:33

Quote:

Originally Posted by a/b

If there's enough room you could use 4 bits per iteration (abcXabcX... X = stop flag) and do add.blw d4,d4 + bcc.b .ploop instead of sub.wl #.

I missed that edit.

Nice trick (it would fit, worse case is 8 times abc), the only problem is now to prepare the data...

Quote:

Originally Posted by Don_Adan

I rethinked a few code, and for me optimal will be A4 as next input:

A4= $221133d8

<snip>

Same number of instructions between xloop and ploop.
2 instructions more for ploop, but faster (add vs lsr).

That would be good.
And 2 instructions less if d6 is free.
Makes me hesitate, a/b's trick removes those 2 lsr but requires a lot more precalc.

Quote:

Originally Posted by Don_Adan

And now You have enough free registers, i think.

I'll never have enough.

Don_Adan · 26 August 2023, 19:37

Yes, i prefer easiest methods, perhaps i will be use small table for bits reversing.

Karlos · 26 August 2023, 21:23

Quote:

Originally Posted by meynaf

I'll never have enough.

Sounds like you need to try a load/store machine. They tend to have plenty of user accessible registers. PPC has 32

:

meynaf · 26 August 2023, 21:33

Quote:

Originally Posted by Karlos

Sounds like you need to try a load/store machine. They tend to have plenty of user accessible registers. PPC has 32

:

I prefer to have 16 and a coder-friendly asm language.

Thomas Richter · 26 August 2023, 21:45

Quote:

Originally Posted by meynaf

I'm now wondering if it wouldn't be easier to try to inflict an interleaved bitmap to BltMaskBitMapRastPort.

BltMaskBitMapRastPort() is historically a rather slow function because it uses two bits, not one. The P96 version uses a single-pass CPU driven function.

Quote:

Originally Posted by meynaf

But would this function accept it ? Isn't it v39+ ?

Interleaved destinations work I believe from v39 onwards. Interleaved sources from v45 onwards (they are broken in v39 and v40).

Karlos · 26 August 2023, 21:49

Quote:

Originally Posted by meynaf

I prefer to have 16 and a coder-friendly asm language.

Hmm. Some sort of 68K style programming model but with completely general registers. Maybe expanded out to 64 bit.

meynaf · 26 August 2023, 22:38

Quote:

Originally Posted by Thomas Richter

BltMaskBitMapRastPort() is historically a rather slow function because it uses two bits, not one. The P96 version uses a single-pass CPU driven function.

It is apparently the only blt fonction able to do transparency so i have little choice but to use it.

Quote:

Originally Posted by Thomas Richter

Interleaved destinations work I believe from v39 onwards. Interleaved sources from v45 onwards (they are broken in v39 and v40).

That's what i feared.

Quote:

Originally Posted by Karlos

Hmm. Some sort of 68K style programming model but with completely general registers. Maybe expanded out to 64 bit.

My VM doesn't struggle for that routine and it has 16 32-bit registers.
But ok, it has a little bit more facilities to use high parts of them...

26 August 2023, 10:30	#2
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,376	subs to dx could use swap sub 1 swap. also maybe load d9 to addw to address regs Last edited by jotd; 26 August 2023 at 11:06.

26 August 2023, 12:15	#3
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,048	After first look, You can reduce size of this routine. But i dont know if it will be fastest. After second look, perhaps max. one register can be free, but it needs extra swaps, which You dont like. Then perhaps You must use stack for extra registers. But i will check it more later. Code: move.l d7,(a2) moveq #1,d7 swap d7 add.l a8,a2 ; a8 sub.l d7,d4 bcc .ploop sub.l a9,a2 ;a9 sub.l d7,d5 bcc .xloop add.l d9,a0 ;d9 add.l d9,a1 ;d9 sub.l d7,d6 bcc .yloop

26 August 2023, 16:49	#11
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,048	If You will be use my idea, then you can replace: Code: move.l d8,d0 with Code: move.l a4,d0 swap d0 extb.l d0 ; or ext.w d0 and ext.l d0 Of course, you must prepare A4 register. And place D8 value (0 or -1) in unused space.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What registers did I touch	Auscoder	Coders. Asm / Hardware	3	23 May 2020 13:39
Preservation of registers	guy lateur	Coders. Asm / Hardware	51	26 October 2018 14:33
A4000 IDE registers	mark_k	Coders. Asm / Hardware	6	11 May 2015 17:05
Using FPU registers?	oRBIT	Coders. General	16	26 April 2010 13:34
Gayle Hardware Registers	bluea	support.Hardware	5	09 July 2006 17:07

26 August 2023, 10:19	#1
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	out of registers If you like code that's totally out of registers, here's a nice (?) example. Not only it uses all regs, but it also uses 4 imaginary regs (d8,d9,a8,a9) because there aren't enough ! And it would have the use for more. The whole routine is much larger than that, but it's the critical part. Nonexisting registers are currently mapped to memory (ds.l). It's code that works and i wish to cleanup and optimize. I can do it myself but i'd like to see how others handle this kind of case. If you really want to know, that code is the loop for remapping 2bpp bitmaps to other colors, computing transparency plane in the process. It also supports inverting the data for selection rendering. A call to BltMaskBitMapRastPort follows (on a normal wb window's rp). It's for a GUI library i'm developing. Code: ; a0-a1 = input planes (a0=a1 for 1bpp), a2=output, a3=transp., d8=0/-1 (sel'd render) ; a4=plane loop / color 1, a5=xloop/color 2, a6=color 3 .yloop move.l a5,d5 .xloop move.l (a0)+,d1 move.l (a1)+,d2 move.l d8,d0 eor.l d0,d1 eor.l d0,d2 move.l d1,d0 or.l d2,d0 move.l d0,(a3)+ move.l d1,d3 and.l d2,d3 move.l d1,d0 not.l d0 or.l d2,d1 eor.l d2,d1 and.l d0,d2 move.l a4,d4 move.w a5,d5 move.w a6,d6 .ploop lsr.b #1,d4 subx.l d7,d7 and.l d1,d7 lsr.b #1,d5 subx.l d0,d0 and.l d2,d0 or.l d0,d7 lsr.b #1,d6 subx.l d0,d0 and.l d3,d0 or.l d0,d7 move.l d7,(a2) add.l a8,a2 subi.l #$10000,d4 bcc .ploop sub.l a9,a2 subi.l #$10000,d5 bcc .xloop add.l d9,a0 add.l d9,a1 subi.l #$10000,d6 bcc .yloop Isn't it strange that such a small code block can use so many registers ?

26 August 2023, 14:14	#4
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,066	Things you probably are aware of and are not a major concern here: 1. assuming there's a move.l a6,d6 just before .yloop label 2. -4c/2b: sub.l #1<<16,d4 to sub.w #1<<8,d4 (plus prep a4 as d8:d8) since 8 bits suffices for the bitplanes 3. -12c: convert 3x lsr.b to add.b by reordering the bits 4. 2x -4c: convert the other 2x sub.l #1<<16,dx to swap/subq.w/swap and pre-swap the ax regs 5. you are not using all registers, a7 is not used (assuming this runs 100% in user mode) Now to the relevant part, using #5 you could do this: stack: d8.l, a6.w, a9.l, (a7 is here), d9.l y_in: x_in: lea (-10,a7),a7 d8.l => move.l (a7)+,d0 a6.w => move.w (a7)+,d6 pl: a8.l => a6.l (innermost loop 100% reg based) x_out: a9.l => sub.l (a7)+,a2 y_out: d9.l => move.l (a7),d0 \| add.l d0,a0 \| add.l d0,a1

26 August 2023, 14:26	#5
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,048	Ok, Phil i have one idea for You. Perhaps you can free 2 registers. You must put all colors in 1 data register, in reversed bit order. This register will be looks as: D4=$33xx2211 and code can be looks next: Code: .ploop add.b d4,d4 subx.l d7,d7 lsr.b #1,d4 and.l d1,d7 add.w d4,d4 subx.l d0,d0 lsr.w #1,d4 and.l d2,d0 or.l d0,d7 add.l d4,d4 subx.l d0,d0 and.l d3,d0 or.l d0,d7 move.l d7,(a2) If no my bugs, perhaps it can works Ok. My head dont works very good (I'm too old), but code looks Ok for me.

26 August 2023, 17:44	#12
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	I'm now wondering if it wouldn't be easier to try to inflict an interleaved bitmap to BltMaskBitMapRastPort. Would allow writing to (a2)+ and remove the need for a8/a9. But would this function accept it ? Isn't it v39+ ?

26 August 2023, 19:37	#15
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,048	Yes, i prefer easiest methods, perhaps i will be use small table for bits reversing.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)