English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 26 August 2023, 10:19   #1
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
out of registers

If you like code that's totally out of registers, here's a nice (?) example.
Not only it uses all regs, but it also uses 4 imaginary regs (d8,d9,a8,a9) because there aren't enough ! And it would have the use for more.

The whole routine is much larger than that, but it's the critical part. Nonexisting registers are currently mapped to memory (ds.l).

It's code that works and i wish to cleanup and optimize.
I can do it myself but i'd like to see how others handle this kind of case.

If you really want to know, that code is the loop for remapping 2bpp bitmaps to other colors, computing transparency plane in the process. It also supports inverting the data for selection rendering. A call to BltMaskBitMapRastPort follows (on a normal wb window's rp). It's for a GUI library i'm developing.

Code:
; a0-a1 = input planes (a0=a1 for 1bpp), a2=output, a3=transp., d8=0/-1 (sel'd render)
; a4=plane loop / color 1, a5=xloop/color 2, a6=color 3
.yloop
 move.l a5,d5
.xloop
 move.l (a0)+,d1
 move.l (a1)+,d2
 move.l d8,d0
 eor.l d0,d1
 eor.l d0,d2
 move.l d1,d0
 or.l d2,d0
 move.l d0,(a3)+
 move.l d1,d3
 and.l d2,d3
 move.l d1,d0
 not.l d0
 or.l d2,d1
 eor.l d2,d1
 and.l d0,d2
 move.l a4,d4
 move.w a5,d5
 move.w a6,d6
.ploop
 lsr.b #1,d4
 subx.l d7,d7
 and.l d1,d7
 lsr.b #1,d5
 subx.l d0,d0
 and.l d2,d0
 or.l d0,d7
 lsr.b #1,d6
 subx.l d0,d0
 and.l d3,d0
 or.l d0,d7
 move.l d7,(a2)
 add.l a8,a2
 subi.l #$10000,d4
 bcc .ploop
 sub.l a9,a2
 subi.l #$10000,d5
 bcc .xloop
 add.l d9,a0
 add.l d9,a1
 subi.l #$10000,d6
 bcc .yloop
Isn't it strange that such a small code block can use so many registers ?
meynaf is offline  
Old 26 August 2023, 10:30   #2
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,376
subs to dx could use swap sub 1 swap.

also maybe load d9 to addw to address regs

Last edited by jotd; 26 August 2023 at 11:06.
jotd is online now  
Old 26 August 2023, 12:15   #3
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
After first look, You can reduce size of this routine. But i dont know if it will be fastest.
After second look, perhaps max. one register can be free, but it needs extra swaps, which You dont like.
Then perhaps You must use stack for extra registers.
But i will check it more later.

Code:
 move.l d7,(a2)
 moveq #1,d7
 swap d7
 add.l a8,a2 ; a8
 sub.l d7,d4
 bcc .ploop
 sub.l a9,a2 ;a9
 sub.l d7,d5
 bcc .xloop
 add.l d9,a0 ;d9
 add.l d9,a1 ;d9
 sub.l d7,d6
 bcc .yloop
Don_Adan is offline  
Old 26 August 2023, 14:14   #4
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,066
Things you probably are aware of and are not a major concern here:
1. assuming there's a move.l a6,d6 just before .yloop label
2. -4c/2b: sub.l #1<<16,d4 to sub.w #1<<8,d4 (plus prep a4 as d8:d8) since 8 bits suffices for the bitplanes
3. -12c: convert 3x lsr.b to add.b by reordering the bits
4. 2x -4c: convert the other 2x sub.l #1<<16,dx to swap/subq.w/swap and pre-swap the ax regs
5. you are not using all registers, a7 is not used (assuming this runs 100% in user mode)

Now to the relevant part, using #5 you could do this:

stack: d8.l, a6.w, a9.l, (a7 is here), d9.l

y_in:

x_in:
lea (-10,a7),a7
d8.l => move.l (a7)+,d0
a6.w => move.w (a7)+,d6

pl:
a8.l => a6.l (innermost loop 100% reg based)

x_out:
a9.l => sub.l (a7)+,a2

y_out:
d9.l => move.l (a7),d0 | add.l d0,a0 | add.l d0,a1
a/b is offline  
Old 26 August 2023, 14:26   #5
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order.
This register will be looks as:
D4=$33xx2211
and code can be looks next:

Code:
.ploop
add.b d4,d4
subx.l d7,d7
lsr.b #1,d4
and.l d1,d7
add.w d4,d4
subx.l d0,d0
lsr.w #1,d4
and.l d2,d0
or.l d0,d7
add.l d4,d4
subx.l d0,d0
and.l d3,d0
or.l d0,d7
move.l d7,(a2)
If no my bugs, perhaps it can works Ok. My head dont works very good (I'm too old), but code looks Ok for me.
Don_Adan is offline  
Old 26 August 2023, 15:20   #6
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,317
Quote:
Originally Posted by meynaf View Post
If you like code that's totally out of registers, here's a nice (?) example.
That also happens in several low-level bit-fiddling operations in P96. Only look at the inner loops, place the rest on the stack. For example, the horizontal loop is typically critical, the vertical iteration does not matter too much - only the parts of the loop that are most often executed contribute.



Quote:
Originally Posted by meynaf View Post
I
Not only it uses all regs, but it also uses 4 imaginary regs (d8,d9,a8,a9) because there aren't enough ! And it would have the use for more.
As this "d8" has only two values, the most sensible approach would be to have two separate functions, one for each value, and you get rid of a stack access in an inner loop.
Thomas Richter is offline  
Old 26 August 2023, 15:38   #7
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,066
Quote:
Originally Posted by Don_Adan View Post
Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order.
This register will be looks as:
D4=$33xx2211
and code can be looks next:

Code:
.ploop
add.b d4,d4
subx.l d7,d7
lsr.b #1,d4
and.l d1,d7
add.w d4,d4
subx.l d0,d0
lsr.w #1,d4
and.l d2,d0
or.l d0,d7
add.l d4,d4
subx.l d0,d0
and.l d3,d0
or.l d0,d7
move.l d7,(a2)
If no my bugs, perhaps it can works Ok. My head dont works very good (I'm too old), but code looks Ok for me.
a4/d4 could be initialized to contain abcabc..abc0..0 and then use 3x add.[b|w|l] (depending on maximum number of bits there could be).
If there's enough room you could use 4 bits per iteration (abcXabcX... X = stop flag) and do add.blw d4,d4 + bcc.b .ploop instead of sub.wl #.

Last edited by a/b; 26 August 2023 at 15:48.
a/b is offline  
Old 26 August 2023, 16:03   #8
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by a/b View Post
1. assuming there's a move.l a6,d6 just before .yloop label
No, d6 is prepared directly. High word of a6 is currently unused.
This is because d6 loop does not need to be repeated, value of d6 at exit does not matter (it's the outermost loop).


Quote:
Originally Posted by a/b View Post
2. -4c/2b: sub.l #1<<16,d4 to sub.w #1<<8,d4 (plus prep a4 as d8:d8) since 8 bits suffices for the bitplanes
Right.


Quote:
Originally Posted by a/b View Post
3. -12c: convert 3x lsr.b to add.b by reordering the bits
Hmm.


Quote:
Originally Posted by a/b View Post
4. 2x -4c: convert the other 2x sub.l #1<<16,dx to swap/subq.w/swap and pre-swap the ax regs
That would be 68000 only. For 020+ this is slower. I'd rather avoid that.


Quote:
Originally Posted by a/b View Post
5. you are not using all registers, a7 is not used (assuming this runs 100% in user mode)
This is normal user code.


Quote:
Originally Posted by a/b View Post
Now to the relevant part, using #5 you could do this:

stack: d8.l, a6.w, a9.l, (a7 is here), d9.l

y_in:

x_in:
lea (-10,a7),a7
d8.l => move.l (a7)+,d0
a6.w => move.w (a7)+,d6

pl:
a8.l => a6.l (innermost loop 100% reg based)

x_out:
a9.l => sub.l (a7)+,a2

y_out:
d9.l => move.l (a7),d0 | add.l d0,a0 | add.l d0,a1
I would prefer keeping the multitask running if possible...



Quote:
Originally Posted by Don_Adan View Post
Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order.
This register will be looks as:
D4=$33xx2211
It can work.
Two register halves at the price of two extra shifts and a lot of bit fiddling in the prep part. Hmm...



Quote:
Originally Posted by Thomas Richter View Post
Only look at the inner loops, place the rest on the stack. For example, the horizontal loop is typically critical, the vertical iteration does not matter too much - only the parts of the loop that are most often executed contribute.
I know.


Quote:
Originally Posted by Thomas Richter View Post
As this "d8" has only two values, the most sensible approach would be to have two separate functions, one for each value, and you get rid of a stack access in an inner loop.
That would duplicate quite a lot of code. Let's keep that for later if no better solution is found.



Quote:
Originally Posted by a/b View Post
a4/d4 could be initialized to contain abcabc..abc0..0 and then use 3x add.[b|w|l] (depending on maximum number of bits there could be).
Number of target bitplanes is that of the public screen the window has opened on, so anything from 1 to 8.
Max ends up 24 bits, so add.l. Complicated to init, but nice.
meynaf is offline  
Old 26 August 2023, 16:34   #9
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,317
Quote:
Originally Posted by meynaf View Post
That would duplicate quite a lot of code. Let's keep that for later if no better solution is found.

In P96, the work is usually split in two parts, the general control logic and the actual executer - which is often generated by macros. Thus, while the binary contains duplicate codes, the source code does not.
Thomas Richter is offline  
Old 26 August 2023, 16:45   #10
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thomas Richter View Post
In P96, the work is usually split in two parts, the general control logic and the actual executer - which is often generated by macros. Thus, while the binary contains duplicate codes, the source code does not.
Of course, but in my case the resulting code would be included in every program using it, so i'd like to keep it small.
meynaf is offline  
Old 26 August 2023, 16:49   #11
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
If You will be use my idea, then you can replace:

Code:
 move.l d8,d0
with

Code:
 
 move.l a4,d0
 swap d0
 extb.l d0 ; or ext.w d0 and ext.l d0
Of course, you must prepare A4 register. And place D8 value (0 or -1) in unused space.
Don_Adan is offline  
Old 26 August 2023, 17:44   #12
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
I'm now wondering if it wouldn't be easier to try to inflict an interleaved bitmap to BltMaskBitMapRastPort.
Would allow writing to (a2)+ and remove the need for a8/a9.
But would this function accept it ? Isn't it v39+ ?
meynaf is offline  
Old 26 August 2023, 17:45   #13
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
I rethinked a few code, and for me optimal will be A4 as next input:

A4= $221133d8

Code:
.xloop
 move.l (a0)+,d1
 move.l (a1)+,d2
; move.l d8,d0

 move.l a4,d4
 add.b d4,d4
 subx.l d0,d0
 swap d4

 eor.l d0,d1
 eor.l d0,d2
 move.l d1,d0
 or.l d2,d0
 move.l d0,(a3)+
 move.l d1,d3
 and.l d2,d3
 move.l d1,d0
 not.l d0
 or.l d2,d1
 eor.l d2,d1
 and.l d0,d2
; move.l a4,d4
; move.w a5,d5
; move.w a6,d6
.ploop
 add.b d4,d4
 subx.l d7,d7
 lsr.b #1,d4
 and.l d1,d7
 add.w d4,d4
 subx.l d0,d0
 lsr.w #1,d4
 and.l d2,d0
 or.l d0,d7
 add.l d4,d4
 subx.l d0,d0
 and.l d3,d0
 or.l d0,d7
 move.l d7,(a2)
Same number of instructions between xloop and ploop.
2 instructions more for ploop, but faster (add vs lsr).
And now You have enough free registers, i think.

Last edited by Don_Adan; 26 August 2023 at 17:52.
Don_Adan is offline  
Old 26 August 2023, 19:33   #14
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by a/b View Post
If there's enough room you could use 4 bits per iteration (abcXabcX... X = stop flag) and do add.blw d4,d4 + bcc.b .ploop instead of sub.wl #.
I missed that edit.
Nice trick (it would fit, worse case is 8 times abc), the only problem is now to prepare the data...



Quote:
Originally Posted by Don_Adan View Post
I rethinked a few code, and for me optimal will be A4 as next input:

A4= $221133d8

<snip>

Same number of instructions between xloop and ploop.
2 instructions more for ploop, but faster (add vs lsr).
That would be good.
And 2 instructions less if d6 is free.
Makes me hesitate, a/b's trick removes those 2 lsr but requires a lot more precalc.


Quote:
Originally Posted by Don_Adan View Post
And now You have enough free registers, i think.
I'll never have enough.
meynaf is offline  
Old 26 August 2023, 19:37   #15
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
Yes, i prefer easiest methods, perhaps i will be use small table for bits reversing.
Don_Adan is offline  
Old 26 August 2023, 21:23   #16
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,480
Quote:
Originally Posted by meynaf View Post
I'll never have enough.
Sounds like you need to try a load/store machine. They tend to have plenty of user accessible registers. PPC has 32 :
Karlos is offline  
Old 26 August 2023, 21:33   #17
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Karlos View Post
Sounds like you need to try a load/store machine. They tend to have plenty of user accessible registers. PPC has 32 :
I prefer to have 16 and a coder-friendly asm language.
meynaf is offline  
Old 26 August 2023, 21:45   #18
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,317
Quote:
Originally Posted by meynaf View Post
I'm now wondering if it wouldn't be easier to try to inflict an interleaved bitmap to BltMaskBitMapRastPort.
BltMaskBitMapRastPort() is historically a rather slow function because it uses two bits, not one. The P96 version uses a single-pass CPU driven function.



Quote:
Originally Posted by meynaf View Post
But would this function accept it ? Isn't it v39+ ?
Interleaved destinations work I believe from v39 onwards. Interleaved sources from v45 onwards (they are broken in v39 and v40).
Thomas Richter is offline  
Old 26 August 2023, 21:49   #19
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,480
Quote:
Originally Posted by meynaf View Post
I prefer to have 16 and a coder-friendly asm language.
Hmm. Some sort of 68K style programming model but with completely general registers. Maybe expanded out to 64 bit.
Karlos is offline  
Old 26 August 2023, 22:38   #20
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thomas Richter View Post
BltMaskBitMapRastPort() is historically a rather slow function because it uses two bits, not one. The P96 version uses a single-pass CPU driven function.
It is apparently the only blt fonction able to do transparency so i have little choice but to use it.


Quote:
Originally Posted by Thomas Richter View Post
Interleaved destinations work I believe from v39 onwards. Interleaved sources from v45 onwards (they are broken in v39 and v40).
That's what i feared.



Quote:
Originally Posted by Karlos View Post
Hmm. Some sort of 68K style programming model but with completely general registers. Maybe expanded out to 64 bit.
My VM doesn't struggle for that routine and it has 16 32-bit registers.
But ok, it has a little bit more facilities to use high parts of them...
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
What registers did I touch Auscoder Coders. Asm / Hardware 3 23 May 2020 13:39
Preservation of registers guy lateur Coders. Asm / Hardware 51 26 October 2018 14:33
A4000 IDE registers mark_k Coders. Asm / Hardware 6 11 May 2015 17:05
Using FPU registers? oRBIT Coders. General 16 26 April 2010 13:34
Gayle Hardware Registers bluea support.Hardware 5 09 July 2006 17:07

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 08:30.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10301 seconds with 13 queries