26 August 2023, 10:19 | #1 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
out of registers
If you like code that's totally out of registers, here's a nice (?) example.
Not only it uses all regs, but it also uses 4 imaginary regs (d8,d9,a8,a9) because there aren't enough ! And it would have the use for more. The whole routine is much larger than that, but it's the critical part. Nonexisting registers are currently mapped to memory (ds.l). It's code that works and i wish to cleanup and optimize. I can do it myself but i'd like to see how others handle this kind of case. If you really want to know, that code is the loop for remapping 2bpp bitmaps to other colors, computing transparency plane in the process. It also supports inverting the data for selection rendering. A call to BltMaskBitMapRastPort follows (on a normal wb window's rp). It's for a GUI library i'm developing. Code:
; a0-a1 = input planes (a0=a1 for 1bpp), a2=output, a3=transp., d8=0/-1 (sel'd render) ; a4=plane loop / color 1, a5=xloop/color 2, a6=color 3 .yloop move.l a5,d5 .xloop move.l (a0)+,d1 move.l (a1)+,d2 move.l d8,d0 eor.l d0,d1 eor.l d0,d2 move.l d1,d0 or.l d2,d0 move.l d0,(a3)+ move.l d1,d3 and.l d2,d3 move.l d1,d0 not.l d0 or.l d2,d1 eor.l d2,d1 and.l d0,d2 move.l a4,d4 move.w a5,d5 move.w a6,d6 .ploop lsr.b #1,d4 subx.l d7,d7 and.l d1,d7 lsr.b #1,d5 subx.l d0,d0 and.l d2,d0 or.l d0,d7 lsr.b #1,d6 subx.l d0,d0 and.l d3,d0 or.l d0,d7 move.l d7,(a2) add.l a8,a2 subi.l #$10000,d4 bcc .ploop sub.l a9,a2 subi.l #$10000,d5 bcc .xloop add.l d9,a0 add.l d9,a1 subi.l #$10000,d6 bcc .yloop |
26 August 2023, 10:30 | #2 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,376
|
subs to dx could use swap sub 1 swap.
also maybe load d9 to addw to address regs Last edited by jotd; 26 August 2023 at 11:06. |
26 August 2023, 12:15 | #3 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
|
After first look, You can reduce size of this routine. But i dont know if it will be fastest.
After second look, perhaps max. one register can be free, but it needs extra swaps, which You dont like. Then perhaps You must use stack for extra registers. But i will check it more later. Code:
move.l d7,(a2) moveq #1,d7 swap d7 add.l a8,a2 ; a8 sub.l d7,d4 bcc .ploop sub.l a9,a2 ;a9 sub.l d7,d5 bcc .xloop add.l d9,a0 ;d9 add.l d9,a1 ;d9 sub.l d7,d6 bcc .yloop |
26 August 2023, 14:14 | #4 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,066
|
Things you probably are aware of and are not a major concern here:
1. assuming there's a move.l a6,d6 just before .yloop label 2. -4c/2b: sub.l #1<<16,d4 to sub.w #1<<8,d4 (plus prep a4 as d8:d8) since 8 bits suffices for the bitplanes 3. -12c: convert 3x lsr.b to add.b by reordering the bits 4. 2x -4c: convert the other 2x sub.l #1<<16,dx to swap/subq.w/swap and pre-swap the ax regs 5. you are not using all registers, a7 is not used (assuming this runs 100% in user mode) Now to the relevant part, using #5 you could do this: stack: d8.l, a6.w, a9.l, (a7 is here), d9.l y_in: x_in: lea (-10,a7),a7 d8.l => move.l (a7)+,d0 a6.w => move.w (a7)+,d6 pl: a8.l => a6.l (innermost loop 100% reg based) x_out: a9.l => sub.l (a7)+,a2 y_out: d9.l => move.l (a7),d0 | add.l d0,a0 | add.l d0,a1 |
26 August 2023, 14:26 | #5 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
|
Ok, Phil i have one idea for You. Perhaps you can free 2 registers.
You must put all colors in 1 data register, in reversed bit order. This register will be looks as: D4=$33xx2211 and code can be looks next: Code:
.ploop add.b d4,d4 subx.l d7,d7 lsr.b #1,d4 and.l d1,d7 add.w d4,d4 subx.l d0,d0 lsr.w #1,d4 and.l d2,d0 or.l d0,d7 add.l d4,d4 subx.l d0,d0 and.l d3,d0 or.l d0,d7 move.l d7,(a2) |
26 August 2023, 15:20 | #6 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,317
|
Quote:
As this "d8" has only two values, the most sensible approach would be to have two separate functions, one for each value, and you get rid of a stack access in an inner loop. |
|
26 August 2023, 15:38 | #7 | |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,066
|
Quote:
If there's enough room you could use 4 bits per iteration (abcXabcX... X = stop flag) and do add.blw d4,d4 + bcc.b .ploop instead of sub.wl #. Last edited by a/b; 26 August 2023 at 15:48. |
|
26 August 2023, 16:03 | #8 | ||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
No, d6 is prepared directly. High word of a6 is currently unused.
This is because d6 loop does not need to be repeated, value of d6 at exit does not matter (it's the outermost loop). Quote:
Hmm. Quote:
Quote:
Quote:
Quote:
Two register halves at the price of two extra shifts and a lot of bit fiddling in the prep part. Hmm... Quote:
Quote:
Quote:
Max ends up 24 bits, so add.l. Complicated to init, but nice. |
||||||||
26 August 2023, 16:34 | #9 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,317
|
Quote:
In P96, the work is usually split in two parts, the general control logic and the actual executer - which is often generated by macros. Thus, while the binary contains duplicate codes, the source code does not. |
|
26 August 2023, 16:45 | #10 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Of course, but in my case the resulting code would be included in every program using it, so i'd like to keep it small.
|
26 August 2023, 16:49 | #11 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
|
If You will be use my idea, then you can replace:
Code:
move.l d8,d0 Code:
move.l a4,d0 swap d0 extb.l d0 ; or ext.w d0 and ext.l d0 |
26 August 2023, 17:44 | #12 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
I'm now wondering if it wouldn't be easier to try to inflict an interleaved bitmap to BltMaskBitMapRastPort.
Would allow writing to (a2)+ and remove the need for a8/a9. But would this function accept it ? Isn't it v39+ ? |
26 August 2023, 17:45 | #13 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
|
I rethinked a few code, and for me optimal will be A4 as next input:
A4= $221133d8 Code:
.xloop move.l (a0)+,d1 move.l (a1)+,d2 ; move.l d8,d0 move.l a4,d4 add.b d4,d4 subx.l d0,d0 swap d4 eor.l d0,d1 eor.l d0,d2 move.l d1,d0 or.l d2,d0 move.l d0,(a3)+ move.l d1,d3 and.l d2,d3 move.l d1,d0 not.l d0 or.l d2,d1 eor.l d2,d1 and.l d0,d2 ; move.l a4,d4 ; move.w a5,d5 ; move.w a6,d6 .ploop add.b d4,d4 subx.l d7,d7 lsr.b #1,d4 and.l d1,d7 add.w d4,d4 subx.l d0,d0 lsr.w #1,d4 and.l d2,d0 or.l d0,d7 add.l d4,d4 subx.l d0,d0 and.l d3,d0 or.l d0,d7 move.l d7,(a2) 2 instructions more for ploop, but faster (add vs lsr). And now You have enough free registers, i think. Last edited by Don_Adan; 26 August 2023 at 17:52. |
26 August 2023, 19:33 | #14 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Nice trick (it would fit, worse case is 8 times abc), the only problem is now to prepare the data... Quote:
And 2 instructions less if d6 is free. Makes me hesitate, a/b's trick removes those 2 lsr but requires a lot more precalc. I'll never have enough. |
||
26 August 2023, 19:37 | #15 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,048
|
Yes, i prefer easiest methods, perhaps i will be use small table for bits reversing.
|
26 August 2023, 21:23 | #16 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,480
|
|
26 August 2023, 21:33 | #17 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
|
26 August 2023, 21:45 | #18 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,317
|
Quote:
Interleaved destinations work I believe from v39 onwards. Interleaved sources from v45 onwards (they are broken in v39 and v40). |
|
26 August 2023, 21:49 | #19 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,480
|
|
26 August 2023, 22:38 | #20 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
Quote:
But ok, it has a little bit more facilities to use high parts of them... |
|||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What registers did I touch | Auscoder | Coders. Asm / Hardware | 3 | 23 May 2020 13:39 |
Preservation of registers | guy lateur | Coders. Asm / Hardware | 51 | 26 October 2018 14:33 |
A4000 IDE registers | mark_k | Coders. Asm / Hardware | 6 | 11 May 2015 17:05 |
Using FPU registers? | oRBIT | Coders. General | 16 | 26 April 2010 13:34 |
Gayle Hardware Registers | bluea | support.Hardware | 5 | 09 July 2006 17:07 |
|
|