Optimizing the 68020+ 32-bit math - Page 16

a/b · 12 June 2021, 12:18

This is not equivalent:

Code:

         move.l #(65536-(ra-start))/7,D7 ; D7=maxn
 ;        move.l #$10000-(ra-start),d7
 ;        divu.w #7*4,D7
 ;        lsl.l #2,D7    ; d7.w=maxn

Also missing ext.l between div and lsl.

8/7 = 1
(8/28)<<2 = 0
7777/7 = 1111
(7777/28)<<2 = 1108

It should be written as either of these:

Code:

	move.l	#((65536-(ra-start))/7)&(~3),D7		; d7=maxn
	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn

Bruce Abbott · 12 June 2021, 12:21

Quote:

Originally Posted by meynaf

But this is bringing us quite far from the original topic of 32-bit division...

This thread is supposed to be about division on 68020/030 only, but discussing differences between them other CPUs is not totally out of place. Also the OP was interested in performance on 'real iron', so the effect of different types of memory etc. could be relevant (eg. base model A1200 vs with FastRAM or accelerator card).

Here's a few more interesting timings. First a straight copy from FastRAM to ChipRAM, which took 46 clock cycles per loop.

Code:

  lea     fastram,a0      ; a0 = pointer to fastram
  lea     chipram,a1      ; a1 = pointer to chipram
  move.w  #1000-1,d5      ; repeat inner loop code 1000 times
; -- inner loop --
loop:
   move.l  (a0)+,(a1)+        ; copy longword from fastram to chipram
  dbf     d5,loop

That's about 4.3MB per second, which isn't particularly impressive. Curiously however, copying the data through a register was just as fast despite having an extra instruction...

Code:

loop:
  move.l  (a0)+,d0        ; read longword from next fastram address
  move.l  d0,(a1)+        ; write longword to next chipram address
  dbf     d5,loop

So how many instructions can we add without increasing the copy time? The answer is, a lot!

Code:

loop:
  move.l  (a0)+,d0        ; read longword from next fastram address
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d0,(a1)+        ; write longword to next chipram address
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  dbf     d5,loop

That's 13 'free' instructions that could be used to manipulate the data while copying it, or for some other purpose.

I don't know where this effect is coming from, but it certainly could be useful. 4.3MB/s may not be so much of a bottleneck if you can combine it with some other processing.

Maybe this analysis is a bit off topic, but it shows that when dealing with slow memory it pays to interleave data memory accesses with internal operations. The pi-spigot code has mostly register to register instructions and no consecutive data memory accesses in its inner loop, so it (fortunately?) has nothing to gain from this principle.

Don_Adan · 12 June 2021, 12:41

Quote:

Originally Posted by a/b

This is not equivalent:

Code:

         move.l #(65536-(ra-start))/7,D7 ; D7=maxn
 ;        move.l #$10000-(ra-start),d7
 ;        divu.w #7*4,D7
 ;        lsl.l #2,D7    ; d7.w=maxn

Also missing ext.l between div and lsl.

8/7 = 1
(8/28)<<2 = 0
7777/7 = 1111
(7777/28)<<2 = 1108

It should be written as either of these:

Code:

	move.l	#((65536-(ra-start))/7)&(~3),D7		; d7=maxn
	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn

Ok, fixed, thanks.
Ext.l is not necessary for this version. Because D7 (D5 later) is handled as word only.
Ext.l is only necessary for litwr version of PR0000 routine with divu.w, for sub.w version can be ignored.

meynaf · 12 June 2021, 12:45

Quote:

Originally Posted by Bruce Abbott

This thread is supposed to be about division on 68020/030 only, but discussing differences between them other CPUs is not totally out of place. Also the OP was interested in performance on 'real iron', so the effect of different types of memory etc. could be relevant (eg. base model A1200 vs with FastRAM or accelerator card).

Ok then. Let's go for it.

Quote:

Originally Posted by Bruce Abbott

I don't know where this effect is coming from, but it certainly could be useful. 4.3MB/s may not be so much of a bottleneck if you can combine it with some other processing.

I suppose the effect for the reads is the one mentioned by grond. It doesn't explain everything but it exists.
For the writes, we know why already.

Now perhaps it's possible to do better. What about :

Code:

loop
 move.l (a0)+,d0
 move.l (a0)+,d1
 move.l (a0)+,d2
 move.l d0,(a1)+
 move.l d1,(a1)+
 move.l d2,(a1)+
 dbf d5,loop

You could also attempt to enable/disable data burst, to see if this has a significant impact.

SpeedGeek · 12 June 2021, 14:31

@Thread

I've edited the thread title so the 040 and 060 can be included as on-topic here. As you can see, the topic diversity is something to think about when the thread is created.

litwr · 18 June 2021, 20:21

First, thanks to people who helped to optimize my code. I have just made a commit with some Don_Adan's suggestions. However I must notice that I was invited to start this thread by meynaf.

off topic removed - Bippym

roondar · 19 June 2021, 20:05

Quote:

Originally Posted by Bruce Abbott

I don't know where this effect is coming from, but it certainly could be useful. 4.3MB/s may not be so much of a bottleneck if you can combine it with some other processing.

I've noticed while coding for 68020 (but I'm assuming this holds for 68030+ as well to at least some extent) that execution time of code often doesn't increase or decrease when you add/remove instructions that only involve registers. I've always understood this to be due to the scheduler's way of being able to execute certain code while a read or write operation is in progress (at least, that's how I understood the Motorola 68020 manual).

How much of that you can do seems to depend at least on the clock speed of the CPU vs the speed of the RAM and whether or not the code is in the cache (i.e. on a Chip RAM only A1200 the effect can be quite extreme).

One thing that may also be worth considering is that some of the instructions involved in the code here write words (or even bytes) to memory. Combining these into a single, bigger write can be a lot faster on at least 68020/68030 (I think this also goes for 68040+, but I'm not too sure), especially if the writes are long word aligned. I did some tests on this and found that, for my code and word based results at least, the extra overhead of needing a register to store the half-results in and the extra need for some extra code to keep track of the half results properly still usually ended up with a notable speed increase.

Note however, I also found that the slower the RAM, the bigger the speed increase from doing this. If you have zero-wait-state RAM, the difference will be a lot smaller than say writing to Chip RAM.

Don_Adan · 19 June 2021, 23:17

End code reworked, but untested.
Edit, its buggy for now. Time string must be reversed.
Edit2, perhaps fixed now. Still can be optimised a few bytes. Perhaps using litwr idea.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)

	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn

.l20 
         move.l (A4),D1    ; cout
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
         move.l (A4),D1 ; cout
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21
         bsr.w PR0000
         move.l (A4),D1 ; cout
         moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D4
         lea string(PC),A3
         moveq #10,D1
         move.b D1,(A3)+  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #$30,D0
;         move.l d5,d6


 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5

        add.b D0,D5
         move.b d5,(a3)+
         divu.w d1,d7
         swap d7
        add.b D0,D7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.',(a3)+      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D4
         divu.w d1,d7
         swap d7
        add.b D0,D7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32,(A3)+           ; newline
         move.l   A3,D2

         moveq #1,D3
.next
         move.l (A4),D1            ; cout
         subq.l #1,D2
         jsr Write(a6) 
         subq.l #1,D4
         bne.b .next

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 ;msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Don_Adan · 20 June 2021, 14:57

BTW. Perhaps rasteri counter can be changed too.

rasteri
addq.l #1,(a1)
moveq #0,d0
rts

After using addq.l #2,(a1), one less command for 50 Hz. But the longer is 60 Hz version. Then maybe exist value which can shortened 60 Hz and is short for 50Hz too. But present i dont have idea.

a/b · 20 June 2021, 17:12

Unless it's one of those 'it works in this version but it's needed for that version', d5/d7 are only used as a word (where this is relevant, up to label .l7):

Code:

;	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn
	move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn
...
;	move.l d7,d5
	move.w	d7,d5

2 bytes shorter.

Another option to make it shorter is to place these three just before label .longdiv:

Code:

.write	move.l	(a4),d1
	add.l	a4,d2
	jmp	Write(a6)

.longdiv
...

and then replace d1/d2 init + jsr Write(a6) with bsr.b .write four times (in one case it'll be bsr.w, even if you do the initialization of d7=maxn prior ;\).
4*8-(3*2+1*4)-8 = 14 bytes shorter
This is all done early and it doesn't affect the speed.

edit: further size reduction...

Don_Adan · 21 June 2021, 03:45

More size optimisations from a/b. And used litwr idea for time too.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7.w=maxn (moved here)
;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         moveq #msg4-msg1,d3
         bsr .write
.l20 
         moveq #msg4-cout,D2
         moveq #msg5-msg4,d3
         bsr.b .write
         move.w d7,d5
         bsr.w PR0000
         moveq #msg5-cout,D2
         moveq #msg3-msg5,d3
         bsr.b .write
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20
         move.w d5,d1
         beq.b .l20
         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21
         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7
.l21
         bsr.w PR0000
         moveq #msg3-cout,D2
         moveq #msg2-msg3+1,d3
         bsr.b .write

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.write
         move.l (A4),D1 ; cout
         add.l A4,D2
         jmp Write(a6)

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D4
         lea string(PC),A3
         move.b #10-$30,(A3)+  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #10,D1 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5

         move.b d5,(a3)+
         divu.w d1,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.'-$30,(a3)+      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D4
         divu.w d1,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32-$30,(A3)+           ; newline
         moveq #1,D3
.next
         move.l (A4),D1            ; cout
         add.b #$30,-(A3)
         move.l A3,D2
         jsr Write(a6) 
         subq.l #1,D4
         bne.b .next

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 ;msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Thomas Richter · 21 June 2021, 09:24

Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related *buffered* calls from the dos.library that are much more efficient for single-character output.

Don_Adan · 21 June 2021, 13:36

Quote:

Originally Posted by Thomas Richter

Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related *buffered* calls from the dos.library that are much more efficient for single-character output.

Code called between "; DMA off" text and "; DMA on" text is optimised for speed.
Code called before "; DMA off" and after "; DMA on" is optimised for size only/mostly.
Single character write is not efficient, i know. In my previous version i want to use only one write for full end text, but this text (time value) must be at first reversed. I dont see short enough routine to reverse time value. If any Amiga dos.library routine can display text in reverse order then code can be changed. But i dont think that end text code will be shortest if other dos.library routine will be used.

Don_Adan · 21 June 2021, 14:10

Inspired by Thomas Richter, maybe even a few bytes shortest. If works.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7.w=maxn (moved here)
;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         moveq #msg4-msg1,d3
         bsr .write
.l20 
         moveq #msg4-cout,D2
         moveq #msg5-msg4,d3
         bsr.b .write
         move.w d7,d5
         bsr.w PR0000
         moveq #msg5-cout,D2
         moveq #msg3-msg5,d3
         bsr.b .write
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20
         move.w d5,d1
         beq.b .l20
         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21
         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7
.l21
         bsr.w PR0000
         moveq #msg3-cout,D2
         moveq #msg2-msg3+1,d3
         bsr.b .write

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.write
         move.l (A4),D1 ; cout
         add.l A4,D2
         jmp Write(a6)

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D3
         lea string+8(PC),A3
         moveq #10,D1
         move.b D1,-(A3)  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #$30,D0 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5
        add.b D0,D5
         move.b d5,-(a3)
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         move.b #'.',-(a3)      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D3
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32,-(A3)           ; space
         move.l (A4),D1            ; cout
         move.l A3,D2
         jsr Write(a6) 

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 ;msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Cyprian · 21 June 2021, 16:19

@Don_Adan

I see following error message:

Code:

error 2029 in line 33: branch destination out of range
>         bsr.b .write

Don_Adan · 21 June 2021, 20:17

Quote:

Originally Posted by Cyprian

@Don_Adan

I see following error message:

Code:

error 2029 in line 33: branch destination out of range
>         bsr.b .write

Ok, thanks, from my manually calculated code length 2 bytes too long, maybe any other small optimisation will be possible, for now i changed this branch to bsr

Don_Adan · 21 June 2021, 20:25

Perhaps now first .write is in bsr.b range.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7.w=maxn (moved here)
;call Write(stdout,buff,size)
         moveq #-4,D4
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         moveq #msg4-msg1,d3
         bsr.b .write
.l20 
         moveq #msg4-cout,D2
         moveq #msg5-msg4,d3
         bsr.b .write
         move.w d7,d5
         bsr.w PR0000
         moveq #msg5-cout,D2
         moveq #msg3-msg5,d3
         bsr.b .write
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20
         move.w d5,d1
         beq.b .l20
         addq.w #3,d5
         and.w D4,d5
         cmp.b #10,(a0)
         bne.b .l21
         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7
.l21
         bsr.w PR0000
         moveq #msg3-cout,D2
         moveq #msg2-msg3,d3
         bsr.b .write

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.write
         move.l (A4),D1 ; cout
         add.l A4,D2
         jmp Write(a6)

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D3
         lea string+8(PC),A3
         moveq #10,D1
         move.b D1,-(A3)  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #$30,D0 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5
        add.b D0,D5
         move.b d5,-(a3)
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         move.b #'.',-(a3)      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D3
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32,-(A3)           ; space
         move.l (A4),D1            ; cout
         move.l A3,D2
         jsr Write(a6) 

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,buf-cout(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code


 cnop 0,4

 cout dc.l 0
 buf ds.b 4
 time dc.l 0

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed',10
msg2
      even
getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

a/b · 21 June 2021, 21:41

-2 bytes due to alignment (extra zero at the end is never used):

Code:

;         moveq #msg2-msg3+1,d3
         moveq #msg2-msg3,d3
...
;msg3 dc.b ' digits will be printed'
;msg2 dc.b 10,0
msg3 dc.b ' digits will be printed',10
msg2
	even

Since there's some code in between that could be activated (dma on/off)... I'm not familiar with vasm syntax, in asm-one I'd do (and yes, you can do AO there and it will auto-correct it):

Code:

	IFGT	.write-*-128
	bsr.w	.write
	ELSE
	bsr.b	.write
	ENDC	; IFGT

Bruce Abbott · 21 June 2021, 23:39

Quote:

Originally Posted by Don_Adan

More size optimisations from a/b. And used litwr idea for time too.

This version is the same speed as the earlier one I tested (8.9 seconds without printing, 10 seconds with printing) but dramatically smaller at 700 bytes vs. 804. Well done!

But I see my work isn't done. I will test your latest version tonight.

Quote:

Originally Posted by Thomas Richter

Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related *buffered* calls from the dos.library that are much more efficient for single-character output.

That may be so, but it is interesting to note that printing 3000 digits takes ~1.1 seconds on my machine, which is only 11% of the total execution time. On a slower machine it should be an even smaller percentage because a large part of that time is taken up rendering to ChipRAM, which is proportionally slower on a faster machine.

Don_Adan · 22 June 2021, 00:28

Quote:

Originally Posted by a/b

-2 bytes due to alignment (extra zero at the end is never used):

Code:

;         moveq #msg2-msg3+1,d3
         moveq #msg2-msg3,d3
...
;msg3 dc.b ' digits will be printed'
;msg2 dc.b 10,0
msg3 dc.b ' digits will be printed',10
msg2
	even

Since there's some code in between that could be activated (dma on/off)... I'm not familiar with vasm syntax, in asm-one I'd do (and yes, you can do AO there and it will auto-correct it):

Code:

	IFGT	.write-*-128
	bsr.w	.write
	ELSE
	bsr.b	.write
	ENDC	; IFGT

Ok, changed. Most assemblers auto reasembled bsr.b to bsr.w. I prefer easy code and dont like auto optimisations.

18 June 2021, 20:21	#306
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	First, thanks to people who helped to optimize my code. I have just made a commit with some Don_Adan's suggestions. However I must notice that I was invited to start this thread by meynaf. off topic removed - Bippym Last edited by BippyM; 20 June 2021 at 14:57.

20 June 2021, 17:12	#310
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Unless it's one of those 'it works in this version but it's needed for that version', d5/d7 are only used as a word (where this is relevant, up to label .l7): Code: ; move.l #((65536-(ra-start))/(7<<2))<<2,D7 ; d7=maxn move.w #((65536-(ra-start))/(7<<2))<<2,D7 ; d7=maxn ... ; move.l d7,d5 move.w d7,d5 2 bytes shorter. Another option to make it shorter is to place these three just before label .longdiv: Code: .write move.l (a4),d1 add.l a4,d2 jmp Write(a6) .longdiv ... and then replace d1/d2 init + jsr Write(a6) with bsr.b .write four times (in one case it'll be bsr.w, even if you do the initialization of d7=maxn prior ;\). 48-(32+14)-8 = 14 bytes shorter This is all done early and it doesn't affect the speed. edit: further size reduction... Last edited by a/b; 20 June 2021 at 18:03.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

12 June 2021, 12:18	#301
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	This is not equivalent: Code: move.l #(65536-(ra-start))/7,D7 ; D7=maxn ; move.l #$10000-(ra-start),d7 ; divu.w #7*4,D7 ; lsl.l #2,D7 ; d7.w=maxn Also missing ext.l between div and lsl. 8/7 = 1 (8/28)<<2 = 0 7777/7 = 1111 (7777/28)<<2 = 1108 It should be written as either of these: Code: move.l #((65536-(ra-start))/7)&(~3),D7 ; d7=maxn move.l #((65536-(ra-start))/(7<<2))<<2,D7 ; d7=maxn

12 June 2021, 14:31	#305
SpeedGeek Moderator Join Date: Dec 2010 Location: Wisconsin USA Age: 60 Posts: 839	@Thread I've edited the thread title so the 040 and 060 can be included as on-topic here. As you can see, the topic diversity is something to think about when the thread is created.

20 June 2021, 14:57	#309
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	BTW. Perhaps rasteri counter can be changed too. rasteri addq.l #1,(a1) moveq #0,d0 rts After using addq.l #2,(a1), one less command for 50 Hz. But the longer is 60 Hz version. Then maybe exist value which can shortened 60 Hz and is short for 50Hz too. But present i dont have idea.

21 June 2021, 09:24	#312
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,215	Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related buffered calls from the dos.library that are much more efficient for single-character output.

21 June 2021, 16:19	#315
Cyprian Registered User Join Date: Jul 2014 Location: Warsaw/Poland Posts: 171	@Don_Adan I see following error message: Code: error 2029 in line 33: branch destination out of range > bsr.b .write

21 June 2021, 21:41	#318
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	-2 bytes due to alignment (extra zero at the end is never used): Code: ; moveq #msg2-msg3+1,d3 moveq #msg2-msg3,d3 ... ;msg3 dc.b ' digits will be printed' ;msg2 dc.b 10,0 msg3 dc.b ' digits will be printed',10 msg2 even Since there's some code in between that could be activated (dma on/off)... I'm not familiar with vasm syntax, in asm-one I'd do (and yes, you can do AO there and it will auto-correct it): Code: IFGT .write-*-128 bsr.w .write ELSE bsr.b .write ENDC ; IFGT

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)