English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 12 June 2021, 12:18   #301
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
This is not equivalent:
Code:
         move.l #(65536-(ra-start))/7,D7 ; D7=maxn
 ;        move.l #$10000-(ra-start),d7
 ;        divu.w #7*4,D7
 ;        lsl.l #2,D7    ; d7.w=maxn
Also missing ext.l between div and lsl.

8/7 = 1
(8/28)<<2 = 0
7777/7 = 1111
(7777/28)<<2 = 1108

It should be written as either of these:
Code:
	move.l	#((65536-(ra-start))/7)&(~3),D7		; d7=maxn
	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn
a/b is offline  
Old 12 June 2021, 12:21   #302
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
Quote:
Originally Posted by meynaf View Post
But this is bringing us quite far from the original topic of 32-bit division...
This thread is supposed to be about division on 68020/030 only, but discussing differences between them other CPUs is not totally out of place. Also the OP was interested in performance on 'real iron', so the effect of different types of memory etc. could be relevant (eg. base model A1200 vs with FastRAM or accelerator card).

Here's a few more interesting timings. First a straight copy from FastRAM to ChipRAM, which took 46 clock cycles per loop.

Code:
  lea     fastram,a0      ; a0 = pointer to fastram
  lea     chipram,a1      ; a1 = pointer to chipram
  move.w  #1000-1,d5      ; repeat inner loop code 1000 times
; -- inner loop --
loop:
   move.l  (a0)+,(a1)+        ; copy longword from fastram to chipram
  dbf     d5,loop
That's about 4.3MB per second, which isn't particularly impressive. Curiously however, copying the data through a register was just as fast despite having an extra instruction...
Code:
loop:
  move.l  (a0)+,d0        ; read longword from next fastram address
  move.l  d0,(a1)+        ; write longword to next chipram address
  dbf     d5,loop
So how many instructions can we add without increasing the copy time? The answer is, a lot!
Code:
loop:
  move.l  (a0)+,d0        ; read longword from next fastram address
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d0,(a1)+        ; write longword to next chipram address
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  move.l  d2,d2
  dbf     d5,loop
That's 13 'free' instructions that could be used to manipulate the data while copying it, or for some other purpose.

I don't know where this effect is coming from, but it certainly could be useful. 4.3MB/s may not be so much of a bottleneck if you can combine it with some other processing.

Maybe this analysis is a bit off topic, but it shows that when dealing with slow memory it pays to interleave data memory accesses with internal operations. The pi-spigot code has mostly register to register instructions and no consecutive data memory accesses in its inner loop, so it (fortunately?) has nothing to gain from this principle.
Bruce Abbott is offline  
Old 12 June 2021, 12:41   #303
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Quote:
Originally Posted by a/b View Post
This is not equivalent:
Code:
         move.l #(65536-(ra-start))/7,D7 ; D7=maxn
 ;        move.l #$10000-(ra-start),d7
 ;        divu.w #7*4,D7
 ;        lsl.l #2,D7    ; d7.w=maxn
Also missing ext.l between div and lsl.

8/7 = 1
(8/28)<<2 = 0
7777/7 = 1111
(7777/28)<<2 = 1108

It should be written as either of these:
Code:
	move.l	#((65536-(ra-start))/7)&(~3),D7		; d7=maxn
	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn
Ok, fixed, thanks.
Ext.l is not necessary for this version. Because D7 (D5 later) is handled as word only.
Ext.l is only necessary for litwr version of PR0000 routine with divu.w, for sub.w version can be ignored.
Don_Adan is offline  
Old 12 June 2021, 12:45   #304
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Bruce Abbott View Post
This thread is supposed to be about division on 68020/030 only, but discussing differences between them other CPUs is not totally out of place. Also the OP was interested in performance on 'real iron', so the effect of different types of memory etc. could be relevant (eg. base model A1200 vs with FastRAM or accelerator card).
Ok then. Let's go for it.


Quote:
Originally Posted by Bruce Abbott View Post
I don't know where this effect is coming from, but it certainly could be useful. 4.3MB/s may not be so much of a bottleneck if you can combine it with some other processing.
I suppose the effect for the reads is the one mentioned by grond. It doesn't explain everything but it exists.
For the writes, we know why already.

Now perhaps it's possible to do better. What about :
Code:
loop
 move.l (a0)+,d0
 move.l (a0)+,d1
 move.l (a0)+,d2
 move.l d0,(a1)+
 move.l d1,(a1)+
 move.l d2,(a1)+
 dbf d5,loop
You could also attempt to enable/disable data burst, to see if this has a significant impact.
meynaf is offline  
Old 12 June 2021, 14:31   #305
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
@Thread

I've edited the thread title so the 040 and 060 can be included as on-topic here. As you can see, the topic diversity is something to think about when the thread is created.
SpeedGeek is offline  
Old 18 June 2021, 20:21   #306
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
First, thanks to people who helped to optimize my code. I have just made a commit with some Don_Adan's suggestions. However I must notice that I was invited to start this thread by meynaf.


off topic removed - Bippym

Last edited by BippyM; 20 June 2021 at 14:57.
litwr is offline  
Old 19 June 2021, 20:05   #307
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Quote:
Originally Posted by Bruce Abbott View Post
I don't know where this effect is coming from, but it certainly could be useful. 4.3MB/s may not be so much of a bottleneck if you can combine it with some other processing.
I've noticed while coding for 68020 (but I'm assuming this holds for 68030+ as well to at least some extent) that execution time of code often doesn't increase or decrease when you add/remove instructions that only involve registers. I've always understood this to be due to the scheduler's way of being able to execute certain code while a read or write operation is in progress (at least, that's how I understood the Motorola 68020 manual).

How much of that you can do seems to depend at least on the clock speed of the CPU vs the speed of the RAM and whether or not the code is in the cache (i.e. on a Chip RAM only A1200 the effect can be quite extreme).

One thing that may also be worth considering is that some of the instructions involved in the code here write words (or even bytes) to memory. Combining these into a single, bigger write can be a lot faster on at least 68020/68030 (I think this also goes for 68040+, but I'm not too sure), especially if the writes are long word aligned. I did some tests on this and found that, for my code and word based results at least, the extra overhead of needing a register to store the half-results in and the extra need for some extra code to keep track of the half results properly still usually ended up with a notable speed increase.

Note however, I also found that the slower the RAM, the bigger the speed increase from doing this. If you have zero-wait-state RAM, the difference will be a lot smaller than say writing to Chip RAM.

Last edited by roondar; 19 June 2021 at 20:14.
roondar is offline  
Old 19 June 2021, 23:17   #308
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
End code reworked, but untested.
Edit, its buggy for now. Time string must be reversed.
Edit2, perhaps fixed now. Still can be optimised a few bytes. Perhaps using litwr idea.
Code:
OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)

	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn

.l20 
         move.l (A4),D1    ; cout
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
         move.l (A4),D1 ; cout
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21
         bsr.w PR0000
         move.l (A4),D1 ; cout
         moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D4
         lea string(PC),A3
         moveq #10,D1
         move.b D1,(A3)+  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #$30,D0
;         move.l d5,d6


 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5

        add.b D0,D5
         move.b d5,(a3)+
         divu.w d1,d7
         swap d7
        add.b D0,D7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.',(a3)+      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D4
         divu.w d1,d7
         swap d7
        add.b D0,D7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32,(A3)+           ; newline
         move.l   A3,D2

         moveq #1,D3
.next
         move.l (A4),D1            ; cout
         subq.l #1,D2
         jsr Write(a6) 
         subq.l #1,D4
         bne.b .next

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 ;msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Last edited by Don_Adan; 20 June 2021 at 14:40.
Don_Adan is offline  
Old 20 June 2021, 14:57   #309
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
BTW. Perhaps rasteri counter can be changed too.

rasteri
addq.l #1,(a1)
moveq #0,d0
rts

After using addq.l #2,(a1), one less command for 50 Hz. But the longer is 60 Hz version. Then maybe exist value which can shortened 60 Hz and is short for 50Hz too. But present i dont have idea.
Don_Adan is offline  
Old 20 June 2021, 17:12   #310
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Unless it's one of those 'it works in this version but it's needed for that version', d5/d7 are only used as a word (where this is relevant, up to label .l7):
Code:
;	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn
	move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn
...
;	move.l d7,d5
	move.w	d7,d5
2 bytes shorter.

Another option to make it shorter is to place these three just before label .longdiv:
Code:
.write	move.l	(a4),d1
	add.l	a4,d2
	jmp	Write(a6)

.longdiv
...
and then replace d1/d2 init + jsr Write(a6) with bsr.b .write four times (in one case it'll be bsr.w, even if you do the initialization of d7=maxn prior ;\).
4*8-(3*2+1*4)-8 = 14 bytes shorter
This is all done early and it doesn't affect the speed.

edit: further size reduction...

Last edited by a/b; 20 June 2021 at 18:03.
a/b is offline  
Old 21 June 2021, 03:45   #311
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
More size optimisations from a/b. And used litwr idea for time too.
Code:
OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7.w=maxn (moved here)
;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         moveq #msg4-msg1,d3
         bsr .write
.l20 
         moveq #msg4-cout,D2
         moveq #msg5-msg4,d3
         bsr.b .write
         move.w d7,d5
         bsr.w PR0000
         moveq #msg5-cout,D2
         moveq #msg3-msg5,d3
         bsr.b .write
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20
         move.w d5,d1
         beq.b .l20
         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21
         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7
.l21
         bsr.w PR0000
         moveq #msg3-cout,D2
         moveq #msg2-msg3+1,d3
         bsr.b .write

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.write
         move.l (A4),D1 ; cout
         add.l A4,D2
         jmp Write(a6)

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D4
         lea string(PC),A3
         move.b #10-$30,(A3)+  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #10,D1 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5

         move.b d5,(a3)+
         divu.w d1,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.'-$30,(a3)+      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D4
         divu.w d1,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32-$30,(A3)+           ; newline
         moveq #1,D3
.next
         move.l (A4),D1            ; cout
         add.b #$30,-(A3)
         move.l A3,D2
         jsr Write(a6) 
         subq.l #1,D4
         bne.b .next

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 ;msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Last edited by Don_Adan; 21 June 2021 at 20:17.
Don_Adan is offline  
Old 21 June 2021, 09:24   #312
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related *buffered* calls from the dos.library that are much more efficient for single-character output.
Thomas Richter is offline  
Old 21 June 2021, 13:36   #313
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Quote:
Originally Posted by Thomas Richter View Post
Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related *buffered* calls from the dos.library that are much more efficient for single-character output.
Code called between "; DMA off" text and "; DMA on" text is optimised for speed.
Code called before "; DMA off" and after "; DMA on" is optimised for size only/mostly.
Single character write is not efficient, i know. In my previous version i want to use only one write for full end text, but this text (time value) must be at first reversed. I dont see short enough routine to reverse time value. If any Amiga dos.library routine can display text in reverse order then code can be changed. But i dont think that end text code will be shortest if other dos.library routine will be used.
Don_Adan is offline  
Old 21 June 2021, 14:10   #314
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Inspired by Thomas Richter, maybe even a few bytes shortest. If works.
Code:
OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7.w=maxn (moved here)
;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         moveq #msg4-msg1,d3
         bsr .write
.l20 
         moveq #msg4-cout,D2
         moveq #msg5-msg4,d3
         bsr.b .write
         move.w d7,d5
         bsr.w PR0000
         moveq #msg5-cout,D2
         moveq #msg3-msg5,d3
         bsr.b .write
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20
         move.w d5,d1
         beq.b .l20
         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21
         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7
.l21
         bsr.w PR0000
         moveq #msg3-cout,D2
         moveq #msg2-msg3+1,d3
         bsr.b .write

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.write
         move.l (A4),D1 ; cout
         add.l A4,D2
         jmp Write(a6)

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D3
         lea string+8(PC),A3
         moveq #10,D1
         move.b D1,-(A3)  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #$30,D0 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5
        add.b D0,D5
         move.b d5,-(a3)
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         move.b #'.',-(a3)      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D3
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32,-(A3)           ; space
         move.l (A4),D1            ; cout
         move.l A3,D2
         jsr Write(a6) 

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 ;msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Last edited by Don_Adan; 21 June 2021 at 20:17.
Don_Adan is offline  
Old 21 June 2021, 16:19   #315
Cyprian
Registered User
 
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 171
@Don_Adan

I see following error message:
Code:
error 2029 in line 33: branch destination out of range
>         bsr.b .write
Cyprian is offline  
Old 21 June 2021, 20:17   #316
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Quote:
Originally Posted by Cyprian View Post
@Don_Adan

I see following error message:
Code:
error 2029 in line 33: branch destination out of range
>         bsr.b .write
Ok, thanks, from my manually calculated code length 2 bytes too long, maybe any other small optimisation will be possible, for now i changed this branch to bsr
Don_Adan is offline  
Old 21 June 2021, 20:25   #317
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Perhaps now first .write is in bsr.b range.
Code:
OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.w	#((65536-(ra-start))/(7<<2))<<2,D7	; d7.w=maxn (moved here)
;call Write(stdout,buff,size)
         moveq #-4,D4
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         moveq #msg4-msg1,d3
         bsr.b .write
.l20 
         moveq #msg4-cout,D2
         moveq #msg5-msg4,d3
         bsr.b .write
         move.w d7,d5
         bsr.w PR0000
         moveq #msg5-cout,D2
         moveq #msg3-msg5,d3
         bsr.b .write
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20
         move.w d5,d1
         beq.b .l20
         addq.w #3,d5
         and.w D4,d5
         cmp.b #10,(a0)
         bne.b .l21
         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7
.l21
         bsr.w PR0000
         moveq #msg3-cout,D2
         moveq #msg2-msg3,d3
         bsr.b .write

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.write
         move.l (A4),D1 ; cout
         add.l A4,D2
         jmp Write(a6)

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1+3+1,D3
         lea string+8(PC),A3
         moveq #10,D1
         move.b D1,-(A3)  ; newline

         move.l d5,d0
         add.l D5,D5
         cmp.b #50,VBlankFrequency(a5)
         beq.b .l8

         add.l D5,D5      ;60 Hz
         add.l d0,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8   
         moveq #$30,D0 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d1,d5
     bvc.b .div32no

     swap d5
     move.w d5,d7
     divu.w d1,d7
     swap d7
     move d7,d5
     swap d5
     divu.w d1,d5
.div32no
     move.w d5,d7
     swap d5
        add.b D0,D5
         move.b d5,-(a3)
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         move.b #'.',-(a3)      ; dot
.l12     tst.w d7
         beq .l11

         addq.l #1,D3
         divu.w d1,d7
         swap d7
       add.b D0,D7
         move.b d7,-(a3)
         clr.w d7
         swap d7
         bra .l12

.l11
         move.b #32,-(A3)           ; space
         move.l (A4),D1            ; cout
         move.l A3,D2
         jsr Write(a6) 

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,buf-cout(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code


 cnop 0,4

 cout dc.l 0
 buf ds.b 4
 time dc.l 0

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed',10
msg2
      even
getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

Last edited by Don_Adan; 22 June 2021 at 00:26.
Don_Adan is offline  
Old 21 June 2021, 21:41   #318
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
-2 bytes due to alignment (extra zero at the end is never used):
Code:
;         moveq #msg2-msg3+1,d3
         moveq #msg2-msg3,d3
...
;msg3 dc.b ' digits will be printed'
;msg2 dc.b 10,0
msg3 dc.b ' digits will be printed',10
msg2
	even
Since there's some code in between that could be activated (dma on/off)... I'm not familiar with vasm syntax, in asm-one I'd do (and yes, you can do AO there and it will auto-correct it):
Code:
	IFGT	.write-*-128
	bsr.w	.write
	ELSE
	bsr.b	.write
	ENDC	; IFGT
a/b is offline  
Old 21 June 2021, 23:39   #319
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
Quote:
Originally Posted by Don_Adan View Post
More size optimisations from a/b. And used litwr idea for time too.
This version is the same speed as the earlier one I tested (8.9 seconds without printing, 10 seconds with printing) but dramatically smaller at 700 bytes vs. 804. Well done!

But I see my work isn't done. I will test your latest version tonight.

Quote:
Originally Posted by Thomas Richter
Do you really run byte output over Write()? This is not advisable, Write() makes a context switch for every single call. Please see FPutC/Printf/FPrintf or related *buffered* calls from the dos.library that are much more efficient for single-character output.
That may be so, but it is interesting to note that printing 3000 digits takes ~1.1 seconds on my machine, which is only 11% of the total execution time. On a slower machine it should be an even smaller percentage because a large part of that time is taken up rendering to ChipRAM, which is proportionally slower on a faster machine.
Bruce Abbott is offline  
Old 22 June 2021, 00:28   #320
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Quote:
Originally Posted by a/b View Post
-2 bytes due to alignment (extra zero at the end is never used):
Code:
;         moveq #msg2-msg3+1,d3
         moveq #msg2-msg3,d3
...
;msg3 dc.b ' digits will be printed'
;msg2 dc.b 10,0
msg3 dc.b ' digits will be printed',10
msg2
	even
Since there's some code in between that could be activated (dma on/off)... I'm not familiar with vasm syntax, in asm-one I'd do (and yes, you can do AO there and it will auto-correct it):
Code:
	IFGT	.write-*-128
	bsr.w	.write
	ELSE
	bsr.b	.write
	ENDC	; IFGT
Ok, changed. Most assemblers auto reasembled bsr.b to bsr.w. I prefer easy code and dont like auto optimisations.
Don_Adan is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68020 Bit Field Instructions mcgeezer Coders. Asm / Hardware 9 27 October 2023 23:21
68060 64-bit integer math BSzili Coders. Asm / Hardware 7 25 January 2021 21:18
Discovery: Math Audio Snow request.Old Rare Games 30 20 August 2018 12:17
Math apps mtb support.Apps 1 08 September 2002 18:59

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 20:28.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.13077 seconds with 16 queries