Optimizing the 68020+ 32-bit math - Page 15

Don_Adan · 08 June 2021, 14:24

Quote:

Originally Posted by Bruce Abbott

Since this thread was originally supposed to be about optimizing 32 bit division (and because I am sick of seeing 300 digits of pi on my screen) I decided that for initial comparisons I would measure execution times without printing the digits. This saves the hassle of having to set the CLI window up exactly the same for each run (to avoid possible variations due to scrolling time etc.). Not printing the digits only made the code run about 8% faster, so any optimization in this area will probably make little difference.

I tested 3 code bases, litwr's V1 and V4 written in 2018 and Don_Adan's V12[BETA3] from post #265, on my A1200 with WB3.0 and 50MHz Blizzard 1230-IV. Results are rounded to the nearest 0.1 second.

litwr V1: 9.7 seconds
litwr V4: 8.9 seconds
Don_Adan V12b3: 8.9 seconds

This suggests that there is little opportunity for further significant improvement in execution speed of the core algorithm.

Just for fun I also tested it under different operating conditions. Normally I run The Enforcer on my system to warn me about programs trashing low memory. This can noticeably slow down programs that do a lot of legitimate low memory access. With The Enforcer running litr's V4 code took 9.0 seconds to execute, which is ~1% slower.

I also tried disabling CPU caches, and executing from ChipRAM. The results were a little surprising. Disabling the data cache had little effect, but disabling the instruction cache increased execution time to 12.4 seconds or ~28% slower. This shows that even with the fast 60ns RAM on the Blizzard 1230-IV, getting critical code to fit inside the instruction cache can greatly speed it up.

But what really surprised me was the effect of running from Chip RAM. I expected a massive slowdown, but it wasn't that bad - 12.7 seconds or ~30% slower, not much worse than running from FastRAM with CPU caches off. Unless The Enforcer was running, then execution time ballooned out to 49.6 seconds whether the instruction cache was on or off. That's 5.57 times slower!

Version from post 265 is almost same like litwr version, because small speed optimisations are removed by changing dbra with subq.l/bne.b in init part. Only version from post 269 (with new PR0000 routine) can be fastest, but not must. For me main loop looks good, only full loop (with PR0000) can be fastest a few.

SpeedGeek · 08 June 2021, 15:50

Quote:

Originally Posted by meynaf

Design bug, not really. That pin can not be changed by the underlying hardware before it has decoded the target address, and by that time the 68030 is probably already executing other instructions so it does not know about the posted write anymore - keeping track of it would probably have been too complex or costly.
To handle this - it's not usually a problem in real life code - the 68030 has DC_WA (Write Allocate) bit in CACR.

Within the limits of "Natural" off-topic discussion (and this thread is already pushing those limits), IMO the 68030 Write-Allocate mode is certainly required for Supervisor mode compatibility.

However, the 68030 CIIN feature/bug is over-rated and rarely causes any problems. Therefore, it makes sense to implement any performance reducing CIIN fixes only when absolutely required (e.g. BridgeBoards and the tiny number of Zorro bus I/O cards which failed to provide their own hardware or software solutions for the problem).

Thomas Richter · 08 June 2021, 16:24

It goes beyond that. Various graphic cards are affected as well. Look into the CVision3D manual, for example. It requests users to run "Enforcer" (back then) to have an operational card on the 68030.

Bruce Abbott · 10 June 2021, 11:49

Quote:

Originally Posted by Don_Adan

Only version from post 269 (with new PR0000 routine) can be fastest, but not must. For me main loop looks good, only full loop (with PR0000) can be fastest a few.

Speed appears to be identical to your earlier code. Doesn't look like we are going to squeeze any more blood out of this stone.

However the file size of your version is significantly smaller that litwr's, and possibly could be made even smaller. You tried replacing move.l #msgxx,d2 with moveq #msg1-cout,D2 : add.l A4,D2 , but this didn't work because (except in one case) the addresses are too far apart. I replaced it with lea msgxx(pc),A0 : move.l a0,d2 which is the same size as the original code but saves a reloc32 entry.

I see a few other places where a few bytes could be saved, but I can't be bothered. It is only 804 bytes now (down from the original 924) which is not bad for what it does.

Don_Adan · 10 June 2021, 16:00

Test version, added a/b optimisation. More changes after dinner.

Moved D2 and D3 out of full loop, maybe can be a few fastest now.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)
         move.l #$10000-(ra-start),D7
         divu.w #7*4,D7
         lsl.l #2,D7    ; d7.w=maxn

.l20 
;    move.l cout(pc),d1
         move.l (A4),D1    ; cout
;         move.l #msg4,d2
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
;         move.l cout(pc),d1
         move.l (A4),D1 ; cout
;         move.l #msg5,d2
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21     bsr.w PR0000
          move.l (A4),D1 ; cout
;         move.l #msg3,d2
        moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2      sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1,d3
         move.l cout(pc),d1
         move.l #msgx,d2
         jsr Write(a6)  ;space

         move.l d5,d3
         lsl.l #1,d5
         cmp.b #50,VBlankFrequency(a5)
         beq .l8

         lsl.l #1,d5      ;60 Hz
         add.l d3,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8      lea string(pc),a3
         moveq.l #10,d4
         move.l d5,d6

;div32x16 macro    ;D7=D6/D4, D6=D6%D4
 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d4,d6
     bvc.b .div32no

     swap d6
     move.w d6,d7
     divu.w d4,d7
     swap d7
     move d7,d6
     swap d6
     divu.w d4,d6
.div32no
     move.w d6,d7
;     clr.w d6 ;not necessary
     swap d6

         move.b d6,(a3)+
         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.'-'0',(a3)+
.l12     tst.w d7
         beq .l11

         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11     add.b #'0',-(a3)
         moveq #1,d3
         move.l cout(pc),d1
         move.l a3,d2
         jsr Write(a6)
         cmp.l #string,a3
         bne .l11

         move.l cout(pc),d1
         move.l #msgx+1,d2
         jsr Write(a6)  ;newline

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
 lea $100.W,A0
 move.l #$303A3030,D0
 move.w #1000,D1
b1000
 sub.w D1,D5
 bcs.b n100
 add.w A0,D0
 bra.b b1000

n100
 add.w D1,D5
 moveq #100,D1
b100
 sub.w D1,D5
 bcs.b n10
 addq.b #1,D0
 bra.b b100

n10
 add.w D1,D5
 swap D0
 moveq #10,D1
b10
 sub.w D1,D5
 bcs.b n1
 add.w A0,D0
 bra.b b10
n1
 add.b D5,D0

 move.l D0,4(A4) ; buf
 move.l (A4),D1    ; cout
 jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra

getnum   jsr Input(a6)          ;get stdin
;         move.l #string,d2     ;set by previous call

         moveq #msg1-cout,D2
         add.l A4,D2
         move.l d0,d1
         moveq #5,d3     ;+ newline
         jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13'
  dc.b 10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10
Buffy
     dcb.b 65536-(Buffy-start)

a/b · 10 June 2021, 17:33

Any particular reason why DCB.B 65536-... and not DS.B instead? As far as I can see the buffer is filled with 2000s, so why not make the executable significantly shorter?

Also, this is shorter (and faster, not that it matters much here):

Code:

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
	move.w	#$0100,a0
	move.l	#$2f3a2f2f,d2
	move.w	#1000,d3
.b1000	add.w	a0,d2
	sub.w	d3,d5
	bcc.b	.b1000
	add.w	d3,d5

	moveq	#100,d3
.b100	addq.b	#1,d2
	sub.w	d3,d5
	bcc.b	.b100
	add.w	d3,d5

	swap	d2
	moveq	#10,d3
.b10	add.w	a0,d2
	sub.w	d3,d5
	bcc.b	.b10
	add.b	d5,d2

	lea	cout(pc),a0
...

Don_Adan · 10 June 2021, 22:52

Added more a/b optimisations. DCB.B was used in github version. I made changes (cleaning code) step by step. Now ds.b is used.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)
         move.l #$10000-(ra-start),D7
         divu.w #7*4,D7
         lsl.l #2,D7    ; d7.w=maxn

.l20 
;    move.l cout(pc),d1
         move.l (A4),D1    ; cout
;         move.l #msg4,d2
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
;         move.l cout(pc),d1
         move.l (A4),D1 ; cout
;         move.l #msg5,d2
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21
         bsr.w PR0000
         move.l (A4),D1 ; cout
;         move.l #msg3,d2
         moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2      sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1,d3
;         move.l cout(pc),d1

         move.l (A4),D1 ; cout
         move.l #msgx,d2
         jsr Write(a6)  ;space

         move.l d5,d3
         lsl.l #1,d5
         cmp.b #50,VBlankFrequency(a5)
         beq .l8

         lsl.l #1,d5      ;60 Hz
         add.l d3,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8      lea string(pc),a3
         moveq.l #10,d4
         move.l d5,d6

;div32x16 macro    ;D7=D6/D4, D6=D6%D4
 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d4,d6
     bvc.b .div32no

     swap d6
     move.w d6,d7
     divu.w d4,d7
     swap d7
     move d7,d6
     swap d6
     divu.w d4,d6
.div32no
     move.w d6,d7
;     clr.w d6 ;not necessary
     swap d6

         move.b d6,(a3)+
         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.'-'0',(a3)+
.l12     tst.w d7
         beq .l11

         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11     add.b #'0',-(a3)
         moveq #1,d3
 ;        move.l cout(pc),d1

        move.l (A4),D1 ; cout
         move.l a3,d2
         jsr Write(a6)
         cmp.l #string,a3
         bne .l11

;         move.l cout(pc),d1

          move.l (A4),D1 ; cout
         move.l #msgx+1,d2
         jsr Write(a6)  ;newline

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra

getnum   jsr Input(a6)          ;get stdin
;         move.l #string,d2     ;set by previous call

         moveq #msg1-cout,D2
         add.l A4,D2
         move.l d0,d1
         moveq #5,d3     ;+ newline
         jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13'
  dc.b 10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10

Buffy
     ds.b 65536-(Buffy-start)

Bruce Abbott · 11 June 2021, 00:01

Quote:

Originally Posted by grond

I believe this possibility only exists in the 060 but I would be happy to be corrected.

Here's a little puzzle for you. The following code takes 28 seconds to execute on my A1200 with 50MHz 68030 (Blizzard 1230-IV), equating to ~28 clock cycles per inner loop. Without the move.l d0,(a0)+ it takes 6 seconds, suggesting a loop overhead of 6 cycles for the dbf plus 22 cycles for the move.l d0,(a0)+ to ChipRAM.

Code:

 section 1,code
 opt d+
 MC68020
 output ram:profile

_main:
  move.w  #50000-1,d6     ; 50000*1000 = 1 second per clk at 50MHz
; -- outer loop --
oloop:
  lea     chipram,a0      ; a0 = pointer to chipram
  move.l  #$12345678,d0   ; d0 = value to write
  move.w  #1000-1,d5      ; repeat inner loop code 1000 times
; -- inner loop --
loop:
  move.l  d0,(a0)+        ; write longword to next chipram address
  dbf     d5,loop
; -- outer loop --
  dbf     d6,oloop
  moveq   #0,d0
  rts

  align.l

fastram:
  ds.l    10000

  section 2,DATA_C

chipram:
  ds.l    10000

Now, how long do you think this code takes to execute:-

Code:

 section 1,code
 opt d+
 MC68020
 output ram:profile

_main:
  move.w  #50000-1,d6     ; 50000*1000 = 1 second per clk at 50MHz
; -- outer loop --
oloop:
  lea     chipram,a0      ; a0 = pointer to chipram
  move.l  #$12345678,d0   ; d0 = value to write
  move.w  #1000-1,d5      ; repeat inner loop code 1000 times
; -- inner loop --
loop:
  move.l  d0,(a0)+        ; write longword to next chipram address
  move.l  d0,d1
  move.l  d1,d2
  move.l  d2,d3
  move.l  d3,d4
  move.l  d4,d3
  move.l  d3,d2
  move.l  d2,d1
  move.l  d1,d0
  move.l  d0,d1
  move.l  d1,d2
  dbf     d5,loop
; -- outer loop --
  dbf     d6,oloop
  moveq   #0,d0
  rts

  align.l

fastram:
  ds.l    10000

  section 2,DATA_C

chipram:
  ds.l    10000

Thorham · 11 June 2021, 00:54

Quote:

Originally Posted by Bruce Abbott

Now, how long do you think this code takes to execute:

Still 22 cycles? Or not much more? The extra moves should take 20, but get executed during the write.

grond · 11 June 2021, 08:29

The extra moves don't make it slower which hasn't anything to do with the processor continuing with the work while the chipmem write is still pending. The extra moves don't seem to consume time because after the first chipmem write the processor and chipmem are in sync meaning the processor can waste cycles on something else before chipmem is even ready to take the next access. While this may appear to be the same as continuing work while the chipmem is pending, it is not. Writing 3.5 or 7 MB to chipmem on the 030 will still block 50M processor cycles on the 030 while the 060 can make those chipmem writes for (almost) free. The 060 can do it because it simply leaves handling the write to a unit that does not block the execution pipeline.

Bruce Abbott · 11 June 2021, 08:42

Quote:

Originally Posted by Thorham

Still 22 cycles? Or not much more? The extra moves should take 20, but get executed during the write.

Yes, it still takes only 28 seconds or 28 cycles per loop, which means those 10 extra instructions are executing out of cache (using only 2 clocks per instruction) while the external bus is completing the write.

Imagine you had a program that reads some data from FastRAM, performs several operations on it in registers, then writes it to ChipRAM. If you parse all the data first and then write it to ChipRAM as a block, the CPU will have to wait for ~20 cycles during each write. But if you interleave the ChipRAM writes with other code you could execute up to 10 instructions 'for free' while each write is in progress - so long as the instructions and data being worked on are in the cache. You could have code that appears to be less efficient (because it needs more instructions to interleave the operations) but actually runs much faster.

The effect is much less when accessing only FastRAM, but in some situations the order of instructions could still make a difference, particularly on machines with relatively slow 'fast' RAM.

So what does this mean for pi-spigot? Firstly it explains why the performance hit from running in ChipRAM is not nearly as much you might expect. Secondly, reducing the number of instructions and/or using what appear to be 'faster' instruction sequences may not necessarily result in the fastest code.

meynaf · 11 June 2021, 08:52

Quote:

Originally Posted by grond

The extra moves don't make it slower which hasn't anything to do with the processor continuing with the work while the chipmem write is still pending. The extra moves don't seem to consume time because after the first chipmem write the processor and chipmem are in sync meaning the processor can waste cycles on something else before chipmem is even ready to take the next access. While this may appear to be the same as continuing work while the chipmem is pending, it is not. Writing 3.5 or 7 MB to chipmem on the 030 will still block 50M processor cycles on the 030 while the 060 can make those chipmem writes for (almost) free. The 060 can do it because it simply leaves handling the write to a unit that does not block the execution pipeline.

If this is true, then a fastmem access should be doable for free too, not only register access. But it's not. Any memory access, even near the end of the loop, completely blocks until the write is completed.

Bruce Abbott · 11 June 2021, 09:24

Quote:

Originally Posted by grond

While this may appear to be the same as continuing work while the chipmem is pending, it is not. Writing 3.5 or 7 MB to chipmem on the 030 will still block 50M processor cycles on the 030

The actual write to ChipRAM must occur at the bus speed of 560ns between CPU slots, so writing a large block of data cannot be done any faster. The question is can a fast CPU use some of that time to execute other instructions as well, and the answer is yes, it can. That means if you have more work to do than just writing a block of data to ChipRAM, it may pay to interleave it with the block write rather than doing one after the other.

Quote:

while the 060 can make those chipmem writes for (almost) free. The 060 can do it because it simply leaves handling the write to a unit that does not block the execution pipeline.

I can't see how it can be 'free' if the ChipRAM itself cannot accept more than one write per 560ns. If you write a large block then eventually it will have to slow down to that speed (is that what you meant by 'almost' free?).

What I don't know is whether there are any accelerator cards which latch the write data on the ChipRAM side of the bus (perhaps even before the CPU slot is ready) and release the CPU side so it can continue accessing its own FastRAM at full speed. I suspect the Blizzard 1230-IV doesn't do this, but I will do more tests to confirm it.

grond · 11 June 2021, 12:08

Quote:

Originally Posted by meynaf

If this is true, then a fastmem access should be doable for free too, not only register access. But it's not. Any memory access, even near the end of the loop, completely blocks until the write is completed.

What "this" are you referring to? I wrote several things and I'm not sure which one you are addressing.

All Motorola 68k CPUs will use some cycles for processing the instruction (probably just 1 cycle for the 060 and something like 4 cycles for the 030). My point is the 060 can completely hide the extra cycles for the external bus (which are many for chipmem writes) from the instruction pipeline while the 030 cannot.

grond · 11 June 2021, 12:18

Quote:

Originally Posted by Bruce Abbott

The actual write to ChipRAM must occur at the bus speed of 560ns between CPU slots, so writing a large block of data cannot be done any faster. The question is can a fast CPU use some of that time to execute other instructions as well, and the answer is yes, it can. That means if you have more work to do than just writing a block of data to ChipRAM, it may pay to interleave it with the block write rather than doing one after the other.

The real question is whether the 030 has the same capacity as the 060, i.e. can it hide slow memory access times? No, it cannot. The 030 is stalled for more cycles if RAM is slow while the 060 is not stalled (as long as it doesn't need the memory bus again before the slow memory is ready to take more data). The amount of processor cycles it takes an 060 to write 7MB per second to chipmem is the same as that for writing 7MB per second to fastmem if the writes are evenly spaced.

Quote:

I can't see how it can be 'free' if the ChipRAM itself cannot accept more than one write per 560ns. If you write a large block then eventually it will have to slow down to that speed (is that what you meant by 'almost' free?).

On the 060 it can be free (except for the processor cycle to execute the move-instruction). This is used in all those shiny 060 demos, they execute the c2p in fastmem and then just intersparse move.l (fastmem-Ax)+,(chipmem-Ay)+ instructions in a work routine. Since the CPU is doing real work, the chipmem writes become virtually free and the 060 can move 7MB to chipmem per second and still do 99% of CPU work. On the 030 this doesn't work because writing 7MB to chipmem will eat almost all usable CPU time as the CPU does not continue processing instructions for the time the bus cycle takes to complete. Remember only every other chipmem cycle goes to the CPU, hence, you can naturally execute processor instructions on the 030 for about half of the CPU clock cycles.

meynaf · 11 June 2021, 12:28

Quote:

Originally Posted by grond

What "this" are you referring to? I wrote several things and I'm not sure which one you are addressing.

All Motorola 68k CPUs will use some cycles for processing the instruction (probably just 1 cycle for the 060 and something like 4 cycles for the 030). My point is the 060 can completely hide the extra cycles for the external bus (which are many for chipmem writes) from the instruction pipeline while the 030 cannot.

You said the 68030 is not able to continue execution while a write is pending and that it's just timing sync. From my experience -- not the case.
The proof is that if you attempt to access memory during these cycles, even from fastmem (actually even from data cache !), it will stop until the write is complete (i.e. you can't hide any memory access).

Quote:

Originally Posted by grond

On the 030 this doesn't work because writing 7MB to chipmem will eat almost all usable CPU time as the CPU does not continue processing instructions for the time the bus cycle takes to complete. Remember only every other chipmem cycle goes to the CPU, hence, you can naturally execute processor instructions on the 030 for about half of the CPU clock cycles.

This is not true. The only condition is to only work in registers.
It takes 28 clocks to 50Mhz 030 to perform an access to chipmem. Out of these, at least 22 are free for use. Far from your "about half".
With 60ns fastmem you can hide 4 cycles out of 8.

grond · 11 June 2021, 13:25

Quote:

Originally Posted by meynaf

You said the 68030 is not able to continue execution while a write is pending and that it's just timing sync. From my experience -- not the case.
The proof is that if you attempt to access memory during these cycles, even from fastmem (actually even from data cache !), it will stop until the write is complete (i.e. you can't hide any memory access).

This is interesting and it looks like I was wrong. The 030 not being able to read from data cache may be the key aspect here. A move.l (fastmem-Ax)+,(chipmem-An)+ on an 030 doesn't leave any available CPU cycles because of the fastmem reads even though the 030 should be able to burst read cache lines from fastmem. This may be the important difference when compared to the 060 (and perhaps 040).

meynaf · 11 June 2021, 13:46

Quote:

Originally Posted by grond

This is interesting and it looks like I was wrong. The 030 not being able to read from data cache may be the key aspect here. A move.l (fastmem-Ax)+,(chipmem-An)+ on an 030 doesn't leave any available CPU cycles because of the fastmem reads even though the 030 should be able to burst read cache lines from fastmem. This may be the important difference when compared to the 060 (and perhaps 040).

Advantages of 040 and 060 is that they are able to continue even if there are other memory accesses, and the push buffer (at least on 060) can contain several items.
You can however make things better on 030 by grouping fastmem reads together.
And if you use data burst you can even insert register-only instructions between the reads and get a little speed gain.

But this is bringing us quite far from the original topic of 32-bit division...

Thomas Richter · 11 June 2021, 20:18

Quote:

Originally Posted by meynaf

Advantages of 040 and 060 is that they are able to continue even if there are other memory accesses, and the push buffer (at least on 060) can contain several items.

As said before, the 040 also has a push buffer, same as the 060. It is one cache line (4 LWs) large, but only used for pushing dirty cache lines. But in additon, the 040 has a three-stage pipeline in which data written out is buffered (WB3 to WB1). They are used for cached and non-serialized write accesses. If non-serialized, a read can overtake a write. That is *not* the case for the 060 where reads and writes are always in strict order. The 060 has an imprecise mode, though.

Don_Adan · 12 June 2021, 05:13

Small size optimisations. To do end part, VBI ticks conversion routine to time.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

start
         lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)
 ;        move.l #$10000-(ra-start),d7
 ;        divu.w #7*4,D7
 ;        lsl.l #2,D7    ; d7.w=maxn
	move.l	#((65536-(ra-start))/(7<<2))<<2,D7	; d7=maxn

.l20 
         move.l (A4),D1    ; cout
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
         move.l (A4),D1 ; cout
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21
         bsr.w PR0000
         move.l (A4),D1 ; cout
         moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2
         moveq #4,D3
         moveq #buf-cout,D2
         add.l  A4,D2 ; buf

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d0
         move.w d0,d7
         divu.w d4,d7
         swap d7
         move.w d7,d0
         swap d0
         divu.w d4,d0

         move.w d0,d7
         exg d0,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2
         sub.l d0,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d0
         divu.w d4,d0
         bvs.s .longdiv

         move.w d0,d7
         clr.w d0
         swap d0
         move.w d0,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         bsr.w PR000N

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1,d3
         move.l (A4),D1 ; cout
;         move.l #msgx,d2

         moveq #msgx-cout,d2
         add.l  A4,D2
         jsr Write(a6)  ;space

         move.l d5,d3
         lsl.l #1,d5
         cmp.b #50,VBlankFrequency(a5)
         beq .l8

         lsl.l #1,d5      ;60 Hz
         add.l d3,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8      lea string(pc),a3
         moveq.l #10,d4
         move.l d5,d6

;div32x16 macro    ;D7=D6/D4, D6=D6%D4
 
;     moveq #0,d7    ; not necessary D7 highword is already cleared
     divu.w d4,d6
     bvc.b .div32no

     swap d6
     move.w d6,d7
     divu.w d4,d7
     swap d7
     move d7,d6
     swap d6
     divu.w d4,d6
.div32no
     move.w d6,d7
;     clr.w d6 ;not necessary
     swap d6

         move.b d6,(a3)+
         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.'-'0',(a3)+
.l12     tst.w d7
         beq .l11

         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11     add.b #'0',-(a3)
         moveq #1,d3
 ;        move.l cout(pc),d1

        move.l (A4),D1 ; cout
         move.l a3,d2
         jsr Write(a6)
         cmp.l #string,a3
         bne .l11

;         move.l cout(pc),d1

          move.l (A4),D1 ; cout
;         move.l #msgx+1,d2
         moveq #msgx+1-cout,d2
         add.l A4,D2
         jsr Write(a6)  ;newline

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
      moveq #4,D3
      moveq #buf-cout,D2
      add.l  A4,D2 ; buf
PR000N
        move.w	#$0100,a0
	move.l	#$2f3a2f2f,d0
	move.w	#1000,d1
.b1000	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b1000
	add.w	d1,d5

	moveq	#100,d1
.b100	addq.b	#1,d0
	sub.w	d1,d5
	bcc.b	.b100
	add.w	d1,d5

	swap	d0
	moveq	#10,d1
.b10	add.w	a0,d0
	sub.w	d1,d5
	bcc.b	.b10
	add.b	d5,d0

        move.l D0,4(A4) ; buf
        move.l (A4),D1    ; cout
        jmp Write(A6) ;call Write(stdout,buff,size)

rasteri
      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4

; Overwritten code/data start here. 
ra
string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v13',10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10,0
      even

getnum
        jsr Input(a6)          ;get stdin
        moveq #msg1-cout,D2
        add.l A4,D2
        move.l d0,d1
        moveq #5,d3     ;+ newline
        jsr Read(a6)
 
        move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Buffy
     ds.b 65536-(Buffy-start)

10 June 2021, 17:33	#286
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,038	Any particular reason why DCB.B 65536-... and not DS.B instead? As far as I can see the buffer is filled with 2000s, so why not make the executable significantly shorter? Also, this is shorter (and faster, not that it matters much here): Code: PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 move.w #$0100,a0 move.l #$2f3a2f2f,d2 move.w #1000,d3 .b1000 add.w a0,d2 sub.w d3,d5 bcc.b .b1000 add.w d3,d5 moveq #100,d3 .b100 addq.b #1,d2 sub.w d3,d5 bcc.b .b100 add.w d3,d5 swap d2 moveq #10,d3 .b10 add.w a0,d2 sub.w d3,d5 bcc.b .b10 add.b d5,d2 lea cout(pc),a0 ...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

08 June 2021, 16:24	#283
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,214	It goes beyond that. Various graphic cards are affected as well. Look into the CVision3D manual, for example. It requests users to run "Enforcer" (back then) to have an operational card on the 68030.

11 June 2021, 08:29	#290
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	The extra moves don't make it slower which hasn't anything to do with the processor continuing with the work while the chipmem write is still pending. The extra moves don't seem to consume time because after the first chipmem write the processor and chipmem are in sync meaning the processor can waste cycles on something else before chipmem is even ready to take the next access. While this may appear to be the same as continuing work while the chipmem is pending, it is not. Writing 3.5 or 7 MB to chipmem on the 030 will still block 50M processor cycles on the 030 while the 060 can make those chipmem writes for (almost) free. The 060 can do it because it simply leaves handling the write to a unit that does not block the execution pipeline.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)