Optimizing the 68020+ 32-bit math - Page 14

BippyM · 01 June 2021, 18:18

Quote:

Originally Posted by modrobert

I'm not so sure he is coming back.

Fine by me!

Quote:

Why? Are you afraid to take responsibility for it?

Excuse me?? Do you even know what you arer saying? I did take responsibility when I told you I had banned him. The reason why has absolutely nothing at all to do with you, hence why I haven't told you. It is between the EAB and Litwr. If he want's to tell you thast is fine.

Furthermore I want to reiterate, this isn't a democracy, your opinion does not matter, litwr has had numerous members report him, he has been warned and now he has been banned. The forum is here for the benefit of the users, and any user who doesn't follow the rules has the same process. We rarely ever give an instant ban (homophobic, sexual, gender, racism etc will result in an instant, and often permanant ban). It is that simple.

If you have an issuie with how the forum is run, take it up with RCK or the logout button is at the top of the page!

Bruce Abbott · 03 June 2021, 12:13

Quote:

Originally Posted by Don_Adan

Init code optimised a few.

[snip]

For Litwr version of PR0000

lea.l buf(pc),a0
move.l a0,d2
can be moved/added before of full loop, because D2 is unchanged in the full loop.
For my version too, but because i dont know which version is really fastest, i dont change this for now. Anyway 2 commands left, in the future.

This is not the full code. Can you post the complete working source code?

Don_Adan · 03 June 2021, 14:21

Sorry, present i dont have access to my Amiga.
You can download this version and replace some parts manually. Maybe later i will join all changes.

https://github.com/litwr2/rosetta-pi...a/pi-amiga.asm

Bruce Abbott · 03 June 2021, 15:00

Quote:

Originally Posted by Don_Adan

Sorry, present i dont have access to my Amiga.
You can download this version and replace some parts manually. Maybe later i will join all changes.

Thanks, I will wait until you join all the changes and produce full working code.

I got litwr's code to assemble with ProAsm, which was quite a lot of work. It seems to run OK so I think I got it right. However I have have low confidence in my ability to join bits of unfamiliar code together without making a mistake. If the code is not accurate then I won't be able to do a fair comparison.

Don_Adan · 03 June 2021, 20:00

I joined all code, but untested if it works or can be assembled.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

div32x16 macro    ;D7=D6/D4, D6=D6%D4
     ;clr.l d7
     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     swap d6
     move d6,d7
     divu d4,d7
     swap d7
     move d7,d6
     swap d6
     divu d4,d6
.div32no\@
     move d6,d7
     clr d6
     swap d6
endm
 
start    lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
 ;        move.l #msg1,d2
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)
;         move.l #start+$10000-ra,d7
;         divu #7,d7
         move.l #$10000-(ra-start),D7
         divu.w #7*4,D7
         ext.l d7                  ; necessary only for Litwr version of PR0000
 ;        and.b #$fc,d7                 ;d7=maxn
         lsl.l #2,D7

.l20 
;    move.l cout(pc),d1
         move.l (A4),D1    ; cout
;         move.l #msg4,d2
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
;         move.l cout(pc),d1
         move.l (A4),D1 ; cout
;         move.l #msg5,d2
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21     bsr.w PR0000
;         move.l cout(pc),d1
          move.l (A4),D1 ; cout
;         move.l #msg3,d2
        moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq.l #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d3
         move.w d3,d7
         divu.w d4,d7
         swap d7
         move.w d7,d3
         swap d3
         divu.w d4,d3

         move.w d3,d7
         exg d3,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2      sub.l d3,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d3
         divu.w d4,d3
         bvs.s .longdiv

         move.w d3,d7
         clr.w d3
         swap d3
         move.w d3,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         ext.l D5   ; necessary only for litwr version of PR0000 routine
         bsr PR0000

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg.l a5,a6
         moveq.l #INTB_VERTB,d0
         lea.l VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg.l a5,a6

         moveq.l #1,d3
         move.l cout(pc),d1
         move.l #msgx,d2
         jsr Write(a6)  ;space

         move.l d5,d3
         lsl.l d5
         cmp.b #50,VBlankFrequency(a5)
         beq .l8

         lsl.l d5      ;60 Hz
         add.l d3,d5
         divu #3,d5
         swap d5
         lsr #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8      lea string(pc),a3
         moveq.l #10,d4
         move.l d5,d6
         div32x16
         move.b d6,(a3)+
         divu d4,d7
         swap d7
         move.b d7,(a3)+
         clr d7
         swap d7
         move.b #'.'-'0',(a3)+
.l12     tst d7
         beq .l11

         divu d4,d7
         swap d7
         move.b d7,(a3)+
         clr d7
         swap d7
         bra .l12

.l11     add.b #'0',-(a3)
         moveq #1,d3
         move.l cout(pc),d1
         move.l a3,d2
         jsr Write(a6)
         cmp.l #string,a3
         bne .l11

         move.l cout(pc),d1
         move.l #msgx+1,d2
         jsr Write(a6)  ;newline

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
       lea.l buf(pc),a0
       move.l a0,d2
       bsr.s .l1
       moveq #4,d3
       move.l cout(pc),d1
       jmp Write(a6)             ;call Write(stdout,buff,size)

.l1    divu #1000,d5
       bsr .l0
       clr d5
       swap d5

       divu #100,d5
       bsr .l0
       clr d5
       swap d5

       divu #10,d5
       bsr .l0
       swap d5

.l0    eori.b #'0',d5
       move.b d5,(a0)+
       rts

rasteri      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4
 
ra

getnum   jsr Input(a6)          ;get stdin
         move.l #string,d2     ;set by previous call
         move.l d0,d1
         moveq.l #5,d3     ;+ newline
         jsr Read(a6)
         subq #1,d0
         beq .err

         move.l d2,a0
         clr.l d5
.l1      clr d6
         move.b (a0)+,d6
         cmpi.b #'9',d6
         bhi .err

         subi.b #'0',d6
         bcs .err

         add d6,d5
         subq #1,d0
         beq .eos

         mulu #10,d5
         bra .l1

.err     clr d5
.eos     rts

string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v12 [Beta 3]'
      dc.b '(68020)'
;      dc.b '(68000)'
  dc.b 10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10
Buffy
     dcb.b 65536-(Buffy-start)

Bruce Abbott · 05 June 2021, 11:50

Quote:

Originally Posted by Don_Adan

I joined all code, but untested if it works or can be assembled.

After fixing a few syntax issues (same ones in litwr's code) it assembled without errors. On running the sign-on message was corrupted, but after that it seemed to work OK.

Only one problem, it doesn't appear to be any faster. But I don't know which version litwr's code is - perhaps it already incorporates some of the speedups discussed here?

Was really hoping I was wrong to say that attempting to optimize the code would be a waste of time, but so far...

robinsonb5 · 05 June 2021, 12:44

Quote:

Originally Posted by Bruce Abbott

Only one problem, it doesn't appear to be any faster. But I don't know which version litwr's code is - perhaps it already incorporates some of the speedups discussed here?

Here's litwr's commit history for the Amiga version - he's certainly incorporated many if not all of the optimisations suggested in this thread:
https://github.com/litwr2/rosetta-pi...a/pi-amiga.asm

(And I have to say I'm impressed by just how many platforms he covered with this project.)

Don_Adan · 05 June 2021, 15:34

Quote:

Originally Posted by Bruce Abbott

After fixing a few syntax issues (same ones in litwr's code) it assembled without errors. On running the sign-on message was corrupted, but after that it seemed to work OK.

Only one problem, it doesn't appear to be any faster. But I don't know which version litwr's code is - perhaps it already incorporates some of the speedups discussed here?

Was really hoping I was wrong to say that attempting to optimize the code would be a waste of time, but so far...

This is standard litwr version only cv handling and a few size was optimised. I will put source of optimised (?) version today. Of course i can made some errors when i joining source.

Don_Adan · 05 June 2021, 16:28

Different version of PR0000 routine. Maybe fastest, maybe not.

Code:

OldOpenLibrary = -408
CloseLibrary = -414
Output = -60
Input = -54
Write = -48
Read = -42
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

div32x16 macro    ;D7=D6/D4, D6=D6%D4
     ;clr.l d7
     moveq #0,d7
     divu.w d4,d6
     bvc.b .div32no\@

     swap d6
     move.w d6,d7
     divu.w d4,d7
     swap d7
     move d7,d6
     swap d6
     divu.w d4,d6
.div32no\@
     move.w d6,d7
     clr.w d6
     swap d6
endm
 
start    lea libname(pc),a1         ;open the dos library
         move.l 4.W,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         lea cout(PC),A4
         move.l d0,(A4)            ;cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
 ;        move.l #msg1,d2
         moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end
         add.l A4,D2
         moveq #msg4-msg1,d3
         jsr Write(a6)
;         move.l #start+$10000-ra,d7
;         divu #7,d7
         move.l #$10000-(ra-start),D7
         divu.w #7*4,D7
;         ext.l d7                  ; necessary only for Litwr version of PR0000
 ;        and.b #$fc,d7                 ;d7=maxn
         lsl.l #2,D7

.l20 
;    move.l cout(pc),d1
         move.l (A4),D1    ; cout
;         move.l #msg4,d2
         moveq #msg4-cout,D2
         add.l A4,D2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr.w PR0000
;         move.l cout(pc),d1
         move.l (A4),D1 ; cout
;         move.l #msg5,d2
         moveq #msg5-cout,D2
         add.l A4,D2
         moveq #msg3-msg5,d3
         jsr Write(a6)
         bsr.w getnum
         cmp.w d7,d5
         bhi.b .l20

         move.w d5,d1
         beq.b .l20

         addq.w #3,d5
         and.w #$fffc,d5
         cmp.b #10,(a0)
         bne.b .l21

         move.w d5,d6
         cmp.w d1,d5
         beq.b .l7

.l21     bsr.w PR0000
;         move.l cout(pc),d1
          move.l (A4),D1 ; cout
;         move.l #msg3,d2
        moveq #msg3-cout,D2
         add.l A4,D2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)

.l7 
         mulu.w #7,d6          ;kv = d6
         lsr.l #2,D6               ; /4
         move.l d6,d7
         lea ra(pc),a3

         exg a5,a6
         jsr Forbid(a6)
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         subq.l #1,D7
         bne.b .fill

         move.l D7,-(SP)    ; cv
         lea 10000.W,A2

.l0      moveq #0,D5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         lsl.l #2,D4           ; *4
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.l A2,D1
         bra.b .l4

.longdiv
         swap d3
         move.w d3,d7
         divu.w d4,d7
         swap d7
         move.w d7,d3
         swap d3
         divu.w d4,d3

         move.w d3,d7
         exg d3,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2      sub.l d3,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d3
         divu.w d4,d3
         bvs.s .longdiv

         move.w d3,d7
         clr.w d3
         swap d3
         move.w d3,(a3)     ;r[i] <- d%b
.enddiv
         subq.l #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
 ;        ext.l D5   ; necessary only for litwr version of PR0000 routine
         bsr.w PR0000

         subq.l #7,d6   ;kv
         bne.b .l0
         addq.l #4,SP ;  restore stack


         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg a5,a6
         moveq #INTB_VERTB,d0
         lea VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg a5,a6

         moveq #1,d3
         move.l cout(pc),d1
         move.l #msgx,d2
         jsr Write(a6)  ;space

         move.l d5,d3
         lsl.l #1,d5
         cmp.b #50,VBlankFrequency(a5)
         beq .l8

         lsl.l #1,d5      ;60 Hz
         add.l d3,d5
         divu.w #3,d5
         swap d5
         lsr.w #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8      lea string(pc),a3
         moveq.l #10,d4
         move.l d5,d6
         div32x16
         move.b d6,(a3)+
         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         move.b #'.'-'0',(a3)+
.l12     tst.w d7
         beq .l11

         divu.w d4,d7
         swap d7
         move.b d7,(a3)+
         clr.w d7
         swap d7
         bra .l12

.l11     add.b #'0',-(a3)
         moveq #1,d3
         move.l cout(pc),d1
         move.l a3,d2
         jsr Write(a6)
         cmp.l #string,a3
         bne .l11

         move.l cout(pc),d1
         move.l #msgx+1,d2
         jsr Write(a6)  ;newline

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
 lea $100.W,A0
 move.l #$303A3030,D2
 move.w #1000,D3
b1000
 sub.w D3,D5
 bcs.b n100
 add.w A0,D2
 bra.b b1000

n100
 add.w D3,D5
 moveq #100,D3
b100
 sub.w D3,D5
 bcs.b n10
 addq.b #1,D2
 bra.b b100

n10
 add.w D3,D5
 swap D2
 moveq #10,D3
b10
 sub.w D3,D5
 bcs.b n1
 add.w A0,D2
 bra.b b10
n1
 add.b D5,D2

 lea cout(PC),A0
 move.l (A0)+,D1 
 move.l D2,(A0)
 move.l A0,D2 ; buf
 moveq #4,D3
 jmp Write(A6) ;call Write(stdout,buff,size)
rasteri      addq.l #1,(a1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0                     ;ln_Name
      dc.l  time,rasteri          ;is_Data,is_Code

 msgx dc.b 32,10

 cnop 0,4

 time dc.l 0
 cout dc.l 0
 buf ds.b 4
 
ra

getnum   jsr Input(a6)          ;get stdin
         move.l #string,d2     ;set by previous call
         move.l d0,d1
         moveq #5,d3     ;+ newline
         jsr Read(a6)
         subq.w #1,d0
         beq .err

         move.l d2,a0
         clr.l d5
.l1      clr.w d6
         move.b (a0)+,d6
         cmpi.b #'9',d6
         bhi.b .err

         subi.b #'0',d6
         bcs.b .err

         add.w d6,d5
         subq.w #1,d0
         beq.b .eos

         mulu.w #10,d5
         bra.b .l1

.err     clr d5
.eos     rts

string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v12 [Beta 3]'
      dc.b '(68020)'
;      dc.b '(68000)'
  dc.b 10
msg4 dc.b 'number of digits (up to '
msg5 dc.b ')? '
msg3 dc.b ' digits will be printed'
msg2 dc.b 10
Buffy
     dcb.b 65536-(Buffy-start)

a/b · 05 June 2021, 21:35

Only glanced at the code, this stuck out:

Code:

.l1      clr.w d6
         move.b (a0)+,d6
         cmpi.b #'9',d6
         bhi.b .err
         subi.b #'0',d6
         bcs.b .err

         add.w d6,d5
...

This is 4 bytes shorter:

Code:

.l1	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.err

	add.w d6,d5
...

EDIT: Kept thinking about the whole loop, this is 8 bytes shorter:

Code:

...
	jsr	Read(a6)
	move.l	d2,a0
	moveq	#0,d5
.loop	subq.w	#1,d0
	beq.b	.done
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	#9,d6
	bhi.b	.error
	mulu.w	#10,d5
	add.w	d6,d5
	bra.b	.loop
.error	moveq	#0,d5
.done	rts

Now back to stabbing myself in the eye with fMP4...

Bruce Abbott · 07 June 2021, 17:06

Since this thread was originally supposed to be about optimizing 32 bit division (and because I am sick of seeing 300 digits of pi on my screen) I decided that for initial comparisons I would measure execution times without printing the digits. This saves the hassle of having to set the CLI window up exactly the same for each run (to avoid possible variations due to scrolling time etc.). Not printing the digits only made the code run about 8% faster, so any optimization in this area will probably make little difference.

I tested 3 code bases, litwr's V1 and V4 written in 2018 and Don_Adan's V12[BETA3] from post #265, on my A1200 with WB3.0 and 50MHz Blizzard 1230-IV. Results are rounded to the nearest 0.1 second.

litwr V1: 9.7 seconds
litwr V4: 8.9 seconds
Don_Adan V12b3: 8.9 seconds

This suggests that there is little opportunity for further significant improvement in execution speed of the core algorithm.

Just for fun I also tested it under different operating conditions. Normally I run The Enforcer on my system to warn me about programs trashing low memory. This can noticeably slow down programs that do a lot of legitimate low memory access. With The Enforcer running litr's V4 code took 9.0 seconds to execute, which is ~1% slower.

I also tried disabling CPU caches, and executing from ChipRAM. The results were a little surprising. Disabling the data cache had little effect, but disabling the instruction cache increased execution time to 12.4 seconds or ~28% slower. This shows that even with the fast 60ns RAM on the Blizzard 1230-IV, getting critical code to fit inside the instruction cache can greatly speed it up.

But what really surprised me was the effect of running from Chip RAM. I expected a massive slowdown, but it wasn't that bad - 12.7 seconds or ~30% slower, not much worse than running from FastRAM with CPU caches off. Unless The Enforcer was running, then execution time ballooned out to 49.6 seconds whether the instruction cache was on or off. That's 5.57 times slower!

robinsonb5 · 07 June 2021, 19:40

Quote:

Originally Posted by Bruce Abbott

I also tried disabling CPU caches, and executing from ChipRAM. The results were a little surprising. Disabling the data cache had little effect, but disabling the instruction cache increased execution time to 12.4 seconds or ~28% slower. This shows that even with the fast 60ns RAM on the Blizzard 1230-IV, getting critical code to fit inside the instruction cache can greatly speed it up.

Isn't data cache disallowed for Chip RAM though? (The CPU reading data that the blitter or disk DMA has written gets way more complicated if there's a data cache.)

Quote:

But what really surprised me was the effect of running from Chip RAM. I expected a massive slowdown, but it wasn't that bad - 12.7 seconds or ~30% slower, not much worse than running from FastRAM with CPU caches off.

That's interesting - can you compare an "easy" screenmode, like PAL 16 colours with, say fully overscanned DblPAL in 256 colours?

Quote:

Unless The Enforcer was running, then execution time ballooned out to 49.6 seconds whether the instruction cache was on or off. That's 5.57 times slower!

Also very interesting!

Bruce Abbott · 08 June 2021, 01:49

Quote:

Originally Posted by robinsonb5

Isn't data cache disallowed for Chip RAM though? (The CPU reading data that the blitter or disk DMA has written gets way more complicated if there's a data cache.)

I presume so, but testing in FastRAM revealed negligible difference between data cache on and off, perhaps because in this program the data spends most of its time in registers.

Even if the data was being written to ChipRAM, some accelerator cards (including the Blizzard 1230-IV?) have a 'delayed write' feature that starts a ChipRAM write and then disconnects the local bus so the CPU can continue processing, only waiting if it has to access ChipRAM again before the write has finished.

Quote:

That's interesting - can you compare an "easy" screenmode, like PAL 16 colours with, say fully overscanned DblPAL in 256 colours?

My previous tests were done on a PAL screen with 8 colors. I couldn't get full overscan to work in DblPAL, so I set it to 676x454 (max text overscan) in DblNTSC. Running litwr's V4 code the results were:-

FastRAM, 256 colors, CPU caches on: 10.1 seconds
ChipRAM, 16 colors, CPU caches on: 15.2 seconds
ChipRAM, 256 colors, CPU caches on: 44.4 seconds
ChipRAM, 16 colors, CPU caches off: 61.2 seconds
ChipRAM, 256 colors, CPU caches off: 194.3 seconds

From this we see that when running from chip - even with massive DMA contention - having CPU caches on makes a huge improvement. When running max text overscan in 256 colors it was 4.4 times faster with caches on. In 16 colors it was 4.0 times faster.

grond · 08 June 2021, 09:54

Quote:

Originally Posted by Bruce Abbott

Even if the data was being written to ChipRAM, some accelerator cards (including the Blizzard 1230-IV?) have a 'delayed write' feature that starts a ChipRAM write and then disconnects the local bus so the CPU can continue processing, only waiting if it has to access ChipRAM again before the write has finished.

I believe this possibility only exists in the 060 but I would be happy to be corrected.

Thomas Richter · 08 June 2021, 10:48

It's more complicated in reality. On the 68030, if the MMU is disabled, caching should(!) be controlled by the CIIN pin of the CPU. Board specific logic detects then into which address range an access goes, and if it goes into chip mem region or custom I/O region (or the region where the board memory assumes them to be present), then the pin is pulled low. In principle, board logic can even detect the function codes and thus allow caching of code, but not of data.

However, "should", because this logic does not quite work. Due to a design bug of the 68030, the processor caches written data even if the CIIN pin is low, and this is all the reason why the 68030.library is needed, namely to enable the MMU and make this reliable.

The mentioned feature of "delayed write" exists both on the 68040 and 68060, albeit in a slightly different form. Both processors have a "push buffer" into which they can migrate written data, and delay the write if the bus is slow. On the 68040, other RAM writes can "overtake" the data in the push buffer such that writes can become non-sequential, and that is why this feature is called "non-serialized" access on the 68040.

On the 68060, Motorola used a different design. Accesses on the 68060 are purely sequential, always, but the push-buffer is present. Instead, the feature is called "imprecise access". This is because if a write ends up in the push buffer, and this write is delayed and later on triggers a physical bus error when it is performed, the program counter no longer points to the causing instruction, but potentially to a later instruction. Thus, the 68060 cannot re-issue the faulting instruction anymore, and the write is "lost".

On the 68040, the contents of the push buffer is saved on the stack frame, and thus can be repeated, but not so on the 68060. Note that this only goes for physical bus errors, not for access errors (MMU invalid page detections) which are handled upfront execution of the instruction.

Thus, it is typically of a (small) advantage to map the chip ram as non-caching, but imprecise (060) nonserialized (040). I/O regions that may potentially cache bus faults should never be mapped imprecise (060), and I/O regions where the order of writes matter (typically yes) should never be mapped non-serialized).

The mmulib handles this all fine for you (or the corresponding processor library).

meynaf · 08 June 2021, 11:54

Quote:

Originally Posted by Thomas Richter

However, "should", because this logic does not quite work. Due to a design bug of the 68030, the processor caches written data even if the CIIN pin is low, and this is all the reason why the 68030.library is needed, namely to enable the MMU and make this reliable.

Design bug, not really. That pin can not be changed by the underlying hardware before it has decoded the target address, and by that time the 68030 is probably already executing other instructions so it does not know about the posted write anymore - keeping track of it would probably have been too complex or costly.
To handle this - it's not usually a problem in real life code - the 68030 has DC_WA (Write Allocate) bit in CACR.

Thomas Richter · 08 June 2021, 12:15

Quote:

Originally Posted by meynaf

Design bug, not really.

Design bug, really. That CIIN was not working on writes was not listed in the first version of the 68030 UM. The problem was found by Mike Sinz, and then included as "specification change" in the manual. The first version of the manual still lists CIIN as operating on writes.

Of course, Mot didn't want to delay the cache update until the bus cycle is initiated, but that is part of the design issue, really. The cache operates "at the wrong place" and "at the wrong time". This issue was carried over from the 68020/68851 system design, but the 68020 had no data cache, so the issue was not apparent there.

Quote:

Originally Posted by meynaf

To handle this - it's not usually a problem in real life code - the 68030 has DC_WA (Write Allocate) bit in CACR.

Which cannot be used, Amiga must always work with write allocation ON. Think about why. (Hint: The cache is logically indexed and includes function codes as index).

IOWs, there is no workaround other than turning the MMU on. Write-allocation off causes *also* cache-inconsistencies, but other inconsistencies. That these defects don't pop up immediately (ie. with write allocation OFF) is also due to the small size of the cache - but the issue also exists.

meynaf · 08 June 2021, 12:39

Quote:

Originally Posted by Thomas Richter

Design bug, really. That CIIN was not working on writes was not listed in the first version of the 68030 UM. The problem was found by Mike Sinz, and then included as "specification change" in the manual. The first version of the manual still lists CIIN as operating on writes.

Of course, Mot didn't want to delay the cache update until the bus cycle is initiated, but that is part of the design issue, really. The cache operates "at the wrong place" and "at the wrong time". This issue was carried over from the 68020/68851 system design, but the 68020 had no data cache, so the issue was not apparent there.

Looks more like implementation issue - or simply something missed in the docs - than design issue.
The problem is that the bus cycle could be initiated long after next instruction is executed. In a similar way, 040/060's push buffers don't change the way data is cached, do they ?

Quote:

Originally Posted by Thomas Richter

Which cannot be used, Amiga must always work with write allocation ON. Think about why. (Hint: The cache is logically indexed and includes function codes as index).

But this is Amiga problem, not 68030 problem.
Thinking about why, i don't have the time to try to guess, and anyway why not having everyone benefit from your knowledge here ?

Quote:

Originally Posted by Thomas Richter

IOWs, there is no workaround other than turning the MMU on. Write-allocation off causes *also* cache-inconsistencies, but other inconsistencies. That these defects don't pop up immediately (ie. with write allocation OFF) is also due to the small size of the cache - but the issue also exists.

In practice i've ran my 68030 during many years and this has never been a problem.
Turning WA off didn't crash the system either.
I have yet to see a scenario where this can cause real trouble.

Thomas Richter · 08 June 2021, 12:51

Quote:

Originally Posted by meynaf

Looks more like implementation issue - or simply something missed in the docs - than design issue.
The problem is that the bus cycle could be initiated long after next instruction is executed. In a similar way, 040/060's push buffers don't change the way data is cached, do they ?

Why should they?

Quote:

Originally Posted by meynaf

But this is Amiga problem, not 68030 problem.
Thinking about why, i don't have the time to try to guess, and anyway why not having everyone benefit from your knowledge here ?

Consider the following code with write-allocation off, with p pointing to a long-word aligned long word:

a = *p; /* read p into cache, fill cache line */
*p = 1; /* update p, in memory and in the data cache */

switch to supervisor

*p = 2; /* update p in memory, NOT in the data cache as write allocation is off */

switch to user

b = *p; /* this reads now stale data from the cache, not from memory */

The problem is that the cache is logically indexed including the function codes which also operate as cache-index, and the write in supervisor mode will not update the cache line alloated in user code as the function code is different. The write goes through, since write allocation is off. Since the data cache has not been updated, the second read from p reads stale data.

With write allocation on, the problem goes away since the write allocates in the cache, and it allocates the same cache line as the read of the user code. That is in a sense a "coincidence", but one that fixes the problem.

The amiga side of the problem is that the Amiga doesn't have a separate user/supervisor model. If the two memory regions would be distinct (as in a true Havard architecture) the problem would not exist.

Problem is that the 68030 design is all quirky, it is really a hot-patch of integrating the 68851 design into the 68020. This is just another example of what went wrong. Mot fixed this all along with the 68040.

meynaf · 08 June 2021, 13:05

Quote:

Originally Posted by Thomas Richter

Why should they?

For same reason the 68030 is supposed to do it ?
Or perhaps they just don't have the relevant pins to detect the case ?

Quote:

Originally Posted by Thomas Richter

Consider the following code with write-allocation off, with p pointing to a long-word aligned long word:

a = *p; /* read p into cache, fill cache line */
*p = 1; /* update p, in memory and in the data cache */

switch to supervisor

*p = 2; /* update p in memory, NOT in the data cache as write allocation is off */

switch to user

b = *p; /* this reads now stale data from the cache, not from memory */

The problem is that the cache is logically indexed including the function codes which also operate as cache-index, and the write in supervisor mode will not update the cache line alloated in user code as the function code is different. The write goes through, since write allocation is off. Since the data cache has not been updated, the second read from p reads stale data.

With write allocation on, the problem goes away since the write allocates in the cache, and it allocates the same cache line as the read of the user code. That is in a sense a "coincidence", but one that fixes the problem.

The amiga side of the problem is that the Amiga doesn't have a separate user/supervisor model. If the two memory regions would be distinct (as in a true Havard architecture) the problem would not exist.

Problem is that the 68030 design is all quirky, it is really a hot-patch of integrating the 68851 design into the 68020. This is just another example of what went wrong. Mot fixed this all along with the 68040.

Now that's clear. Thank you.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

03 June 2021, 14:21	#263
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	Sorry, present i dont have access to my Amiga. You can download this version and replace some parts manually. Maybe later i will join all changes. https://github.com/litwr2/rosetta-pi...a/pi-amiga.asm

07 June 2021, 17:06	#271
Bruce Abbott Registered User Join Date: Mar 2018 Location: Hastings, New Zealand Posts: 2,544	Since this thread was originally supposed to be about optimizing 32 bit division (and because I am sick of seeing 300 digits of pi on my screen) I decided that for initial comparisons I would measure execution times without printing the digits. This saves the hassle of having to set the CLI window up exactly the same for each run (to avoid possible variations due to scrolling time etc.). Not printing the digits only made the code run about 8% faster, so any optimization in this area will probably make little difference. I tested 3 code bases, litwr's V1 and V4 written in 2018 and Don_Adan's V12[BETA3] from post #265, on my A1200 with WB3.0 and 50MHz Blizzard 1230-IV. Results are rounded to the nearest 0.1 second. litwr V1: 9.7 seconds litwr V4: 8.9 seconds Don_Adan V12b3: 8.9 seconds This suggests that there is little opportunity for further significant improvement in execution speed of the core algorithm. Just for fun I also tested it under different operating conditions. Normally I run The Enforcer on my system to warn me about programs trashing low memory. This can noticeably slow down programs that do a lot of legitimate low memory access. With The Enforcer running litr's V4 code took 9.0 seconds to execute, which is ~1% slower. I also tried disabling CPU caches, and executing from ChipRAM. The results were a little surprising. Disabling the data cache had little effect, but disabling the instruction cache increased execution time to 12.4 seconds or ~28% slower. This shows that even with the fast 60ns RAM on the Blizzard 1230-IV, getting critical code to fit inside the instruction cache can greatly speed it up. But what really surprised me was the effect of running from Chip RAM. I expected a massive slowdown, but it wasn't that bad - 12.7 seconds or ~30% slower, not much worse than running from FastRAM with CPU caches off. Unless The Enforcer was running, then execution time ballooned out to 49.6 seconds whether the instruction cache was on or off. That's 5.57 times slower!

08 June 2021, 10:48	#275
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,215	It's more complicated in reality. On the 68030, if the MMU is disabled, caching should(!) be controlled by the CIIN pin of the CPU. Board specific logic detects then into which address range an access goes, and if it goes into chip mem region or custom I/O region (or the region where the board memory assumes them to be present), then the pin is pulled low. In principle, board logic can even detect the function codes and thus allow caching of code, but not of data. However, "should", because this logic does not quite work. Due to a design bug of the 68030, the processor caches written data even if the CIIN pin is low, and this is all the reason why the 68030.library is needed, namely to enable the MMU and make this reliable. The mentioned feature of "delayed write" exists both on the 68040 and 68060, albeit in a slightly different form. Both processors have a "push buffer" into which they can migrate written data, and delay the write if the bus is slow. On the 68040, other RAM writes can "overtake" the data in the push buffer such that writes can become non-sequential, and that is why this feature is called "non-serialized" access on the 68040. On the 68060, Motorola used a different design. Accesses on the 68060 are purely sequential, always, but the push-buffer is present. Instead, the feature is called "imprecise access". This is because if a write ends up in the push buffer, and this write is delayed and later on triggers a physical bus error when it is performed, the program counter no longer points to the causing instruction, but potentially to a later instruction. Thus, the 68060 cannot re-issue the faulting instruction anymore, and the write is "lost". On the 68040, the contents of the push buffer is saved on the stack frame, and thus can be repeated, but not so on the 68060. Note that this only goes for physical bus errors, not for access errors (MMU invalid page detections) which are handled upfront execution of the instruction. Thus, it is typically of a (small) advantage to map the chip ram as non-caching, but imprecise (060) nonserialized (040). I/O regions that may potentially cache bus faults should never be mapped imprecise (060), and I/O regions where the order of writes matter (typically yes) should never be mapped non-serialized). The mmulib handles this all fine for you (or the corresponding processor library).

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)