01 June 2021, 18:18 | #261 | |
Global Moderator
Join Date: Nov 2001
Location: Derby, UK
Age: 48
Posts: 9,355
|
Fine by me! Quote:
Excuse me?? Do you even know what you arer saying? I did take responsibility when I told you I had banned him. The reason why has absolutely nothing at all to do with you, hence why I haven't told you. It is between the EAB and Litwr. If he want's to tell you thast is fine. Furthermore I want to reiterate, this isn't a democracy, your opinion does not matter, litwr has had numerous members report him, he has been warned and now he has been banned. The forum is here for the benefit of the users, and any user who doesn't follow the rules has the same process. We rarely ever give an instant ban (homophobic, sexual, gender, racism etc will result in an instant, and often permanant ban). It is that simple. If you have an issuie with how the forum is run, take it up with RCK or the logout button is at the top of the page! Last edited by BippyM; 01 June 2021 at 18:25. |
|
03 June 2021, 12:13 | #262 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Quote:
|
|
03 June 2021, 14:21 | #263 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
Sorry, present i dont have access to my Amiga.
You can download this version and replace some parts manually. Maybe later i will join all changes. https://github.com/litwr2/rosetta-pi...a/pi-amiga.asm |
03 June 2021, 15:00 | #264 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Quote:
I got litwr's code to assemble with ProAsm, which was quite a lot of work. It seems to run OK so I think I got it right. However I have have low confidence in my ability to join bits of unfamiliar code together without making a mistake. If the code is not accurate then I won't be able to do a fair comparison. |
|
03 June 2021, 20:00 | #265 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
I joined all code, but untested if it works or can be assembled.
Code:
OldOpenLibrary = -408 CloseLibrary = -414 Output = -60 Input = -54 Write = -48 Read = -42 Forbid = -132 Permit = -138 AddIntServer = -168 RemIntServer = -174 VBlankFrequency = 530 INTB_VERTB = 5 ;for vblank interrupt NT_INTERRUPT = 2 ;node type ;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits div32x16 macro ;D7=D6/D4, D6=D6%D4 ;clr.l d7 moveq.l #0,d7 divu d4,d6 bvc .div32no\@ swap d6 move d6,d7 divu d4,d7 swap d7 move d7,d6 swap d6 divu d4,d6 .div32no\@ move d6,d7 clr d6 swap d6 endm start lea libname(pc),a1 ;open the dos library move.l 4.W,a5 move.l a5,a6 jsr OldOpenLibrary(a6) move.l d0,a6 jsr Output(a6) ;get stdout lea cout(PC),A4 move.l d0,(A4) ;cout move.l d0,d1 ;call Write(stdout,buff,size) ; move.l #msg1,d2 moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end add.l A4,D2 moveq #msg4-msg1,d3 jsr Write(a6) ; move.l #start+$10000-ra,d7 ; divu #7,d7 move.l #$10000-(ra-start),D7 divu.w #7*4,D7 ext.l d7 ; necessary only for Litwr version of PR0000 ; and.b #$fc,d7 ;d7=maxn lsl.l #2,D7 .l20 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg4,d2 moveq #msg4-cout,D2 add.l A4,D2 moveq #msg5-msg4,d3 jsr Write(a6) move.l d7,d5 bsr.w PR0000 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg5,d2 moveq #msg5-cout,D2 add.l A4,D2 moveq #msg3-msg5,d3 jsr Write(a6) bsr.w getnum cmp.w d7,d5 bhi.b .l20 move.w d5,d1 beq.b .l20 addq.w #3,d5 and.w #$fffc,d5 cmp.b #10,(a0) bne.b .l21 move.w d5,d6 cmp.w d1,d5 beq.b .l7 .l21 bsr.w PR0000 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg3,d2 moveq #msg3-cout,D2 add.l A4,D2 moveq #msg2-msg3+1,d3 jsr Write(a6) .l7 mulu.w #7,d6 ;kv = d6 lsr.l #2,D6 ; /4 move.l d6,d7 lea ra(pc),a3 exg a5,a6 jsr Forbid(a6) moveq.l #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr AddIntServer(a6) exg a5,a6 ;move.w #$4000,$dff096 ;DMA off move.l #2000*65537,d0 move.l a3,a0 .fill move.l d0,(a0)+ subq.l #1,D7 bne.b .fill move.l D7,-(SP) ; cv lea 10000.W,A2 .l0 moveq #0,D5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i*2 lsl.l #2,D4 ; *4 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 move.l A2,D1 bra.b .l4 .longdiv swap d3 move.w d3,d7 divu.w d4,d7 swap d7 move.w d7,d3 swap d3 divu.w d4,d3 move.w d3,d7 exg d3,d7 clr.w d7 swap d7 move.w d7,(a3) ;r[i] <- d%b bra.b .enddiv .l2 sub.l d3,d5 sub.l d7,d5 lsr.l #1,d5 .l4 move -(a3),d0 ; r[i] mulu.w d1,d0 ;r[i]*10000 add.l d0,d5 ;d += r[i]*10000 move.l d5,d3 divu.w d4,d3 bvs.s .longdiv move.w d3,d7 clr.w d3 swap d3 move.w d3,(a3) ;r[i] <- d%b .enddiv subq.l #2,d4 ;i <- i - 1 bcc.b .l2 ;the main loop divu.w d1,d5 ;removed with MULU optimization add.w (SP),D5 ; cv move.l D5,(SP) ; cv ext.l D5 ; necessary only for litwr version of PR0000 routine bsr PR0000 subq.l #7,d6 ;kv bne.b .l0 addq.l #4,SP ; restore stack move.l time(pc),d5 ;move.w #$c000,$dff096 ;DMA on exg.l a5,a6 moveq.l #INTB_VERTB,d0 lea.l VBlankServer(pc),a1 jsr RemIntServer(a6) jsr Permit(a6) exg.l a5,a6 moveq.l #1,d3 move.l cout(pc),d1 move.l #msgx,d2 jsr Write(a6) ;space move.l d5,d3 lsl.l d5 cmp.b #50,VBlankFrequency(a5) beq .l8 lsl.l d5 ;60 Hz add.l d3,d5 divu #3,d5 swap d5 lsr #2,d5 swap d5 negx.l d5 neg.l d5 .l8 lea string(pc),a3 moveq.l #10,d4 move.l d5,d6 div32x16 move.b d6,(a3)+ divu d4,d7 swap d7 move.b d7,(a3)+ clr d7 swap d7 move.b #'.'-'0',(a3)+ .l12 tst d7 beq .l11 divu d4,d7 swap d7 move.b d7,(a3)+ clr d7 swap d7 bra .l12 .l11 add.b #'0',-(a3) moveq #1,d3 move.l cout(pc),d1 move.l a3,d2 jsr Write(a6) cmp.l #string,a3 bne .l11 move.l cout(pc),d1 move.l #msgx+1,d2 jsr Write(a6) ;newline move.l a6,a1 move.l a5,a6 jmp CloseLibrary(a6) PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 lea.l buf(pc),a0 move.l a0,d2 bsr.s .l1 moveq #4,d3 move.l cout(pc),d1 jmp Write(a6) ;call Write(stdout,buff,size) .l1 divu #1000,d5 bsr .l0 clr d5 swap d5 divu #100,d5 bsr .l0 clr d5 swap d5 divu #10,d5 bsr .l0 swap d5 .l0 eori.b #'0',d5 move.b d5,(a0)+ rts rasteri addq.l #1,(a1) ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l 0 ;ln_Name dc.l time,rasteri ;is_Data,is_Code msgx dc.b 32,10 cnop 0,4 time dc.l 0 cout dc.l 0 buf ds.b 4 ra getnum jsr Input(a6) ;get stdin move.l #string,d2 ;set by previous call move.l d0,d1 moveq.l #5,d3 ;+ newline jsr Read(a6) subq #1,d0 beq .err move.l d2,a0 clr.l d5 .l1 clr d6 move.b (a0)+,d6 cmpi.b #'9',d6 bhi .err subi.b #'0',d6 bcs .err add d6,d5 subq #1,d0 beq .eos mulu #10,d5 bra .l1 .err clr d5 .eos rts string = msg1 libname dc.b "dos.library",0 msg1 dc.b 'number pi calculator v12 [Beta 3]' dc.b '(68020)' ; dc.b '(68000)' dc.b 10 msg4 dc.b 'number of digits (up to ' msg5 dc.b ')? ' msg3 dc.b ' digits will be printed' msg2 dc.b 10 Buffy dcb.b 65536-(Buffy-start) |
05 June 2021, 11:50 | #266 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Quote:
Only one problem, it doesn't appear to be any faster. But I don't know which version litwr's code is - perhaps it already incorporates some of the speedups discussed here? Was really hoping I was wrong to say that attempting to optimize the code would be a waste of time, but so far... |
|
05 June 2021, 12:44 | #267 | |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Quote:
Here's litwr's commit history for the Amiga version - he's certainly incorporated many if not all of the optimisations suggested in this thread: https://github.com/litwr2/rosetta-pi...a/pi-amiga.asm (And I have to say I'm impressed by just how many platforms he covered with this project.) |
|
05 June 2021, 15:34 | #268 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
Quote:
|
|
05 June 2021, 16:28 | #269 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
Different version of PR0000 routine. Maybe fastest, maybe not.
Code:
OldOpenLibrary = -408 CloseLibrary = -414 Output = -60 Input = -54 Write = -48 Read = -42 Forbid = -132 Permit = -138 AddIntServer = -168 RemIntServer = -174 VBlankFrequency = 530 INTB_VERTB = 5 ;for vblank interrupt NT_INTERRUPT = 2 ;node type ;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits div32x16 macro ;D7=D6/D4, D6=D6%D4 ;clr.l d7 moveq #0,d7 divu.w d4,d6 bvc.b .div32no\@ swap d6 move.w d6,d7 divu.w d4,d7 swap d7 move d7,d6 swap d6 divu.w d4,d6 .div32no\@ move.w d6,d7 clr.w d6 swap d6 endm start lea libname(pc),a1 ;open the dos library move.l 4.W,a5 move.l a5,a6 jsr OldOpenLibrary(a6) move.l d0,a6 jsr Output(a6) ;get stdout lea cout(PC),A4 move.l d0,(A4) ;cout move.l d0,d1 ;call Write(stdout,buff,size) ; move.l #msg1,d2 moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end add.l A4,D2 moveq #msg4-msg1,d3 jsr Write(a6) ; move.l #start+$10000-ra,d7 ; divu #7,d7 move.l #$10000-(ra-start),D7 divu.w #7*4,D7 ; ext.l d7 ; necessary only for Litwr version of PR0000 ; and.b #$fc,d7 ;d7=maxn lsl.l #2,D7 .l20 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg4,d2 moveq #msg4-cout,D2 add.l A4,D2 moveq #msg5-msg4,d3 jsr Write(a6) move.l d7,d5 bsr.w PR0000 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg5,d2 moveq #msg5-cout,D2 add.l A4,D2 moveq #msg3-msg5,d3 jsr Write(a6) bsr.w getnum cmp.w d7,d5 bhi.b .l20 move.w d5,d1 beq.b .l20 addq.w #3,d5 and.w #$fffc,d5 cmp.b #10,(a0) bne.b .l21 move.w d5,d6 cmp.w d1,d5 beq.b .l7 .l21 bsr.w PR0000 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg3,d2 moveq #msg3-cout,D2 add.l A4,D2 moveq #msg2-msg3+1,d3 jsr Write(a6) .l7 mulu.w #7,d6 ;kv = d6 lsr.l #2,D6 ; /4 move.l d6,d7 lea ra(pc),a3 exg a5,a6 jsr Forbid(a6) moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr AddIntServer(a6) exg a5,a6 ;move.w #$4000,$dff096 ;DMA off move.l #2000*65537,d0 move.l a3,a0 .fill move.l d0,(a0)+ subq.l #1,D7 bne.b .fill move.l D7,-(SP) ; cv lea 10000.W,A2 .l0 moveq #0,D5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i*2 lsl.l #2,D4 ; *4 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 move.l A2,D1 bra.b .l4 .longdiv swap d3 move.w d3,d7 divu.w d4,d7 swap d7 move.w d7,d3 swap d3 divu.w d4,d3 move.w d3,d7 exg d3,d7 clr.w d7 swap d7 move.w d7,(a3) ;r[i] <- d%b bra.b .enddiv .l2 sub.l d3,d5 sub.l d7,d5 lsr.l #1,d5 .l4 move -(a3),d0 ; r[i] mulu.w d1,d0 ;r[i]*10000 add.l d0,d5 ;d += r[i]*10000 move.l d5,d3 divu.w d4,d3 bvs.s .longdiv move.w d3,d7 clr.w d3 swap d3 move.w d3,(a3) ;r[i] <- d%b .enddiv subq.l #2,d4 ;i <- i - 1 bcc.b .l2 ;the main loop divu.w d1,d5 ;removed with MULU optimization add.w (SP),D5 ; cv move.l D5,(SP) ; cv ; ext.l D5 ; necessary only for litwr version of PR0000 routine bsr.w PR0000 subq.l #7,d6 ;kv bne.b .l0 addq.l #4,SP ; restore stack move.l time(pc),d5 ;move.w #$c000,$dff096 ;DMA on exg a5,a6 moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr RemIntServer(a6) jsr Permit(a6) exg a5,a6 moveq #1,d3 move.l cout(pc),d1 move.l #msgx,d2 jsr Write(a6) ;space move.l d5,d3 lsl.l #1,d5 cmp.b #50,VBlankFrequency(a5) beq .l8 lsl.l #1,d5 ;60 Hz add.l d3,d5 divu.w #3,d5 swap d5 lsr.w #2,d5 swap d5 negx.l d5 neg.l d5 .l8 lea string(pc),a3 moveq.l #10,d4 move.l d5,d6 div32x16 move.b d6,(a3)+ divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 move.b #'.'-'0',(a3)+ .l12 tst.w d7 beq .l11 divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 bra .l12 .l11 add.b #'0',-(a3) moveq #1,d3 move.l cout(pc),d1 move.l a3,d2 jsr Write(a6) cmp.l #string,a3 bne .l11 move.l cout(pc),d1 move.l #msgx+1,d2 jsr Write(a6) ;newline move.l a6,a1 move.l a5,a6 jmp CloseLibrary(a6) PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 lea $100.W,A0 move.l #$303A3030,D2 move.w #1000,D3 b1000 sub.w D3,D5 bcs.b n100 add.w A0,D2 bra.b b1000 n100 add.w D3,D5 moveq #100,D3 b100 sub.w D3,D5 bcs.b n10 addq.b #1,D2 bra.b b100 n10 add.w D3,D5 swap D2 moveq #10,D3 b10 sub.w D3,D5 bcs.b n1 add.w A0,D2 bra.b b10 n1 add.b D5,D2 lea cout(PC),A0 move.l (A0)+,D1 move.l D2,(A0) move.l A0,D2 ; buf moveq #4,D3 jmp Write(A6) ;call Write(stdout,buff,size) rasteri addq.l #1,(a1) ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l 0 ;ln_Name dc.l time,rasteri ;is_Data,is_Code msgx dc.b 32,10 cnop 0,4 time dc.l 0 cout dc.l 0 buf ds.b 4 ra getnum jsr Input(a6) ;get stdin move.l #string,d2 ;set by previous call move.l d0,d1 moveq #5,d3 ;+ newline jsr Read(a6) subq.w #1,d0 beq .err move.l d2,a0 clr.l d5 .l1 clr.w d6 move.b (a0)+,d6 cmpi.b #'9',d6 bhi.b .err subi.b #'0',d6 bcs.b .err add.w d6,d5 subq.w #1,d0 beq.b .eos mulu.w #10,d5 bra.b .l1 .err clr d5 .eos rts string = msg1 libname dc.b "dos.library",0 msg1 dc.b 'number pi calculator v12 [Beta 3]' dc.b '(68020)' ; dc.b '(68000)' dc.b 10 msg4 dc.b 'number of digits (up to ' msg5 dc.b ')? ' msg3 dc.b ' digits will be printed' msg2 dc.b 10 Buffy dcb.b 65536-(Buffy-start) |
05 June 2021, 21:35 | #270 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Only glanced at the code, this stuck out:
Code:
.l1 clr.w d6 move.b (a0)+,d6 cmpi.b #'9',d6 bhi.b .err subi.b #'0',d6 bcs.b .err add.w d6,d5 ... Code:
.l1 move.w #256-'0',d6 add.b (a0)+,d6 cmp.w #9,d6 bhi.b .err add.w d6,d5 ... Code:
... jsr Read(a6) move.l d2,a0 moveq #0,d5 .loop subq.w #1,d0 beq.b .done move.w #256-'0',d6 add.b (a0)+,d6 cmp.w #9,d6 bhi.b .error mulu.w #10,d5 add.w d6,d5 bra.b .loop .error moveq #0,d5 .done rts Last edited by a/b; 05 June 2021 at 22:08. |
07 June 2021, 17:06 | #271 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Since this thread was originally supposed to be about optimizing 32 bit division (and because I am sick of seeing 300 digits of pi on my screen) I decided that for initial comparisons I would measure execution times without printing the digits. This saves the hassle of having to set the CLI window up exactly the same for each run (to avoid possible variations due to scrolling time etc.). Not printing the digits only made the code run about 8% faster, so any optimization in this area will probably make little difference.
I tested 3 code bases, litwr's V1 and V4 written in 2018 and Don_Adan's V12[BETA3] from post #265, on my A1200 with WB3.0 and 50MHz Blizzard 1230-IV. Results are rounded to the nearest 0.1 second. litwr V1: 9.7 seconds litwr V4: 8.9 seconds Don_Adan V12b3: 8.9 seconds This suggests that there is little opportunity for further significant improvement in execution speed of the core algorithm. Just for fun I also tested it under different operating conditions. Normally I run The Enforcer on my system to warn me about programs trashing low memory. This can noticeably slow down programs that do a lot of legitimate low memory access. With The Enforcer running litr's V4 code took 9.0 seconds to execute, which is ~1% slower. I also tried disabling CPU caches, and executing from ChipRAM. The results were a little surprising. Disabling the data cache had little effect, but disabling the instruction cache increased execution time to 12.4 seconds or ~28% slower. This shows that even with the fast 60ns RAM on the Blizzard 1230-IV, getting critical code to fit inside the instruction cache can greatly speed it up. But what really surprised me was the effect of running from Chip RAM. I expected a massive slowdown, but it wasn't that bad - 12.7 seconds or ~30% slower, not much worse than running from FastRAM with CPU caches off. Unless The Enforcer was running, then execution time ballooned out to 49.6 seconds whether the instruction cache was on or off. That's 5.57 times slower! |
07 June 2021, 19:40 | #272 | |||
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Quote:
Isn't data cache disallowed for Chip RAM though? (The CPU reading data that the blitter or disk DMA has written gets way more complicated if there's a data cache.) Quote:
Quote:
|
|||
08 June 2021, 01:49 | #273 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Quote:
Even if the data was being written to ChipRAM, some accelerator cards (including the Blizzard 1230-IV?) have a 'delayed write' feature that starts a ChipRAM write and then disconnects the local bus so the CPU can continue processing, only waiting if it has to access ChipRAM again before the write has finished. Quote:
FastRAM, 256 colors, CPU caches on: 10.1 seconds ChipRAM, 16 colors, CPU caches on: 15.2 seconds ChipRAM, 256 colors, CPU caches on: 44.4 seconds ChipRAM, 16 colors, CPU caches off: 61.2 seconds ChipRAM, 256 colors, CPU caches off: 194.3 seconds From this we see that when running from chip - even with massive DMA contention - having CPU caches on makes a huge improvement. When running max text overscan in 256 colors it was 4.4 times faster with caches on. In 16 colors it was 4.0 times faster. |
||
08 June 2021, 09:54 | #274 | |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
|
|
08 June 2021, 10:48 | #275 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
It's more complicated in reality. On the 68030, if the MMU is disabled, caching should(!) be controlled by the CIIN pin of the CPU. Board specific logic detects then into which address range an access goes, and if it goes into chip mem region or custom I/O region (or the region where the board memory assumes them to be present), then the pin is pulled low. In principle, board logic can even detect the function codes and thus allow caching of code, but not of data.
However, "should", because this logic does not quite work. Due to a design bug of the 68030, the processor caches written data even if the CIIN pin is low, and this is all the reason why the 68030.library is needed, namely to enable the MMU and make this reliable. The mentioned feature of "delayed write" exists both on the 68040 and 68060, albeit in a slightly different form. Both processors have a "push buffer" into which they can migrate written data, and delay the write if the bus is slow. On the 68040, other RAM writes can "overtake" the data in the push buffer such that writes can become non-sequential, and that is why this feature is called "non-serialized" access on the 68040. On the 68060, Motorola used a different design. Accesses on the 68060 are purely sequential, always, but the push-buffer is present. Instead, the feature is called "imprecise access". This is because if a write ends up in the push buffer, and this write is delayed and later on triggers a physical bus error when it is performed, the program counter no longer points to the causing instruction, but potentially to a later instruction. Thus, the 68060 cannot re-issue the faulting instruction anymore, and the write is "lost". On the 68040, the contents of the push buffer is saved on the stack frame, and thus can be repeated, but not so on the 68060. Note that this only goes for physical bus errors, not for access errors (MMU invalid page detections) which are handled upfront execution of the instruction. Thus, it is typically of a (small) advantage to map the chip ram as non-caching, but imprecise (060) nonserialized (040). I/O regions that may potentially cache bus faults should never be mapped imprecise (060), and I/O regions where the order of writes matter (typically yes) should never be mapped non-serialized). The mmulib handles this all fine for you (or the corresponding processor library). |
08 June 2021, 11:54 | #276 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
To handle this - it's not usually a problem in real life code - the 68030 has DC_WA (Write Allocate) bit in CACR. |
|
08 June 2021, 12:15 | #277 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
Design bug, really. That CIIN was not working on writes was not listed in the first version of the 68030 UM. The problem was found by Mike Sinz, and then included as "specification change" in the manual. The first version of the manual still lists CIIN as operating on writes.
Of course, Mot didn't want to delay the cache update until the bus cycle is initiated, but that is part of the design issue, really. The cache operates "at the wrong place" and "at the wrong time". This issue was carried over from the 68020/68851 system design, but the 68020 had no data cache, so the issue was not apparent there. Quote:
Which cannot be used, Amiga must always work with write allocation ON. Think about why. (Hint: The cache is logically indexed and includes function codes as index). IOWs, there is no workaround other than turning the MMU on. Write-allocation off causes *also* cache-inconsistencies, but other inconsistencies. That these defects don't pop up immediately (ie. with write allocation OFF) is also due to the small size of the cache - but the issue also exists. |
|
08 June 2021, 12:39 | #278 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
The problem is that the bus cycle could be initiated long after next instruction is executed. In a similar way, 040/060's push buffers don't change the way data is cached, do they ? Quote:
Thinking about why, i don't have the time to try to guess, and anyway why not having everyone benefit from your knowledge here ? Quote:
Turning WA off didn't crash the system either. I have yet to see a scenario where this can cause real trouble. |
|||
08 June 2021, 12:51 | #279 | ||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
Quote:
Quote:
Consider the following code with write-allocation off, with p pointing to a long-word aligned long word: a = *p; /* read p into cache, fill cache line */ *p = 1; /* update p, in memory and in the data cache */ switch to supervisor *p = 2; /* update p in memory, NOT in the data cache as write allocation is off */ switch to user b = *p; /* this reads now stale data from the cache, not from memory */ The problem is that the cache is logically indexed including the function codes which also operate as cache-index, and the write in supervisor mode will not update the cache line alloated in user code as the function code is different. The write goes through, since write allocation is off. Since the data cache has not been updated, the second read from p reads stale data. With write allocation on, the problem goes away since the write allocates in the cache, and it allocates the same cache line as the read of the user code. That is in a sense a "coincidence", but one that fixes the problem. The amiga side of the problem is that the Amiga doesn't have a separate user/supervisor model. If the two memory regions would be distinct (as in a true Havard architecture) the problem would not exist. Problem is that the 68030 design is all quirky, it is really a hot-patch of integrating the 68851 design into the 68020. This is just another example of what went wrong. Mot fixed this all along with the 68040. |
||
08 June 2021, 13:05 | #280 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
For same reason the 68030 is supposed to do it ?
Or perhaps they just don't have the relevant pins to detect the case ? Quote:
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|