08 June 2021, 14:24 | #281 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,957
|
Quote:
|
|
08 June 2021, 15:50 | #282 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
|
Quote:
However, the 68030 CIIN feature/bug is over-rated and rarely causes any problems. Therefore, it makes sense to implement any performance reducing CIIN fixes only when absolutely required (e.g. BridgeBoards and the tiny number of Zorro bus I/O cards which failed to provide their own hardware or software solutions for the problem). |
|
08 June 2021, 16:24 | #283 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,214
|
It goes beyond that. Various graphic cards are affected as well. Look into the CVision3D manual, for example. It requests users to run "Enforcer" (back then) to have an operational card on the 68030.
|
10 June 2021, 11:49 | #284 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,543
|
Quote:
However the file size of your version is significantly smaller that litwr's, and possibly could be made even smaller. You tried replacing move.l #msgxx,d2 with moveq #msg1-cout,D2 : add.l A4,D2 , but this didn't work because (except in one case) the addresses are too far apart. I replaced it with lea msgxx(pc),A0 : move.l a0,d2 which is the same size as the original code but saves a reloc32 entry. I see a few other places where a few bytes could be saved, but I can't be bothered. It is only 804 bytes now (down from the original 924) which is not bad for what it does. |
|
10 June 2021, 16:00 | #285 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,957
|
Test version, added a/b optimisation. More changes after dinner.
Moved D2 and D3 out of full loop, maybe can be a few fastest now. Code:
OldOpenLibrary = -408 CloseLibrary = -414 Output = -60 Input = -54 Write = -48 Read = -42 Forbid = -132 Permit = -138 AddIntServer = -168 RemIntServer = -174 VBlankFrequency = 530 INTB_VERTB = 5 ;for vblank interrupt NT_INTERRUPT = 2 ;node type ;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits start lea libname(pc),a1 ;open the dos library move.l 4.W,a5 move.l a5,a6 jsr OldOpenLibrary(a6) move.l d0,a6 jsr Output(a6) ;get stdout lea cout(PC),A4 move.l d0,(A4) ;cout move.l d0,d1 ;call Write(stdout,buff,size) moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end add.l A4,D2 moveq #msg4-msg1,d3 jsr Write(a6) move.l #$10000-(ra-start),D7 divu.w #7*4,D7 lsl.l #2,D7 ; d7.w=maxn .l20 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg4,d2 moveq #msg4-cout,D2 add.l A4,D2 moveq #msg5-msg4,d3 jsr Write(a6) move.l d7,d5 bsr.w PR0000 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg5,d2 moveq #msg5-cout,D2 add.l A4,D2 moveq #msg3-msg5,d3 jsr Write(a6) bsr.w getnum cmp.w d7,d5 bhi.b .l20 move.w d5,d1 beq.b .l20 addq.w #3,d5 and.w #$fffc,d5 cmp.b #10,(a0) bne.b .l21 move.w d5,d6 cmp.w d1,d5 beq.b .l7 .l21 bsr.w PR0000 move.l (A4),D1 ; cout ; move.l #msg3,d2 moveq #msg3-cout,D2 add.l A4,D2 moveq #msg2-msg3+1,d3 jsr Write(a6) .l7 mulu.w #7,d6 ;kv = d6 lsr.l #2,D6 ; /4 move.l d6,d7 lea ra(pc),a3 exg a5,a6 jsr Forbid(a6) moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr AddIntServer(a6) exg a5,a6 ;move.w #$4000,$dff096 ;DMA off move.l #2000*65537,d0 move.l a3,a0 .fill move.l d0,(a0)+ subq.l #1,D7 bne.b .fill move.l D7,-(SP) ; cv lea 10000.W,A2 moveq #4,D3 moveq #buf-cout,D2 add.l A4,D2 ; buf .l0 moveq #0,D5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i*2 lsl.l #2,D4 ; *4 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 move.l A2,D1 bra.b .l4 .longdiv swap d0 move.w d0,d7 divu.w d4,d7 swap d7 move.w d7,d0 swap d0 divu.w d4,d0 move.w d0,d7 exg d0,d7 clr.w d7 swap d7 move.w d7,(a3) ;r[i] <- d%b bra.b .enddiv .l2 sub.l d0,d5 sub.l d7,d5 lsr.l #1,d5 .l4 move -(a3),d0 ; r[i] mulu.w d1,d0 ;r[i]*10000 add.l d0,d5 ;d += r[i]*10000 move.l d5,d0 divu.w d4,d0 bvs.s .longdiv move.w d0,d7 clr.w d0 swap d0 move.w d0,(a3) ;r[i] <- d%b .enddiv subq.l #2,d4 ;i <- i - 1 bcc.b .l2 ;the main loop divu.w d1,d5 ;removed with MULU optimization add.w (SP),D5 ; cv move.l D5,(SP) ; cv bsr.w PR000N subq.l #7,d6 ;kv bne.b .l0 addq.l #4,SP ; restore stack move.l time(pc),d5 ;move.w #$c000,$dff096 ;DMA on exg a5,a6 moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr RemIntServer(a6) jsr Permit(a6) exg a5,a6 moveq #1,d3 move.l cout(pc),d1 move.l #msgx,d2 jsr Write(a6) ;space move.l d5,d3 lsl.l #1,d5 cmp.b #50,VBlankFrequency(a5) beq .l8 lsl.l #1,d5 ;60 Hz add.l d3,d5 divu.w #3,d5 swap d5 lsr.w #2,d5 swap d5 negx.l d5 neg.l d5 .l8 lea string(pc),a3 moveq.l #10,d4 move.l d5,d6 ;div32x16 macro ;D7=D6/D4, D6=D6%D4 ; moveq #0,d7 ; not necessary D7 highword is already cleared divu.w d4,d6 bvc.b .div32no swap d6 move.w d6,d7 divu.w d4,d7 swap d7 move d7,d6 swap d6 divu.w d4,d6 .div32no move.w d6,d7 ; clr.w d6 ;not necessary swap d6 move.b d6,(a3)+ divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 move.b #'.'-'0',(a3)+ .l12 tst.w d7 beq .l11 divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 bra .l12 .l11 add.b #'0',-(a3) moveq #1,d3 move.l cout(pc),d1 move.l a3,d2 jsr Write(a6) cmp.l #string,a3 bne .l11 move.l cout(pc),d1 move.l #msgx+1,d2 jsr Write(a6) ;newline move.l a6,a1 move.l a5,a6 jmp CloseLibrary(a6) PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 moveq #4,D3 moveq #buf-cout,D2 add.l A4,D2 ; buf PR000N lea $100.W,A0 move.l #$303A3030,D0 move.w #1000,D1 b1000 sub.w D1,D5 bcs.b n100 add.w A0,D0 bra.b b1000 n100 add.w D1,D5 moveq #100,D1 b100 sub.w D1,D5 bcs.b n10 addq.b #1,D0 bra.b b100 n10 add.w D1,D5 swap D0 moveq #10,D1 b10 sub.w D1,D5 bcs.b n1 add.w A0,D0 bra.b b10 n1 add.b D5,D0 move.l D0,4(A4) ; buf move.l (A4),D1 ; cout jmp Write(A6) ;call Write(stdout,buff,size) rasteri addq.l #1,(a1) ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l 0 ;ln_Name dc.l time,rasteri ;is_Data,is_Code msgx dc.b 32,10 cnop 0,4 time dc.l 0 cout dc.l 0 buf ds.b 4 ; Overwritten code/data start here. ra getnum jsr Input(a6) ;get stdin ; move.l #string,d2 ;set by previous call moveq #msg1-cout,D2 add.l A4,D2 move.l d0,d1 moveq #5,d3 ;+ newline jsr Read(a6) move.l d2,a0 moveq #0,d5 .loop subq.w #1,d0 beq.b .done move.w #256-'0',d6 add.b (a0)+,d6 cmp.w #9,d6 bhi.b .error mulu.w #10,d5 add.w d6,d5 bra.b .loop .error moveq #0,d5 .done rts string = msg1 libname dc.b "dos.library",0 msg1 dc.b 'number pi calculator v13' dc.b 10 msg4 dc.b 'number of digits (up to ' msg5 dc.b ')? ' msg3 dc.b ' digits will be printed' msg2 dc.b 10 Buffy dcb.b 65536-(Buffy-start) Last edited by Don_Adan; 10 June 2021 at 17:30. |
10 June 2021, 17:33 | #286 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,038
|
Any particular reason why DCB.B 65536-... and not DS.B instead? As far as I can see the buffer is filled with 2000s, so why not make the executable significantly shorter?
Also, this is shorter (and faster, not that it matters much here): Code:
PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 move.w #$0100,a0 move.l #$2f3a2f2f,d2 move.w #1000,d3 .b1000 add.w a0,d2 sub.w d3,d5 bcc.b .b1000 add.w d3,d5 moveq #100,d3 .b100 addq.b #1,d2 sub.w d3,d5 bcc.b .b100 add.w d3,d5 swap d2 moveq #10,d3 .b10 add.w a0,d2 sub.w d3,d5 bcc.b .b10 add.b d5,d2 lea cout(pc),a0 ... |
10 June 2021, 22:52 | #287 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,957
|
Added more a/b optimisations. DCB.B was used in github version. I made changes (cleaning code) step by step. Now ds.b is used.
Code:
OldOpenLibrary = -408 CloseLibrary = -414 Output = -60 Input = -54 Write = -48 Read = -42 Forbid = -132 Permit = -138 AddIntServer = -168 RemIntServer = -174 VBlankFrequency = 530 INTB_VERTB = 5 ;for vblank interrupt NT_INTERRUPT = 2 ;node type ;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits start lea libname(pc),a1 ;open the dos library move.l 4.W,a5 move.l a5,a6 jsr OldOpenLibrary(a6) move.l d0,a6 jsr Output(a6) ;get stdout lea cout(PC),A4 move.l d0,(A4) ;cout move.l d0,d1 ;call Write(stdout,buff,size) moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end add.l A4,D2 moveq #msg4-msg1,d3 jsr Write(a6) move.l #$10000-(ra-start),D7 divu.w #7*4,D7 lsl.l #2,D7 ; d7.w=maxn .l20 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg4,d2 moveq #msg4-cout,D2 add.l A4,D2 moveq #msg5-msg4,d3 jsr Write(a6) move.l d7,d5 bsr.w PR0000 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msg5,d2 moveq #msg5-cout,D2 add.l A4,D2 moveq #msg3-msg5,d3 jsr Write(a6) bsr.w getnum cmp.w d7,d5 bhi.b .l20 move.w d5,d1 beq.b .l20 addq.w #3,d5 and.w #$fffc,d5 cmp.b #10,(a0) bne.b .l21 move.w d5,d6 cmp.w d1,d5 beq.b .l7 .l21 bsr.w PR0000 move.l (A4),D1 ; cout ; move.l #msg3,d2 moveq #msg3-cout,D2 add.l A4,D2 moveq #msg2-msg3+1,d3 jsr Write(a6) .l7 mulu.w #7,d6 ;kv = d6 lsr.l #2,D6 ; /4 move.l d6,d7 lea ra(pc),a3 exg a5,a6 jsr Forbid(a6) moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr AddIntServer(a6) exg a5,a6 ;move.w #$4000,$dff096 ;DMA off move.l #2000*65537,d0 move.l a3,a0 .fill move.l d0,(a0)+ subq.l #1,D7 bne.b .fill move.l D7,-(SP) ; cv lea 10000.W,A2 moveq #4,D3 moveq #buf-cout,D2 add.l A4,D2 ; buf .l0 moveq #0,D5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i*2 lsl.l #2,D4 ; *4 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 move.l A2,D1 bra.b .l4 .longdiv swap d0 move.w d0,d7 divu.w d4,d7 swap d7 move.w d7,d0 swap d0 divu.w d4,d0 move.w d0,d7 exg d0,d7 clr.w d7 swap d7 move.w d7,(a3) ;r[i] <- d%b bra.b .enddiv .l2 sub.l d0,d5 sub.l d7,d5 lsr.l #1,d5 .l4 move -(a3),d0 ; r[i] mulu.w d1,d0 ;r[i]*10000 add.l d0,d5 ;d += r[i]*10000 move.l d5,d0 divu.w d4,d0 bvs.s .longdiv move.w d0,d7 clr.w d0 swap d0 move.w d0,(a3) ;r[i] <- d%b .enddiv subq.l #2,d4 ;i <- i - 1 bcc.b .l2 ;the main loop divu.w d1,d5 ;removed with MULU optimization add.w (SP),D5 ; cv move.l D5,(SP) ; cv bsr.w PR000N subq.l #7,d6 ;kv bne.b .l0 addq.l #4,SP ; restore stack move.l time(pc),d5 ;move.w #$c000,$dff096 ;DMA on exg a5,a6 moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr RemIntServer(a6) jsr Permit(a6) exg a5,a6 moveq #1,d3 ; move.l cout(pc),d1 move.l (A4),D1 ; cout move.l #msgx,d2 jsr Write(a6) ;space move.l d5,d3 lsl.l #1,d5 cmp.b #50,VBlankFrequency(a5) beq .l8 lsl.l #1,d5 ;60 Hz add.l d3,d5 divu.w #3,d5 swap d5 lsr.w #2,d5 swap d5 negx.l d5 neg.l d5 .l8 lea string(pc),a3 moveq.l #10,d4 move.l d5,d6 ;div32x16 macro ;D7=D6/D4, D6=D6%D4 ; moveq #0,d7 ; not necessary D7 highword is already cleared divu.w d4,d6 bvc.b .div32no swap d6 move.w d6,d7 divu.w d4,d7 swap d7 move d7,d6 swap d6 divu.w d4,d6 .div32no move.w d6,d7 ; clr.w d6 ;not necessary swap d6 move.b d6,(a3)+ divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 move.b #'.'-'0',(a3)+ .l12 tst.w d7 beq .l11 divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 bra .l12 .l11 add.b #'0',-(a3) moveq #1,d3 ; move.l cout(pc),d1 move.l (A4),D1 ; cout move.l a3,d2 jsr Write(a6) cmp.l #string,a3 bne .l11 ; move.l cout(pc),d1 move.l (A4),D1 ; cout move.l #msgx+1,d2 jsr Write(a6) ;newline move.l a6,a1 move.l a5,a6 jmp CloseLibrary(a6) PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 moveq #4,D3 moveq #buf-cout,D2 add.l A4,D2 ; buf PR000N move.w #$0100,a0 move.l #$2f3a2f2f,d0 move.w #1000,d1 .b1000 add.w a0,d0 sub.w d1,d5 bcc.b .b1000 add.w d1,d5 moveq #100,d1 .b100 addq.b #1,d0 sub.w d1,d5 bcc.b .b100 add.w d1,d5 swap d0 moveq #10,d1 .b10 add.w a0,d0 sub.w d1,d5 bcc.b .b10 add.b d5,d0 move.l D0,4(A4) ; buf move.l (A4),D1 ; cout jmp Write(A6) ;call Write(stdout,buff,size) rasteri addq.l #1,(a1) ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l 0 ;ln_Name dc.l time,rasteri ;is_Data,is_Code msgx dc.b 32,10 cnop 0,4 time dc.l 0 cout dc.l 0 buf ds.b 4 ; Overwritten code/data start here. ra getnum jsr Input(a6) ;get stdin ; move.l #string,d2 ;set by previous call moveq #msg1-cout,D2 add.l A4,D2 move.l d0,d1 moveq #5,d3 ;+ newline jsr Read(a6) move.l d2,a0 moveq #0,d5 .loop subq.w #1,d0 beq.b .done move.w #256-'0',d6 add.b (a0)+,d6 cmp.w #9,d6 bhi.b .error mulu.w #10,d5 add.w d6,d5 bra.b .loop .error moveq #0,d5 .done rts string = msg1 libname dc.b "dos.library",0 msg1 dc.b 'number pi calculator v13' dc.b 10 msg4 dc.b 'number of digits (up to ' msg5 dc.b ')? ' msg3 dc.b ' digits will be printed' msg2 dc.b 10 Buffy ds.b 65536-(Buffy-start) Last edited by Don_Adan; 10 June 2021 at 23:00. |
11 June 2021, 00:01 | #288 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,543
|
Quote:
Code:
section 1,code opt d+ MC68020 output ram:profile _main: move.w #50000-1,d6 ; 50000*1000 = 1 second per clk at 50MHz ; -- outer loop -- oloop: lea chipram,a0 ; a0 = pointer to chipram move.l #$12345678,d0 ; d0 = value to write move.w #1000-1,d5 ; repeat inner loop code 1000 times ; -- inner loop -- loop: move.l d0,(a0)+ ; write longword to next chipram address dbf d5,loop ; -- outer loop -- dbf d6,oloop moveq #0,d0 rts align.l fastram: ds.l 10000 section 2,DATA_C chipram: ds.l 10000 Now, how long do you think this code takes to execute:- Code:
section 1,code opt d+ MC68020 output ram:profile _main: move.w #50000-1,d6 ; 50000*1000 = 1 second per clk at 50MHz ; -- outer loop -- oloop: lea chipram,a0 ; a0 = pointer to chipram move.l #$12345678,d0 ; d0 = value to write move.w #1000-1,d5 ; repeat inner loop code 1000 times ; -- inner loop -- loop: move.l d0,(a0)+ ; write longword to next chipram address move.l d0,d1 move.l d1,d2 move.l d2,d3 move.l d3,d4 move.l d4,d3 move.l d3,d2 move.l d2,d1 move.l d1,d0 move.l d0,d1 move.l d1,d2 dbf d5,loop ; -- outer loop -- dbf d6,oloop moveq #0,d0 rts align.l fastram: ds.l 10000 section 2,DATA_C chipram: ds.l 10000 |
|
11 June 2021, 00:54 | #289 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
|
11 June 2021, 08:29 | #290 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
The extra moves don't make it slower which hasn't anything to do with the processor continuing with the work while the chipmem write is still pending. The extra moves don't seem to consume time because after the first chipmem write the processor and chipmem are in sync meaning the processor can waste cycles on something else before chipmem is even ready to take the next access. While this may appear to be the same as continuing work while the chipmem is pending, it is not. Writing 3.5 or 7 MB to chipmem on the 030 will still block 50M processor cycles on the 030 while the 060 can make those chipmem writes for (almost) free. The 060 can do it because it simply leaves handling the write to a unit that does not block the execution pipeline.
|
11 June 2021, 08:42 | #291 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,543
|
Quote:
Imagine you had a program that reads some data from FastRAM, performs several operations on it in registers, then writes it to ChipRAM. If you parse all the data first and then write it to ChipRAM as a block, the CPU will have to wait for ~20 cycles during each write. But if you interleave the ChipRAM writes with other code you could execute up to 10 instructions 'for free' while each write is in progress - so long as the instructions and data being worked on are in the cache. You could have code that appears to be less efficient (because it needs more instructions to interleave the operations) but actually runs much faster. The effect is much less when accessing only FastRAM, but in some situations the order of instructions could still make a difference, particularly on machines with relatively slow 'fast' RAM. So what does this mean for pi-spigot? Firstly it explains why the performance hit from running in ChipRAM is not nearly as much you might expect. Secondly, reducing the number of instructions and/or using what appear to be 'faster' instruction sequences may not necessarily result in the fastest code. |
|
11 June 2021, 08:52 | #292 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
|
|
11 June 2021, 09:24 | #293 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,543
|
Quote:
Quote:
What I don't know is whether there are any accelerator cards which latch the write data on the ChipRAM side of the bus (perhaps even before the CPU slot is ready) and release the CPU side so it can continue accessing its own FastRAM at full speed. I suspect the Blizzard 1230-IV doesn't do this, but I will do more tests to confirm it. |
||
11 June 2021, 12:08 | #294 | |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
All Motorola 68k CPUs will use some cycles for processing the instruction (probably just 1 cycle for the 060 and something like 4 cycles for the 030). My point is the 060 can completely hide the extra cycles for the external bus (which are many for chipmem writes) from the instruction pipeline while the 030 cannot. |
|
11 June 2021, 12:18 | #295 | ||
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
Quote:
|
||
11 June 2021, 12:28 | #296 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
The proof is that if you attempt to access memory during these cycles, even from fastmem (actually even from data cache !), it will stop until the write is complete (i.e. you can't hide any memory access). Quote:
It takes 28 clocks to 50Mhz 030 to perform an access to chipmem. Out of these, at least 22 are free for use. Far from your "about half". With 60ns fastmem you can hide 4 cycles out of 8. |
||
11 June 2021, 13:25 | #297 | |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
|
|
11 June 2021, 13:46 | #298 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
You can however make things better on 030 by grouping fastmem reads together. And if you use data burst you can even insert register-only instructions between the reads and get a little speed gain. But this is bringing us quite far from the original topic of 32-bit division... |
|
11 June 2021, 20:18 | #299 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,214
|
As said before, the 040 also has a push buffer, same as the 060. It is one cache line (4 LWs) large, but only used for pushing dirty cache lines. But in additon, the 040 has a three-stage pipeline in which data written out is buffered (WB3 to WB1). They are used for cached and non-serialized write accesses. If non-serialized, a read can overtake a write. That is *not* the case for the 060 where reads and writes are always in strict order. The 060 has an imprecise mode, though.
|
12 June 2021, 05:13 | #300 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,957
|
Small size optimisations. To do end part, VBI ticks conversion routine to time.
Code:
OldOpenLibrary = -408 CloseLibrary = -414 Output = -60 Input = -54 Write = -48 Read = -42 Forbid = -132 Permit = -138 AddIntServer = -168 RemIntServer = -174 VBlankFrequency = 530 INTB_VERTB = 5 ;for vblank interrupt NT_INTERRUPT = 2 ;node type ;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits start lea libname(pc),a1 ;open the dos library move.l 4.W,a5 move.l a5,a6 jsr OldOpenLibrary(a6) move.l d0,a6 jsr Output(a6) ;get stdout lea cout(PC),A4 move.l d0,(A4) ;cout move.l d0,d1 ;call Write(stdout,buff,size) moveq #msg1-cout,D2 ; must be checked if in moveq range, the longest text can be moved at end add.l A4,D2 moveq #msg4-msg1,d3 jsr Write(a6) ; move.l #$10000-(ra-start),d7 ; divu.w #7*4,D7 ; lsl.l #2,D7 ; d7.w=maxn move.l #((65536-(ra-start))/(7<<2))<<2,D7 ; d7=maxn .l20 move.l (A4),D1 ; cout moveq #msg4-cout,D2 add.l A4,D2 moveq #msg5-msg4,d3 jsr Write(a6) move.l d7,d5 bsr.w PR0000 move.l (A4),D1 ; cout moveq #msg5-cout,D2 add.l A4,D2 moveq #msg3-msg5,d3 jsr Write(a6) bsr.w getnum cmp.w d7,d5 bhi.b .l20 move.w d5,d1 beq.b .l20 addq.w #3,d5 and.w #$fffc,d5 cmp.b #10,(a0) bne.b .l21 move.w d5,d6 cmp.w d1,d5 beq.b .l7 .l21 bsr.w PR0000 move.l (A4),D1 ; cout moveq #msg3-cout,D2 add.l A4,D2 moveq #msg2-msg3+1,d3 jsr Write(a6) .l7 mulu.w #7,d6 ;kv = d6 lsr.l #2,D6 ; /4 move.l d6,d7 lea ra(pc),a3 exg a5,a6 jsr Forbid(a6) moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr AddIntServer(a6) exg a5,a6 ;move.w #$4000,$dff096 ;DMA off move.l #2000*65537,d0 move.l a3,a0 .fill move.l d0,(a0)+ subq.l #1,D7 bne.b .fill move.l D7,-(SP) ; cv lea 10000.W,A2 moveq #4,D3 moveq #buf-cout,D2 add.l A4,D2 ; buf .l0 moveq #0,D5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i*2 lsl.l #2,D4 ; *4 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 move.l A2,D1 bra.b .l4 .longdiv swap d0 move.w d0,d7 divu.w d4,d7 swap d7 move.w d7,d0 swap d0 divu.w d4,d0 move.w d0,d7 exg d0,d7 clr.w d7 swap d7 move.w d7,(a3) ;r[i] <- d%b bra.b .enddiv .l2 sub.l d0,d5 sub.l d7,d5 lsr.l #1,d5 .l4 move -(a3),d0 ; r[i] mulu.w d1,d0 ;r[i]*10000 add.l d0,d5 ;d += r[i]*10000 move.l d5,d0 divu.w d4,d0 bvs.s .longdiv move.w d0,d7 clr.w d0 swap d0 move.w d0,(a3) ;r[i] <- d%b .enddiv subq.l #2,d4 ;i <- i - 1 bcc.b .l2 ;the main loop divu.w d1,d5 ;removed with MULU optimization add.w (SP),D5 ; cv move.l D5,(SP) ; cv bsr.w PR000N subq.l #7,d6 ;kv bne.b .l0 addq.l #4,SP ; restore stack move.l time(pc),d5 ;move.w #$c000,$dff096 ;DMA on exg a5,a6 moveq #INTB_VERTB,d0 lea VBlankServer(pc),a1 jsr RemIntServer(a6) jsr Permit(a6) exg a5,a6 moveq #1,d3 move.l (A4),D1 ; cout ; move.l #msgx,d2 moveq #msgx-cout,d2 add.l A4,D2 jsr Write(a6) ;space move.l d5,d3 lsl.l #1,d5 cmp.b #50,VBlankFrequency(a5) beq .l8 lsl.l #1,d5 ;60 Hz add.l d3,d5 divu.w #3,d5 swap d5 lsr.w #2,d5 swap d5 negx.l d5 neg.l d5 .l8 lea string(pc),a3 moveq.l #10,d4 move.l d5,d6 ;div32x16 macro ;D7=D6/D4, D6=D6%D4 ; moveq #0,d7 ; not necessary D7 highword is already cleared divu.w d4,d6 bvc.b .div32no swap d6 move.w d6,d7 divu.w d4,d7 swap d7 move d7,d6 swap d6 divu.w d4,d6 .div32no move.w d6,d7 ; clr.w d6 ;not necessary swap d6 move.b d6,(a3)+ divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 move.b #'.'-'0',(a3)+ .l12 tst.w d7 beq .l11 divu.w d4,d7 swap d7 move.b d7,(a3)+ clr.w d7 swap d7 bra .l12 .l11 add.b #'0',-(a3) moveq #1,d3 ; move.l cout(pc),d1 move.l (A4),D1 ; cout move.l a3,d2 jsr Write(a6) cmp.l #string,a3 bne .l11 ; move.l cout(pc),d1 move.l (A4),D1 ; cout ; move.l #msgx+1,d2 moveq #msgx+1-cout,d2 add.l A4,D2 jsr Write(a6) ;newline move.l a6,a1 move.l a5,a6 jmp CloseLibrary(a6) PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 moveq #4,D3 moveq #buf-cout,D2 add.l A4,D2 ; buf PR000N move.w #$0100,a0 move.l #$2f3a2f2f,d0 move.w #1000,d1 .b1000 add.w a0,d0 sub.w d1,d5 bcc.b .b1000 add.w d1,d5 moveq #100,d1 .b100 addq.b #1,d0 sub.w d1,d5 bcc.b .b100 add.w d1,d5 swap d0 moveq #10,d1 .b10 add.w a0,d0 sub.w d1,d5 bcc.b .b10 add.b d5,d0 move.l D0,4(A4) ; buf move.l (A4),D1 ; cout jmp Write(A6) ;call Write(stdout,buff,size) rasteri addq.l #1,(a1) ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l 0 ;ln_Name dc.l time,rasteri ;is_Data,is_Code msgx dc.b 32,10 cnop 0,4 time dc.l 0 cout dc.l 0 buf ds.b 4 ; Overwritten code/data start here. ra string = msg1 libname dc.b "dos.library",0 msg1 dc.b 'number pi calculator v13',10 msg4 dc.b 'number of digits (up to ' msg5 dc.b ')? ' msg3 dc.b ' digits will be printed' msg2 dc.b 10,0 even getnum jsr Input(a6) ;get stdin moveq #msg1-cout,D2 add.l A4,D2 move.l d0,d1 moveq #5,d3 ;+ newline jsr Read(a6) move.l d2,a0 moveq #0,d5 .loop subq.w #1,d0 beq.b .done move.w #256-'0',d6 add.b (a0)+,d6 cmp.w #9,d6 bhi.b .error mulu.w #10,d5 add.w d6,d5 bra.b .loop .error moveq #0,d5 .done rts Buffy ds.b 65536-(Buffy-start) Last edited by Don_Adan; 12 June 2021 at 12:38. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|