06 May 2021, 07:59 | #61 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,233
|
Quote:
Actually, I'm doing a lot of signal processing here in my day job, and what I learned is: Regardless what the CPU is, avoid divisions. You typically replace them by a multiplication by a pre-shifted inverse, and a right-shift. That's precise enough, and a lot faster than the division algorithm. |
|
06 May 2021, 09:38 | #62 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
Quote:
I mitigated that by replacing DIVU.W with NOP in the last test from previous post. EDIT: Realized now when typing this, it would be better to just comment DIVU.W out in 'div.s', so here is that test. Code:
> timeit test_div 1 1 4000000 Running division: 1 / 1 Done with 4000000 divisions, result: 0x00000001 Elapsed: 3.00s Last edited by modrobert; 06 May 2021 at 12:00. |
|
06 May 2021, 09:54 | #63 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
Quote:
|
|
06 May 2021, 10:51 | #64 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
I checked the TG68 VHDL source code (TG68_fast.vhd), which is used as 68000 in FPGA solutions such as Minimig.
Code:
----------------------------------------------------------------------------- -- DIVU ----------------------------------------------------------------------------- PROCESS (clk, execOPC, opcode, OP1out, OP2out, div_reg, dummy_div_sub, div_quot, div_sign, dummy_div_over, dummy_div) BEGIN set_V_Flag <= '0'; IF rising_edge(clk) THEN IF clkena='1' THEN IF decodeOPC='1' THEN IF opcode(8)='1' AND reg_QB(31)='1' THEN -- Neg divisor div_sign <= '1'; div_reg <= 0-reg_QB; ELSE div_sign <= '0'; div_reg <= reg_QB; END IF; ELSIF exec_DIVU='1' THEN div_reg <= div_quot; END IF; END IF; END IF; dummy_div_over <= ('0'&OP1out(31 downto 16))-('0'&OP2out(15 downto 0)); IF opcode(8)='1' AND OP2out(15) ='1' THEN dummy_div_sub <= (div_reg(31 downto 15))+('1'&OP2out(15 downto 0)); ELSE dummy_div_sub <= (div_reg(31 downto 15))-('0'&OP2out(15 downto 0)); END IF; IF (dummy_div_sub(16))='1' THEN div_quot(31 downto 16) <= div_reg(30 downto 15); ELSE div_quot(31 downto 16) <= dummy_div_sub(15 downto 0); END IF; div_quot(15 downto 0) <= div_reg(14 downto 0)&NOT dummy_div_sub(16); IF execOPC='1' AND opcode(8)='1' AND (OP2out(15) XOR div_sign)='1' THEN dummy_div(15 downto 0) <= 0-div_quot(15 downto 0); ELSE dummy_div(15 downto 0) <= div_quot(15 downto 0); END IF; IF div_sign='1' THEN dummy_div(31 downto 16) <= 0-div_quot(31 downto 16); ELSE dummy_div(31 downto 16) <= div_quot(31 downto 16); END IF; IF (opcode(8)='1' AND (OP2out(15) XOR div_sign XOR dummy_div(15))='1' AND dummy_div(15 downto 0)/=X"0000") --Overflow DIVS OR (opcode(8)='0' AND dummy_div_over(16)='0') THEN --Overflow DIVU set_V_Flag <= '1'; END IF; END PROCESS; I've done some simple hardware designs in the past (both Verilog and VHDL), and modified some 8 bit state machines which somewhat resembles a CPU, but this is another level, never had experience designing mathematical arithmetic logic (so far). The general idea seems to be "division bit for bit" handling the dividend, divisor and output, and using several 16 and 32 bit custom division registers (div_reg, div_quot, OP2out, etc.) mapped to result from bit operations as the process continues each clock. This partially explains why the inputs rarely matter, it just chugs through each of the bits several passes. The only exception I can find where input is checked is when the divisor is zero (which gives guru in the Amiga, so no speed gain there). I don't know if the 68EC020 in A1200 uses the same DIVU design as TG68, but perhaps something similar. Last edited by modrobert; 06 May 2021 at 14:31. |
06 May 2021, 20:08 | #65 | ||||||||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
The 68000 shows good results, though it is behind leaders the VAX, IBM/370, and NS 32016. Quote:
Quote:
Quote:
Quote:
Thank you but it is rather super-scalar optimization. My code is too ancient for it. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Code:
SUB r2, r1, #10 ; keep (x-10) for later SUB r1, r1, r1, lsr #2 ADD r1, r1, r1, lsr #4 ADD r1, r1, r1, lsr #8 ADD r1, r1, r1, lsr #16 MOV r1, r1, lsr #3 ADD r3, r1, r1, asl #2 SUBS r2, r2, r3, asl #1 ; calc (x-10) - (x/10)*10 ADDMI r2, r2, #10 ; fix-up remainder ADDPL r1, r1, #1 ; fix-up quotient Last edited by litwr; 06 May 2021 at 20:21. |
||||||||||||||
06 May 2021, 21:25 | #66 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
And 40 bytes of code just to divide by 10. |
|
06 May 2021, 21:45 | #67 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,233
|
Quote:
Then ( x * u * 5 - x ) = (x * (2^T + 1) - x ) = 2^T, thus x * u = x / 5 mod 2^T. Dividing by 2 is then simple. A particular choice (though possibly not yours) is T = 18, 2^T + 1 = 262145 = 5 * 52429. Hence, to divide by 5, multiply x by 52429, and take the mod 2^18 (easy, just masking). To divide by 10, add another rightshift. Of course, that is not the only choice for T. However, that is not in general what is needed in my job because the divisors are parameters and not constants. |
|
06 May 2021, 22:42 | #68 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
"Thank you but this code is outside any loops so IMHO more clear logic is better for this case."
Yes, or.w d5,d5 is outside loops, but why you used Max. digits value, if you wasted memory for nothing on Amiga version? It was only example, perhaps more unnecessary code can exist. "It just works. However it will be good if you give me an example of better code to measure time. I have already asked saimo for this" I will use AddIntServer and RemIntServer from exec. Or use something like this: http://eab.abime.net/showpost.php?p=552625&postcount=43 Or better use/adapt time measure routine from c2p routine. Originally written by Jim Drew. http://eab.abime.net/showpost.php?p=...&postcount=235 |
07 May 2021, 00:04 | #69 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
It's about a week that I forced myself to stop working on my new game because of serious sleep issues (in fact, I can't always think straight). But this divu thing got me intrigued, so I just couldn't help but make more tests
I decided to check the divu instructions with broader ranges of values (always ensuring that overflow does not occur), comparing them against one another, and seeing how they behave on 68020 and 68030. The tests were made with interrupts and DMA off, and the time has been measured using the 32-bit timer from CIA A. First of all, an overview of the tests performed and of the results: Code:
# | OPERATION | DIVIDEND | DIVISOR | ITERATIONS | TIME 68020 | TIME 68030 ---+------------------+-------------------+-------------------------+------------+------------+------------ 1 | 32/16 -> 16q 16r | 2^16-1 | 1 ... 2^16-1 | 2^16-1 | 180325 | 50210 | divu.w dx,dy | $ffff | 1 ... 65535 | 65535 | § 176949 | § 50210 ---+------------------+-------------------+-------------------------+------------+------------+------------ 2 | 32/16 -> 16q 16r | (2^16-1) * 2^15 | 2^15 ... 2^16-1 | 2^15 | 88578 | 25106 | divu.w dx,dy | $7fff8000 | 32768 ... 65535 | 32768 | § 90166 | § 25106 ---+------------------+-------------------+-------------------------+------------+------------+------------ 3 | 32/32 -> 32q | 2^32-1 | 1 ... 2^20 | 2^20 | 4718597 | 1338906 | divu.l dx,dy | $ffffffff | 1 ... 1048576 | 1048576 | § 4875883 | § 1368658 ---+------------------+-------------------+-------------------------+------------+------------+------------ 4 | 32/32 -> 32q 32r | 2^32-1 | 1 ... 2^20 | 2^20 | 4718597 | 1338906 | divul.l dx,dy:dz | $ffffffff | 1 ... 1048576 | 1048576 | § 4771025 | § 1338905 ---+------------------+-------------------+-------------------------+------------+------------+------------ 5 | 64/32 -> 32q 32r | 2^32-1 | 1 ... 2^16-1 | 2^16-1 | 304742 | 85541 | divu.l dx,dy:dz | $ffffffff | 1 ... 65535 | 65535 | § 301465 | § 85542 ---+------------------+-------------------+-------------------------+------------+------------+------------ 6 | 64/32 -> 32q 32r | (2^32-1) * 2^31 | 2^32-2^20 ... 2^32-1 | 2^20 | 4823456 | 1368660 | divu.l dx,dy:dz | $7fffffff80000000 | $fff00000 ... $ffffffff | 1048576 | § 4875884 | § 1368659 The times are expressed in CIA clocks. The times are relative to the whole core loops, not just to the divu instructions. § = time when the code alignment was altered with an nop before the core loop. * the 68020 is more sensitive than the 68030 to alignment; * 32- and 64-bit divus seem to perform very similarly; * 16-bit divisions seem faster (as one would expect); * remainders do not seem to impact performance (as one would expect); * input data does not seem to affect performance (as the MC68020UM implies and contrary to what the MC68030UM says). Now, given that the number of iterations differ, to be able to compare the times, let's see how long a single iteration took on average (best times only): Code:
# | OPERATION | ITERATIONS | TIME 020 | TIME 030 | AVERAGE 68020 | AVERAGE 68030 ---+------------------+------------+----------+----------+----------------+---------------- 1 | divu.w dx,dy | 65535 | 176949 | 50210 | 2.700068665599 | 0.766155489433 ---+------------------+------------+----------+----------+----------------+---------------- 2 | divu.w dx,dy | 32768 | 88578 | 25106 | 2.703186035156 | 0.766174316406 ---+------------------+------------+----------+----------+----------------+---------------- 3 | divu.l dx,dy | 1048576 | 4718597 | 1338906 | 4.500004768372 | 1.276880264282 ---+------------------+------------+----------+----------+----------------+---------------- 4 | divul.l dx,dy:dz | 1048576 | 4718597 | 1338906 | 4.500004768372 | 1.276880264282 ---+------------------+------------+----------+----------+----------------+---------------- 5 | divu.l dx,dy:dz | 65535 | 301465 | 85541 | 4.600061036088 | 1.305271992065 ---+------------------+------------+----------+----------+----------------+---------------- 6 | divu.l dx,dy:dz | 1048576 | 4823456 | 1368659 | 4.600006103516 | 1.305254936218 * 16-bit divus are faster; * 32-bit divus perform equally regardless of the remainder; * 64-bit divus perform equally regardless of the remainder. It also seems to indicate that 64-bit divus are slower than 32-bit divus - but that isn't case, because the difference is due to the extra code in the core loops. So, before proceeding further, let's look at the core loops. Code:
TEST #1 move.l #$ffff,d2 ;$0000ffff moveq.l #1,d7 ;i = 1 .l move.l d2,d0 ;$0000ffff divu.w d7,d0 ;$0000ffff/i addq.w #1,d7 ;++i bne.b .l TEST #2 move.l #$7fff8000,d2 ;$ffff*2^15 move.w #$8000,d7 ;i = 2^15 .l move.l d2,d0 ;$ffff*2^15 divu.w d7,d0 ;($ffff*2^15)/i addq.w #1,d7 ;++i bne.b .l TEST #3 moveq.l #-1,d2 ;$ffffffff move.l #$100000,d7 ;i = 2^20 .l move.l d2,d0 ;$ffffffff divu.l d7,d0 ;($ffffffff)/i subq.l #1,d7 ;--i bne.b .l TEST #4 moveq.l #-1,d2 ;$ffffffff move.l #$100000,d7 ;i = 2^20 .l move.l d2,d0 ;$ffffffff divul.l d7,d1:d0 ;($ffffffff)/i subq.l #1,d7 ;--i bne.b .l TEST #5 moveq.l #-1,d2 ;$ffffffff moveq.l #1,d7 ;i = 1 .l clr.l d1 ;0 move.l d2,d0 ;$ffffffff divu.l d7,d1:d0 ;($00000000ffffffff)/i addq.w #1,d7 ;++i bne.b .l TEST #6 move.l #$7fffffff,d3 ;$7fffffff move.l #$80000000,d2 ;$80000000 move.l #$fff00000,d7 ;i = 2^32-2^20 .l move.l d3,d1 ;$7fffffff move.l d2,d0 ;$80000000 divu.l d7,d1:d0 ;((2^32-1)*2^31)/i addq.l #1,d7 ;++i bne.b .l Code:
moveq.l #-1,d2 ;$ffffffff move.l #$100000,d7 ;i = 2^20 .l move.l d4,d5 ;dummy operation move.l d2,d0 ;$ffffffff divu.l d7,d0 ;($ffffffff)/i subq.l #1,d7 ;--i bne.b .l Now let's compare the performance of the two CPUs: Code:
# | OPERATION | TIME 020 | TIME 030 | TIME 020 / TIME 030 ---+------------------+----------+----------+--------------------- 1 | divu.w dx,dy | 176949 | 50210 | 3.524178450508 ---+------------------+----------+----------+--------------------- 2 | divu.w dx,dy | 88578 | 25106 | 3.52816059906 ---+------------------+----------+----------+--------------------- 3 | divu.l dx,dy | 4718597 | 1338906 | 3.5242182797 ---+------------------+----------+----------+--------------------- 4 | divul.l dx,dy:dz | 4718597 | 1338906 | 3.5242182797 ---+------------------+----------+----------+--------------------- 5 | divu.l dx,dy:dz | 301465 | 85541 | 3.524216457605 ---+------------------+----------+----------+--------------------- 6 | divu.l dx,dy:dz | 4823456 | 1368659 | 3.524220423056 This also means that the divus implementations are the same on both the CPUs. In fact, their user's manuals indicate the very same timings (for the cache-case, that is, but also the other cases are almost identical). I guess that where the MC68030UM says that the actual timing depends on the input data, it refers to divisions by 0 and overflows. Finally, one last questions: are the results reliable? Let's look at the test #1 code and at its cycles on the 68020, and let's compare them with the result obtained experimentally. Code:
move.b #$41,$bfee01 ;(reload timer and start it) move.l #$ffff,d2 ;5w 6c (yes, the cache-cache is said to be worse than the worst-case) moveq.l #1,d7 ;3w 2c .l move.l d2,d0 ;3w 2c divu.w d7,d0 ;44w 44c addq.w #1,d7 ;3w 2c bne.b .l ;9w 6c 4cl clr.b $bfee01 ;6w+5w 4c+4c (stop timer) w = worst-case c = cache-case l = last iteration Code:
longword-aligned code case: move.l #$ffff,d2 ;5 moveq.l #1,d7 ;2 (because it has been prefetched with the previous instruction) .l move.l d2,d0 ;3 divu.w d7,d0 ;44 addq.w #1,d7 ;3 bne.b .l ;7 (because the opcode has been prefetched, so only the offset needs an additional read) word-aligned code case: move.l #$ffff,d2 ;4 (because the opcode has been prefetched with the previous instruction) moveq.l #1,d7 ;3 .l move.l d2,d0 ;2 (because it has been prefetched with the previous instruction) divu.w d7,d0 ;44 addq.w #1,d7 ;2 (because it has been prefetched with the previous instruction) bne.b .l ;9 The last iteration takes 2 cycles less as the branch is not taken, i.e. 52 cycles. Finally, the write to the CIA has to be evaluated taking into account that its timing has a base time (6w 4c) plus the calculate effective address time (5w 4c). Depending on the code alignment, the opcode might already be in the cache thanks to the 32-bit fetch for bne, so, in that case, the time is 4c+5w = 9 cycles; otherwise, the time is 6w+5w = 11 cycles. Therefore, theoretically, the whole execution takes 64+54*65533+52+9 = 3538907 or 64+54*65533+52+11 = 3538909 cycles. The CIA runs at 0.709379 MHz, i.e. 1/20th of the CPU speed, so the elapsed time in CPU cycles is 176949*20 = 3538980. The difference between the actual time and the theoretical time is thus 3538980-3538907 = 73 or 3538980-3538909 = 71 cycles. I guess that it can be explained as follows: * the instructions are fetched from CHIP RAM, which is slower than what the MC68020UM assumes (its timings are relative to a 0-wait-state RAM); namely, the CHIP RAM runs at 1/4 of the CPU frequency; given that when the instructions are not in the cache theoretically 64-54+5 = 15 or 64-54+7 = 17 cycles more for RAM accesses are needed, execution takes actually 15*4 = 60 or 17*4 = 68 cycles longer; * the access to the CIA is slow due to the slower frequency of the chip (1/20 of the CPU frequency), so, in the worst case, clr could actually take 19 cycles longer. Even if I made some mistake, even without considering the performance penalty factors, 73 out 3538980 cycles represent a 0.002063% error, I'd say that the tests are reliable. That's also supported by the fact that I ran the tests multiple times obtaining always the same results. EDIT START I just had to see how test #1 performs with minimized overhead, so I rewrote it like this: Code:
move.l #$ffff,d2 moveq.l #1,d7 lea.l $bfee01,a5 move.b #$41,(a5) ;(reload timer and start it) .l move.l d2,d0 ;3w 2c divu.w d7,d0 ;44w 44c addq.w #1,d7 ;3w 2c bne.b .l ;9w 6c 4cl clr.b (a5) ;6w+2w 4c+2c (stop timer) w = worst-case c = cache-case l = last iteration * without nop before the code: 176946; * with nop before the code: 180323. (Side note: this means that, in the best case, the cost of the overhead of the previous version of the code was (176949-176946)*20 = 60 cycles.) When the code is not cached, the theoretical cycles are: Code:
longword-aligned code case: .l move.l d2,d0 ;3 divu.w d7,d0 ;44 addq.w #1,d7 ;3 bne.b .l ;7 (because the opcode has been prefetched, so only the offset needs an additional read) clr.b (a5) ;4+2 = 6 (because it has been prefetched with the previous instruction) word-aligned code case: .l move.l d2,d0 ;2 (because it has been prefetched with the previous instruction) divu.w d7,d0 ;44 addq.w #1,d7 ;2 (because it has been prefetched with the previous instruction) bne.b .l ;9 clr.b (a5) ;6+2 = 8 * first iteration: 57 cycles; * next iterations: 54 cycles; * last iteration: 52 cycles; * final write: 6 cycles; * total: 57+54*65533+52+6 = 3538897 cycles The difference between actual time and theoretical time is 176946*20-3538897 = 23 cycles. By looking at the timings above, we see that the additional cycles to fetch the instructions from RAM are 57-54 = 3. Due to the CHIP RAM slowness, those actually amount to 3*4 = 12 cycles. The remaining 23-12 = 11 cycles should be due to the access to the CIA. To be honest, I didn't write down on paper a timing chart of the CPU activity (and, even if I tried, there is no documentation that explains how to do that exactly), so calculations might be a bit off here and there, but still the closeness of the figures and the 100% stable test results prove even better that the measured times are accurate enough for the purpose of evaluating the divu operations performance. EDIT END Conclusions? Don't get involved in threads that might steal you an enourmous amount of time for stuff you'll never have a use for anyway @litwr I'll get back to you tomorrow ASAP (edit: couldn't sleep). Now I should really try to get some sleep. Last edited by saimo; 08 May 2021 at 00:34. |
07 May 2021, 06:09 | #70 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
I find your results interesting, and the reward for doing this is learning more; the very foundation of every future decision. Basic research vs being productive, you need both to stay sane.
|
08 May 2021, 09:36 | #71 | |||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
BTW I removed that OR, it is a tiny and insignificant but an improvement!. Thank you very much. Thank you. But you pointed me that I wasted 2 bytes afore. Quote:
Quote:
Quote:
BTW would you like please to run pi-amiga, pi-amiga1200, pi-amigax on your Blizzard 1230-IV @50 MHz for me (100, 1000, 3000 digits)? The archive is attached. Could you run pi-amiga, pi-amiga1200, pi-amigax (from pi-amiga-11-beta.zip) for 3000 digits? It can help to find more details about the 68020. BTW I have made several test with my old 80386 board. I tested the next instruction sequence Code:
.loop: nop ... jmp .loop Last edited by BippyM; 01 June 2021 at 18:24. |
|||||
08 May 2021, 10:40 | #72 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
Quote:
Why you used Max Digits value, if your Amiga code is inefficiency? For Atari you have 9288 digits, for Amiga only 9252 digits. And you wasted much more than 2 bytes. Why you allocated/freeing memory for this routine? When BSS section is enough? Section BSS,Digits ds.b $10000-(endmark-start) The best option is using Code_BSS, then only one memory area will be used and code can be fully PC relative. If you want to reach good number of digits, your code must be fully PC relative. |
|
08 May 2021, 12:39 | #73 | |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Code:
msg2 dc.b 10 bss mydata blk.b 60000 ;it is a separate section, this makes a file larger :( Indeed even a separate BSS section gives some gain. Maybe I make this tiny improvement (which gives us less than 10 digits) sometime. Of course I remember your super tricky way to use the stack but I would like to use more conventional coding. And, you know the main goal of my project is the maximum speed, the size optimization is secondary and not important goal there. Sorry, some of you remarks are still cryptic for me. Would you like please to clarify some you phrases? "Why using get VBR? Because you can learn something new about Amiga." - Sorry, I completely missed your idea here. "Why you used Max Digits value, if your Amiga code is inefficiency?" - I still don't understand what is wrong about Max Digits value? You know, the Amiga OS does't have a symbol input function, so the getnum-function for the Amiga and Atari ST are not the same. You can notice that TOS allows us just to use memory allocated for this function for the main array later. I don't know how use the same approach under WB. IMHO it is quite pc-relative now. The first version of the pi-spigot for the Amiga were less pc-relative. It is fixed quite long ago. Thanks to EAB-experts. |
|
08 May 2021, 13:31 | #74 | ||||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
And, anyway, I'm confident it runs a bit faster on all 68020+ CPUs - if I get a chance, I'll make a test later myself. Quote:
To make reliable tests on an unexpanded Amiga, you should also turn DMA off (because DMA affects the access of the CPU to the RAM), but a proper takeover and restore code isn't trivial (and still might cause issues on some machines). Anyway, exclusively for internal testing and to quickly make the tests, you could use move.w #$4000,$dff096 before the test loop and move.w #$c000,$dff096 after it. Please note that this is extremely brute code, so, it's best to run the tests on a minimal environment, i.e. after booting the machine without startup-sequence. Regarding the timer code, for maximum precision, hardware (CIA timers) should be used directly - so, again, a proper startup and restore code is needed. In absence of that, it's best to use the OS functions. I see that Don Adan already provided you with a couple of pointers. Quote:
For convenience, only because this thread became quite complicated, here's the code again: Code:
move.l #$ffff,d3 ... bra.b .l4 .longdiv divul.l d4,d7:d6 move.w d7,(a3) subq.l #2,d4 bcs.b .enddiv .l2 sub.l d6,d5 sub.l d7,d5 .l4 move.w -(a3),d0 lsr.l #1,d5 mulu.w d1,d0 add.l d0,d5 move.l d5,d6 divu.w d4,d6 bvs.b .longdiv move.w d6,d7 swap.w d6 and.l d3,d7 move.w d6,(a3) and.l d3,d6 subq.l #2,d4 bcc.b .l2 .enddiv Quote:
Quote:
Quote:
EDIT Results on 68020 (3000 digits): * pi-amiga: 37.32 * pi-amiga1200 and pi-amigax: 36.94 Results on 68030: all .00, regardless of the number of digits. Last edited by saimo; 09 May 2021 at 11:20. |
||||||
08 May 2021, 13:44 | #75 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Well, the problem is that all that work didn't really tell us much: we already knew that the timings are basically the same for both the 68020 and 68030 (as per the official manuals) and that the input value doesn't make a difference on 68020 (as per its manual). It was therefore likely that also on the 68030 the speed didn't depend on the input value, even if the manual says "Indicates Maximum Time (Acutal time is data dependent)" (typo not mine ), so basically the tests simply confirmed just that. Quite little... but it's been fun!
|
08 May 2021, 15:40 | #76 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
Quote:
Stack version it was Ross idea, you can use normal version with Code_ Bss section. For VBR you must learn/remember that VBR not always must be at $0 address for 68010+ CPUs. Then your code will be dont works or can crash. And yes your code can be optimised/shortened much more than for only 10 digits. You used: maxn dc.w 0 move.l d0,maxn You know that this is buggy? Of course, if you want you can overwrite "dos.library" name, it was originally meynaf's idea for shortest code. Anyway much easiest and shortest is replacing maxn with D7 register. move.l d0,d7 move.l d7,d5 cmp.w d7,d5 This is used for something? msg6 dc.b 'no fast memory',10 For me only wasted memory. |
|
08 May 2021, 16:11 | #77 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
Quote:
Seriously though, I think optimizing is so much fun. Besides being a lost art, it's such a challenge on classic hardware and useful knowledge in all types of programming. Know the hardware, know the software, now make it fast. EDIT: litwr, Code:
> pi-amiga number pi calculator v11 (beta)(68000) number of digits (up to 9248)? 3000 314159... 35.30 Code:
> pi-amiga1200 number pi calculator v11 (beta)(68020) number of digits (up to 9248)? 3000 314159... 34.70 Code:
> pi-amigax number pi calculator v11 (Beta, SuperScalar)(68020) number of digits (up to 9244)? 3000 314159... 34.68 Last edited by modrobert; 08 May 2021 at 16:46. |
|
08 May 2021, 19:56 | #78 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
BTW. When you made one hunk (code and bss) version, you can easy beat Atari Max Digits value.
BTW2. For good, but unfair code you can use all 64 KB for Max Digits. |
09 May 2021, 13:58 | #79 | |||||||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
It seems your 68030 system relocates interrupt vector table. It is the first time I met such a thing. Thanks to Don_Adan I have just made new code which uses AddIntServer/RemIntServer instead of the direct work with the interrupt vector. Please could you rerun the new code (it is attached) on your 68030 hardware for me? Quote:
Now I just use a BSS section but all gain from it and other your optimizations was eaten by new AddIntServer/RemIntServer code. It is also sad that VASM doesn't allow us to use Code:
ds.b 65536-endmark+start Quote:
Quote:
Quote:
Thank you I already fixed it a bit earlier. Quote:
Quote:
Please be less cryptic. Would you like please provide us with details? Last edited by BippyM; 01 June 2021 at 18:24. |
|||||||||||||
09 May 2021, 16:31 | #80 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
I dont know, what is necessary (option or naming) for auto creating Code_BSS section by Vasm.
But perhaps some infos you can find here: http://eab.abime.net/showthread.php?t=97310 I always did this manually. Assemble your source as one section program with ds.b $10000-(endmark-start) at end of source. Later manually cut/remove empty bytes created by ds.b $10000-(endmark-start) part. And manually edit one longword in Amiga exe header. You can find which one if you compared normal code version and code_bss version meynaf's pi routine from old (pi?) thread. Exactly for your routine you must replace second $00004000 value in Amiga exe file with (endmark-start+3)/4 value. All sizes of pure sections in Amiga exe are stored as size/4 (longword). Then section 16 bytes is stored as $00000004, 1024 bytes as $00000100, 65536 bytes as $00004000. Of course target sections like Code_C(hip), Code_F(ast), Data_C(hip) etc set some bits in stored longword too, but this is not your case. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|