17 November 2018, 08:57 | #801 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
|
17 November 2018, 10:49 | #802 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
By now I was interested in what was going on and so I wrote a small test program (using CIA timer, OS disabled and forced chip/forced fast code) and ran it using WinUAE cycle exact mode. It did indeed show a difference on shifts even with 0 bitplanes, but it only was about 0,9%.
I'll expand the program to produce readable output and retry on actual hardware, but I'm honestly not expecting the results to change much - if WinUAE was off by that much a bunch of A500 stuff should really fail to run reliably. This is rather off-topic though, perhaps a new thread would be better? Last edited by roondar; 17 November 2018 at 11:04. |
17 November 2018, 12:00 | #803 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
Remember that you can disable all DMA but not the memory refresh cycles for chip/bogo RAM, so the 'not multiple of four' cycles count shift instructions could be delayed. EDIT: Slow down estimation for PAL Amiga in chip RAM: 100/227*(4/2)=0,88% So 0,9% seem a valid result. Last edited by ross; 17 November 2018 at 12:11. Reason: Added estimation |
|
17 November 2018, 12:59 | #804 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Wouldn't fast ram also need refresh cycles? Maybe UAE simply does not emulate those, as probably there isn't any software relying on a specific fast ram timing (which could also vary a lot between implementations, take e.g. the slow PCMCIA SRAM...).
|
17 November 2018, 13:17 | #805 | ||
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
Perhaps in a good implementation (with cells fast enough) delays can be negligible. Quote:
|
||
17 November 2018, 13:29 | #806 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Right, I've now run my test program on an actual 68000 powered Amiga with Fast RAM and added a screenshot with the results.
Results are identical to the emulated Amiga: 0 bitplanes slows down shifts by about 0,9%. For those interested, I've attached the executable so you can run your own test. Note that it does contain a 'minor' bug, after running it's possible the keyboard stops working. This is probably due to me mishandling restoring the CIA registers. That said, the timer is accurate so the important bit does work |
17 November 2018, 13:31 | #807 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
Roondar: Very interesting result! What kind of memory expansion do you use? |
|
17 November 2018, 13:52 | #808 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
I'm actually still looking for an affordable way to add sideslot Fast RAM (without a faster CPU) to the A500 for other testing purposes. Though seeing these results fit with WinUAE I guess it's less of a necessity now - I should be able to use the A600 instead. For reference, my program runs 30000 shifts (asl.w #2,d0) and times the result using the CIA. These should take exactly 10 cycles each, but turn out to take 0,9% more when run in Chip RAM. The result is given in CIA cycles where one CIA cycle is 10 CPU cycles as the CIA runs at 1/10th of the CPU frequency. |
|
17 November 2018, 16:55 | #809 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Hmm.. maybe the 20% increase was with 4 planes active. I did see a difference between running the code from fast RAM and from chip. I'll have a dig about my 1k drive and see if I can find it later this week.
I was using raster timing. |
17 November 2018, 19:52 | #810 | |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
Quote:
Still, it's nice seeing confirmation that most of them are taking 6 cycles. That means for a long run of shift instructions there's no 33% penalty in speed. Have you tested any other sequences that complete on non-divisible by four cycles? With the blitter running, it's conceivable that some actions might be faster overall using instructions with lots of idle cycles. A series of shifts and adds might take more CPU cycles to run for an equivalent MUL, but the MUL is going to leave lots of gaps for other DMA. |
|
17 November 2018, 19:56 | #811 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Well, in for a penny, in for pound.
I altered my program to display an empty 320x256 bitmap in either 4, 5 or 6 bitplanes (not selectable, I just manually change the number of bitplanes active in the copperlist and reassemble). Here are the results: Code:
4 bitplanes (chip) = 33172 4 bitplanes (fast) = 30001 This is about 11% slower for shifts & other instructions taking 10 cycles. 5 bitplanes (chip) = 38473 5 bitplanes (fast) = 30001 This is about 28% slower for shifts & other instructions taking 10 cycles. 6 bitplanes (chip) = 39136 6 bitplanes (fast) = 30002 This is about 30% slower for shifts & other instructions taking 10 cycles. Note that 5 & 6 bitplanes are really that close in repeated testing. All numbers fluctuate a bit when repeating the test once bitplane DMA is introduced. The difference is small though, no more than about 3-500 CIA cycles. In realistic code, quite a few instructions will end up being divisible by 4 and thus won't get slowed down and quite a few of the instructions that can't be divided by 4 will take longer than 10 cycles and thus be affected much less by bitplane contention (especially on 5 and 6 bitplane screens). On an interesting side note, do note that code is not always affected as much by bitplane fetches as might be expected. A naive calculation might conclude that 5 bitplanes slows down the CPU by 25% and 6 bitplanes slows down the CPU by 50%. This turns out to be untrue both because of the idle memory cycles during which the 68000 can keep running internally and because bitplane DMA only occurs for the part of the frame where the raster is actually drawing. I've also attached the executables for these three tests so you can test for yourself. Note that it only starts after pressing the left mouse button once the display is cleared and that only 4 bitplane pointers are set in the copper list so you may get a strange display for the 5 and 6 bitplane tests - this is pure laziness on my part, the garbage on screen does not actually impact results so I left it as is. In all other ways, the tests work the same as the previous one. ---- Quote:
Running the blitter while doing MULU's etc is actually a viable tactic as you are correct, a MULU does take a long time, but the blitter gets to run during these empty cycles and that makes up for most of this. However, running significant amounts of code while blitting can be tricky without using Copper blitting or interrupt based blitting. Last edited by roondar; 17 November 2018 at 20:04. |
|
17 November 2018, 20:01 | #812 | |||||||||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Where is your logical proof for this? Mine is easy, I can repeat it, your are too emotional about x86 - this means a strong passion. BTW I feel quite comfortable knowing that 68000 can be sometime more than 100% faster than 8088 at the same frequency or, thanks to you, that 68000 code density is often better than for 8086.
Quote:
It is easy to shorten my code by several bytes. The discussion in this thread clearly has shown that the segmentation gives us some advantages, e.g., the headerless format. Indeed its 8086 implementation has also disadvantages. Anyway we have a strong fact 168 < 236. It will be more interesting to compare larger codes where an algorithm needs to use a lot of memory but it will be too difficult for me to provide a proper participation in this contest. Shorter OS call is well known advantage of x86: DOS, Linux-x86 use INT for those calls. Quote:
Quote:
Code:
.l3: mov ax,cx 2 2 2 1 3 shl ax,1 / add ax,ax 2 2 2 1 4 cmp ax,di 3 2 2 1 5 jl .l5 17 11 10 4 Quote:
Quote:
Some noticeable use was found only by its cheaper version – 68LC040, which does not have a built-in coprocessor. However, the first versions of this chip had a serious hardware defect, which did not allow using even the software emulation of the coprocessor! Motorola always had problems with mathematical coprocessors. Motorola never released such a coprocessor for the 68000/68010, while Intel was releasing its very successful 8087 since 1980. But to get a significant performance boost, the code for 68882 needs to be compiled differently than for 68881. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
It is interesting that 68020/30 can execute up to three instructions simultaneously, Pentium could only two but it is much faster per 1 MHz. Why?! Quote:
Quote:
Quote:
Quote:
EDIT. My tests http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html show that A1200 with fast RAM is only 9% faster than without fast RAM... Last edited by litwr; 17 November 2018 at 20:54. |
|||||||||||||||
17 November 2018, 21:02 | #813 | ||||||||||||||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
The point about 'fast enough' is interesting and mostly true from 1992 onwards, but I do have to point out I have never claimed that the 68040 was faster than the 486DX/2, nor have I claimed the 68040 outsold it. All I have claimed (correctly) is that it wasn't the failure you've made it out to be and was used in a fair number of systems. Quote:
The 68882 vs 68881 stuff is interesting, but still misleading as Intel FPUs had similar issues: floating point code optimised for 80387 performs quite a bit better than code optimised for 8087/80287 when running on the 80387 and up. Moreover, while the 68882 does perform better when software is recompiled for it, it still manages to be competitive with the 80387 even when software isn't recompiled. Likewise, claiming Motorola had buggy FPU's as a point where Intel allegedly does better is hilarious given the Pentium FDIV bug. Quote:
Your reply is especially interesting as the 486, while fast for an Intel chip was not in fact all that fast compared to 'dedicated high performance' CPU's on release. Not only were direct competitors such as the 68040 actually faster, but a number of other CPU's were as well. So much for those 'talented people' - they couldn't even beat the guys at Motorola who you claim had no such talent. Quote:
No fantasy there, just facts. Quote:
Quote:
To add another example, even the pi-spigot benchmark you have linked shows that the difference is smaller than 10x. In it, the 8MHZ Acorn 440/1 is roughly 3.7x the speed of an A500 and the 12MHz Acorn 3020 is roughly 5.3x the speed of an A500. Quote:
All I really showed with my example was that the A1200 does indeed perform better than you claimed it does and I stand by that. Case in point, Super Stardust was running quite well on an A1200 and won't run at all on the cheap 386's you pointed out. Quote:
More importantly, it was pretty much impossible to use a PC in any real sense without a hard drive by 1992. The A1200, while benefiting enormously from an added hard drive is actually still useful without one. Lastly, note that a 'cheap' SVGA card in 1992 cost between $100 and $200 and required a monitor which was more expensive than a standard VGA monitor due to the higher resolutions supported. Add in a sound card for parity with the A1200 (which also cost around $100) and that's $300 gone before the mainboard/fdd/ram/case is even added. In short, I humbly ask if you have some facts (such as adverts etc) to back up your opinion about $300 PC's? Quote:
Quote:
Quote:
They just made it compatible with the earlier model and gave it a similar name. Not that this is a bad thing, but they did basically add a new ISA to the 8086 and call it the 80286/386. They did it again later (Pentium MMX or Pentium 2 IIRC) when they stopped running x86 code natively altogether. And again when they transitioned to x64. Quote:
Quote:
Quote:
It also shows quite clearly why you can't just use a tiny bit of code to accurately measure performance etc - it just doesn't work well, especially once cache comes into play. Last edited by roondar; 18 November 2018 at 03:26. Reason: Looked up 386 pricing & clarified a few things |
||||||||||||||
17 November 2018, 21:04 | #814 | |||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
The logical proof for this is the amount of energy you spend in false arguments in vain attempts to prove your point about code density (and on ONE example, btw).
Another proof is easy to find in your articles - they are indeed VERY emotional. Quote:
And for me being too emotional about x86, well, you haven't asked me what i think about various other architectures... Quote:
In fact x86 is so bad that even Intel themselves tried to get rid of it ! And ARM situation isn't better. There are very different instruction sets (original arm, thumb, thumb-2, arm-64) so it's obvious it's inadequate in some situations. Any code can be shortened by removing features... Quote:
Quote:
Oh, and i haven't counted the memory allocation, by the way, to it's even harder. That's probably another 20 bytes gain. Quote:
But I will use your logic. You said before that it's impossible to beat gcc. Do you still hold this claim ? If so, you can perfectly make programs of any size. (The claim is wrong, obviously, but it's you who are supposed to believe in it, so use your "knowledge".) And don't tell me writing some C code is gonna be too difficult. You said C was beautiful. (And again i didn't agree, but you have to behave as if it were true, or at least admit it wasn't.) Quote:
Atari ST also has short OS call so it's NOT an advantage of x86. Same for MacOS 68k. INT isn't shorter than TRAP or Line-A. When will you stop writing spurious things ? Quote:
First, while this will work if you ask for very little memory, at some point there is a risk there is not enough of it - and then, crash'n'burn. Dirty code. Second, you are once again removing OS code and attribute the gain to x86. |
|||||||
17 November 2018, 22:45 | #815 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Phil, could you post source of your Pi version? I want to see, how it looks. Perhaps using TaggedOpenLibrary for dos.library and Code_BSS (no need to alloc/free memory), your code will be shortest than PC .COM version.
|
18 November 2018, 02:51 | #816 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,547
|
Quote:
Sticking to standard AmigaOS 3.0 I managed to trim off a few bytes using FindName on Exec's LibList (so no need to close dos library or cache the dos base). I also used 'code_bss' to avoid having to allocate memory (if we must have a header then why not make full use of it?). This got the file size down to 208 bytes (172 bytes 'code', 36 bytes 'header'). |
|
18 November 2018, 08:20 | #817 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Quote:
move.l #truc,d7 lea buffer+truc*2(pc),a4 ; a4 = end buffer ; remplissage initial ; move.l a0,a4 ; on garde adr dans a4 move.w d7,d2 .fill move.w #2000,-(a4) subq.w #1,d2 bgt.s .fill For TaggedOpenLibrary, I wrote info before, somewhere in this thread. And Im not sure, if execbase is not given at startup as register, but maybe this is only for bootblock code? |
|
18 November 2018, 08:50 | #818 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
|
18 November 2018, 10:31 | #819 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Maybe it can works. Saved as Code_BSS.
Code:
; 68020 size-optimised spigot ; ent?te nbch equ 1000 ; nb chiffres truc equ (nbch/2)*7 ; const utilisée un peu partout mc68020 ; init move.l 4.w,a6 ; exec base moveq #4,D0 ; dos library jsr -$32A(A6) ; TaggedOpenLibrary move.l d0,a6 move.l #truc,d7 lea buffer+truc*2(pc),a4 ; a4 = end of buffer ; remplissage initial : move.l a0,a4 ; on garde adr dans a4 move.w d7,d2 .fill move.w #2000,-(a4) subq.w #1,d2 bgt.s .fill ; message d'ent?te ; exg a5,a6 ; pour a6=dos lea msg0(pc),a0 bsr.s aff ; main loop, req. a4=buf et d7=truc - note : a3 libre move.l #10000,d3 moveq #0,d1 .loop1 clr.l d5 move.l a4,a1 move.l d7,d0 ; i move.l d7,d4 add.l d4,d4 subq.l #1,d4 ; i*2-1 .loop2 mulu.l d0,d5 move.w (a1),d6 ; r[i] mulu.w d3,d6 ; r[i]*10000 add.l d6,d5 ; d += d + r[i]*10000 divul.l d4,d6:d5 move.w d6,(a1)+ ; d%b -> r[i] subq.l #2,d4 subq.l #1,d0 bgt.s .loop2 ; aff chiffres divul.l d3,d4:d5 ; d/10000 add.l d1,d5 ; +c bsr.s affd5 ; suite move.l d4,d1 ; c = d % 10000; lea 28(a4),a4 ; 14 itérations de moins cette fois sub.w #14,d7 ; .w suffit bgt.s .loop1 ; fin lea lf(pc),a0 ; terminé, envoyer un newline bra.s aff ; aff nbr d5 affd5 lea decbuf-1(pc),a0 ; pointe sur le 00 moveq #3,d0 ; nb ch moveq #10,d1 ; div.l shortcut .loop divul.l d1,d2:d5 addi.b #"0",d2 move.b d2,-(a0) dbf d0,.loop ; on retrouve a0="nnnn",0, afficher directement via aff ci-dessous ; normal cli print aff move.l a0,d1 jmp -$3b4(a6) msg0 dc.b "pi calculator v6" lf dc.b 10,0 buffer: dx.b truc*2 dx.b 6 decbuf |
18 November 2018, 10:53 | #820 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any software to see technical OS details? | necronom | support.Other | 3 | 02 April 2016 12:05 |
2-star rarity details? | stet | HOL suggestions and feedback | 0 | 14 December 2015 05:24 |
EAB's FTP details... | Basquemactee1 | project.Amiga File Server | 2 | 30 October 2013 22:54 |
req details for sdl | turrican3 | request.Other | 0 | 20 April 2008 22:06 |
Forum Details | BippyM | request.Other | 0 | 15 May 2006 00:56 |
|
|