68k details - Page 41

frank_b · 17 November 2018, 08:57

Quote:

Originally Posted by Bruce Abbott

How much faster and with which Fast RAM?

About 20% IIRC. It was a regular fast RAM card. Nothing fancy.

roondar · 17 November 2018, 10:49

Quote:

Originally Posted by frank_b

About 20% IIRC. It was a regular fast RAM card. Nothing fancy.

By now I was interested in what was going on and so I wrote a small test program (using CIA timer, OS disabled and forced chip/forced fast code) and ran it using WinUAE cycle exact mode. It did indeed show a difference on shifts even with 0 bitplanes, but it only was about 0,9%.

I'll expand the program to produce readable output and retry on actual hardware, but I'm honestly not expecting the results to change much - if WinUAE was off by that much a bunch of A500 stuff should really fail to run reliably.

This is rather off-topic though, perhaps a new thread would be better?

ross · 17 November 2018, 12:00

Quote:

Originally Posted by roondar

This is rather off-topic though, perhaps a new thread would be better?

Well, topic is "68k details" so is not too much off-topic

(although in effect this is really Amiga strictly related, but this is an Amiga forum

)

Remember that you can disable all DMA but not the memory refresh cycles for chip/bogo RAM,
so the 'not multiple of four' cycles count shift instructions could be delayed.

EDIT:
Slow down estimation for PAL Amiga in chip RAM: 100/227*(4/2)=0,88%
So 0,9% seem a valid result.

chb · 17 November 2018, 12:59

Wouldn't fast ram also need refresh cycles? Maybe UAE simply does not emulate those, as probably there isn't any software relying on a specific fast ram timing (which could also vary a lot between implementations, take e.g. the slow PCMCIA SRAM...).

ross · 17 November 2018, 13:17

Quote:

Originally Posted by chb

Wouldn't fast ram also need refresh cycles? Maybe UAE simply does not emulate those, as probably there isn't any software relying on a specific fast ram timing (which could also vary a lot between implementations

Sure DRAM require refresh, no idea how implemented in various 'real' fast RAM expansions.
Perhaps in a good implementation (with cells fast enough) delays can be negligible.

Quote:

take e.g. the slow PCMCIA SRAM...).

Actually PCMCIA SRAM is slow for other reasons, it do not require refresh.

roondar · 17 November 2018, 13:29

Right, I've now run my test program on an actual 68000 powered Amiga with Fast RAM and added a screenshot with the results.

Results are identical to the emulated Amiga: 0 bitplanes slows down shifts by about 0,9%.

For those interested, I've attached the executable so you can run your own test. Note that it does contain a 'minor' bug, after running it's possible the keyboard stops working. This is probably due to me mishandling restoring the CIA registers. That said, the timer is accurate so the important bit does work

chb · 17 November 2018, 13:31

Quote:

Originally Posted by ross

Actually PCMCIA SRAM is slow for other reasons, it do not require refresh.

Yeah I know, I just had in mind everything that you could stick into the PCMCIA slot, my comment was only related to general timing, not specifically to refresh.

Roondar: Very interesting result! What kind of memory expansion do you use?

roondar · 17 November 2018, 13:52

Quote:

Originally Posted by chb

Yeah I know, I just had in mind everything that you could stick into the PCMCIA slot, my comment was only related to general timing, not specifically to refresh.

Roondar: Very interesting result! What kind of memory expansion do you use?

That is on an A600 with a 4MB PCMCIA card. It's the only Amiga with both a 68000 and Fast RAM that I have - my A500 has Fast RAM, but it's on the ACA500 which has a double speed 68000 and that doesn't feel fair.

I'm actually still looking for an affordable way to add sideslot Fast RAM (without a faster CPU) to the A500 for other testing purposes. Though seeing these results fit with WinUAE I guess it's less of a necessity now - I should be able to use the A600 instead.

For reference, my program runs 30000 shifts (asl.w #2,d0) and times the result using the CIA. These should take exactly 10 cycles each, but turn out to take 0,9% more when run in Chip RAM. The result is given in CIA cycles where one CIA cycle is 10 CPU cycles as the CIA runs at 1/10th of the CPU frequency.

frank_b · 17 November 2018, 16:55

Hmm.. maybe the 20% increase was with 4 planes active. I did see a difference between running the code from fast RAM and from chip. I'll have a dig about my 1k drive and see if I can find it later this week.
I was using raster timing.

mc6809e · 17 November 2018, 19:52

Quote:

Originally Posted by roondar

Right, I've now run my test program on an actual 68000 powered Amiga with Fast RAM and added a screenshot with the results.

Results are identical to the emulated Amiga: 0 bitplanes slows down shifts by about 0,9%.

Yeah, the three refresh DMA cycles at the beginning of each scanline is going to cause at least one instruction to take 8 cycles instead of 6.

Still, it's nice seeing confirmation that most of them are taking 6 cycles. That means for a long run of shift instructions there's no 33% penalty in speed.

Have you tested any other sequences that complete on non-divisible by four cycles?

With the blitter running, it's conceivable that some actions might be faster overall using instructions with lots of idle cycles. A series of shifts and adds might take more CPU cycles to run for an equivalent MUL, but the MUL is going to leave lots of gaps for other DMA.

roondar · 17 November 2018, 19:56

Well, in for a penny, in for pound.

I altered my program to display an empty 320x256 bitmap in either 4, 5 or 6 bitplanes (not selectable, I just manually change the number of bitplanes active in the copperlist and reassemble). Here are the results:

Code:

4 bitplanes (chip) = 33172
4 bitplanes (fast) = 30001
This is about 11% slower for shifts & other instructions taking 10 cycles.

5 bitplanes (chip) = 38473
5 bitplanes (fast) = 30001
This is about 28% slower for shifts & other instructions taking 10 cycles.

6 bitplanes (chip) = 39136
6 bitplanes (fast) = 30002
This is about 30% slower for shifts & other instructions taking 10 cycles.

Note that 5 & 6 bitplanes are really that close in repeated testing. 
All numbers fluctuate a bit when repeating the test once bitplane DMA is introduced.
The difference is small though, no more than about 3-500 CIA cycles.

Hopefully this helps to clear this all up (though I am fascinated by the 5 and 6BPL result myself, I did not expect them to be this close). And just to be sure others understand these numbers as well, they represent a worst-case scenario (for 4 bitplanes - 5 and 6 bitplanes might require a different number of bits shifted to get the worst case) where I've purposefully tried to max out the number of delays you can get.

In realistic code, quite a few instructions will end up being divisible by 4 and thus won't get slowed down and quite a few of the instructions that can't be divided by 4 will take longer than 10 cycles and thus be affected much less by bitplane contention (especially on 5 and 6 bitplane screens).

On an interesting side note, do note that code is not always affected as much by bitplane fetches as might be expected. A naive calculation might conclude that 5 bitplanes slows down the CPU by 25% and 6 bitplanes slows down the CPU by 50%. This turns out to be untrue both because of the idle memory cycles during which the 68000 can keep running internally and because bitplane DMA only occurs for the part of the frame where the raster is actually drawing.

I've also attached the executables for these three tests so you can test for yourself. Note that it only starts after pressing the left mouse button once the display is cleared and that only 4 bitplane pointers are set in the copper list so you may get a strange display for the 5 and 6 bitplane tests - this is pure laziness on my part, the garbage on screen does not actually impact results so I left it as is.

In all other ways, the tests work the same as the previous one.

----

Quote:

Originally Posted by mc6809e

Yeah, the three refresh DMA cycles at the beginning of each scanline is going to cause at least one instruction to take 8 cycles instead of 6.

Still, it's nice seeing confirmation that most of them are taking 6 cycles. That means for a long run of shift instructions there's no 33% penalty in speed.

Have you tested any other sequences that complete on non-divisible by four cycles?

With the blitter running, it's conceivable that some actions might be faster overall using instructions with lots of idle cycles. A series of shifts and adds might take more CPU cycles to run for an equivalent MUL, but the MUL is going to leave lots of gaps for other DMA.

I have not tested other sequences as the 10 cycle one represents the worst case (it should at least for a 4 bitplane screen, though it should be close to the worst case for the 5 and 6 bitplane results as well considering more 'idle' cycles should improve overall results).

Running the blitter while doing MULU's etc is actually a viable tactic as you are correct, a MULU does take a long time, but the blitter gets to run during these empty cycles and that makes up for most of this. However, running significant amounts of code while blitting can be tricky without using Copper blitting or interrupt based blitting.

litwr · 17 November 2018, 20:01

Quote:

Originally Posted by meynaf

Of course x86 is NOT my beloved thing. It's yours.

Where is your logical proof for this? Mine is easy, I can repeat it, your are too emotional about x86 - this means a strong passion. BTW I feel quite comfortable knowing that 68000 can be sometime more than 100% faster than 8088 at the same frequency or, thanks to you, that 68000 code density is often better than for 8086.

Quote:

Originally Posted by meynaf

No it's not a weak point. It gives shorter, faster code, that can be run from any place. Why, even ARM has this (and it got added in x86_64 because it became mandatory !).

IMHO it is not exactly a weak point it is rather a bit exaggerated point in 68k ISA. Too many such exaggerated points made 68k ISA weaker - ARM and x86 had won long ago.

Quote:

Originally Posted by meynaf

And now 171 equals 168...

It is easy to shorten my code by several bytes. The discussion in this thread clearly has shown that the segmentation gives us some advantages, e.g., the headerless format. Indeed its 8086 implementation has also disadvantages. Anyway we have a strong fact 168 < 236. It will be more interesting to compare larger codes where an algorithm needs to use a lot of memory but it will be too difficult for me to provide a proper participation in this contest. Shorter OS call is well known advantage of x86: DOS, Linux-x86 use INT for those calls.

Quote:

Originally Posted by meynaf

Another proof of cheat : your code does not make proper memory allocation and happily overwrites free memory.
That's pretty dirty, yep.

Sorry but you are completely wrong with this. A COM-file gets all free OS memory for it automatically.

Quote:

Originally Posted by frank_b

Indeed it is possible that the prefetch will stall the CPU whilst waiting for main memory to fetch opcodes.

I have looked at the code more accurately. And I can agree that there are several places where 4 bytes in the queue may be not enough. Thus I can estimate that in the sequence

Code:

.l3:    mov ax,cx          2   2   2   1
3    shl ax,1 / add ax,ax  2   2   2   1
4       cmp ax,di          3   2   2   1
5       jl .l5            17  11  10   4

we have to add about 6 additional cycles for 8088. I can also count not more than 4 additional cycles. So we have about 10 additional cycles for 8088 due to the short prefetch queue factor.

Quote:

Originally Posted by roondar

You claimed only a few hundred 68040's were sold. This is nonsense, plain and simple. First of, the 68LC040 is a 68040 as much as the 486SX (or the 486DX/2 for that matter) is a 486. Secondly, Apple sold a whole bunch of 68040 Macs that had a full 68040 in them. No matter how much you dislike it.

I have written that the exact number is not important for me. The important was a fact that Motorola couldn't provide Apple and other computer producers with fast enough processor in time. Thanks to our discussion now I can estimate that about several dozen thousand 68040 based PC were produced, or about 300 hundred. I did't use word "a few" (it was you!) I wrote just "hundreds".

Quote:

Originally Posted by roondar

As for hardware issues, the 68040 was not the only CPU of that 'generation' to have those.

However 68LC040 had a really unpleasant bug which prevent to use even software emulation for the math copro! I have to add to my article the next information.

Some noticeable use was found only by its cheaper version – 68LC040, which does not have a built-in coprocessor. However, the first versions of this chip had a serious hardware defect, which did not allow using even the software emulation of the coprocessor!

Motorola always had problems with mathematical coprocessors. Motorola never released such a coprocessor for the 68000/68010, while Intel was releasing its very successful 8087 since 1980. But to get a significant performance boost, the code for 68882 needs to be compiled differently than for 68881.

Quote:

Originally Posted by roondar

And they chose to not use Intel instead of Motorola when they stopped using them. Which seems to imply Intel didn't offer 'power' either.

Intel processors continued to be used wider and wider. They were not the best but always among the best in the 80s and 90s. It was very interesting that 80286, 80386, 80486, ... were made much better than it was expected. I can't help but I am a bit astonished by a fact that when 80486 was initially tested it could beat the dedicated for high performance Intel's RISC processor! It was shock even inside Intel. 80286 based systems were made artificially slower at least until 1986. Thus Intel showed its ability to work with talented ppl unlike Motorola which couldn't work with very talented 6502 team.

Quote:

Originally Posted by roondar

Your IMHO is wrong on both counts, the 68k was selling very well in the consumer PC market until the mid 1990's. Not only that, but each 68k CPU was faster than the x86 equivalent CPU. This only changed with the 486DX/2@66MHz (though this briefly flipped again with the 68060 vs Pentium). Furthermore, the 68060 and the later Coldfire stuff showed that the performance limit for the ISA was clearly not reached with the 68040.

What a fantasy have you written! http://lowendmac.com/2016/cpus-motorola-68060/ 80486 based PC were maybe in millions to 1995, most of Macs used 68LC040 (not 68040), there were no low end 68040 based PC, ...

Quote:

Originally Posted by roondar

In other words, your position about ARM compilers being terrible is still false.
This is a very ignorant statement to make. Mainly because it's not true.

I did't use such strong words like terrible, it is you.

Quote:

Originally Posted by roondar

First off, on the A500 fast memory runs at the exact same speed as chip memory - the only difference is there can't be any custom chip activity in fast memory.

Thank you. I missed this detail.

Quote:

Originally Posted by roondar

Anyway, I just noticed something - you repeatedly used the linedrawing example to try and show speed differences and even I don't agree this is a valid way to go about it, I did chuckle a bit when I noticed this. You wrote "I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63". This shows a 4,5x speed improvement of the ARM over the 68000, which is rather close to my claims of 5x speed

Maybe if I can find some spare time to do some more test with Archimedes and Amiga then I can get more clear point. But my experience with the IBM PC emulators gives about 10x. Indeed, maybe with some applications it is only 5x.

Quote:

Originally Posted by roondar

It doesn't have performance that poor because the overall system design is much better than the 386 you compare it with. This is easy to verify - A1200 games ported to x86 hardware always needed a fairly hefty system to keep up. For instance, the game Super Stardust runs on a basic A1200, yet requires a 486@33MHz with 4MB to run for the PC version (http://www.oldpcgaming.net/super-stardust-96/). Similarly, a basic A1200 could be used for productivity just fine without being anywhere near as slow in operation as a low speed 386.

Also, the A1200 was a mass produced not expensive system with a launch price below even cheap 386 based system, so even if it was somewhat slow this was kind of to be expected.

Still not easy as the 68020 can (and does) concurrently execute several instructions - or put differently, it's possible for an instruction to take zero cycles on the 68020, depending on what the instruction before it was doing.

Doom was running well at 486dx-25 and I doubt that Doom was playable at A1200. I can't agree that $600 A1200 without an HDD and monitor was cheaper than 386dx-25 with 2 MB RAM, a cheap SVGA card and FDD. IMHO such 386-system was below $300. Indeed I mean PC made in Taiwan or similar places.

It is interesting that 68020/30 can execute up to three instructions simultaneously, Pentium could only two but it is much faster per 1 MHz. Why?!

Quote:

Originally Posted by roondar

The 8088 and 8086 were used in computers with way more memory than you seem to think and at much earlier dates. Case in point, in 1982 you could get a NEC PC with an 8086 and 128K up to 640KB of RAM. There where even portable PC's out by 1983 with 128K to 640K of RAM (such as the Compaq Portable).
By 1984, 256K+ was pretty common. This was several years before the 386.

We discussed that many times. x86 before 80386 were slow with big arrays all other problems are not very important. IMHO 8086 is quite fine for about 512 KB of RAM in the most cases.

Quote:

Originally Posted by roondar

Alas, DOS and people buying tons of crappy 8088's made that a hard thing.

68k Mac were also quite popular but Apple had to choose another CPU.

Quote:

Originally Posted by roondar

And let's not even get into the massive performance benefit they both have compared to the 8088 based PC's that were actually used by people. After all, as Zen of Assembly language teaches us, the 8086 is much faster than the 8088 (the book claims an average of about 30%, but states this can go up to much higher for some code, due to prefetching blues).

IMHO Abrash missed something. There are a lot of quite respectable tests that show that 8086 is only about 30% faster than 8088 at the same frequency. Look, for example, at a big database https://github.com/MobyGamer/TOPBENC...ses/tag/0.38.h - its ZIP contains INI-file with a lot of interesting data.

Quote:

Originally Posted by roondar

Results are identical to the emulated Amiga: 0 bitplanes slows down shifts by about 0,9%.

Thank you very much for this very interesting result. Thus fast RAM only 1% faster than slow RAM - unexpected.

EDIT. My tests http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html show that A1200 with fast RAM is only 9% faster than without fast RAM...

roondar · 17 November 2018, 21:02

Quote:

Originally Posted by litwr

I have written that the exact number is not important for me. The important was a fact that Motorola couldn't provide Apple and other computer producers with fast enough processor in time. Thanks to our discussion now I can estimate that about several dozen thousand 68040 based PC were produced, or about 300 hundred. I did't use word "a few" (it was you!) I wrote just "hundreds".

Hundreds tends to be used to indicate somewhere up to maybe 2000. Selling 2000 or so units would qualify as a few. That said, you may just have misused the word 'hundreds' and I may just have misunderstood you. As to your estimation, I'm really not trying to be rude here, but you've thrown about so many different numbers (about many different subjects in this thread) and so many of them turned out wrong that I'd really like a source for this estimation as I just don't trust it to be correct.

The point about 'fast enough' is interesting and mostly true from 1992 onwards, but I do have to point out I have never claimed that the 68040 was faster than the 486DX/2, nor have I claimed the 68040 outsold it. All I have claimed (correctly) is that it wasn't the failure you've made it out to be and was used in a fair number of systems.

Quote:

However 68LC040 had a really unpleasant bug which prevent to use even software emulation for the math copro! I have to add to my article the next information.

Some noticeable use was found only by its cheaper version – 68LC040, which does not have a built-in coprocessor. However, the first versions of this chip had a serious hardware defect, which did not allow using even the software emulation of the coprocessor!

Motorola always had problems with mathematical coprocessors. Motorola never released such a coprocessor for the 68000/68010, while Intel was releasing its very successful 8087 since 1980. But to get a significant performance boost, the code for 68882 needs to be compiled differently than for 68881.

Why do I get the feeling you won't add the initial 486 thermal and incompatibility problems I've pointed out to your x86 article. Or are you going to show you are as unbiased as you claim and update the x86 article with this negative part of info as well?

The 68882 vs 68881 stuff is interesting, but still misleading as Intel FPUs had similar issues: floating point code optimised for 80387 performs quite a bit better than code optimised for 8087/80287 when running on the 80387 and up. Moreover, while the 68882 does perform better when software is recompiled for it, it still manages to be competitive with the 80387 even when software isn't recompiled.

Likewise, claiming Motorola had buggy FPU's as a point where Intel allegedly does better is hilarious given the Pentium FDIV bug.

Quote:

Intel processor continued to be used wider and wider. They were not the best but always among the best in the 80s and 90s. It was very interesting that 80286, 80386, 80486, ... were made much better than it was expected. I can't help but I am a bit astonished by a fact that when 80486 was inititially tested it could beat the dedicated for high performance Intel's RISC processor! It was shock even inside Intel. 80286 based systems were made artificially slower at least until 1986. Thus Intel showed its ability to work with talented ppl unlike Motorola which couldn't work with very talented 6502 team.

Intel CPU's were generally only used in PC's. It took to 2000 or later for them to become truly popular in the high end world. This is what I was referring to.

Your reply is especially interesting as the 486, while fast for an Intel chip was not in fact all that fast compared to 'dedicated high performance' CPU's on release. Not only were direct competitors such as the 68040 actually faster, but a number of other CPU's were as well. So much for those 'talented people' - they couldn't even beat the guys at Motorola who you claim had no such talent.

Quote:

What a fantasy have you written! http://lowendmac.com/2016/cpus-motorola-68060/ 80486 based PC were maybe in millions to 1995, most of Macs used 68LC040 (not 68040), there were no low end 68040 based PC, ...

I wrote that 68k was quite successful in the market until the mid 1990s (and much longer still if you look outside of the desktop world). It was - millions of 68k based computers where sold in that period. I also wrote that 68k outperformed x86 (in CPU performance, not sales) until the 486DX/2 came about. It did.

No fantasy there, just facts.

Quote:

I did't use such strong words like terrible, it is you.

I'm sorry, I define being 10 times worse -which was your initial claim- as being terrible. You might not, but a 10x difference is a lot. More importantly, the claim was false - the example you gave showed that ARM compilers were really quite good whereas you have consistently argued they were not. And that was my point. Arguing semantics won't change this.

Quote:

Maybe if I can find some spare time to do some more test with Archimedes and Amiga then I can get more clear point. But my experience with the IBM PC emulators gives about 10x. Indeed, maybe with some applications it is only 5x.

I've not seen anything close to that so far. I have an open mind, maybe you can show it somehow, but all things I've seen so far point to about 5x (which is still massive by the way).

To add another example, even the pi-spigot benchmark you have linked shows that the difference is smaller than 10x. In it, the 8MHZ Acorn 440/1 is roughly 3.7x the speed of an A500 and the 12MHz Acorn 3020 is roughly 5.3x the speed of an A500.

Quote:

Doom was running well at 486dx-25 and I doubt that Doom was playable at A1200.

You were comparing cheap 386's to the A1200. Doom doesn't run at all on a 386@20MHz with 2MB as it requires 4MB minimum. But even if you added 2MB to that 386, it still won't run in any way acceptably (a nice video benchmarking doom on a 40MHz 386 with 4MB shows even at reduced screen sizes it really struggles - just think how much worse it would be if that had been a 20MHz 386. [ Show youtube player ]).

All I really showed with my example was that the A1200 does indeed perform better than you claimed it does and I stand by that. Case in point, Super Stardust was running quite well on an A1200 and won't run at all on the cheap 386's you pointed out.

Quote:

I can't agree that $600 A1200 without an HDD and monitor was cheaper than 386dx-25 with 2 MB RAM, a cheap SVGA card and FDD. IMHO such 386-system was below $300. Indeed I mean PC made in Taiwan or similar places.

IMHO again? Well, I've never seen a 386 at that price in 1992 so I'm honestly doubtful. So I went looking and didn't find anything even remotely close to this price. I did find the attached image (from February 1992), which is from InfoWorld and was part of an article talking about how much cheaper the mail order companies such as Dell were. While there probably were somewhat cheaper options, I do not believe for one second that 'Taiwanese Clones' were less than a fifth of the price of these systems even without HDD and monitor.

More importantly, it was pretty much impossible to use a PC in any real sense without a hard drive by 1992. The A1200, while benefiting enormously from an added hard drive is actually still useful without one.

Lastly, note that a 'cheap' SVGA card in 1992 cost between $100 and $200 and required a monitor which was more expensive than a standard VGA monitor due to the higher resolutions supported. Add in a sound card for parity with the A1200 (which also cost around $100) and that's $300 gone before the mainboard/fdd/ram/case is even added.

In short, I humbly ask if you have some facts (such as adverts etc) to back up your opinion about $300 PC's?

Quote:

It is interesting that 68020/30 can execute up to three instructions simultaneously, Pentium could only two but it is much faster per 1 MHz. Why?!

They don't execute instructions simultaneously in the same way as the Pentium does it, they execute instructions from cache during memory fetches which allows for some concurrency. The Pentium uses either a pipeline or multiple ALU's to get multiple instructions per cycle (I forget which of the two it is) all the time (cache permitting). Other than that, Pentium instructions take 1 cycle each and 68020 instructions do not.

Quote:

We discussed that many times. x86 before 80386 were slow with big arrays all other problems are not very important. IMHO 8086 is quite fine for about 512 KB of RAM in the most cases.

No one agreed it was good for 512KB - we agreed it was good for 64KB. You yourself specifically said it was much worse after 192KB.

Quote:

68k Mac were also quite popular but Apple had to choose another CPU.

Indeed, as did Intel

They just made it compatible with the earlier model and gave it a similar name. Not that this is a bad thing, but they did basically add a new ISA to the 8086 and call it the 80286/386. They did it again later (Pentium MMX or Pentium 2 IIRC) when they stopped running x86 code natively altogether. And again when they transitioned to x64.

Quote:

IMHO Abrash missed something. There are a lot of quite respectable tests that show that 8086 is only about 30% faster than 8088 at the same frequency. Look, for example, at a big database https://github.com/MobyGamer/TOPBENC...ses/tag/0.38.h - its ZIP contains INI-file with a lot of interesting data.

Abrash says it's 30% faster on average. This is what you're also claiming. I fail to see where it's wrong. He only added that there are some instructions where the results are better than that.

Quote:

Thank you very much for this very interesting result. Thus fast RAM only 1% faster than slow RAM - unexpected.

Not really, it's info pretty much straight from the HRM - most people just don't understand it / haven't read it.

Quote:

Originally Posted by litwr

EDIT. My tests http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html show that A1200 with fast RAM is only 9% faster than without fast RAM...

That is because your test is tiny. It mostly fits in the cache (the main part that calculates probably fully fits in the cache). Try it again with a program that doesn't and you'll a much higher performance increase. Also, you should probably know that emulation speed of 68020+ on UAE (FS-UAE or WinUAE) is not fully accurate.

It also shows quite clearly why you can't just use a tiny bit of code to accurately measure performance etc - it just doesn't work well, especially once cache comes into play.

meynaf · 17 November 2018, 21:04

Quote:

Originally Posted by litwr

Where is your logical proof for this?

The logical proof for this is the amount of energy you spend in false arguments in vain attempts to prove your point about code density (and on ONE example, btw).
Another proof is easy to find in your articles - they are indeed VERY emotional.

Quote:

Originally Posted by litwr

Mine is easy, I can repeat it, your are too emotional about x86 - this means a strong passion. BTW I feel quite comfortable knowing that 68000 can be sometime more than 100% faster than 8088 at the same frequency or, thanks to you, that 68000 code density is often better than for 8086.

It's not me who wrote emotional articles about cpus and had favored x86 in them, while continuously attempting to cheat to prove its code density is the best

And for me being too emotional about x86, well, you haven't asked me what i think about various other architectures...

Quote:

Originally Posted by litwr

IMHO it is not exactly a weak point it is rather a bit exaggerated point in 68k ISA. Too many such exaggerated points made 68k ISA weaker - ARM and x86 had won long ago.

You were told several times, i think, that the reason why ARM and x86 had won, is NOT linked to ISA quality. But no, you still ignore this fact.
In fact x86 is so bad that even Intel themselves tried to get rid of it !
And ARM situation isn't better. There are very different instruction sets (original arm, thumb, thumb-2, arm-64) so it's obvious it's inadequate in some situations.

Quote:

Originally Posted by litwr

It is easy to shorten my code by several bytes.

Any code can be shortened by removing features...

Quote:

Originally Posted by litwr

The discussion in this thread clearly has shown that the segmentation gives us some advantages, e.g., the headerless format.

Segmentation is not mandatory for having a headerless format. So if it gives some advantages, it's not that one.

Quote:

Originally Posted by litwr

Indeed its 8086 implementation has also disadvantages. Anyway we have a strong fact 168 < 236.

The real fact is (236-36-12-12-8) < 171.
Oh, and i haven't counted the memory allocation, by the way, to it's even harder. That's probably another 20 bytes gain.

Quote:

Originally Posted by litwr

It will be more interesting to compare larger codes where an algorithm needs to use a lot of memory but it will be too difficult for me to provide a proper participation in this contest.

Oh, my. This "too difficult" you've just written clearly means you know full well that x86 code is hard to write...

But I will use your logic.
You said before that it's impossible to beat gcc. Do you still hold this claim ?
If so, you can perfectly make programs of any size.
(The claim is wrong, obviously, but it's you who are supposed to believe in it, so use your "knowledge".)
And don't tell me writing some C code is gonna be too difficult. You said C was beautiful.
(And again i didn't agree, but you have to behave as if it were true, or at least admit it wasn't.)

Quote:

Originally Posted by litwr

Shorter OS call is well known advantage of x86: DOS, Linux-x86 use INT for those calls.

Ha ha ! Very funny. Now look at the typical Windows system call.

Atari ST also has short OS call so it's NOT an advantage of x86.
Same for MacOS 68k.
INT isn't shorter than TRAP or Line-A.
When will you stop writing spurious things ?

Quote:

Originally Posted by litwr

Sorry but you are completely wrong with this. A COM-file gets all free OS memory for it automatically.

It's you who are completely wrong.
First, while this will work if you ask for very little memory, at some point there is a risk there is not enough of it - and then, crash'n'burn. Dirty code.
Second, you are once again removing OS code and attribute the gain to x86.

Don_Adan · 17 November 2018, 22:45

Phil, could you post source of your Pi version? I want to see, how it looks. Perhaps using TaggedOpenLibrary for dos.library and Code_BSS (no need to alloc/free memory), your code will be shortest than PC .COM version.

Bruce Abbott · 18 November 2018, 02:51

Quote:

Originally Posted by Don_Adan

Perhaps using TaggedOpenLibrary for dos.library and Code_BSS (no need to alloc/free memory)...

The only reference I could find to TaggedOpenLibrary() was in AROS. If we are allowed to use an alternative OS then that's fine - we can just whip up one that executes headerless files

.

Sticking to standard AmigaOS 3.0 I managed to trim off a few bytes using FindName on Exec's LibList (so no need to close dos library or cache the dos base). I also used 'code_bss' to avoid having to allocate memory (if we must have a header then why not make full use of it?). This got the file size down to 208 bytes (172 bytes 'code', 36 bytes 'header').

Don_Adan · 18 November 2018, 08:20

Quote:

Originally Posted by Bruce Abbott

The only reference I could find to TaggedOpenLibrary() was in AROS. If we are allowed to use an alternative OS then that's fine - we can just whip up one that executes headerless files

.

Sticking to standard AmigaOS 3.0 I managed to trim off a few bytes using FindName on Exec's LibList (so no need to close dos library or cache the dos base). I also used 'code_bss' to avoid having to allocate memory (if we must have a header then why not make full use of it?). This got the file size down to 208 bytes (172 bytes 'code', 36 bytes 'header').

Thanks. This code can be optimised of course. I will use, for safe 2 bytes something like this:, i hope that no my fault in this code.

move.l #truc,d7

lea buffer+truc*2(pc),a4 ; a4 = end buffer

; remplissage initial
; move.l a0,a4 ; on garde adr dans a4
move.w d7,d2
.fill
move.w #2000,-(a4)
subq.w #1,d2
bgt.s .fill

For TaggedOpenLibrary, I wrote info before, somewhere in this thread. And Im not sure, if execbase is not given at startup as register, but maybe this is only for bootblock code?

ross · 18 November 2018, 08:50

Quote:

Originally Posted by Don_Adan

And Im not sure, if execbase is not given at startup as register, but maybe this is only for bootblock code?

Only to bootblock code.

Don_Adan · 18 November 2018, 10:31

Maybe it can works. Saved as Code_BSS.

Code:

; 68020 size-optimised spigot


; ent?te
nbch equ 1000			; nb chiffres
truc equ (nbch/2)*7		; const utilisée un peu partout

 mc68020


; init
 move.l 4.w,a6                       ; exec base
	moveq	#4,D0			; dos library
	jsr	-$32A(A6)		; TaggedOpenLibrary
 move.l d0,a6

 move.l #truc,d7

 lea  buffer+truc*2(pc),a4 ; a4 = end of buffer

; remplissage initial
: move.l a0,a4			; on garde adr dans a4
 move.w d7,d2
.fill
 move.w #2000,-(a4)
 subq.w #1,d2
 bgt.s .fill

; message d'ent?te
; exg a5,a6			; pour a6=dos
 lea msg0(pc),a0
 bsr.s aff

; main loop, req. a4=buf et d7=truc - note : a3 libre
 move.l #10000,d3
 moveq #0,d1
.loop1
 clr.l d5
 move.l a4,a1
 move.l d7,d0			; i
 move.l d7,d4
 add.l d4,d4
 subq.l #1,d4			; i*2-1
.loop2
 mulu.l d0,d5
 move.w (a1),d6			; r[i]
 mulu.w d3,d6			; r[i]*10000
 add.l d6,d5			; d += d + r[i]*10000
 divul.l d4,d6:d5
 move.w d6,(a1)+		; d%b -> r[i]
 subq.l #2,d4
 subq.l #1,d0
 bgt.s .loop2

; aff chiffres
 divul.l d3,d4:d5		; d/10000
 add.l d1,d5			; +c
 bsr.s affd5

; suite
 move.l d4,d1			; c = d % 10000;
 lea 28(a4),a4			; 14 itérations de moins cette fois
 sub.w #14,d7			; .w suffit
 bgt.s .loop1

; fin
 lea lf(pc),a0			; terminé, envoyer un newline
 bra.s aff

; aff nbr d5
affd5
 lea decbuf-1(pc),a0		; pointe sur le 00
 moveq #3,d0			; nb ch
 moveq #10,d1			; div.l shortcut
.loop
 divul.l d1,d2:d5
 addi.b #"0",d2
 move.b d2,-(a0)
 dbf d0,.loop
; on retrouve a0="nnnn",0, afficher directement via aff ci-dessous
; normal cli print
aff
 move.l a0,d1
 jmp -$3b4(a6)

msg0 dc.b "pi calculator v6"
lf dc.b 10,0

buffer:
 dx.b truc*2
 dx.b 6 
decbuf

Still can be a few optimised.

meynaf · 18 November 2018, 10:53

Quote:

Originally Posted by Don_Adan

Maybe it can works. Saved as Code_BSS.

It works. Size = 188, of which 36 are hunk data. Code is therefore 152 bytes. Even claimed 168 value is largely beaten.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any software to see technical OS details?	necronom	support.Other	3	02 April 2016 12:05
2-star rarity details?	stet	HOL suggestions and feedback	0	14 December 2015 05:24
EAB's FTP details...	Basquemactee1	project.Amiga File Server	2	30 October 2013 22:54
req details for sdl	turrican3	request.Other	0	20 April 2008 22:06
Forum Details	BippyM	request.Other	0	15 May 2006 00:56

17 November 2018, 12:59	#804
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	Wouldn't fast ram also need refresh cycles? Maybe UAE simply does not emulate those, as probably there isn't any software relying on a specific fast ram timing (which could also vary a lot between implementations, take e.g. the slow PCMCIA SRAM...).

17 November 2018, 16:55	#809
frank_b Registered User Join Date: Jun 2008 Location: Boston USA Posts: 466	Hmm.. maybe the 20% increase was with 4 planes active. I did see a difference between running the code from fast RAM and from chip. I'll have a dig about my 1k drive and see if I can find it later this week. I was using raster timing.

17 November 2018, 22:45	#815
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,960	Phil, could you post source of your Pi version? I want to see, how it looks. Perhaps using TaggedOpenLibrary for dos.library and Code_BSS (no need to alloc/free memory), your code will be shortest than PC .COM version.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)