12 November 2018, 02:56 | #761 |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
I'm surprised no one mentioned a potentially difficult problem using x86 segments and C: pointer aliasing issues.
It's entirely possible for two pointers to point to precisely the same memory location and still be unequal. This can make certain compiler optimizations next to impossible even if the programmer knows that two pointers can't possibly point to the same memory location. This goes for OoO processing at the processor level as well. Dependencies are created that limit the degree to which operations can be reordered or elided. |
12 November 2018, 09:54 | #762 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
Page mode works because it is faster to read data from the same column than having to change the row (RAS/CAS latency), so accessing multiple consecutive words can be faster than words in random order. The Archimedes' memory controller reads four words in a go when possible, giving a 2-1-1-1 timing (four random words would be 2-2-2-2), thus almost doubling throughput in the ideal case. Could this have been used in the Amiga? IMHO only with difficulties. The chipset is not designed to use consecutive memory accesses, and uses a lot of different memory locations. Bitplanes e.g. come from six locations, blitter has four channels, audio also, I think there are 25 DMA channels in total. To fully take advantage of page mode, every channel would need to read four words in a go before another channel takes over, meaning additional buffers and logic on-chip. A naive calculation would give 25 channels* 64 bit * 6 transistors/bit = 9600 additional transistors, just for buffers, without logic. Probably half if you would page mode only for blitter and video. The Archimedes on the other hand has no blitter, and video data comes from just one address (chunky mode), so much easier to implement. Also, the 68000 is probably less suited for page mode than the ARM2: The ARM a load/store architecture, so typically you'd read from mem to registers, execute some code and store registers to memory. This generates rather sequential memory accesses, especially as the ARM has conditional instructions, so does not need to branch in a lot of cases. The 68k on the other hand has a lot of powerful instructions and addressing modes to work directly in memory, which generates more random access patterns (instruction from one location, data access from another). Also no conditional instructions, and at 7 Mhz it cannot saturate the bus anyway. Of course you could write your code to maximize memory throughput, but e.g. compilers would have needed special adaption. As the Archimedes was the only ARM architecture for some time, I guess compilers were optimized for this memory type there. So to sum it up, it could probably have been done at least for bitplane and blitter access (AGA does it for the former) with quite some effort, but without speeding up CPU operation. |
|
13 November 2018, 10:10 | #763 | |
Registered User
Join Date: Dec 2017
Location: france
Posts: 186
|
Quote:
EDIT: ok it's 150 of access time, but 260 of cycle time. If this can help : https://retrocomputing.stackexchange...mory-bandwidth Last edited by touko; 13 November 2018 at 10:55. |
|
13 November 2018, 19:06 | #764 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
@meynaf I can get 171 bytes for pi-spigot for 80386. Your 68020 code takes 236 bytes. However I added to my article about 68k a phrase Additionally, as shown by eab.abime.net experts, the code density of 68k is often better than that of x86.
I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63. Interestingly, that 80286 has almost the same count as 80286. IMHO 68k could have been successful but 68020 added too much difficult to maintain instructions and this was a big mistake. Sorry, I still can't calculate this number for 68020 - it is very difficult. I can estimate 32 but IMHO it must be a larger number. There are my cycles sheet below. Code:
86 286 386 486 .loop: call putpixel 19 7 9 3 .m3: cmp bp,7777 4 3 2 1 0 jne .l3 16 11 10 3 .m4: cmp bx,7777 1 je .l4 .l3: mov ax,cx 2 2 2 1 3 shl ax,1 / add ax,ax 2 2 2 1 4 cmp ax,di 3 2 2 1 5 jl .l5 17 11 10 4 6 add cx,di .m1: add bp,8 .l5: cmp cx,si 3 2 2 1 7 jg .loop 17 11 10 4 8 add cx,si .m2: add bx,8 9 jmp .loop 15 8 8 3 .l4: 68000 68020 .loop 1 sub.l d4,d6 4 3? 2 bgt.s .xp 13 (10 or 8+4+4=16) 6? 3 add.l a0,d1 4 add.l a2,d6 .xp 5 sub.l d5,d7 4 3? 6 bgt.s .yp 13 6? 7 add.l a1,d2 8 add.l a2,d7 .yp 9 bsr.s setpixel 18 8? 0 dbf d0,.loop 10 6? Last edited by litwr; 13 November 2018 at 21:48. |
13 November 2018, 19:10 | #765 | |||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
Quote:
Quote:
|
|||
13 November 2018, 19:13 | #766 | ||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Sorry I had to add more clarification to my point. For example, for me 68000 is a bit clumsy. It is because my first x86 experience began with 80286 which is 3 years older and have a lot of very rapid instructions. Indeed I couldn't use this word if I had compared 8088 and 68000. I am always ready to present you one or even two shots of the best vodka. I feel myself as your debtor because of your numeros interesting and helpful remarks. Quote:
Was there any computer based on 68060? As I know it was only available as an upgrade. Quote:
Quote:
Quote:
Quote:
I don't insist that Archimedes is always 10 times faster but IMHO for proper optimised programs it can be faster about this number. I repeat my counts with the standard line drawing algorithm show that ARM can be 50% faster than 80486 at the same frequency. Amiga-500 has only slow RAM and Archimedes RAM access is 2 or even more times faster. So theoretically 68000 can only be 4-5 times slower but with Amiga it is about 10. Thank for an interesting Amiga benchmark cite. I am a bit surprised that for some tests A1200 only 40% faster than A500. IMHO 68020 should be much faster. So it sounds as a common problem of Motorola products - their theoretical specifications are much better than practical. Quote:
Any example? |
||||||||
13 November 2018, 19:31 | #767 | ||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Quote:
Quote:
Position-independent : code that can be run unchanged regardless of where it is located in memory. Relocatable : code that provides the relevant information (= relocation tables) so it can be moved anywhere in memory by altering a few parts of it. What a strange claim ! Quote:
Besides, MMU does not exactly come for free. Of things that are missing or very poor in C++ ? Aside of already seen swap, proper dynamic string/array support is a good example. |
||||
13 November 2018, 21:23 | #768 | ||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Quote:
Sorry but I really don't see much necessity for position-independent code. Indeed, I meant namely position independent code. PC-relative addressing is related to possibility to get such code. But it is only useful for hardware tester and some system programmers. Quote:
IMHO Headers <string>, <vector>, <deque> give proper dynamic string/array support. |
||||
13 November 2018, 21:48 | #769 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Being typical does not make it more time critical. But it's interesting to see you qualify it as "typical" now. So it's typical when you get good speed measurements for your beloved x86, but it's a particular case when it comes to code density ? Damned, if that's not biased reasoning then what is it. Quote:
But as x86 does not have proper PC-relative modes, it's normal you don't see the interest Quote:
|
|||
13 November 2018, 21:53 | #770 | |||||||||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
And remember, you claimed no more than a few hundred 68040 based machines were made worldwide. This is clearly nonsense. You also forget there were a whole bunch of Apple Macs based on the 68040. These were obviously mass produced and were part of the market for a number of years. Quote:
Quote:
The low end fared considerably worse though, the A3000 was competing with Atari's and Amiga's around the GBP250-300 mark and itself cost anywhere from GBP499 to GBP649 (I can't seem to find a price for these in 1991, the 649 is from 1989 and the 499 is from it's replacement the A3010 in 1992) Quote:
You are also really, really, really misrepresenting the results of that blog post - the compiler did indeed originally produce code that was 100 bytes long. However, after changing the compiler flags to produce size optimised rather than non optimised code and code that was specific to the actual CPU used rather than general ARM code, it dropped down to only 16 bytes!! Quote:
Then he takes the 12 bytes example and manages to improve that to 10 bytes. In other words, he managed to hand optimise the code by all of 6 bytes if we include the changed line of C code as an example of 'hand optimised assembly' (which it isn't!) and if we exclude that he only managed to optimise the final result of the compiler by 2 bytes. All other optimisation was done purely by the compiler. Seriously, your claim of 100 to 10 bytes is so wrong/misleading I'm starting to wonder if you even read the article. That article you linked is actually conclusive proof that compilers for the ARM are really quite good. Quote:
Quote:
It's really quite complicated to compare performance. In the last post, I even gave a counter example: Archimedes 3D games are not 10x the speed of the Amiga version, even though they suit the Archimedes very well. Another example: the A500 can accelerate line drawing using the Blitter. The speed increase varies with length, but for most lines it's apparently at least as fast as a high speed 68030. Quote:
2) It is common knowledge the 68020 in the A1200 was held back by the slow on board RAM. If you add any amount of trapdoor RAM to the A1200, it becomes about 4x the speed of the A500 - without needing to upgrade the CPU. As an example, the GVP1208 (which does nothing but add 8MB of RAM to the A1200) just about doubles the speed of the base A1200. So your conclusion about the problem being the Motorola chip is wrong. Even the idea that the design was at fault would be wrong - the A1200 was designed as a low-end computer, not as a high end one. This may not be what most Amiga fans (including me) wanted to see, but at least it was fairly easy and cheap to double the performance of the A1200. Quote:
Or if you prefer a simpler approach, this can also be done using Blitz Basic and AMOS PRO (though the last one is bit of a stretch as it isn't as system friendly and thus doesn't do windows per se). Edit: it occurs to me the above might not be quite clear enough on what I mean, so here goes: All the above forms of BASIC can play back IFF animations (though it is easier using AMOS Pro/Blitz) and at least two of them can do so in an entirely OS friendly way. All these forms of BASIC can also create and display animations in different ways (such as using sprites, BOBs, etc). The part about the window might be more interesting - I'm 100% sure Blitz and Amiga Basic can display IFF animations on a standard Amiga screen and as such, the animation and listing can be shown at the same time. Getting the listing to display on the same screen (or in a window on the WB screen) as an IFF animation may be more tricky as WB1.3/WB2.0 were not really designed for that, but should probably still be doable. Displaying BOBs/sprites in a window should be easier to do. Last edited by roondar; 14 November 2018 at 14:38. Reason: rephrased the A1200 vs A500 benchmark bit / clarified the basic anim player bit |
|||||||||
14 November 2018, 08:46 | #771 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
By the way, if we take that into account with my 236 bytes version...
There are 36 bytes of hunk overhead, actually 38 because hunks are longword aligned and there are two null padding bytes at the end. Brings us at 198. Count 12 for dos.library string. We're at 186. Count 12 for actually opening the library. We're at 174. Count 8 for closing said library. We end up at 166. 80386 is beaten. Now could we please stop that senseless game ? 68k is better than x86 in code density and you can't do anything about that. Even if in your particular example this were not true, this would still be just a single meaningless one. Try to find other code. |
14 November 2018, 10:04 | #772 | |
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
|
Quote:
Reading this thread is like watching old couples argue, it's 40 pages long and nobody has learned anything. I get tired from just seeing it pop up in Today's Posts. |
|
14 November 2018, 11:45 | #773 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Quote:
Quote:
|
||
14 November 2018, 15:19 | #774 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
Code:
68000 .loop 1 sub.l d4,d6 8 2 bgt.s .xp 10 if taken / 8 if not 3 add.l a0,d1 8 4 add.l a2,d6 8 .xp 5 sub.l d5,d7 8 6 bgt.s .yp 10 if taken / 8 if not 7 add.l a1,d2 8 8 add.l a2,d7 8 .yp 9 bsr.s setpixel 18 0 dbf d0,.loop 10 For that matter, the 68k code looks kind of odd - why are you adding address registers to data registers? I might be wrong here, but I think you mean to do the opposite. On a side note: if putpixel takes x&y coordinates the longword add/sub commands can be optimised into word add/sub commands. That said, I've not actually looked at the line drawing stuff you discussed much as I find it to be a far to small algorithm to actually be useful to compare stuff accurately. So it might be correct after all. The 68020 is much harder to 'cycle count' for because the 68020 has a cache which means execution times start to differ depending on the code being inside or outside of the cache (stuff in cache is much faster). More so, code running from the cache can continue to run during memory accesses of prior instructions so it's possible for some opcodes to take '0 cycles' by being run during a memory access. The Motorola manual has an example like this: Code:
; This example assumes code is running from cache 4 cycles move.l d4,(a1)+ 0 cycles add.l d4,d6 ; This example assumes code is running from memory 4 cycles mode.l d4,(a1)+ 3 cycles add.l d4,d6 The 486 actually has similar problems, it also has cache memory and will run code inside the cache considerably faster than code that isn't in the cache. I can't say for certain the 486 also runs opcodes while waiting on memory access or uses internal concurrency, but it probably does have these abilities and thus likewise is fairly complicated to count for. The 386 tended to run without cache as far as I can find. |
|
14 November 2018, 15:25 | #775 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Sorry, but sub.l and add.l instructions needs 6c, not 8c for 68000.
|
14 November 2018, 16:13 | #776 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
Code:
Standard Instruction Execution Times instruction Size op<ea>,An ^ op<ea>,Dn op Dn,<M> ADD byte,word 8(1/0) + 4(1/0) + 8(1/1) + long 6(1/0) +** 6(1/0) +** 12(1/2)+ notes: + Add effective address calculation time ^ Word or long only * Indicates maximum value ** The base time of six clock periods is increased to eight if the effective address mode is register direct or immediate (effective address time should also be added) The same figures can be found elsewhere as well. Strangely, the 'current' edition of the 68000 family programmers manual does not include cycle counts per instruction Last edited by roondar; 14 November 2018 at 16:21. |
|
14 November 2018, 16:28 | #777 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Do your 8086 timings take into account memory accesses during prefetch?
Every opcode fetch is going to cost cycles. Can you please add best case/worst case columns for 8086. |
14 November 2018, 17:46 | #778 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Quote:
|
|
14 November 2018, 18:09 | #779 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
I'm adding address registers to data registers, because i'm out of data registers and therefore some data has to go in address registers. Using word operations is a bad idea because it's only valid for 68000 (others don't care), adds some constraint on the other function, and - i repeat - this code isn't speed critical. |
|
14 November 2018, 20:16 | #780 |
Registered User
Join Date: Dec 2016
Location: Finland
Posts: 168
|
It's not easy to compare instruction cycle count between various architectures. If you look in MC68020 manual, there's 3 cases given for each instruction, best, cache and worst. Typically you would be interested in cache cases and try to optimize your code towards best case. By trying out first various memory alignments of start of code and then try to reorder instructions for better performance. Branches also have different cycle counts depending on whether the branch is taken or not. Variations between best and worst cases can be large, e.g. 2-3x cycles. Also, in some cases an instruction can take 0 cycles to execute (goes in parallell with some other) or then a few cycles, so difference is infinite.
You also can't see performance of a CPU by just looking at cycle counts for each instruction and comparing it against other CPUs . There are other factors, like cache performance, and number of registers available, which are also important for performance. The x86 cycles for instructions seem on paper often impressive, but x86's lacked the amount of CPU registers that 680x0's have. Also, the 386/486 caches weren't as good as 68030/68040's caches (386 had some sort of external cache). |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any software to see technical OS details? | necronom | support.Other | 3 | 02 April 2016 12:05 |
2-star rarity details? | stet | HOL suggestions and feedback | 0 | 14 December 2015 05:24 |
EAB's FTP details... | Basquemactee1 | project.Amiga File Server | 2 | 30 October 2013 22:54 |
req details for sdl | turrican3 | request.Other | 0 | 20 April 2008 22:06 |
Forum Details | BippyM | request.Other | 0 | 15 May 2006 00:56 |
|
|