English Amiga Board


Go Back   English Amiga Board > Main > Amiga scene

 
 
Thread Tools
Old 12 September 2022, 21:22   #961
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by nonarkitten View Post
Wow. Umm, #include <stdint.h>

I mean, did you stop learning C sometime before 1999? Because that's been in the C99 standard since then.
Retrocomputing
Quote:
...is the use of older computer hardware and software in modern times. Retrocomputing is usually classed as a hobby and recreation rather than a practical application of technology; enthusiasts often collect rare and valuable hardware and software for sentimental reasons. However, some do make use of it.
Am I 'daft' for wanting to stay in the retrocomputing spirit by continuing to use the C compiler I paid a lot of money for back in the 90's?

Quote:
Also, gcc has had overflow detection for a very long time. At least since GCC6, but I'm not sure exactly.
It's a compiler-specific extension, apparently introduced in GCC V5 (unconfirmed):-

6.56 Built-in Functions to Perform Arithmetic with Overflow Checking
Quote:
The following built-in functions allow performing simple arithmetic operations together with checking whether the operations overflowed.

Built-in Function: bool __builtin_add_overflow (type1 a, type2 b, type3 *res)
GCC 6.8 has recently been ported to the Amiga, but
Quote:
Originally Posted by Olaf Barthel View Post
For the cc1 command this means that some 11 MBytes of code are processed in a loop which reads chunks of 1024 bytes each and copies these, about 11,200 times...

The real kicker is what happens with the relocation information which is applied to the code hunk. For cc1 more than 89,840 individual relocations need to be performed and the format in which these instructions are stored in the load file is not geared towards efficiency.
It's a behemoth that I bet would be very slow on my A1200 if it worked at all. Might be better with a Vampire.

Due to that and other issues with GCC I'm not touching it. Instead I port the code to SASC. I fully admit to not being an expert on GCC or C in general, so this often takes a while (I'm getting better at it...). This is worth doing because I have fun doing it, and it's in the retrocomputing spirit. Who knows, perhaps one day I will port some of my own C code to the Amiga, and of course release the sources for anyone else to use.

Quote:
"Serious" lol. Yeeeeah, it was totally Intel that made Motorola do that. LOL.
No, I wasn't being serious.

But there was a rivalry between Intel and Motorola with each trying to outdo the other - for a while. At some point (040?) Motorola probably realized they didn't have the chip-making skills to keep up, so they started removing features instead. Intel just kept making chips with more and more transistors, and putting bigger and bigger heatsinks with huge fans on them. This strategy worked. Cutting the CPU down to work with fewer transistors didn't.

Imagine if Motorola had followed the same path as Intel. Assuming they were able to upgrade their foundries similarly, we might now have a successor to the 68030 similar to the 68080 but at GHz speeds. Apple might not have gone PPC and then back to Intel, making Macs more popular and 68k still relevant. The Amiga would ride on those coattails instead of heading down the dead end of PPC.

Quote:
Having 64-bit multiply has NOTHING TO DO WITH DISK LIMITS.
So the error in HDToolbox has nothing to do with integer overflow?

Quote:
64-bit multiply takes massive amounts of silicon.
And yet Motorola had no problem squeezing it into the 68020 in 1984.
Bruce Abbott is offline  
Old 12 September 2022, 21:37   #962
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by meynaf View Post
If one requires training in C due mistakes are so easy to make otherwise, then C isn't exactly as programmer friendly as it's supposed to be...
There's your mistake. C is actually designed to separate elite programmers from the daft among us.

Just go to Stack Overflow and you will see what I mean. It's full of condescending elites who berate any beginner brave enough to ask a question. Can be entertaining when they fight amongst themselves though...
Bruce Abbott is offline  
Old 13 September 2022, 03:05   #963
nonarkitten
Registered User
 
nonarkitten's Avatar
 
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
Quote:
Originally Posted by Bruce Abbott View Post
<snip>
How to check overflow of an integer in C.

Option 1. This is by far the most common one, but you compare the result to your operands. If the result of an addition is less than either of the operands, then an overflow has occurred. Most decent compilers will compile this into a simple flag-check.

Option 2. Use multi-precision arithmetic. So if you want a 32x32 => 64-bit MULU on hardware that doesn't have it, break it into four 16x16 => 32-bit MULU's and add them together. My first real programming job was writing a 64-bit library for the 68332. It's not hard.

Option 3. Define a small assembly routine (or even inline) which does this and then you can call it from C. Those built-ins from GCC could be reimplemented for SAS/C in an afternoon.
nonarkitten is offline  
Old 13 September 2022, 05:47   #964
hammer
Registered User

 
Join Date: Aug 2020
Location: Australia
Posts: 663
Quote:
Originally Posted by meynaf View Post
Well, Windows 3.1 apps won't run on Windows 11 either, at least not without some kind of emulation...
The end user can run 16bit Windows apps on Windows 11 via open-source WineVDM/OTVDM.

---

Leaked NTVDMx64 is based on the real Microsoft's original NTVDM, hence it's Microsoft's own direction to drop 16bit Windows support with 64bit Windows.
More info from https://github.com/leecher1337/ntvdmx64

HAXM version doesn't emulate the CPU, it uses HAXM VT-x hardware acceleration (the CPU needs to support it).
hammer is offline  
Old 13 September 2022, 06:23   #965
hammer
Registered User

 
Join Date: Aug 2020
Location: Australia
Posts: 663
Quote:
Originally Posted by nonarkitten View Post
For example, "AMMX MIPS". WTF are those? So the V4 is 13 times faster than the 68060 is at AMMX? Really? I didn't even know the 68060 had AMMX, must be an undocumented feature.

Oh, you mean he's comparing some "arbitrary algorithm" in AMMX to the same routine in pure 68K assembly and it's about 13 times faster. Talk about cherry-picked nonsense.

On RiVA, a hyper-specific use-case where AMMX really makes a difference, there was a 100-150% speed up. Is that huge? Sure, I think so. Is it THIRTEEN TIMES FASTER? No. Hell no. And you'd have to be a moron to even claim that.
Using CoffinOS R58's MPEG video test files, RiVA 68K and FroggerNG (from my A1200 rev1D1/TF1260/AOS3.2 install) run pretty well on my PiStorm/RPi 3a/Emu68/A500 rev6A.


RiVA AMMX (AC68EC080) requires a separate build from RIVA 68K while PiStorm/RPi 3a/Emu68 (like Transmeta Code Morphing Software method, cite ref 1) path improves the legacy RIVA 68K's playback performance.


Depending on the CPU, CoffinOS R58's startup swaps between RIVA AMMX and RIVA 68K executables.


Transmeta Code Morphing Software (CMS) includes JIT with instruction reorder on VLIW micro-architectured CPU. The compiled code is stored in a "translation cache". CMS does not try retranslating the region in which the interrupt occurs. CMS has techniques for handling self-modifying code.


Reference
1. https://www.cs.cornell.edu/courses/c...log/transmeta/
hammer is offline  
Old 13 September 2022, 08:29   #966
nonarkitten
Registered User
 
nonarkitten's Avatar
 
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
Quote:
Originally Posted by meynaf View Post
This is totally ridiculous.
68K can do direct memory-to-memory moves, it can do arithmetic to/from memory directly too, it can handle three data types in computations and has a complete set of addressing modes.
ARM can load bytes, words and longs too. ARM does ternary arithmetic though, so it needs far fewer pointless moves shuttling around data in registers.

Quote:
Originally Posted by meynaf View Post
So much programming flexibility in comparison to any RISC cpu. ARM is incredibly limited and has a terrible syntax. Not to mention 68k also has better code density.
I like ARM's syntax. It's quite readable. Not sure what issue you have.

Better code density? Than Thumb2? No way.

Quote:
Originally Posted by meynaf View Post
Why would I do this, it's totally useless instruction in real life code.
Which part, loading a signed int? Because this isn't real?
Code:
int8_t flags;
Or maybe it's the 32-byte offset? Because this isn't real? You couldn't have an array of these you might want to iterate over?
Code:
struct { int8_t flags; .... }; // size is 32 bytes
Or maybe it's the conditional? Because those never happen either, right?

Quote:
Originally Posted by meynaf View Post
Very funny argument, considering predicates have been dropped in aarch64 (along with automatic barrel shifter).
These were dropped because ARM64 still uses 32-bit opcodes and the number of registers needed more bits.

Quote:
Originally Posted by meynaf View Post
Not five instructions, just four (and if Günni listened to me it would be only three).
The original 68000 has no byte-to-long sign-extend, but sure. On the 68020 and higher, this would "only" be four.

Quote:
Originally Posted by meynaf View Post
As said, your one RISC instruction isn't very useful for real life workloads. Now consider simple and much more useful add.l #data,mem on 68k (32-bit constant added on a variable with a linear 32-bit address). Doing that on PPC should require something like 6 instructions. How many on ARM already ?
Three, ldr, add, then str. This is a little contrived, you usually don't march through your data in passes doing one thing at a time.

On the 68K, RAM was "fast enough" to not care, so optimization tricks like using tables, were a common thing. At 1GHz, none of those tricks work anymore because a cache miss can cost dozens of clock cycles. So it's better to compute on the fly and keep things in registers. 68K style of assembly language requires that the CPU be slower than RAM and even by the 68060, that was breaking down.

Quote:
Originally Posted by meynaf View Post
Why would we want to do this ? We HAVE a stack, so why not just use it ?
Because you don't have infinite memory bandwidth?

Quote:
Originally Posted by meynaf View Post
Oh sorry, maybe ARM does not support a proper stack ? Oh wait, i forgot : you wasted a GPR for the program counter so you don't want to reserve another one. I understand.
Yes ARM supports a "proper stack," don't be stupid. And using PC in a regular register has unlimited potential for abuse that's so cool. Things you could never do on 68K.

Quote:
Originally Posted by meynaf View Post
But we can serve an interrupt without touching the registers.
But not the stack. By the time you're in the interrupt, you're already dozens of cycles behind the ARM.

Quote:
Originally Posted by meynaf View Post
The 68k can support direct memory operations.
Oh, wait. ARM can't do ADDQ.W #1,mem ? Poor mite, your so great cpu can't implement simple interrupt counter without touching registers.
That's silly. Why would you worry about touching registers? You're ADDQ is still performing the LOAD, ADD and STORE operations, you're just not aware of what's going on with the microcode.

Quote:
Originally Posted by meynaf View Post
And it can't move a memory cell directly to another ?
Again, once processors top around 200MHz, direct memory for everything becomes a serious limitation since RAM cannot keep up anymore.

Quote:
Originally Posted by meynaf View Post
Also we CAN branch to a subroutine without touching the stack, it's just LEA+JMP. Not that this operation would be a common one, of course.
Good point, but then you don't get any sort of prediction on the "return."

Quote:
Originally Posted by meynaf View Post
Try move ccr to a data register. Or use Scc instruction to keep the condition. Or better, do your computation on an address register. Or don't do it at all, it's not a common operation either.
It's pretty common in emulation.

On ARM you just omit the 's' flag on the opcode and then all ALU operations don't affect flags. The nice thing is, this works for all ALU operations like MUL and DIV and not just the couple cherry picked ones that some engineer in 1976 though would be useful. Saving and restoring are two cycles too many for me.

Quote:
Originally Posted by meynaf View Post
But again, there is nothing wrong in having a stack. Or maybe you're allergic to stacks ? Don't have a look at java bytecode or webassembly then !
LOL. These are intermediate representations and both will always get JITed into machine code. Many of those stack operations get eliminated.

Quote:
Originally Posted by meynaf View Post
Not everything is useful, but we have Bitfields and good luck with your ARM to do the same with fewer instructions !
ARM has bitfields too.

ARM has better code density than 68K.

Your contrived example is far worse than mine. I see this kind of pattern all the time in compiled code on ARM and use it in PJIT. It's great. I love conditional everything. I love that every load can also be a sign or zero extend. I love that I can take huge steps when indexing. It's great for structs. But you're an ASM coder, you don't think in "structs."

But a single RMW for a RAM variable? Unless that's ALL you're going to do with that variable, it would be a lot more efficient to have separate LOAD/ADD/STORE steps. Not that I've ever had to have an interrupt just to count one number. That's what timers are for.

Your "everything in memory model" doesn't work with modern hardware where CPU's are several dozens of times slower than the fastest RAM. Caching helps, but expecting it to save your bacon is poor programming design. And even on the 68000, loading stuff into register to do a lot of work is still going to be faster than munching through RAM all the time. Every RMW is going to eat cycles, and ADDQ.L to a register is always going to be faster than ADDQ.L to RAM.
nonarkitten is offline  
Old 13 September 2022, 08:37   #967
nonarkitten
Registered User
 
nonarkitten's Avatar
 
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
Quote:
Originally Posted by hammer View Post
RiVA AMMX (AC68EC080) requires a separate build from RIVA 68K while PiStorm/RPi 3a/Emu68 (like Transmeta Code Morphing Software method, cite ref 1) path improves the legacy RIVA 68K's playback performance.
Emu68 does not use Transmeta's method of JIT.
- Emu68 is not a tracing JIT; cache is just flushed if it runs out
- Emu68 does not perform any interpreter passes first
- Emu68 does very little to optimize code
- Emu68 has no 'rollback' nor any need to
- Emu68 handles self-modifying code by checksumming the whole compiled block

Bernie talks about this in the Apollo forums -- too much optimization is usually worse for JIT than just "begin a dumb translator." Emu68 doesn't do all the crap so it's best case performance actually matches that of the ARM core itself. Transmeta was trying to be too smart here and didn't get that simpler is better.

Last edited by nonarkitten; 13 September 2022 at 08:45.
nonarkitten is offline  
Old 13 September 2022, 10:26   #968
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Thomas Richter View Post
Here comes the comes the man with programming experience...
Yep.



Quote:
Originally Posted by Bruce Abbott View Post
There's your mistake. C is actually designed to separate elite programmers from the daft among us.

Just go to Stack Overflow and you will see what I mean. It's full of condescending elites who berate any beginner brave enough to ask a question. Can be entertaining when they fight amongst themselves though...
Indeed, i missed this



Quote:
Originally Posted by nonarkitten View Post
ARM can load bytes, words and longs too.
Recent ARM, yes. Original ARM couldn't do words.
And while it can load/store them, it can not do direct computations on this data size, and it can not operate arithmetic operations on the fly. Instead of single, simple RMW instructions, you have load + ope + store. In comparison, even 6502 can increment a memory cell in just one instruction.


Quote:
Originally Posted by nonarkitten View Post
ARM does ternary arithmetic though, so it needs far fewer pointless moves shuttling around data in registers.
The funny part is, it does not make code shorter (single 32-bit vs 2x 16-bit) and it's easy to handle by instruction fusing (IIRC 68080 does that).


Quote:
Originally Posted by nonarkitten View Post
I like ARM's syntax. It's quite readable. Not sure what issue you have.
Frankly, you find ldrsbne r0, [r1, #32]! readable ? Then how should it look like to become unreadable ??


Quote:
Originally Posted by nonarkitten View Post
Better code density? Than Thumb2? No way.
Oh yes, it has. And the bigger the program becomes, the bigger the difference is. Ready for a small code contest ?
You seem to forget that 16-bit Thumb opcodes can only access 8 registers while 68k 16-bit opcodes can access them all. And operations they can perform are of course limited, while on 68k they are not.


Quote:
Originally Posted by nonarkitten View Post
Which part, loading a signed int? Because this isn't real?
Code:
int8_t flags;
It is real, but loading int8 on 68k does not require to sign extend it, we can use the byte directly in computations. Of course you were forced to extend it on the ARM.


Quote:
Originally Posted by nonarkitten View Post
Or maybe it's the 32-byte offset? Because this isn't real? You couldn't have an array of these you might want to iterate over?
Code:
struct { int8_t flags; .... }; // size is 32 bytes
You could do this. Or if you're smart enough, group bytes together in same array so you don't have data alignment issues.


Quote:
Originally Posted by nonarkitten View Post
Or maybe it's the conditional? Because those never happen either, right?
May I assume that if the condition is false, the "32" won't get added/subed to the register because the instruction isn't actually executed ?

No, really, it's not the byte, it's not the 32, it's not the conditional. Not per individual. It's all them together that makes an unlikely situation.


Quote:
Originally Posted by nonarkitten View Post
These were dropped because ARM64 still uses 32-bit opcodes and the number of registers needed more bits.
That might be a reason, but the main one is that predicates are a pain for OoO and at high frequencies the barrel shifter is no longer 'free'.


Quote:
Originally Posted by nonarkitten View Post
The original 68000 has no byte-to-long sign-extend, but sure. On the 68020 and higher, this would "only" be four.
On the 68020 and higher, it can be only three if we use the bitfields to perform the sign extension.
Anyway, as i said, we don't need to always sign extend. Rarely, actually. We're not on some puny RISC cpu which can only perform 32-bit computations and is therefore forced to extend everything.
Besides, most bytes are actually unsigned.


Quote:
Originally Posted by nonarkitten View Post
Three, ldr, add, then str. This is a little contrived, you usually don't march through your data in passes doing one thing at a time.
But alas, this is not enough. It's not 3, it's 4 so a little more contrived.
You forgot that there are two 32-bit values to load with ldr (the address and the data).


Quote:
Originally Posted by nonarkitten View Post
On the 68K, RAM was "fast enough" to not care, so optimization tricks like using tables, were a common thing. At 1GHz, none of those tricks work anymore because a cache miss can cost dozens of clock cycles. So it's better to compute on the fly and keep things in registers. 68K style of assembly language requires that the CPU be slower than RAM and even by the 68060, that was breaking down.
But the 68k does not need to go more often in memory than the ARM - actually quite less, and not only because we (usually) have one more register to use.
We can use tricks to reduce the register pressure, like using the high part of a register to hold different data - something that requires the ability to perform byte or word operations without touching the rest of the register. Consider, for example, two flags and a 16-bit loop counter all in same register.


Quote:
Originally Posted by nonarkitten View Post
Because you don't have infinite memory bandwidth?
We don't have infinite number of registers either and top of stack is nice candidate for L1 cache.


Quote:
Originally Posted by nonarkitten View Post
Yes ARM supports a "proper stack," don't be stupid.
Oh yeah ? So how many instructions for simple 68k's JSR in the stack ?


Quote:
Originally Posted by nonarkitten View Post
And using PC in a regular register has unlimited potential for abuse that's so cool.
Yeah, it's super cool, especially when you have to do a branch predictor.
Of course you also have one register less for use in regular programs.


Quote:
Originally Posted by nonarkitten View Post
Things you could never do on 68K.
Hopefully !


Quote:
Originally Posted by nonarkitten View Post
But not the stack. By the time you're in the interrupt, you're already dozens of cycles behind the ARM.
Well, you're right for small interrupts.
But if the code becomes complex enough, then ARM too will be forced to use the stack, destroying the benefit.


Quote:
Originally Posted by nonarkitten View Post
That's silly. Why would you worry about touching registers? You're ADDQ is still performing the LOAD, ADD and STORE operations, you're just not aware of what's going on with the microcode.
Microcode internal registers won't trash programmer visible registers. That's the point.
Besides, with fully pipelined cpu there's not such microcode at all. Your LOAD, ADD and STORE are all done at different stages of the pipeline.


Quote:
Originally Posted by nonarkitten View Post
Again, once processors top around 200MHz, direct memory for everything becomes a serious limitation since RAM cannot keep up anymore.
First, caches exist to handle that.
Second, it's not about direct memory for everything. Memory is still used and when it needs to be, you're happy not having to waste registers for its access.


Quote:
Originally Posted by nonarkitten View Post
Good point, but then you don't get any sort of prediction on the "return."
Ever heard of stack traces ? Modern or even semi-modern cpus do have prediction on the return.


Quote:
Originally Posted by nonarkitten View Post
It's pretty common in emulation.

On ARM you just omit the 's' flag on the opcode and then all ALU operations don't affect flags. The nice thing is, this works for all ALU operations like MUL and DIV and not just the couple cherry picked ones that some engineer in 1976 though would be useful. Saving and restoring are two cycles too many for me.
In emulation you normally don't use the host's flags register to hold flags permanently ; as soon as you have to perform a test of any sort, your emulated flags get trashed...
Now consider this situation :
Code:
.loop
 addx.l -(a0),-(a1)
 dbf d0,.loop
Now you need the carry from the addx for next iteration but if you don't have a proper loop instruction then you sub #1 and loop with that... oh wait, your sub needs the Z flag and will trash the carry if you allow it to change flags...
This happens in emulation as well, not all instructions touching the flags will change them all (rarely, actually, on 68k most leave X bit alone). Consider simple btst which only touches Z. ARM will not leave you the choice : all flags are altered, or none are.


Quote:
Originally Posted by nonarkitten View Post
LOL. These are intermediate representations and both will always get JITed into machine code. Many of those stack operations get eliminated.
Yet some stack operations remain and you'd better use a cpu that has a proper stack.
Also JIT isn't really clever, it has fewer options than a regular compiler.


Quote:
Originally Posted by nonarkitten View Post
ARM has bitfields too.
Then show me the equivalent of bfins d2,(a0){d0:d1). Or bfset (a0){d0:1}.


Quote:
Originally Posted by nonarkitten View Post
ARM has better code density than 68K.
Certainly not. You can always cherry-pick a single example but it won't prove anything. It's only with real examples of say 20-40 instructions that we can start to see something.


Quote:
Originally Posted by nonarkitten View Post
Your contrived example is far worse than mine. I see this kind of pattern all the time in compiled code on ARM and use it in PJIT. It's great. I love conditional everything.
We can have conditional execution too, with simple macros and a cpu implementation that fuses the condition and the instruction.


Quote:
Originally Posted by nonarkitten View Post
I love that every load can also be a sign or zero extend.
I see the value of this, but tell that to Gunnar who said mvz/mvs were useless and refused to add them...


Quote:
Originally Posted by nonarkitten View Post
I love that I can take huge steps when indexing. It's great for structs. But you're an ASM coder, you don't think in "structs."
Huge steps when indexing ? You're not indexing.
Indexing is something like move.b (a0,d0.w),(a1,d1.w) or add.w (a0,d0.w),d1 and ARM can NOT do this.


Quote:
Originally Posted by nonarkitten View Post
But a single RMW for a RAM variable? Unless that's ALL you're going to do with that variable, it would be a lot more efficient to have separate LOAD/ADD/STORE steps. Not that I've ever had to have an interrupt just to count one number. That's what timers are for.

Your "everything in memory model" doesn't work with modern hardware where CPU's are several dozens of times slower than the fastest RAM. Caching helps, but expecting it to save your bacon is poor programming design. And even on the 68000, loading stuff into register to do a lot of work is still going to be faster than munching through RAM all the time.
Have you noticed that we don't have enough registers to hold everything a program needs and that we sometimes need to use memory ?
So when it happens, be happy to have operations that don't require using a register because in many situations you don't have any that's free and you have to save and restore one !


Quote:
Originally Posted by nonarkitten View Post
Every RMW is going to eat cycles, and ADDQ.L to a register is always going to be faster than ADDQ.L to RAM.
Invalid point. ADDQ.L to a register assumes we have a register available, and RMW eats less cycles than separate load+op+store.
meynaf is offline  
Old 13 September 2022, 11:00   #969
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
The sign-extension on loading indeed is not necessary in the 080 because the ext-instruction is only another word-sized instruction that will be fused with the move. That means the combination of the move and the ext is treated exactly like a sign-extending ldr instruction, it will be scheduled to the ALU as a single instruction and it will execute in a single cycle and even in parallel with another instruction scheduled to the other pipeline.

With regard to ARM, I liked coding ARM in assembly language and I didn't have much difficulties with it. It looked to me much like a more orthogonal 68k CPU. There is no distinction between A and D registers and there are no exceptions like EOR and some other stuff. I also like 3-operand code even though CPUs nowadays don't gain much (if anything) from 3-operand code because 2-operand instructions and the extra move-instructions usually can be fused into a single operation on a 3-operand ALU. It thus can be argued that 3-operand code means worse code density without producing faster code.

I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them. I never worked much on memory even on 68k, in speed-critical code you usually burst-load data, process it and then store it. For non-speed-critical code it just doesn't matter much whether you get worse code density. We now have plenty of RAM and storage.

I agree that predication is a concept on ARM that at first sight appears a great feature but at second sight isn't that attractive any more. It's four bits wasted in each instruction and gets used rarely. For blocks of instructions a branch is usually better even on ARM. So yes, code density suffers on ARM as much as on most RISCs but the predictable size of instructions is a much more important advantage for creating a high-speed processor than code density is. I think the most popular compromise for recent CPU architectures is to mix 16bit and 32bit instructions which still helps keep the instruction decoders simple (important for highly superscalar CPUs) but gives overall much improved code density.

Last edited by grond; 13 September 2022 at 11:14.
grond is offline  
Old 13 September 2022, 12:14   #970
dreadnought
Registered User
 
Join Date: Dec 2019
Location: Ur, Atlantis
Posts: 1,899
Whoa! For those who follow the "how many PgDwns to scroll a post" rankings, Meynaf has just came up with a world-beating 7-presser! That is truly impressive and leaves the previous contenders (TR: 4x, BA: 3x, ='_'=: 4x though that was separate posts) waaaay behind.

It will take some doing, but can anyone beat this?
dreadnought is online now  
Old 13 September 2022, 12:21   #971
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by dreadnought View Post
Whoa! For those who follow the "how many PgDwns to scroll a post" rankings, Meynaf has just came up with a world-beating 7-presser! That is truly impressive and leaves the previous contenders (TR: 4x, BA: 3x, ='_'=: 4x though that was separate posts) waaaay behind.

It will take some doing, but can anyone beat this?
This is what you get when you counter every argument instead of picking what arranges you and ignore the rest.
meynaf is offline  
Old 13 September 2022, 12:33   #972
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Quote:
Originally Posted by grond View Post
I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them.
Indeed, and it is not even a property of the CPU in first place. One could design your own assembler... A very nice assembler language I remember is that of the AD Blackfin processors (another embedded processor for signal processing applications) where a "move" simply reads


Code:
A1 = A0

..and nobody stops you creating a similar assembler for 68K or arm such that


Code:
add.l 4(a0),d0

becomes something like


Code:
d0 += [a0+4]
Thomas Richter is offline  
Old 13 September 2022, 12:35   #973
Promilus
Registered User
 
Join Date: Sep 2013
Location: Poland
Posts: 807
It's funny to see how (again) discussion went from Apollo products (or design, or features) to asm vs c and 68k vs arm What I'd like to add in that particular topic - it doesn't matter if C is memory hungry. It doesn't even matter should it waste clock cycles. If you can bring new software to Amiga it is good. And iirc VanillaConquer is written in C and so is DevilutionX ... So everyone can argue which approach is the best in their opinion but the one thing which remains is "put money where your mouth is". ASM might be fun, fast and efficient but it is irrelevant when there's hardly any NEW software written in it. Even in the last corner where it is still widely used.
Promilus is offline  
Old 13 September 2022, 12:43   #974
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
The sign-extension on loading indeed is not necessary in the 080 because the ext-instruction is only another word-sized instruction that will be fused with the move. That means the combination of the move and the ext is treated exactly like a sign-extending ldr instruction, it will be scheduled to the ALU as a single instruction and it will execute in a single cycle and even in parallel with another instruction scheduled to the other pipeline.
I agree with that, but there are exceptions for which the extension (as unsigned) doesn't work this way.
In all cases, not having it still damages code density and programming flexibility, which are my main concerns.


Quote:
Originally Posted by grond View Post
With regard to ARM, I liked coding ARM in assembly language and I didn't have much difficulties with it. It looked to me much like a more orthogonal 68k CPU. There is no distinction between A and D registers and there are no exceptions like EOR and some other stuff.
On Thumb2 there is a distinction between r0-r7 and r8-r15, so the situation isn't exactly that good. You have weak registers whose usage damages code density.
Having A and D registers is a nice feature as they don't behave the same and both ways have their use. Consider add.w d0,d1 vs adda.w d0,a1 - one touches ccr and leaves high part alone, the other leaves ccr alone and provides automatic extension.
Funny that you mention the EOR exception, 68k indeed can't do EOR from mem, but ARM also can't ! You can't possibly see that as an advantage.


Quote:
Originally Posted by grond View Post
I also like 3-operand code even though CPUs nowadays don't gain much (if anything) from 3-operand code because 2-operand instructions and the extra move-instructions usually can be fused into a single operation on a 3-operand ALU. It thus can be argued that 3-operand code means worse code density without producing faster code.
This is what i said, it's not helpful to have it.


Quote:
Originally Posted by grond View Post
I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them. I never worked much on memory even on 68k, in speed-critical code you usually burst-load data, process it and then store it. For non-speed-critical code it just doesn't matter much whether you get worse code density. We now have plenty of RAM and storage.
Non-speed critical code contains most of the bulk and most of the bugs also, so even if code density does not really matter, programming flexibility does.


Quote:
Originally Posted by grond View Post
I agree that predication is a concept on ARM that at first sight appears a great feature but at second sight isn't that attractive any more. It's four bits wasted in each instruction and gets used rarely. For blocks of instructions a branch is usually better even on ARM. So yes, code density suffers on ARM as much as on most RISCs but the predictable size of instructions is a much more important advantage for creating a high-speed processor than code density is. I think the most popular compromise for recent CPU architectures is to mix 16bit and 32bit instructions which still helps keep the instruction decoders simple (important for highly superscalar CPUs) but gives overall much improved code density.
Well, today the instruction set of CPUs doesn't matter as much as it did for the implementation, if at all.
meynaf is offline  
Old 13 September 2022, 13:18   #975
Leon Besson
Banned
 
Leon Besson's Avatar
 
Join Date: Feb 2022
Location: Anywhere and everywhere I have a contract
Posts: 822
Well here is the deal with V2 licensing Bromigos! Just had this sent to me on Email.

license V2 cards:
90% of all V2 cards are licensed and can be updated without any trouble. If your card is licensed the following informations are not important for you.
This is required for previously NOT licensed V2 accelerator cards.

Licensing your V2 Card will allow you to benefit from the new core updates, to use all the new games and to take advantage of the all great new features.
Unlicensed cards will only have a black and white screen after core update 2.16 and higher.

==> Did you buy your card from a reseller, then please contact him.
==> If you want to purchase your Apollo 68080 core license in the apollo
computer shop
Buy your V2 license
SPECIAL DISCOUND PRICE of 50 € UNTIL END OF SEPTEMBER
(Normal price 100 €)

How does the licensing work:
You pay the license fee in the shop.
You read the serial number of your V2 and email us the number
==> type in the CLI: VControl SN
We will send you a personalized license sticker.
We will update the serial number in the core update asap.
From now on you can use all core updates without any problems.
Leon Besson is offline  
Old 13 September 2022, 13:28   #976
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
I agree with that, but there are exceptions for which the extension (as unsigned) doesn't work this way.
In all cases, not having it still damages code density and programming flexibility, which are my main concerns.
Neither code density nor programming flexibility are any of my concerns which is why I come to a different conclusion about processor architectures than you do. I want a CPU to be able to do as much useful work in a given time as possible.

I have never run out of memory for code. Yes, I tried to squeeze loops into the 020/030's tiny instruction cache for optimal execution but it's not like I see this as a common usecase for which an ISA should be defined. With cache sizes common in this century I don't see much problems in having lower code density (at least not if code density isn't lower by factors) if the instruction decoder gets so much simpler by having reliable instruction boundaries.


Quote:
On Thumb2 there is a distinction between r0-r7 and r8-r15, so the situation isn't exactly that good. You have weak registers whose usage damages code density.
Well, that is if you choose to code Thumb2. Actually having these two sets of instructions clearly is a nice feature. But yes, one addresses a shortcoming of the other by introducing other shortcomings. You can't have all at the same time, not even in 68k.


Quote:
Having A and D registers is a nice feature as they don't behave the same and both ways have their use.
I don't think the fixed separation is a nice feature. I like that on ARM registers can be used both for address calculation and data handling.



Quote:
Consider add.w d0,d1 vs adda.w d0,a1 - one touches ccr and leaves high part alone, the other leaves ccr alone and provides automatic extension.

Yes, but you can't choose which one does, it's implied by the registers you use. ARM can do the extension and you can specify for each instruction whether it should modify the flags or not.


Quote:
Funny that you mention the EOR exception, 68k indeed can't do EOR from mem, but ARM also can't ! You can't possibly see that as an advantage.
I quoted EOR as an example because in ARM all instructions work the same while on 68k you have to remember some irrational exceptions.


Quote:
Non-speed critical code contains most of the bulk and most of the bugs also, so even if code density does not really matter, programming flexibility does.
Yes, and I guess most people use high-level languages for the non-speed critical code for this exact reason


Quote:
Well, today the instruction set of CPUs doesn't matter as much as it did for the implementation, if at all.
Oh, the instruction set still matters for instruction decode. On x86 you never know where instruction boundaries are which makes it difficult to decode instructions in the instruction stream ahead of where you are and needs extra invisible bits in the 1st level instruction cache.
grond is offline  
Old 13 September 2022, 14:10   #977
lmimmfn
Registered User
 
Join Date: May 2018
Location: Ireland
Posts: 672
Quote:
Originally Posted by Leon Besson View Post
Well here is the deal with V2 licensing Bromigos! Just had this sent to me on Email.

license V2 cards:
90% of all V2 cards are licensed and can be updated without any trouble. If your card is licensed the following informations are not important for you.
This is required for previously NOT licensed V2 accelerator cards.

Licensing your V2 Card will allow you to benefit from the new core updates, to use all the new games and to take advantage of the all great new features.
Unlicensed cards will only have a black and white screen after core update 2.16 and higher.

==> Did you buy your card from a reseller, then please contact him.
==> If you want to purchase your Apollo 68080 core license in the apollo
computer shop
Buy your V2 license
SPECIAL DISCOUND PRICE of 50 € UNTIL END OF SEPTEMBER
(Normal price 100 €)

How does the licensing work:
You pay the license fee in the shop.
You read the serial number of your V2 and email us the number
==> type in the CLI: VControl SN
We will send you a personalized license sticker.
We will update the serial number in the core update asap.
From now on you can use all core updates without any problems.
Tbh, that's quite fair 50(for the end user) being better than the 100, not sure how resellers will handle it though.
lmimmfn is offline  
Old 13 September 2022, 14:19   #978
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
Neither code density nor programming flexibility are any of my concerns which is why I come to a different conclusion about processor architectures than you do.
Seems you're perfectly aligned with Gunnar on that aspect.
But as he refused to provide some kind of nice programming tools, coders aren't exactly beating down his door.


Quote:
Originally Posted by grond View Post
I want a CPU to be able to do as much useful work in a given time as possible.
If this is your only concern, i'm afraid there are better choices for you than 68k.


Quote:
Originally Posted by grond View Post
I have never run out of memory for code. Yes, I tried to squeeze loops into the 020/030's tiny instruction cache for optimal execution but it's not like I see this as a common usecase for which an ISA should be defined. With cache sizes common in this century I don't see much problems in having lower code density (at least not if code density isn't lower by factors) if the instruction decoder gets so much simpler by having reliable instruction boundaries.
The problem of the instruction decoder is that it can only receive a fixed amount of data per clock, and that's not very much. The more instructions that fit in this data, the more instructions you can decode at once.


Quote:
Originally Posted by grond View Post
Well, that is if you choose to code Thumb2. Actually having these two sets of instructions clearly is a nice feature. But yes, one addresses a shortcoming of the other by introducing other shortcomings. You can't have all at the same time, not even in 68k.
On 68k you really have all at the same time : 16 registers you can use and short code.


Quote:
Originally Posted by grond View Post
I don't think the fixed separation is a nice feature. I like that on ARM registers can be used both for address calculation and data handling.
Do you need to multiply or divide pointers ? Sorry, data and pointers are two different things. Even C compilers know this.


Quote:
Originally Posted by grond View Post
Yes, but you can't choose which one does, it's implied by the registers you use. ARM can do the extension and you can specify for each instruction whether it should modify the flags or not.
No, ARM can't do automatic extension with an ADD instruction... remember, it does not support non-longword operations.
Anyway, specifying if you modify the flags or not has a cost of 1 extra bit per instruction. I really prefer the D/A split which comes at no encoding cost.


Quote:
Originally Posted by grond View Post
I quoted EOR as an example because in ARM all instructions work the same while on 68k you have to remember some irrational exceptions.
Exceptions aren't exactly irrational. EOR was considered a rare operations, and if you know that rare operations have less flexibility than common ones, there is no problem anymore.


Quote:
Originally Posted by grond View Post
Yes, and I guess most people use high-level languages for the non-speed critical code for this exact reason
Right, but this makes these people not trained enough with asm to be able to see the strengths and weaknesses of the different cpu families.
So if they say "cpu xyz has better asm than 68k", they speak about something they do not know. Write whole programs of significant size in asm first. Then we'll talk.


Quote:
Originally Posted by grond View Post
Oh, the instruction set still matters for instruction decode. On x86 you never know where instruction boundaries are which makes it difficult to decode instructions in the instruction stream ahead of where you are and needs extra invisible bits in the 1st level instruction cache.
It seems x86 still manages pretty well, doesn't it ?
meynaf is offline  
Old 13 September 2022, 15:33   #979
malko
Ex nihilo nihil
 
malko's Avatar
 
Join Date: Oct 2017
Location: CH
Posts: 4,856
Quote:
Originally Posted by grond View Post
[...] Yes, and I guess most people use high-level languages for the non-speed critical code for this exact reason [...]
And I guess that most and most people use HLL for "every kind of code" because the" hardware" with its actuel speed make them think that nothing is speed critical.
Alone on this sole thread, I do not count anymore how much time this argument was given "with actual speed, bla bla bla, it doesn't matter"...
malko is offline  
Old 13 September 2022, 15:45   #980
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
If this is your only concern, i'm afraid there are better choices for you than 68k.
Yes, ARM, for example.

I was referring to how I assess CPU architectures.


Quote:
The problem of the instruction decoder is that it can only receive a fixed amount of data per clock, and that's not very much. The more instructions that fit in this data, the more instructions you can decode at once.
Well, with 32bit instructions you can easily load eight instructions in parallel and decode all of them because you know exactly

a) there are eight instructions
b) where each instructions is located

When you try to decode 32 bytes of x86 instruction data, you don't know any of this. 68k is somewhere in the middle.


Quote:
Do you need to multiply or divide pointers ? Sorry, data and pointers are two different things. Even C compilers know this.
I don't need to do that but I more often need a nineth data register than an eight adress register. Being limited to a certain set of operations available for each type of register is the limitation.


Quote:
No, ARM can't do automatic extension with an ADD instruction... remember, it does not support non-longword operations.
Where is the problem? You extend all data upon loading it. Then you process it and store either byte, word or long.


Quote:
Anyway, specifying if you modify the flags or not has a cost of 1 extra bit per instruction. I really prefer the D/A split which comes at no encoding cost.
Well, that's your result of weighing the pros and cons, I come to a different conclusion.


Quote:
It seems x86 still manages pretty well, doesn't it ?
Sure, but it needs more engineering hours to make it manage that well when compared to other architectures. Or do you think 68k wouldn't be faster than x86 if the same effort was put into the architecture?

Last edited by grond; 13 September 2022 at 16:40.
grond is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Vampire V4 plus Amiga 1200 and 500 for sale drusso66 MarketPlace 7 14 November 2021 05:59
For Sale: Amiga 1200 with vampire 1200 v2 supperbin MarketPlace 8 09 July 2021 15:47
Warp 1260 or Vampire 1200 V2 dude1995 MarketPlace 0 20 May 2021 04:05
Vampire 1200 HanSolo support.Hardware 55 19 June 2017 10:15
Amiga 1200 Vampire Cards PaulG Amiga scene 61 24 February 2017 03:47

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 14:44.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.19085 seconds with 16 queries