Vampire discourse, keep it civil (was: Vampire 1200 V2 waiting times) - Page 49

Bruce Abbott · 12 September 2022, 21:22

Quote:

Originally Posted by nonarkitten

Wow. Umm, #include <stdint.h>

I mean, did you stop learning C sometime before 1999? Because that's been in the C99 standard since then.

Retrocomputing

Quote:

...is the use of older computer hardware and software in modern times. Retrocomputing is usually classed as a hobby and recreation rather than a practical application of technology; enthusiasts often collect rare and valuable hardware and software for sentimental reasons. However, some do make use of it.

Am I 'daft' for wanting to stay in the retrocomputing spirit by continuing to use the C compiler I paid a lot of money for back in the 90's?

Quote:

Also, gcc has had overflow detection for a very long time. At least since GCC6, but I'm not sure exactly.

It's a compiler-specific extension, apparently introduced in GCC V5 (unconfirmed):-

6.56 Built-in Functions to Perform Arithmetic with Overflow Checking

Quote:

The following built-in functions allow performing simple arithmetic operations together with checking whether the operations overflowed.

Built-in Function: bool __builtin_add_overflow (type1 a, type2 b, type3 *res)

GCC 6.8 has recently been ported to the Amiga, but

Quote:

Originally Posted by Olaf Barthel

For the cc1 command this means that some 11 MBytes of code are processed in a loop which reads chunks of 1024 bytes each and copies these, about 11,200 times...

The real kicker is what happens with the relocation information which is applied to the code hunk. For cc1 more than 89,840 individual relocations need to be performed and the format in which these instructions are stored in the load file is not geared towards efficiency.

It's a behemoth that I bet would be very slow on my A1200 if it worked at all. Might be better with a Vampire.

Due to that and other issues with GCC I'm not touching it. Instead I port the code to SASC. I fully admit to not being an expert on GCC or C in general, so this often takes a while (I'm getting better at it...). This is worth doing because I have fun doing it, and it's in the retrocomputing spirit. Who knows, perhaps one day I will port some of my own C code to the Amiga, and of course release the sources for anyone else to use.

Quote:

"Serious" lol. Yeeeeah, it was totally Intel that made Motorola do that. LOL.

No, I wasn't being serious.

But there was a rivalry between Intel and Motorola with each trying to outdo the other - for a while. At some point (040?) Motorola probably realized they didn't have the chip-making skills to keep up, so they started removing features instead. Intel just kept making chips with more and more transistors, and putting bigger and bigger heatsinks with huge fans on them. This strategy worked. Cutting the CPU down to work with fewer transistors didn't.

Imagine if Motorola had followed the same path as Intel. Assuming they were able to upgrade their foundries similarly, we might now have a successor to the 68030 similar to the 68080 but at GHz speeds. Apple might not have gone PPC and then back to Intel, making Macs more popular and 68k still relevant. The Amiga would ride on those coattails instead of heading down the dead end of PPC.

Quote:

Having 64-bit multiply has NOTHING TO DO WITH DISK LIMITS.

So the error in HDToolbox has nothing to do with integer overflow?

Quote:

64-bit multiply takes massive amounts of silicon.

And yet Motorola had no problem squeezing it into the 68020 in 1984.

Bruce Abbott · 12 September 2022, 21:37

Quote:

Originally Posted by meynaf

If one requires training in C due mistakes are so easy to make otherwise, then C isn't exactly as programmer friendly as it's supposed to be...

There's your mistake. C is actually designed to separate elite programmers from the daft among us.

Just go to Stack Overflow and you will see what I mean. It's full of condescending elites who berate any beginner brave enough to ask a question. Can be entertaining when they fight amongst themselves though...

nonarkitten · 13 September 2022, 03:05

Quote:

Originally Posted by Bruce Abbott

<snip>

How to check overflow of an integer in C.

Option 1. This is by far the most common one, but you compare the result to your operands. If the result of an addition is less than either of the operands, then an overflow has occurred. Most decent compilers will compile this into a simple flag-check.

Option 2. Use multi-precision arithmetic. So if you want a 32x32 => 64-bit MULU on hardware that doesn't have it, break it into four 16x16 => 32-bit MULU's and add them together. My first real programming job was writing a 64-bit library for the 68332. It's not hard.

Option 3. Define a small assembly routine (or even inline) which does this and then you can call it from C. Those built-ins from GCC could be reimplemented for SAS/C in an afternoon.

hammer · 13 September 2022, 05:47

Quote:

Originally Posted by meynaf

Well, Windows 3.1 apps won't run on Windows 11 either, at least not without some kind of emulation...

The end user can run 16bit Windows apps on Windows 11 via open-source WineVDM/OTVDM.

---

Leaked NTVDMx64 is based on the real Microsoft's original NTVDM, hence it's Microsoft's own direction to drop 16bit Windows support with 64bit Windows.
More info from https://github.com/leecher1337/ntvdmx64

HAXM version doesn't emulate the CPU, it uses HAXM VT-x hardware acceleration (the CPU needs to support it).

hammer · 13 September 2022, 06:23

Quote:

Originally Posted by nonarkitten

For example, "AMMX MIPS". WTF are those? So the V4 is 13 times faster than the 68060 is at AMMX? Really? I didn't even know the 68060 had AMMX, must be an undocumented feature.

Oh, you mean he's comparing some "arbitrary algorithm" in AMMX to the same routine in pure 68K assembly and it's about 13 times faster. Talk about cherry-picked nonsense.

On RiVA, a hyper-specific use-case where AMMX really makes a difference, there was a 100-150% speed up. Is that huge? Sure, I think so. Is it THIRTEEN TIMES FASTER? No. Hell no. And you'd have to be a moron to even claim that.

Using CoffinOS R58's MPEG video test files, RiVA 68K and FroggerNG (from my A1200 rev1D1/TF1260/AOS3.2 install) run pretty well on my PiStorm/RPi 3a/Emu68/A500 rev6A.

RiVA AMMX (AC68EC080) requires a separate build from RIVA 68K while PiStorm/RPi 3a/Emu68 (like Transmeta Code Morphing Software method, cite ref 1) path improves the legacy RIVA 68K's playback performance.

Depending on the CPU, CoffinOS R58's startup swaps between RIVA AMMX and RIVA 68K executables.

Transmeta Code Morphing Software (CMS) includes JIT with instruction reorder on VLIW micro-architectured CPU. The compiled code is stored in a "translation cache". CMS does not try retranslating the region in which the interrupt occurs. CMS has techniques for handling self-modifying code.

Reference
1. https://www.cs.cornell.edu/courses/c...log/transmeta/

nonarkitten · 13 September 2022, 08:29

Quote:

Originally Posted by meynaf

This is totally ridiculous.
68K can do direct memory-to-memory moves, it can do arithmetic to/from memory directly too, it can handle three data types in computations and has a complete set of addressing modes.

ARM can load bytes, words and longs too. ARM does ternary arithmetic though, so it needs far fewer pointless moves shuttling around data in registers.

Quote:

Originally Posted by meynaf

So much programming flexibility in comparison to any RISC cpu. ARM is incredibly limited and has a terrible syntax. Not to mention 68k also has better code density.

I like ARM's syntax. It's quite readable. Not sure what issue you have.

Better code density? Than Thumb2? No way.

Quote:

Originally Posted by meynaf

Why would I do this, it's totally useless instruction in real life code.

Which part, loading a signed int? Because this isn't real?

Code:

int8_t flags;

Or maybe it's the 32-byte offset? Because this isn't real? You couldn't have an array of these you might want to iterate over?

Code:

struct { int8_t flags; .... }; // size is 32 bytes

Or maybe it's the conditional? Because those never happen either, right?

Quote:

Originally Posted by meynaf

Very funny argument, considering predicates have been dropped in aarch64 (along with automatic barrel shifter).

These were dropped because ARM64 still uses 32-bit opcodes and the number of registers needed more bits.

Quote:

Originally Posted by meynaf

Not five instructions, just four (and if Günni listened to me it would be only three).

The original 68000 has no byte-to-long sign-extend, but sure. On the 68020 and higher, this would "only" be four.

Quote:

Originally Posted by meynaf

As said, your one RISC instruction isn't very useful for real life workloads. Now consider simple and much more useful add.l #data,mem on 68k (32-bit constant added on a variable with a linear 32-bit address). Doing that on PPC should require something like 6 instructions. How many on ARM already ?

Three, ldr, add, then str. This is a little contrived, you usually don't march through your data in passes doing one thing at a time.

On the 68K, RAM was "fast enough" to not care, so optimization tricks like using tables, were a common thing. At 1GHz, none of those tricks work anymore because a cache miss can cost dozens of clock cycles. So it's better to compute on the fly and keep things in registers. 68K style of assembly language requires that the CPU be slower than RAM and even by the 68060, that was breaking down.

Quote:

Originally Posted by meynaf

Why would we want to do this ? We HAVE a stack, so why not just use it ?

Because you don't have infinite memory bandwidth?

Quote:

Originally Posted by meynaf

Oh sorry, maybe ARM does not support a proper stack ? Oh wait, i forgot : you wasted a GPR for the program counter so you don't want to reserve another one. I understand.

Yes ARM supports a "proper stack," don't be stupid. And using PC in a regular register has unlimited potential for abuse that's so cool. Things you could never do on 68K.

Quote:

Originally Posted by meynaf

But we can serve an interrupt without touching the registers.

But not the stack. By the time you're in the interrupt, you're already dozens of cycles behind the ARM.

Quote:

Originally Posted by meynaf

The 68k can support direct memory operations.
Oh, wait. ARM can't do ADDQ.W #1,mem ? Poor mite, your so great cpu can't implement simple interrupt counter without touching registers.

That's silly. Why would you worry about touching registers? You're ADDQ is still performing the LOAD, ADD and STORE operations, you're just not aware of what's going on with the microcode.

Quote:

Originally Posted by meynaf

And it can't move a memory cell directly to another ?

Again, once processors top around 200MHz, direct memory for everything becomes a serious limitation since RAM cannot keep up anymore.

Quote:

Originally Posted by meynaf

Also we CAN branch to a subroutine without touching the stack, it's just LEA+JMP. Not that this operation would be a common one, of course.

Good point, but then you don't get any sort of prediction on the "return."

Quote:

Originally Posted by meynaf

Try move ccr to a data register. Or use Scc instruction to keep the condition. Or better, do your computation on an address register. Or don't do it at all, it's not a common operation either.

It's pretty common in emulation.

On ARM you just omit the 's' flag on the opcode and then all ALU operations don't affect flags. The nice thing is, this works for all ALU operations like MUL and DIV and not just the couple cherry picked ones that some engineer in 1976 though would be useful. Saving and restoring are two cycles too many for me.

Quote:

Originally Posted by meynaf

But again, there is nothing wrong in having a stack. Or maybe you're allergic to stacks ? Don't have a look at java bytecode or webassembly then !

LOL. These are intermediate representations and both will always get JITed into machine code. Many of those stack operations get eliminated.

Quote:

Originally Posted by meynaf

Not everything is useful, but we have Bitfields and good luck with your ARM to do the same with fewer instructions !

ARM has bitfields too.

ARM has better code density than 68K.

Your contrived example is far worse than mine. I see this kind of pattern all the time in compiled code on ARM and use it in PJIT. It's great. I love conditional everything. I love that every load can also be a sign or zero extend. I love that I can take huge steps when indexing. It's great for structs. But you're an ASM coder, you don't think in "structs."

But a single RMW for a RAM variable? Unless that's ALL you're going to do with that variable, it would be a lot more efficient to have separate LOAD/ADD/STORE steps. Not that I've ever had to have an interrupt just to count one number. That's what timers are for.

Your "everything in memory model" doesn't work with modern hardware where CPU's are several dozens of times slower than the fastest RAM. Caching helps, but expecting it to save your bacon is poor programming design. And even on the 68000, loading stuff into register to do a lot of work is still going to be faster than munching through RAM all the time. Every RMW is going to eat cycles, and ADDQ.L to a register is always going to be faster than ADDQ.L to RAM.

nonarkitten · 13 September 2022, 08:37

Quote:

Originally Posted by hammer

RiVA AMMX (AC68EC080) requires a separate build from RIVA 68K while PiStorm/RPi 3a/Emu68 (like Transmeta Code Morphing Software method, cite ref 1) path improves the legacy RIVA 68K's playback performance.

Emu68 does not use Transmeta's method of JIT.
- Emu68 is not a tracing JIT; cache is just flushed if it runs out
- Emu68 does not perform any interpreter passes first
- Emu68 does very little to optimize code
- Emu68 has no 'rollback' nor any need to
- Emu68 handles self-modifying code by checksumming the whole compiled block

Bernie talks about this in the Apollo forums -- too much optimization is usually worse for JIT than just "begin a dumb translator." Emu68 doesn't do all the crap so it's best case performance actually matches that of the ARM core itself. Transmeta was trying to be too smart here and didn't get that simpler is better.

meynaf · 13 September 2022, 10:26

Quote:

Originally Posted by Thomas Richter

Here comes the comes the man with programming experience...

Yep.

Quote:

Originally Posted by Bruce Abbott

There's your mistake. C is actually designed to separate elite programmers from the daft among us.

Just go to Stack Overflow and you will see what I mean. It's full of condescending elites who berate any beginner brave enough to ask a question. Can be entertaining when they fight amongst themselves though...

Indeed, i missed this

Quote:

Originally Posted by nonarkitten

ARM can load bytes, words and longs too.

Recent ARM, yes. Original ARM couldn't do words.
And while it can load/store them, it can not do direct computations on this data size, and it can not operate arithmetic operations on the fly. Instead of single, simple RMW instructions, you have load + ope + store. In comparison, even 6502 can increment a memory cell in just one instruction.

Quote:

Originally Posted by nonarkitten

ARM does ternary arithmetic though, so it needs far fewer pointless moves shuttling around data in registers.

The funny part is, it does not make code shorter (single 32-bit vs 2x 16-bit) and it's easy to handle by instruction fusing (IIRC 68080 does that).

Quote:

Originally Posted by nonarkitten

I like ARM's syntax. It's quite readable. Not sure what issue you have.

Frankly, you find ldrsbne r0, [r1, #32]! readable ? Then how should it look like to become unreadable ??

Quote:

Originally Posted by nonarkitten

Better code density? Than Thumb2? No way.

Oh yes, it has. And the bigger the program becomes, the bigger the difference is. Ready for a small code contest ?
You seem to forget that 16-bit Thumb opcodes can only access 8 registers while 68k 16-bit opcodes can access them all. And operations they can perform are of course limited, while on 68k they are not.

Quote:

Originally Posted by nonarkitten

Which part, loading a signed int? Because this isn't real?

Code:

int8_t flags;

It is real, but loading int8 on 68k does not require to sign extend it, we can use the byte directly in computations. Of course you were forced to extend it on the ARM.

Quote:

Originally Posted by nonarkitten

Or maybe it's the 32-byte offset? Because this isn't real? You couldn't have an array of these you might want to iterate over?

Code:

struct { int8_t flags; .... }; // size is 32 bytes

You could do this. Or if you're smart enough, group bytes together in same array so you don't have data alignment issues.

Quote:

Originally Posted by nonarkitten

Or maybe it's the conditional? Because those never happen either, right?

May I assume that if the condition is false, the "32" won't get added/subed to the register because the instruction isn't actually executed ?

No, really, it's not the byte, it's not the 32, it's not the conditional. Not per individual. It's all them together that makes an unlikely situation.

Quote:

Originally Posted by nonarkitten

These were dropped because ARM64 still uses 32-bit opcodes and the number of registers needed more bits.

That might be a reason, but the main one is that predicates are a pain for OoO and at high frequencies the barrel shifter is no longer 'free'.

Quote:

Originally Posted by nonarkitten

The original 68000 has no byte-to-long sign-extend, but sure. On the 68020 and higher, this would "only" be four.

On the 68020 and higher, it can be only three if we use the bitfields to perform the sign extension.
Anyway, as i said, we don't need to always sign extend. Rarely, actually. We're not on some puny RISC cpu which can only perform 32-bit computations and is therefore forced to extend everything.
Besides, most bytes are actually unsigned.

Quote:

Originally Posted by nonarkitten

Three, ldr, add, then str. This is a little contrived, you usually don't march through your data in passes doing one thing at a time.

But alas, this is not enough. It's not 3, it's 4 so a little more contrived.
You forgot that there are two 32-bit values to load with ldr (the address and the data).

Quote:

Originally Posted by nonarkitten

On the 68K, RAM was "fast enough" to not care, so optimization tricks like using tables, were a common thing. At 1GHz, none of those tricks work anymore because a cache miss can cost dozens of clock cycles. So it's better to compute on the fly and keep things in registers. 68K style of assembly language requires that the CPU be slower than RAM and even by the 68060, that was breaking down.

But the 68k does not need to go more often in memory than the ARM - actually quite less, and not only because we (usually) have one more register to use.
We can use tricks to reduce the register pressure, like using the high part of a register to hold different data - something that requires the ability to perform byte or word operations without touching the rest of the register. Consider, for example, two flags and a 16-bit loop counter all in same register.

Quote:

Originally Posted by nonarkitten

Because you don't have infinite memory bandwidth?

We don't have infinite number of registers either and top of stack is nice candidate for L1 cache.

Quote:

Originally Posted by nonarkitten

Yes ARM supports a "proper stack," don't be stupid.

Oh yeah ? So how many instructions for simple 68k's JSR in the stack ?

Quote:

Originally Posted by nonarkitten

And using PC in a regular register has unlimited potential for abuse that's so cool.

Yeah, it's super cool, especially when you have to do a branch predictor.
Of course you also have one register less for use in regular programs.

Quote:

Originally Posted by nonarkitten

Things you could never do on 68K.

Hopefully !

Quote:

Originally Posted by nonarkitten

But not the stack. By the time you're in the interrupt, you're already dozens of cycles behind the ARM.

Well, you're right for small interrupts.
But if the code becomes complex enough, then ARM too will be forced to use the stack, destroying the benefit.

Quote:

Originally Posted by nonarkitten

That's silly. Why would you worry about touching registers? You're ADDQ is still performing the LOAD, ADD and STORE operations, you're just not aware of what's going on with the microcode.

Microcode internal registers won't trash programmer visible registers. That's the point.
Besides, with fully pipelined cpu there's not such microcode at all. Your LOAD, ADD and STORE are all done at different stages of the pipeline.

Quote:

Originally Posted by nonarkitten

Again, once processors top around 200MHz, direct memory for everything becomes a serious limitation since RAM cannot keep up anymore.

First, caches exist to handle that.
Second, it's not about direct memory for everything. Memory is still used and when it needs to be, you're happy not having to waste registers for its access.

Quote:

Originally Posted by nonarkitten

Good point, but then you don't get any sort of prediction on the "return."

Ever heard of stack traces ? Modern or even semi-modern cpus do have prediction on the return.

Quote:

Originally Posted by nonarkitten

It's pretty common in emulation.

On ARM you just omit the 's' flag on the opcode and then all ALU operations don't affect flags. The nice thing is, this works for all ALU operations like MUL and DIV and not just the couple cherry picked ones that some engineer in 1976 though would be useful. Saving and restoring are two cycles too many for me.

In emulation you normally don't use the host's flags register to hold flags permanently ; as soon as you have to perform a test of any sort, your emulated flags get trashed...
Now consider this situation :

Code:

.loop
 addx.l -(a0),-(a1)
 dbf d0,.loop

Now you need the carry from the addx for next iteration but if you don't have a proper loop instruction then you sub #1 and loop with that... oh wait, your sub needs the Z flag and will trash the carry if you allow it to change flags...
This happens in emulation as well, not all instructions touching the flags will change them all (rarely, actually, on 68k most leave X bit alone). Consider simple btst which only touches Z. ARM will not leave you the choice : all flags are altered, or none are.

Quote:

Originally Posted by nonarkitten

LOL. These are intermediate representations and both will always get JITed into machine code. Many of those stack operations get eliminated.

Yet some stack operations remain and you'd better use a cpu that has a proper stack.
Also JIT isn't really clever, it has fewer options than a regular compiler.

Quote:

Originally Posted by nonarkitten

ARM has bitfields too.

Then show me the equivalent of bfins d2,(a0){d0:d1). Or bfset (a0){d0:1}.

Quote:

Originally Posted by nonarkitten

ARM has better code density than 68K.

Certainly not. You can always cherry-pick a single example but it won't prove anything. It's only with real examples of say 20-40 instructions that we can start to see something.

Quote:

Originally Posted by nonarkitten

Your contrived example is far worse than mine. I see this kind of pattern all the time in compiled code on ARM and use it in PJIT. It's great. I love conditional everything.

We can have conditional execution too, with simple macros and a cpu implementation that fuses the condition and the instruction.

Quote:

Originally Posted by nonarkitten

I love that every load can also be a sign or zero extend.

I see the value of this, but tell that to Gunnar who said mvz/mvs were useless and refused to add them...

Quote:

Originally Posted by nonarkitten

I love that I can take huge steps when indexing. It's great for structs. But you're an ASM coder, you don't think in "structs."

Huge steps when indexing ? You're not indexing.
Indexing is something like move.b (a0,d0.w),(a1,d1.w) or add.w (a0,d0.w),d1 and ARM can NOT do this.

Quote:

Originally Posted by nonarkitten

But a single RMW for a RAM variable? Unless that's ALL you're going to do with that variable, it would be a lot more efficient to have separate LOAD/ADD/STORE steps. Not that I've ever had to have an interrupt just to count one number. That's what timers are for.

Your "everything in memory model" doesn't work with modern hardware where CPU's are several dozens of times slower than the fastest RAM. Caching helps, but expecting it to save your bacon is poor programming design. And even on the 68000, loading stuff into register to do a lot of work is still going to be faster than munching through RAM all the time.

Have you noticed that we don't have enough registers to hold everything a program needs and that we sometimes need to use memory ?
So when it happens, be happy to have operations that don't require using a register because in many situations you don't have any that's free and you have to save and restore one !

Quote:

Originally Posted by nonarkitten

Every RMW is going to eat cycles, and ADDQ.L to a register is always going to be faster than ADDQ.L to RAM.

Invalid point. ADDQ.L to a register assumes we have a register available, and RMW eats less cycles than separate load+op+store.

grond · 13 September 2022, 11:00

The sign-extension on loading indeed is not necessary in the 080 because the ext-instruction is only another word-sized instruction that will be fused with the move. That means the combination of the move and the ext is treated exactly like a sign-extending ldr instruction, it will be scheduled to the ALU as a single instruction and it will execute in a single cycle and even in parallel with another instruction scheduled to the other pipeline.

With regard to ARM, I liked coding ARM in assembly language and I didn't have much difficulties with it. It looked to me much like a more orthogonal 68k CPU. There is no distinction between A and D registers and there are no exceptions like EOR and some other stuff. I also like 3-operand code even though CPUs nowadays don't gain much (if anything) from 3-operand code because 2-operand instructions and the extra move-instructions usually can be fused into a single operation on a 3-operand ALU. It thus can be argued that 3-operand code means worse code density without producing faster code.

I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them. I never worked much on memory even on 68k, in speed-critical code you usually burst-load data, process it and then store it. For non-speed-critical code it just doesn't matter much whether you get worse code density. We now have plenty of RAM and storage.

I agree that predication is a concept on ARM that at first sight appears a great feature but at second sight isn't that attractive any more. It's four bits wasted in each instruction and gets used rarely. For blocks of instructions a branch is usually better even on ARM. So yes, code density suffers on ARM as much as on most RISCs but the predictable size of instructions is a much more important advantage for creating a high-speed processor than code density is. I think the most popular compromise for recent CPU architectures is to mix 16bit and 32bit instructions which still helps keep the instruction decoders simple (important for highly superscalar CPUs) but gives overall much improved code density.

dreadnought · 13 September 2022, 12:14

Whoa! For those who follow the "how many PgDwns to scroll a post" rankings, Meynaf has just came up with a world-beating 7-presser! That is truly impressive and leaves the previous contenders (TR: 4x, BA: 3x, ='_'=: 4x though that was separate posts) waaaay behind.

It will take some doing, but can anyone beat this?

meynaf · 13 September 2022, 12:21

Quote:

Originally Posted by dreadnought

Whoa! For those who follow the "how many PgDwns to scroll a post" rankings, Meynaf has just came up with a world-beating 7-presser! That is truly impressive and leaves the previous contenders (TR: 4x, BA: 3x, ='_'=: 4x though that was separate posts) waaaay behind.

It will take some doing, but can anyone beat this?

This is what you get when you counter every argument instead of picking what arranges you and ignore the rest.

Thomas Richter · 13 September 2022, 12:33

Quote:

Originally Posted by grond

I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them.

Indeed, and it is not even a property of the CPU in first place. One could design your own assembler... A very nice assembler language I remember is that of the AD Blackfin processors (another embedded processor for signal processing applications) where a "move" simply reads

Code:

A1 = A0

..and nobody stops you creating a similar assembler for 68K or arm such that

Code:

add.l 4(a0),d0

becomes something like

Code:

d0 += [a0+4]

Promilus · 13 September 2022, 12:35

It's funny to see how (again) discussion went from Apollo products (or design, or features) to asm vs c and 68k vs arm

What I'd like to add in that particular topic - it doesn't matter if C is memory hungry. It doesn't even matter should it waste clock cycles. If you can bring new software to Amiga it is good. And iirc VanillaConquer is written in C and so is DevilutionX ... So everyone can argue which approach is the best in their opinion but the one thing which remains is "put money where your mouth is". ASM might be fun, fast and efficient but it is irrelevant when there's hardly any NEW software written in it. Even in the last corner where it is still widely used.

meynaf · 13 September 2022, 12:43

Quote:

Originally Posted by grond

The sign-extension on loading indeed is not necessary in the 080 because the ext-instruction is only another word-sized instruction that will be fused with the move. That means the combination of the move and the ext is treated exactly like a sign-extending ldr instruction, it will be scheduled to the ALU as a single instruction and it will execute in a single cycle and even in parallel with another instruction scheduled to the other pipeline.

I agree with that, but there are exceptions for which the extension (as unsigned) doesn't work this way.
In all cases, not having it still damages code density and programming flexibility, which are my main concerns.

Quote:

Originally Posted by grond

With regard to ARM, I liked coding ARM in assembly language and I didn't have much difficulties with it. It looked to me much like a more orthogonal 68k CPU. There is no distinction between A and D registers and there are no exceptions like EOR and some other stuff.

On Thumb2 there is a distinction between r0-r7 and r8-r15, so the situation isn't exactly that good. You have weak registers whose usage damages code density.
Having A and D registers is a nice feature as they don't behave the same and both ways have their use. Consider add.w d0,d1 vs adda.w d0,a1 - one touches ccr and leaves high part alone, the other leaves ccr alone and provides automatic extension.
Funny that you mention the EOR exception, 68k indeed can't do EOR from mem, but ARM also can't ! You can't possibly see that as an advantage.

Quote:

Originally Posted by grond

I also like 3-operand code even though CPUs nowadays don't gain much (if anything) from 3-operand code because 2-operand instructions and the extra move-instructions usually can be fused into a single operation on a 3-operand ALU. It thus can be argued that 3-operand code means worse code density without producing faster code.

This is what i said, it's not helpful to have it.

Quote:

Originally Posted by grond

I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them. I never worked much on memory even on 68k, in speed-critical code you usually burst-load data, process it and then store it. For non-speed-critical code it just doesn't matter much whether you get worse code density. We now have plenty of RAM and storage.

Non-speed critical code contains most of the bulk and most of the bugs also, so even if code density does not really matter, programming flexibility does.

Quote:

Originally Posted by grond

I agree that predication is a concept on ARM that at first sight appears a great feature but at second sight isn't that attractive any more. It's four bits wasted in each instruction and gets used rarely. For blocks of instructions a branch is usually better even on ARM. So yes, code density suffers on ARM as much as on most RISCs but the predictable size of instructions is a much more important advantage for creating a high-speed processor than code density is. I think the most popular compromise for recent CPU architectures is to mix 16bit and 32bit instructions which still helps keep the instruction decoders simple (important for highly superscalar CPUs) but gives overall much improved code density.

Well, today the instruction set of CPUs doesn't matter as much as it did for the implementation, if at all.

Leon Besson · 13 September 2022, 13:18

Well here is the deal with V2 licensing Bromigos! Just had this sent to me on Email.

license V2 cards:
90% of all V2 cards are licensed and can be updated without any trouble. If your card is licensed the following informations are not important for you.
This is required for previously NOT licensed V2 accelerator cards.

Licensing your V2 Card will allow you to benefit from the new core updates, to use all the new games and to take advantage of the all great new features.
Unlicensed cards will only have a black and white screen after core update 2.16 and higher.

==> Did you buy your card from a reseller, then please contact him.
==> If you want to purchase your Apollo 68080 core license in the apollo
computer shop
Buy your V2 license
SPECIAL DISCOUND PRICE of 50 € UNTIL END OF SEPTEMBER
(Normal price 100 €)

How does the licensing work:
You pay the license fee in the shop.
You read the serial number of your V2 and email us the number
==> type in the CLI: VControl SN
We will send you a personalized license sticker.
We will update the serial number in the core update asap.
From now on you can use all core updates without any problems.

grond · 13 September 2022, 13:28

Quote:

Originally Posted by meynaf

I agree with that, but there are exceptions for which the extension (as unsigned) doesn't work this way.
In all cases, not having it still damages code density and programming flexibility, which are my main concerns.

Neither code density nor programming flexibility are any of my concerns which is why I come to a different conclusion about processor architectures than you do. I want a CPU to be able to do as much useful work in a given time as possible.

I have never run out of memory for code. Yes, I tried to squeeze loops into the 020/030's tiny instruction cache for optimal execution but it's not like I see this as a common usecase for which an ISA should be defined. With cache sizes common in this century I don't see much problems in having lower code density (at least not if code density isn't lower by factors) if the instruction decoder gets so much simpler by having reliable instruction boundaries.

Quote:

On Thumb2 there is a distinction between r0-r7 and r8-r15, so the situation isn't exactly that good. You have weak registers whose usage damages code density.

Well, that is if you choose to code Thumb2. Actually having these two sets of instructions clearly is a nice feature. But yes, one addresses a shortcoming of the other by introducing other shortcomings. You can't have all at the same time, not even in 68k.

Quote:

Having A and D registers is a nice feature as they don't behave the same and both ways have their use.

I don't think the fixed separation is a nice feature. I like that on ARM registers can be used both for address calculation and data handling.

Quote:

Consider add.w d0,d1 vs adda.w d0,a1 - one touches ccr and leaves high part alone, the other leaves ccr alone and provides automatic extension.

Yes, but you can't choose which one does, it's implied by the registers you use. ARM can do the extension and you can specify for each instruction whether it should modify the flags or not.

Quote:

Funny that you mention the EOR exception, 68k indeed can't do EOR from mem, but ARM also can't ! You can't possibly see that as an advantage.

I quoted EOR as an example because in ARM all instructions work the same while on 68k you have to remember some irrational exceptions.

Quote:

Non-speed critical code contains most of the bulk and most of the bugs also, so even if code density does not really matter, programming flexibility does.

Yes, and I guess most people use high-level languages for the non-speed critical code for this exact reason

Quote:

Well, today the instruction set of CPUs doesn't matter as much as it did for the implementation, if at all.

Oh, the instruction set still matters for instruction decode. On x86 you never know where instruction boundaries are which makes it difficult to decode instructions in the instruction stream ahead of where you are and needs extra invisible bits in the 1st level instruction cache.

lmimmfn · 13 September 2022, 14:10

Quote:

Originally Posted by Leon Besson

Well here is the deal with V2 licensing Bromigos! Just had this sent to me on Email.

license V2 cards:
90% of all V2 cards are licensed and can be updated without any trouble. If your card is licensed the following informations are not important for you.
This is required for previously NOT licensed V2 accelerator cards.

Licensing your V2 Card will allow you to benefit from the new core updates, to use all the new games and to take advantage of the all great new features.
Unlicensed cards will only have a black and white screen after core update 2.16 and higher.

==> Did you buy your card from a reseller, then please contact him.
==> If you want to purchase your Apollo 68080 core license in the apollo
computer shop
Buy your V2 license
SPECIAL DISCOUND PRICE of 50 € UNTIL END OF SEPTEMBER
(Normal price 100 €)

How does the licensing work:
You pay the license fee in the shop.
You read the serial number of your V2 and email us the number
==> type in the CLI: VControl SN
We will send you a personalized license sticker.
We will update the serial number in the core update asap.
From now on you can use all core updates without any problems.

Tbh, that's quite fair 50(for the end user) being better than the 100, not sure how resellers will handle it though.

meynaf · 13 September 2022, 14:19

Quote:

Originally Posted by grond

Neither code density nor programming flexibility are any of my concerns which is why I come to a different conclusion about processor architectures than you do.

Seems you're perfectly aligned with Gunnar on that aspect.
But as he refused to provide some kind of nice programming tools, coders aren't exactly beating down his door.

Quote:

Originally Posted by grond

I want a CPU to be able to do as much useful work in a given time as possible.

If this is your only concern, i'm afraid there are better choices for you than 68k.

Quote:

Originally Posted by grond

I have never run out of memory for code. Yes, I tried to squeeze loops into the 020/030's tiny instruction cache for optimal execution but it's not like I see this as a common usecase for which an ISA should be defined. With cache sizes common in this century I don't see much problems in having lower code density (at least not if code density isn't lower by factors) if the instruction decoder gets so much simpler by having reliable instruction boundaries.

The problem of the instruction decoder is that it can only receive a fixed amount of data per clock, and that's not very much. The more instructions that fit in this data, the more instructions you can decode at once.

Quote:

Originally Posted by grond

Well, that is if you choose to code Thumb2. Actually having these two sets of instructions clearly is a nice feature. But yes, one addresses a shortcoming of the other by introducing other shortcomings. You can't have all at the same time, not even in 68k.

On 68k you really have all at the same time : 16 registers you can use and short code.

Quote:

Originally Posted by grond

I don't think the fixed separation is a nice feature. I like that on ARM registers can be used both for address calculation and data handling.

Do you need to multiply or divide pointers ? Sorry, data and pointers are two different things. Even C compilers know this.

Quote:

Originally Posted by grond

Yes, but you can't choose which one does, it's implied by the registers you use. ARM can do the extension and you can specify for each instruction whether it should modify the flags or not.

No, ARM can't do automatic extension with an ADD instruction... remember, it does not support non-longword operations.
Anyway, specifying if you modify the flags or not has a cost of 1 extra bit per instruction. I really prefer the D/A split which comes at no encoding cost.

Quote:

Originally Posted by grond

I quoted EOR as an example because in ARM all instructions work the same while on 68k you have to remember some irrational exceptions.

Exceptions aren't exactly irrational. EOR was considered a rare operations, and if you know that rare operations have less flexibility than common ones, there is no problem anymore.

Quote:

Originally Posted by grond

Yes, and I guess most people use high-level languages for the non-speed critical code for this exact reason

Right, but this makes these people not trained enough with asm to be able to see the strengths and weaknesses of the different cpu families.
So if they say "cpu xyz has better asm than 68k", they speak about something they do not know. Write whole programs of significant size in asm first. Then we'll talk.

Quote:

Originally Posted by grond

Oh, the instruction set still matters for instruction decode. On x86 you never know where instruction boundaries are which makes it difficult to decode instructions in the instruction stream ahead of where you are and needs extra invisible bits in the 1st level instruction cache.

It seems x86 still manages pretty well, doesn't it ?

malko · 13 September 2022, 15:33

Quote:

Originally Posted by grond

[...] Yes, and I guess most people use high-level languages for the non-speed critical code for this exact reason

[...]

And I guess that most and most people use HLL for "every kind of code" because the" hardware" with its actuel speed make them think that nothing is speed critical.
Alone on this sole thread, I do not count anymore how much time this argument was given "with actual speed, bla bla bla, it doesn't matter"...

grond · 13 September 2022, 15:45

Quote:

Originally Posted by meynaf

If this is your only concern, i'm afraid there are better choices for you than 68k.

Yes, ARM, for example.

I was referring to how I assess CPU architectures.

Quote:

The problem of the instruction decoder is that it can only receive a fixed amount of data per clock, and that's not very much. The more instructions that fit in this data, the more instructions you can decode at once.

Well, with 32bit instructions you can easily load eight instructions in parallel and decode all of them because you know exactly

a) there are eight instructions
b) where each instructions is located

When you try to decode 32 bytes of x86 instruction data, you don't know any of this. 68k is somewhere in the middle.

Quote:

Do you need to multiply or divide pointers ? Sorry, data and pointers are two different things. Even C compilers know this.

I don't need to do that but I more often need a nineth data register than an eight adress register. Being limited to a certain set of operations available for each type of register is the limitation.

Quote:

No, ARM can't do automatic extension with an ADD instruction... remember, it does not support non-longword operations.

Where is the problem? You extend all data upon loading it. Then you process it and store either byte, word or long.

Quote:

Anyway, specifying if you modify the flags or not has a cost of 1 extra bit per instruction. I really prefer the D/A split which comes at no encoding cost.

Well, that's your result of weighing the pros and cons, I come to a different conclusion.

Quote:

It seems x86 still manages pretty well, doesn't it ?

Sure, but it needs more engineering hours to make it manage that well when compared to other architectures. Or do you think 68k wouldn't be faster than x86 if the same effort was put into the architecture?

13 September 2022, 11:00	#969
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	The sign-extension on loading indeed is not necessary in the 080 because the ext-instruction is only another word-sized instruction that will be fused with the move. That means the combination of the move and the ext is treated exactly like a sign-extending ldr instruction, it will be scheduled to the ALU as a single instruction and it will execute in a single cycle and even in parallel with another instruction scheduled to the other pipeline. With regard to ARM, I liked coding ARM in assembly language and I didn't have much difficulties with it. It looked to me much like a more orthogonal 68k CPU. There is no distinction between A and D registers and there are no exceptions like EOR and some other stuff. I also like 3-operand code even though CPUs nowadays don't gain much (if anything) from 3-operand code because 2-operand instructions and the extra move-instructions usually can be fused into a single operation on a 3-operand ALU. It thus can be argued that 3-operand code means worse code density without producing faster code. I never found the ARM mnemonics hard to remember or decipher. It's simply a matter of getting used to them. I never worked much on memory even on 68k, in speed-critical code you usually burst-load data, process it and then store it. For non-speed-critical code it just doesn't matter much whether you get worse code density. We now have plenty of RAM and storage. I agree that predication is a concept on ARM that at first sight appears a great feature but at second sight isn't that attractive any more. It's four bits wasted in each instruction and gets used rarely. For blocks of instructions a branch is usually better even on ARM. So yes, code density suffers on ARM as much as on most RISCs but the predictable size of instructions is a much more important advantage for creating a high-speed processor than code density is. I think the most popular compromise for recent CPU architectures is to mix 16bit and 32bit instructions which still helps keep the instruction decoders simple (important for highly superscalar CPUs) but gives overall much improved code density. Last edited by grond; 13 September 2022 at 11:14.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Vampire V4 plus Amiga 1200 and 500 for sale	drusso66	MarketPlace	7	14 November 2021 05:59
For Sale: Amiga 1200 with vampire 1200 v2	supperbin	MarketPlace	8	09 July 2021 15:47
Warp 1260 or Vampire 1200 V2	dude1995	MarketPlace	0	20 May 2021 04:05
Vampire 1200	HanSolo	support.Hardware	55	19 June 2017 10:15
Amiga 1200 Vampire Cards	PaulG	Amiga scene	61	24 February 2017 03:47

13 September 2022, 12:14	#970
dreadnought Registered User Join Date: Dec 2019 Location: Ur, Atlantis Posts: 1,899	Whoa! For those who follow the "how many PgDwns to scroll a post" rankings, Meynaf has just came up with a world-beating 7-presser! That is truly impressive and leaves the previous contenders (TR: 4x, BA: 3x, ='_'=: 4x though that was separate posts) waaaay behind. It will take some doing, but can anyone beat this?

13 September 2022, 12:35	#973
Promilus Registered User Join Date: Sep 2013 Location: Poland Posts: 807	It's funny to see how (again) discussion went from Apollo products (or design, or features) to asm vs c and 68k vs arm What I'd like to add in that particular topic - it doesn't matter if C is memory hungry. It doesn't even matter should it waste clock cycles. If you can bring new software to Amiga it is good. And iirc VanillaConquer is written in C and so is DevilutionX ... So everyone can argue which approach is the best in their opinion but the one thing which remains is "put money where your mouth is". ASM might be fun, fast and efficient but it is irrelevant when there's hardly any NEW software written in it. Even in the last corner where it is still widely used.

13 September 2022, 13:18	#975
Leon Besson Banned Join Date: Feb 2022 Location: Anywhere and everywhere I have a contract Posts: 822	Well here is the deal with V2 licensing Bromigos! Just had this sent to me on Email. license V2 cards: 90% of all V2 cards are licensed and can be updated without any trouble. If your card is licensed the following informations are not important for you. This is required for previously NOT licensed V2 accelerator cards. Licensing your V2 Card will allow you to benefit from the new core updates, to use all the new games and to take advantage of the all great new features. Unlicensed cards will only have a black and white screen after core update 2.16 and higher. ==> Did you buy your card from a reseller, then please contact him. ==> If you want to purchase your Apollo 68080 core license in the apollo computer shop Buy your V2 license SPECIAL DISCOUND PRICE of 50 € UNTIL END OF SEPTEMBER (Normal price 100 €) How does the licensing work: You pay the license fee in the shop. You read the serial number of your V2 and email us the number ==> type in the CLI: VControl SN We will send you a personalized license sticker. We will update the serial number in the core update asap. From now on you can use all core updates without any problems.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)