Any reason to use assembly on the Amiga rather than C? - Page 22

Don_Adan · 09 July 2023, 21:27

Thanks. I dont coding from long time, but after quick check, I dont know why You cleared BSS section matrix, because this is always 0, if this is one shoot program. And clearing working buff via move byte command is not very effective.

alkis · 09 July 2023, 21:49

Quote:

Originally Posted by paraj

@alkis: Not to stoke the fires, and just out of interest, could you also post the comparable C code (just the important loop/function will do). Could be interesting to see what the compiler did better (for learning, not to hate on your code).

Sure, I'll just put the whole thing so it can be compilable.

Code:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef unsigned char ptype;

int isqrt(int x) {
  int q = 1, r = 0;
  while (q <= x) {
    q <<= 2;
  }
  while (q > 1) {
    int t;
    q >>= 2;
    t = x - r - q;
    r >>= 1;
    if (t >= 0) {
      x = t;
      r += q;
    }
  }
  return r;
}

// #define COMPUTE_SUM

int computePrimes(ptype *table, int n) {
  int ret = 1; // This is our counter. Starts at 1 to account for number 2
  int i;       //  which is prime, even being even...how odd :P
  int sq = isqrt(n + n) >> 1;
  int step;
  int j;
#ifdef COMPUTE_SUM
  unsigned long long sum64 = 2;
  unsigned int sumh = 0, suml = 2, t;
#endif

  for (i = 0; i <= sq; ++i) {
    if (table[i] == 0) {
      // This is a prime
      ++ret;
      step = 2 * i + 3;
#ifdef COMPUTE_SUM
      sum64 += step;
      t = suml;
      suml += step;
      if (suml < t)
        ++sumh;
#endif
      j = i + step;   // we leave the first occurance as zero
      while (j < n) { // in case we want to print them later.
        table[j] = 1;
        j += step;
      }
    }
  }
  --n; // fixes a bug when N+1 is prime and gets counted
  for (; i < n; ++i)
    if (table[i] == 0) {
#ifdef COMPUTE_SUM
      sum64 += 2 * i + 3;
      t = suml;
      suml += 2 * i + 3;
      if (suml < t)
        ++sumh;
#endif
      ++ret;
    }
#ifdef COMPUTE_SUM
  printf("%16llx\n%08x%08x\n", sum64, sumh, suml);
#endif
  return ret;
}

int main(int argc, char **argv) {
  int num, howMuchMem, n, loops, i;
  ptype *table;

  if (argc <= 2) {
    fprintf(stderr,
            "Usage: %s limit [number of times to run] -- will compute primes "
            "up to num\n",
            argv[0]);
    return 5;
  }
  if (argc == 3) {
    loops = atol(argv[2]);
  } else {
    loops = 1;
  }
  num = atol(argv[1]);
  if (num & 1) {
    num += 1;
  }
  num >>= 1;
  howMuchMem = num * sizeof(ptype);
  table = (ptype *)malloc(howMuchMem);
  if (!table) {
    fprintf(stderr, "Out of memory\n");
    return 20;
  }

  for (i = 0; i < loops; ++i) {
    // printf("Zeroing memory...\n");
    memset(table, 0, howMuchMem);
    // printf("Computing primes...\n");
    n = computePrimes(table, num);
  }
  printf("Counted %d primes up to %d. (Run %d times)\n", n, num * 2, loops);
  free(table);
  return 0;
}

compiles with:
m68k-amigaos-gcc -O3 -mcrt=nix13 -o prime prime.c

alkis · 09 July 2023, 22:14

Quote:

Originally Posted by Don_Adan

Thanks. I dont coding from long time, but after quick check, I dont know why You cleared BSS section matrix, because this is always 0, if this is one shoot program. And clearing working buff via move byte command is not very effective.

New clear. Movem.l 32 bytes at a time, unrolled once to go 64bytes per loop.

Code:

          lea.l        guard,a0
          move.w       #7813-1,d5
          moveq        #0,d0
          moveq        #0,d1
          moveq        #0,d2
          moveq        #0,d3
          moveq        #0,d4
          move.l       d0,a1
          move.l       d0,a2
          move.l       d0,a3

.nclr     MOVEM.L      D0-D4/A1-A3,-(A0)                            ; 8 x 4 = 32 bytes
          MOVEM.L      D0-D4/A1-A3,-(A0)                            ; 8 x 4 again for 64 bytes	
          dbra         d5,.nclr

vamos -v sieve
23:09:56.731 main: INFO: done. exit code=0
23:09:56.731 main: INFO: total cycles: 53939790
23:09:56.732 main: INFO: vamos is exiting
Number of primes 78498

Down from 68736530. Gcc still comes up on top with 49823902

Muzza · 10 July 2023, 00:24

Quote:

Originally Posted by Hypex

Thanks for testing. I'm not sure what jeq and jlt are. That's not a common mnemonic.

Pseudo op-codes.
They are replaced with the shortest kind of branch possible, but this is not always known at compile time.

Quote:

Originally Posted by Hypex

Except it's branching (or jumping) to a routine, which then jumps to another, comes back and returns. My code branches direct to assumed routines.

I did it like this to represent the ; 0 ; + ; - comments you added to your original, which I assume were there to place-hold some kind of work. Without them the compiler would optimize away the entire routine because it doesn't actually do anything. Taking that into account, the ASM it produces is identical to yours.

Change the functions pos/neg/zero to some small inlined calculation that gets re-used in multiple places, and you'd have a nice example of the benefits of an optimizing compiler.

Don_Adan · 10 July 2023, 01:12

Quote:

Originally Posted by alkis

New clear. Movem.l 32 bytes at a time, unrolled once to go 64bytes per loop.

Code:

          lea.l        guard,a0
          move.w       #7813-1,d5
          moveq        #0,d0
          moveq        #0,d1
          moveq        #0,d2
          moveq        #0,d3
          moveq        #0,d4
          move.l       d0,a1
          move.l       d0,a2
          move.l       d0,a3

.nclr     MOVEM.L      D0-D4/A1-A3,-(A0)                            ; 8 x 4 = 32 bytes
          MOVEM.L      D0-D4/A1-A3,-(A0)                            ; 8 x 4 again for 64 bytes	
          dbra         d5,.nclr

vamos -v sieve
23:09:56.731 main: INFO: done. exit code=0
23:09:56.731 main: INFO: total cycles: 53939790
23:09:56.732 main: INFO: vamos is exiting
Number of primes 78498

Down from 68736530. Gcc still comes up on top with 49823902

If I understand correctly Your routine, then You used 0 byte for prime and 1 for no prime. If You replace this for 0 as no prime and 1 as prime, then in final loop You can perhaps replace tst.b (a0)+ with tst.w (a0)+. And final loop will be 2 times faster.

Also I dont remember if add.l d6,d1 is fastest than addq.l #1,d1. For me addq.l version is much more readable.

alkis · 10 July 2023, 03:28

Quote:

Originally Posted by Don_Adan

If I understand correctly Your routine, then You used 0 byte for prime and 1 for no prime. If You replace this for 0 as no prime and 1 as prime, then in final loop You can perhaps replace tst.b (a0)+ with tst.w (a0)+. And final loop will be 2 times faster.

Also I dont remember if add.l d6,d1 is fastest than addq.l #1,d1. For me addq.l version is much more readable.

I am missing something on your first suggestion. Table element is byte. How can I check 2 values with one check when using words?

For the add.l vs addq, I think they are both 8 cycles according to https://wiki.neogeodev.org/index.php...ctions_timings

kamelito · 10 July 2023, 08:57

It reminds me of this video and guess what Assembly was not the winner nor C.
[ Show youtube player ]

meynaf · 10 July 2023, 09:03

This doesn't look too well :

Code:

.while	cmp.l	d7,d3
	bge.b	.endw
	move.b	d6,(a1,d3.l)
	add.l	d2,d3
	bra.b	.while
.endw

Perhaps you could try something like :

Code:

 lea (a1,d7.l),a3       ; put that out of for loop
 lea (a1,d3.l),a2
 bra.b .do              ; remove that if first test always returns to .while
.while
 move.b d6,(a2)
 add.l d2,a2
.do
 cmp.l a3,a2
 blo.b .while

Bruce Abbott · 10 July 2023, 09:56

Quote:

Originally Posted by alkis

For example, I coded a program that counts all primes up to 1 million. I did it in asm and then I did it in C (I am using beboo's gcc 6.5). C wins. I am sure if a godlike assembly programmer (like most other guys in this forum are) would write the asm version asm would come on top. The other day I thought, "Huh, what would sas c give?". I compiled with sas and my asm was better. So ranking is:
1) gcc 6.5
2) average (or below) asm programer
3) sas c

For reference here are the timings...

How did you count the cycles?

alkis · 10 July 2023, 10:03

Quote:

Originally Posted by Bruce Abbott

How did you count the cycles?

I run the executable through vamos on linux. With -v it reports the cycles needed for the execution.

alkis · 10 July 2023, 10:14

Quote:

Originally Posted by meynaf

This doesn't look too well :

Code:

.while	cmp.l	d7,d3
	bge.b	.endw
	move.b	d6,(a1,d3.l)
	add.l	d2,d3
	bra.b	.while
.endw

Perhaps you could try something like :

Code:

 lea (a1,d7.l),a3       ; put that out of for loop
 lea (a1,d3.l),a2
 bra.b .do              ; remove that if first test always returns to .while
.while
 move.b d6,(a2)
 add.l d2,a2
.do
 cmp.l a3,a2
 blo.b .while

vamos -v sieve
11:12:02.255 main: INFO: done. exit code=0
11:12:02.255 main: INFO: total cycles: 42054466
11:12:02.256 main: INFO: vamos is exiting
Number of primes 78498

Asm on top now.

Thorham · 10 July 2023, 11:17

Quote:

Originally Posted by kamelito

It reminds me of this video and guess what Assembly was not the winner nor C.

He's wrong. Assembly language is the fastest. Assembly language is translated to machine code directly, and machine code is all a CPU knows. So how is assembly language not the fastest?

meynaf · 10 July 2023, 11:46

Quote:

Originally Posted by Thorham

So how is assembly language not the fastest?

Easy : it wasn't in the test.
Frankly this test looks completely bogus. Java in top 5, serious ?

Karlos · 10 July 2023, 12:39

Quote:

Originally Posted by Thorham

He's wrong. Assembly language is the fastest. Assembly language is translated to machine code directly, and machine code is all a CPU knows. So how is assembly language not the fastest?

It is possible for JIT to beat ahead of time compiled or assembled code for the very simple reason that lots of "variables" at compile time and branches based on them end up being invariant at runtime and a good JIT only optimises the paths actually taken. If you have never heard of it i suggest you read up on HP's Project Dynamo in which a JIT emulation of their own PA-RISC CPU ran a number of benchmarks faster than the host CPU did, despite the fact it was the same chip! Think of it as a 68K emulator running in a 68K outperforming the actual CPU for the sane task.

The reason for this apparent over-unity performance wasn't bad measurement or a dysfunction if reality. It was just that the compiled benchmark code was further optimisable at runtime.

Java, as annoying as it is, does have a very mature JIT on x64, so i would not be surprised for it to run some high level code faster than some native binary.

Thorham · 10 July 2023, 13:05

Quote:

Originally Posted by meynaf

Java in top 5, serious ?

Depends on the JIT compiler.

Quote:

Originally Posted by Karlos

It is possible for JIT to beat ahead of time compiled or assembled code

It still ends up being native machine code (which can be directly translated to ASM).

Quote:

Originally Posted by Karlos

Think of it as a 68K emulator running in a 68K outperforming the actual CPU for the sane task.

Yeah, I wanna see that.

The reason why ASM is fastest is because machine code is all the CPU knows. A CPU can't run anything else. That you can use various techniques to improve performance doesn't change this.

meynaf · 10 July 2023, 13:16

Quote:

Originally Posted by Thorham

Depends on the JIT compiler.

It sure does, but java run-time is far from being lightweight.
Every java app i've tried so far wasn't impressive by its speed...
So perhaps a good JIT can offset things, but putting that... thing in top 5 of two hundred languages ? Nah.

While there are some optimizations JIT can do from information only available at run-time, it has to be much faster than static optimization and therefore misses many opportunities.
In addition, we all know benchmarks have a strong tendency to do useless things that can be completely removed.

But anyone who believes JIT compiler can beat native speed is welcome to help me improve my VM.

Thorham · 10 July 2023, 13:26

Quote:

Originally Posted by meynaf

Every java app i've tried so far wasn't impressive by its speed...

Depends on the code quality. I have a DDS decoder in C# that's based on someone else's code. I managed to make the code three times as fast. I also found another DDS decoder in C# that was about ten times slower than mine!

Quote:

Originally Posted by meynaf

But anyone who believes JIT compiler can beat native speed is welcome to help me improve my VM.

You can't beat native speed. It's impossible.

kamelito · 10 July 2023, 13:28

Quote:

Originally Posted by Thorham

He's wrong. Assembly language is the fastest. Assembly language is translated to machine code directly, and machine code is all a CPU knows. So how is assembly language not the fastest?

You know that it not the machine code produced but the code being C or assembly no the end result at the machine level that the person who did the assembly code was not able to write.

meynaf · 10 July 2023, 13:41

Quote:

Originally Posted by Thorham

Depends on the code quality. I have a DDS decoder in C# that's based on someone else's code. I managed to make the code three times as fast. I also found another DDS decoder in C# that was about ten times slower than mine!

Well, it's C# vs C#. Still not impressed, it's so easy to kill performance in HLLs.

Quote:

Originally Posted by Thorham

You can't beat native speed. It's impossible.

I know. I was just sarcarstic.

meynaf · 10 July 2023, 13:45

Quote:

Originally Posted by kamelito

You know that it not the machine code produced but the code being C or assembly no the end result at the machine level that the person who did the assembly code was not able to write.

That nobody was able to write decent x86 code isn't surprising.
The fact x86 (and many others) asm is just horrible has led to the (false) belief compilers do a better job. But it's easy to be the best when nobody seriously challenges you.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Chat GPT Amiga Assembly	rcman	Coders. Asm / Hardware	3	26 March 2023 20:24
An Amiga coder was banned without a reason - is it ok?	litwr	project.EAB	1	18 June 2021 20:38
Beginning Amiga Assembly Tutorial(s)	Curbie	Coders. Asm / Hardware	15	29 May 2020 00:21
Beginning Amiga Assembly Programming	Hewitson	Coders. Tutorials	32	09 October 2012 18:25
Amiga Assembly sources - Please help!	W4r3DeV1L	Amiga scene	21	16 July 2008 08:13

09 July 2023, 21:27	#421
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,975	Thanks. I dont coding from long time, but after quick check, I dont know why You cleared BSS section matrix, because this is always 0, if this is one shoot program. And clearing working buff via move byte command is not very effective.

10 July 2023, 08:57	#427
kamelito Zone Friend Join Date: May 2006 Location: France Posts: 1,801	It reminds me of this video and guess what Assembly was not the winner nor C. [ Show youtube player ]

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)