Sorting benchmark

Samurai_Crow · 15 November 2016, 15:41

Could someone please compile this code on an AmigaOne with a G3 or Efika and post the results on this thread?

Code:

/*-----------------------------------------------------------------------*/
/*
 CPU  Stress test 
 (Gunnar von Boehn) Feel free to do with the code what you want.

 The test is small but stresses the following CPU features:
 - DataCache
 - Conditional Code execution / Branch prediction
 - Loop acceleration  
 - Memory Hazard Detection

 = These are all important CPU features that are stressed.
 The test was taken from the internal test cases of the APOLLO CPU project.


 Compile with O2 
 E.g: gcc -o sort -O2 sortbench.c
 
 */
# include <stdio.h>
# include <stdlib.h>
# include <float.h>
# include <sys/time.h>
# define HLINE "-------------------------------------------------------------\n"



int count[32]={ 524799,2098175,4720127,8390655,
          13109759,18877439,25693695,33558527,
          42471935,52433919,63444479,75503615,
          88611327,102767615,117972479,134225919,
          151527935,169878527,189277695,209725439,
          231221759,253766655,277360127,302002175,
          327692799,354431999,382219775,411056127,
          440941055,471874559,503856639,536887295};


 
double mysecond() {
        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

/* 
 68K Code
  loop2:
       move.l  D2,D1        ;                                           1  -  
       move.l  (a0)+,D2      ; 2nd value in register D2             1  1                      
       cmp.l   D1,D2        ; Compare values                            2  1 
       bge     noswap       ; Branch if greater than or equal to        3  -         
 doswap:
       exg     D1,D2        ;                                       3  1
 noswap:
       move.l  D1,-8(a0)    ; Store 1st ordered values                  4  1            
       subq.l  #1,D6
       bge     d6,loop2     ; inner loop                   4  1
*/

void sort(int * data, int size){
    int D6,D7,D1,D2,D3; int * A0;   
    D7 = size-2;
loop1:
    A0 = data;
    D6 = D7;
    D2 = *A0++;
loop2:
    D1 = D2;
    D2 = *A0++;
    if( D1>D2 ){
      D3=D1; D1=D2; D2=D3;
    }
    *(A0-2) = D1;
    D6--; if( D6 > -1) goto loop2;
    *(A0-1) = D2;
    D7--; if( D7 > -1) goto loop1;
}

void bench(int size1){
    int i; double time1, time2;
    int * data;
    int * A0,A1;
    int size=size1*1024;
    int loops, loopsmax;

    data =malloc(size*4); // Malloc anough for array between 1KB to  64KB
    time1= mysecond(); 

    loopsmax=count[31]/count[size1-1];
    for(loops=loopsmax; loops>0 ; loops--){

      A0 = data;
      for (i=size; i>0 ; i--){
        *A0++=i;
      }
      sort(data, size);
    }
    time2= mysecond(); 

    printf("%2i K Element :  %6.2f MB/sec\n",size/1024, loopsmax*count[size1-1]/1024*8/(time2-time1)/1024 );
    free(data);
}
main() {
    int i;

    printf(HLINE);
    printf("SORTBENCH 1.1 (Gunnar von Boehn)\n");
    printf("Its a CPU benchmark that stresses CPU, DCache and branch prediction.\n");
    printf(HLINE);

    for (i=1; i<=32; i=i*2){
       bench(i);
    }
}

We want to gauge against the current experimental Gold2 Core on the Vampire.

DDNI · 15 November 2016, 18:48

No efika or G3 here.
Tested on my AmigaOne X1000

-------------------------------------------------------------
SORTBENCH 1.1 (Gunnar von Boehn)
Its a CPU benchmark that stresses CPU, DCache and branch prediction.
-------------------------------------------------------------
1 K Element : 2084.88 MB/sec
2 K Element : 2096.05 MB/sec
4 K Element : 2105.75 MB/sec
8 K Element : 2134.63 MB/sec
16 K Element : 2112.34 MB/sec
32 K Element : 1712.92 MB/sec

matthey · 15 November 2016, 20:24

Quote:

Originally Posted by DDNI

No efika or G3 here.
Tested on my AmigaOne X1000

32 K Element : 1712.92 MB/sec

Someone is going to need a bigger graph

. More interesting would be MB/s/MHz though. It shows which processors are weak (like most ARM and ColdFire) and gives a much better comparison of processor design. Compiler performance is a big part of this test as with DMIPS. One bug or missed optimization can make a huge difference as we saw in another thread with vbcc DMIPS compiled code.

emufan · 15 November 2016, 20:38

compiler version makes a difference - (here on my peecee amd phenom II X4 955 3.2 ghz)

i686-w64-mingw32-gcc-5.4.0.exe -> 32 K Element : 6076.98 MB/sec
i686-pc-mingw32-gcc-4.7.3.exe -> 32 K Element : 4870.25 MB/sec

sort-gcc-4.7.3.exe: PE32 executable (console) Intel 80386 (stripped to external PDB), for MS Windows
sort-gcc-5.4.0.exe: PE32 executable (console) Intel 80386 (stripped to external PDB), for MS Windows

attached my binaries - compiled using cygwin/ming32 [ -o sort -O2 sortbench.c ]

#1) can anyone attach an amiga os binary? the linker of my crosscompiler does not work :/

nogginthenog · 16 November 2016, 22:44

4 year old Intel I7 3770 3.4GHz running Windows:
Linux VM: 32 K Element : 14682.84 MB/sec
Cygwin: 32 K Element : 14657.00 MB/sec
Windows 10 bash: 32 K Element : 14563.97 MB/sec
(pretty consistent)

A4000 68060 CyberStorm MkII gcc 2.95.3:
32 K Element : 34.87 MB/sec

Compiled with m68k-amigaos-gcc -noixemul -m68020-60 -o sort_060 -O2 sortbench.c -lm

Ouch! Amiga executable attached.

emufan · 16 November 2016, 23:17

Quote:

Originally Posted by nogginthenog

4 year old Intel I7 3770 3.4GHz running Windows:
Linux VM: 32 K Element : 14682.84 MB/sec
Cygwin: 32 K Element : 14657.00 MB/sec
Windows 10 bash: 32 K Element : 14563.97 MB/sec
(pretty consistent)

pretty awesome - while my amd burns down 120 wattage

Quote:

Ouch! Amiga executable attached.

thanks - but this is just a log of sortbench, not the amiga binary

#1) but using ur compiler syntax, I was able to make one on my own, thanks, the "-noixemul -lm" did the trick

#2) had to change %6.2f into %6f in the printf, otherwise i had "%6.2f MB/sec" in results list.
I only have winuae here, so those results makes no sense; but i get "915 MB/sec" in my fast as possible 030/882 setup

#3) attached 68000 and 68020-060 AmigaOS binaries.

matthey · 17 November 2016, 00:23

Quote:

Originally Posted by nogginthenog

4 year old Intel I7 3770 3.4GHz running Windows:
Linux VM: 32 K Element : 14682.84 MB/sec
Cygwin: 32 K Element : 14657.00 MB/sec
Windows 10 bash: 32 K Element : 14563.97 MB/sec
(pretty consistent)

Modern x86_64 processors are strong at this test. They did well in test results posted at the following link also.

http://www.apollo-core.com/sortbench...age=benchmarks

Quote:

Originally Posted by nogginthenog

A4000 68060 CyberStorm MkII gcc 2.95.3:
32 K Element : 34.87 MB/sec

Compiled with m68k-amigaos-gcc -noixemul -m68020-60 -o sort_060 -O2 sortbench.c -lm

Ouch!

The 68060 results are actually impressive. 32k elements is too big to fit in the DCache and is a poor size to choose when comparing the performance of older processors. A smaller number of elements were chosen for comparison in the results I linked above. The performance in the cache should scale linearly and different processors can be compared by looking at MB/s/MHz. My 68060@75MHz CSMKIII scored ~120 MB/s with outdated compilers and I achieved ~140 MB/s with assembler optimizations when the DCache could hold all the data.

ARM Cortex A4 0.30 MB/s/MHz
ColdFire v3 MCF5329 0.44 MB/s/MHz
Raspberry Pi ARM 1176JZF-S 0.652 MB/s/MHz
ARM Feroceon 88FR131 0.69 MB/s/MHz
IBM Power 6 0.69 MB/s/MHz
Intel Atom 0.84 MB/s/MHz
IBM Power 7 1.16 MB/s/MHz
AmigaOne X1000 PA6T-1682M 1.19 MB/s/MHz
PPC G4 7447 1.26 MB/s/MHz
68060 1.60 MB/s/MHz (1.87 MB/s/MHz with assembler optimizations)
Intel Core 2 Duo 2.61 MB/s/MHz
Intel i7 3770 4.32 MB/s/MHz (or more if smaller element size helps)

A modern clocked 68060 with modern die size and modern cache sizes would likely outperform everything here except modern x86_64 processors in this single core benchmark. It should even be possible for performance to improve some with a modern die shrink (shorter timings for instructions and addressing modes). The Apollo core claims even better performance than the 68060. The claim at the link I posted would give 3.70 MB/s/MHz which would be faster than an Intel Core 2 Duo for this test and would be even more impressive. This is why I suggested MB/s/MHz and using a smaller element size for comparison but nobody ever listened to me

.

Samurai_Crow · 17 November 2016, 07:18

Hi matthey,

Quote:

A modern clocked 68060 with modern die size and modern cache sizes would likely outperform everything here except modern x86_64 processors in this single core benchmark. It should even be possible for performance to improve some with a modern die shrink (shorter timings for instructions and addressing modes). The Apollo core claims even better performance than the 68060. The claim at the link I posted would give 3.70 MB/s/MHz which would be faster than an Intel Core 2 Duo for this test and would be even more impressive.

It shouldn't be too surprising. I think Gunnar is figuring out many of the same cache optimizations that are used by Intel. One additional advantage that the '060 and '080 share is that the code density is slightly higher than most of the others.

The AMD64 instruction set brought registers that were only used for segment pointers that hung lifeless in 32-bit enhanced code back into circulation as general-purpose registers bringing the total up to 16. Of course the '080 has an additional bank of 8 bringing the total for the '080 up to 24, as you have probably heard.

TuKo · 17 November 2016, 09:03

Quote:

ARM Cortex A4 0.30 MB/s/MHz
ColdFire v3 MCF5329 0.44 MB/s/MHz
Raspberry Pi ARM 1176JZF-S 0.652 MB/s/MHz
ARM Feroceon 88FR131 0.69 MB/s/MHz
IBM Power 6 0.69 MB/s/MHz
Intel Atom 0.84 MB/s/MHz
IBM Power 7 1.16 MB/s/MHz
AmigaOne X1000 PA6T-1682M 1.19 MB/s/MHz
PPC G4 7447 1.26 MB/s/MHz
68060 1.60 MB/s/MHz (1.87 MB/s/MHz with assembler optimizations)
Intel Core 2 Duo 2.61 MB/s/MHz
Intel i7 3770 4.32 MB/s/MHz (or more if smaller element size helps)

Thanks for these results, they are very interesting ! Can you please give us model details for Core 2 Duo ?

Samurai_Crow · 17 November 2016, 09:19

Here are the results for my Core2 Duo-based Mac Mini running 64-bit Debian Linux at 1.83 GHz:
-------------------------------------------------------------
SORTBENCH 1.1 (Gunnar von Boehn)
Its a CPU benchmark that stresses CPU, DCache and branch prediction.
-------------------------------------------------------------
1 K Element : 6646.58 MB/sec
2 K Element : 6817.39 MB/sec
4 K Element : 6903.55 MB/sec
8 K Element : 6927.06 MB/sec
16 K Element : 6876.59 MB/sec
32 K Element : 6868.39 MB/sec

Results are from the same executable compiled with GCC 4.9.2 .

-edit-
The results using -mtune=core2 are shown below. Slightly better than before.

samuraicrow@SamsMacMini:~/Downloads$ ./sort2
-------------------------------------------------------------
SORTBENCH 1.1 (Gunnar von Boehn)
Its a CPU benchmark that stresses CPU, DCache and branch prediction.
-------------------------------------------------------------
1 K Element : 6697.62 MB/sec
2 K Element : 6809.94 MB/sec
4 K Element : 6917.67 MB/sec
8 K Element : 6947.43 MB/sec
16 K Element : 6934.64 MB/sec
32 K Element : 6934.44 MB/sec

@matthey

Attached is a new chart for you with a newer Gold2core candidate measured against the efficiency of your Core2Duo.

matthey · 17 November 2016, 12:16

Quote:

Originally Posted by Samurai_Crow

It shouldn't be too surprising. I think Gunnar is figuring out many of the same cache optimizations that are used by Intel. One additional advantage that the '060 and '080 share is that the code density is slightly higher than most of the others.

I'm not sure code density makes much of a difference for this benchmark. The code is relatively small and should fit in the ICache of all but the oldest processors. The Data takes the same space in the DCache of all processors. Furthermore, if Gunnar cared about code density then maybe he wouldn't have abandoned research and enhancements which improve it.

Quote:

Originally Posted by Samurai_Crow

The AMD64 instruction set brought registers that were only used for segment pointers that hung lifeless in 32-bit enhanced code back into circulation as general-purpose registers bringing the total up to 16. Of course the '080 has an additional bank of 8 bringing the total for the '080 up to 24, as you have probably heard.

The 68080, as Gunnar calls it, isn't using any of those extra registers for this benchmark. They are unlikely to ever be used by any compiler. This benchmark is testing only the decades old API and stack based ABI of a more modern 68k CPU design. The AMD64/x86_64 enhancements of 64 registers and an improved ABI passing function arguments in registers is being utilized.

Quote:

Originally Posted by TuKo

Thanks for these results, they are very interesting ! Can you please give us model details for Core 2 Duo ?

Results will vary significantly due to many factors. My point was to produce a rough idea of the peak performance in cache of different processors which is roughly comparable. My numbers and info was based on the web site I linked to (and numbers given in this thread) and I do not know how reliable they are other than the 68060 numbers from my Amiga. There probably is a significant difference between early Core 2 Duos with small caches and later die shrink versions with larger caches. Results can vary significantly by API/ABI used as well. Samurai's results are significantly higher for example. They would be ~3.80 MB/s/MHz. The Core 2 Duo is a strong and efficient (for x86_64 architecture) processor.

Samurai_Crow · 17 November 2016, 12:22

Ninja'd by matthey's post. Check the edit of my previous post.

-edit-
Whoops. I see now that it was from the website that you got the results. I wonder what compiler was used to generate such poor results of the Core2Duo that should have been faster than my Mac Mini by spec.

-edit2-
Found the problem. All the results on the website were from Sortbench 1.0 while the source in this thread was Sortbench 1.1.

jPV · 17 November 2016, 16:11

Quote:

Originally Posted by Samurai_Crow

Could someone please compile this code on an AmigaOne with a G3 or Efika and post the results on this thread?

Is Pegasos 1 with G3/600MHz and MorphOS close enough?

As already said, there's some difference which compiler you use, here are with gcc 2 and 5:

Pegasos1 G3/600MHz MorphOS gcc2:

-------------------------------------------------------------
SORTBENCH 1.1 (Gunnar von Boehn)
Its a CPU benchmark that stresses CPU, DCache and branch prediction.
-------------------------------------------------------------
1 K Element : 736.16 MB/sec
2 K Element : 736.15 MB/sec
4 K Element : 734.83 MB/sec
8 K Element : 734.24 MB/sec
16 K Element : 670.56 MB/sec
32 K Element : 649.91 MB/sec

Pegasos1 G3/600MHz MorphOS gcc5:

-------------------------------------------------------------
SORTBENCH 1.1 (Gunnar von Boehn)
Its a CPU benchmark that stresses CPU, DCache and branch prediction.
-------------------------------------------------------------
1 K Element : 1102.51 MB/sec
2 K Element : 1103.06 MB/sec
4 K Element : 1101.12 MB/sec
8 K Element : 1099.38 MB/sec
16 K Element : 984.12 MB/sec
32 K Element : 948.57 MB/sec

And then one test with my "Amiga laptop"...

PowerBook G4/1667MHz MorphOS gcc5:

-------------------------------------------------------------
SORTBENCH 1.1 (Gunnar von Boehn)
Its a CPU benchmark that stresses CPU, DCache and branch prediction.
-------------------------------------------------------------
1 K Element : 3082.05 MB/sec
2 K Element : 3082.15 MB/sec
4 K Element : 3059.91 MB/sec
8 K Element : 3069.99 MB/sec
16 K Element : 2877.81 MB/sec
32 K Element : 2807.30 MB/sec

nogginthenog · 17 November 2016, 20:04

Quote:

Originally Posted by emufan

thanks - but this is just a log of sortbench, not the amiga binary

Doh!

Quote:

#1) but using ur compiler syntax, I was able to make one on my own, thanks, the "-noixemul -lm" did the trick

#2) had to change %6.2f into %6f in the printf, otherwise i had "%6.2f MB/sec" in results list.

I got similar results until I added -lm. -m68020-60 made little difference.
It would be interesting to see if GCC v6 makes a difference. This guy claims to have 6.20 working. I compiled v6 for 68k ages ago but without Amiga patches.

Samurai_Crow · 17 November 2016, 21:52

Neat link about GCC 6.2 building. Thanks!

nogginthenog · 27 November 2016, 18:28

Quote:

Originally Posted by Samurai_Crow

Neat link about GCC 6.2 building. Thanks!

Did you try to build it? It doesn't look like Stefan has published his patches.

Samurai_Crow · 27 November 2016, 22:06

I didn't try to build it. It doesn't seem like his patches are public or completely tested.

jack-3d · 23 April 2017, 03:20

Can someone please make and share WarpOS executable?

17 November 2016, 09:19	#10
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,223	Here are the results for my Core2 Duo-based Mac Mini running 64-bit Debian Linux at 1.83 GHz: ------------------------------------------------------------- SORTBENCH 1.1 (Gunnar von Boehn) Its a CPU benchmark that stresses CPU, DCache and branch prediction. ------------------------------------------------------------- 1 K Element : 6646.58 MB/sec 2 K Element : 6817.39 MB/sec 4 K Element : 6903.55 MB/sec 8 K Element : 6927.06 MB/sec 16 K Element : 6876.59 MB/sec 32 K Element : 6868.39 MB/sec Results are from the same executable compiled with GCC 4.9.2 . -edit- The results using -mtune=core2 are shown below. Slightly better than before. samuraicrow@SamsMacMini:~/Downloads$ ./sort2 ------------------------------------------------------------- SORTBENCH 1.1 (Gunnar von Boehn) Its a CPU benchmark that stresses CPU, DCache and branch prediction. ------------------------------------------------------------- 1 K Element : 6697.62 MB/sec 2 K Element : 6809.94 MB/sec 4 K Element : 6917.67 MB/sec 8 K Element : 6947.43 MB/sec 16 K Element : 6934.64 MB/sec 32 K Element : 6934.44 MB/sec @matthey Attached is a new chart for you with a newer Gold2core candidate measured against the efficiency of your Core2Duo. Attached Thumbnails Last edited by Samurai_Crow; 17 November 2016 at 12:19. Reason: updated results with tuned executable

17 November 2016, 12:22	#12
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,223	Ninja'd by matthey's post. Check the edit of my previous post. -edit- Whoops. I see now that it was from the website that you got the results. I wonder what compiler was used to generate such poor results of the Core2Duo that should have been faster than my Mac Mini by spec. -edit2- Found the problem. All the results on the website were from Sortbench 1.0 while the source in this thread was Sortbench 1.1. Last edited by Samurai_Crow; 17 November 2016 at 12:53. Reason: Noted source of information correctly.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Remakes - Sorting the good from the crap	Jim	Retrogaming General Discussion	42	12 December 2013 14:31
Sorting through my Amiga games Collection!	fitzsteve	Retrogaming General Discussion	6	04 July 2013 01:29
Sorting out	gotmashed	MarketPlace	1	13 August 2007 11:49
Bonus! Was sorting out all my game boxes...	Chris	Nostalgia & memories	29	23 January 2003 18:37
Sorting through my disk collection	Jim	Retrogaming General Discussion	10	10 September 2002 10:54

15 November 2016, 18:48	#2
DDNI Targ Explorer Join Date: Mar 2006 Location: Northern Ireland Posts: 5,441	No efika or G3 here. Tested on my AmigaOne X1000 ------------------------------------------------------------- SORTBENCH 1.1 (Gunnar von Boehn) Its a CPU benchmark that stresses CPU, DCache and branch prediction. ------------------------------------------------------------- 1 K Element : 2084.88 MB/sec 2 K Element : 2096.05 MB/sec 4 K Element : 2105.75 MB/sec 8 K Element : 2134.63 MB/sec 16 K Element : 2112.34 MB/sec 32 K Element : 1712.92 MB/sec

17 November 2016, 21:52	#15
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,223	Neat link about GCC 6.2 building. Thanks!

27 November 2016, 22:06	#17
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,223	I didn't try to build it. It doesn't seem like his patches are public or completely tested.

23 April 2017, 03:20	#18
jack-3d kLiker Join Date: Mar 2011 Location: Brno / Czech Republic Posts: 371	Can someone please make and share WarpOS executable?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)