GCC 6.2 toolchain for AmigaOS 3 - Page 66

BSzili · 14 October 2021, 08:51

I attached a small(ish) test case for the inline assembly issue. This is part of the cache system from Build. I basically took a bunch of external assembly functions and turned them into GCC inlines. While the original ones work fine, these inlines don't. Maybe I made a mistake when I converted the functions to inlines, but I can't o figure out what

I double checked the input/output/clobber lists, and they looked OK.

I included generic C versions of the offending functions, if you define NOASM they will be used. You can build the example program with:

m68k-amigaos-gcc -Wall -noixemul -m68040  -O2 -fomit-frame-pointer -fno-strict-aliasing -o cachetest cachetest.c

paraj · 14 October 2021, 18:06

EDIT: You forgot d1 in the clobberlist

You have an error somewhere in copybufbyte (maybe because you forgot d1 from the clobberlist). Replacing its body with just

Code:

        "1: subq.l #1, d0\n\t"
        "   bmi.s  2f\n\t"
        "   move.b (a0)+,(a1)+\n\t"
        "   bra.s  1b\n\t"
        "2:\n\t"

seems to work.

Debugging note: copybuf doesn't seem to be called at all, so there's no reason to include it, and you could have checked whether either of the functions worked on their own to reduce the test case (sometimes they interact, but it doesn't seem to be the case in this example).

BTW is that complicated function even worth it time wise compared to calling memcpy/CopyMemQuick or w/e?

BSzili · 14 October 2021, 18:14

As I mentioned all of these functions work perfectly as external assembly, they only break when I turn them into inline assembly functions. It doesn't matter if the complicated copybufbyte function is worth it or not, I only included it in the example because it triggers the bug very easily. In the actual program (Blood) even the simplest inline assembly functions break the code.

paraj · 14 October 2021, 18:41

Quote:

Originally Posted by BSzili

As I mentioned all of these functions work perfectly as external assembly, they only break when I turn them into inline assembly functions. It doesn't matter if the complicated copybufbyte function is worth it or not, I only included it in the example because it triggers the bug very easily. In the actual program (Blood) even the simplest inline assembly functions break the code.

You just missed my edit, but your clobberlist is bugged. Inline assembly is extremely difficult to get right and the errors are very unforgiving (as you've noticed). It'll often seem to work fine until a specific set of circumstances arise (sometimes much later in the development process) and you'll have a very hard time tracking down the bugs.

My recommendation would be to avoid inline assembly and stick with externally defined asm functions and only use it for the hopefully very few cases where it's essential for performance. Your mulscale32 function could be an example, but you'd want to express it in a way that doesn't force specific registers to be used, which would be more difficult to express properly but would allow it to interact better with the C optimizer.

Really, avoid inline asm even if you think you know what you're doing. Speaking from experience

BSzili · 14 October 2021, 19:01

Thanks, I indeed missed the edit! I'll add d1 to the clobber list and try again. The reason I'm trying to use inline assembly is because the code is riddled with those 64-bit math functions, and there must be a few cycles I could save since they are called very often.
If I can get this working then I'll try to use %0, %1, etc. instead of explicit registers.

edit: With d1 in the clobber list it works perfectly!

Thanks again, I'll go back to every function to verify if nothing is missing from the clobber lists!

bebbo · 15 October 2021, 17:17

Quote:

Originally Posted by BSzili

I attached a small(ish) test case for the inline assembly issue. This is part of the cache system from Build. I basically took a bunch of external assembly functions and turned them into GCC inlines. While the original ones work fine, these inlines don't. Maybe I made a mistake when I converted the functions to inlines, but I can't o figure out what

I double checked the input/output/clobber lists, and they looked OK.

I included generic C versions of the offending functions, if you define NOASM they will be used. You can build the example program with:

m68k-amigaos-gcc -Wall -noixemul -m68040  -O2 -fomit-frame-pointer -fno-strict-aliasing -o cachetest cachetest.c

sorry, but I don't see the reason for using such inline assembly with gcc.
the NOASM functions aren't worse. With loop unrolling it should be even faster.

https://franke.ms/cex/z/aTjM31

BSzili · 16 October 2021, 08:42

Of course. It was just an example to figure out what did I mess up (clobber lists). For the mulscale, etc. functions I need the assembly as the 64-bit multiplication is replaced with the FPU version on the 68060. The rest will be gradually replaced with C versions, as GCC6 generate pretty good code.

bebbo · 16 October 2021, 10:11

Quote:

Originally Posted by BSzili

Of course. It was just an example to figure out what did I mess up (clobber lists). For the mulscale, etc. functions I need the assembly as the 64-bit multiplication is replaced with the FPU version on the 68060. The rest will be gradually replaced with C versions, as GCC6 generate pretty good code.

Oh, I see. with -m68060 the code is... insane^^ that should be fixed in gcc...

bebbo · 16 October 2021, 18:17

Quote:

Originally Posted by bebbo

Oh, I see. with -m68060 the code is... insane^^ that should be fixed in gcc...

oh - the 68060 code is OK! Why?

Quote:

The unimplemented integer instructions include 64-bit divide and multiply, ...

which means: you may use mulu.l/muls.l on a MC68060 but this will raise an exception and an emulation of the instruction gets invoked instead.

I would not bet on which is faster in the end.

paraj · 16 October 2021, 19:39

Quote:

Originally Posted by bebbo

oh - the 68060 code is OK! Why?

which means: you may use mulu.l/muls.l on a MC68060 but this will raise an exception and an emulation of the instruction gets invoked instead.

I would not bet on which is faster in the end.

Might be missing something here, but it seems like the code generated for the C version of mulscale32 is slower than it needs to be for 060.

It calls ___muldi3 (which is understandable since '060 doesn't have 32x32->64bit multiply), but then at least with my version (m68k-amigaos-gcc (GCC) 6.5.0b 210726154642, built from amiga-gcc commit 15656337dad68ed40f54d600ed2b19e64bfd9ea2) ___muldi3 looks like this:

Code:

__muldi3 (DWtype u, DWtype v)
{
    4af8:       4e55 0000       link.w a5,#0
    4afc:       48e7 3c00       movem.l d2-d5,-(sp)
    4b00:       242d 0008       move.l 8(a5),d2
    4b04:       262d 000c       move.l 12(a5),d3
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b08:       2003            move.l d3,d0
{
    4b0a:       282d 0010       move.l 16(a5),d4
    4b0e:       2a2d 0014       move.l 20(a5),d5
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b12:       4c05 0401       mulu.l d5,d1,d0
    4b16:       2041            movea.l d1,a0
    4b18:       2240            movea.l d0,a1
    4b1a:       2008            move.l a0,d0
    4b1c:       2209            move.l a1,d1

  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b1e:       4c04 3800       muls.l d4,d3
               + (UWtype) uu.s.high * (UWtype) vv.s.low);
    4b22:       4c05 2800       muls.l d5,d2
    4b26:       d483            add.l d3,d2
  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b28:       2002            move.l d2,d0
    4b2a:       d088            add.l a0,d0

  return w.ll;
}
    4b2c:       4cdf 003c       movem.l (sp)+,d2-d5
    4b30:       4e5d            unlk a5
    4b32:       4e75            rts

I.e. it uses an emulated 32x32->64bit multiply (mulu.l d5,d1,d0) and does a bunch of extra work?

BSzili · 16 October 2021, 20:17

Quote:

Originally Posted by bebbo

oh - the 68060 code is OK! Why?

which means: you may use mulu.l/muls.l on a MC68060 but this will raise an exception and an emulation of the instruction gets invoked instead.

I would not bet on which is faster in the end.

Sorry I was a bit vague, I have custom FPU-based replacements for the missing instructions, that are faster the ones that come with the compiler. It's not a one size fits all solution as I have to set the FPU to round toward minus infinity, but for these games it's OK as they only use the FPU lightly and don't depend on the default rounding for the game logic.

bebbo · 16 October 2021, 20:50

Quote:

Originally Posted by paraj

Might be missing something here, but it seems like the code generated for the C version of mulscale32 is slower than it needs to be for 060.

It calls ___muldi3 (which is understandable since '060 doesn't have 32x32->64bit multiply), but then at least with my version (m68k-amigaos-gcc (GCC) 6.5.0b 210726154642, built from amiga-gcc commit 15656337dad68ed40f54d600ed2b19e64bfd9ea2) ___muldi3 looks like this:

Code:

__muldi3 (DWtype u, DWtype v)
{
    4af8:       4e55 0000       link.w a5,#0
    4afc:       48e7 3c00       movem.l d2-d5,-(sp)
    4b00:       242d 0008       move.l 8(a5),d2
    4b04:       262d 000c       move.l 12(a5),d3
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b08:       2003            move.l d3,d0
{
    4b0a:       282d 0010       move.l 16(a5),d4
    4b0e:       2a2d 0014       move.l 20(a5),d5
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b12:       4c05 0401       mulu.l d5,d1,d0
    4b16:       2041            movea.l d1,a0
    4b18:       2240            movea.l d0,a1
    4b1a:       2008            move.l a0,d0
    4b1c:       2209            move.l a1,d1

  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b1e:       4c04 3800       muls.l d4,d3
               + (UWtype) uu.s.high * (UWtype) vv.s.low);
    4b22:       4c05 2800       muls.l d5,d2
    4b26:       d483            add.l d3,d2
  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b28:       2002            move.l d2,d0
    4b2a:       d088            add.l a0,d0

  return w.ll;
}
    4b2c:       4cdf 003c       movem.l (sp)+,d2-d5
    4b30:       4e5d            unlk a5
    4b32:       4e75            rts

I.e. it uses an emulated 32x32->64bit multiply (mulu.l d5,d1,d0) and does a bunch of extra work?

hehe - fun - that lib wasn't build for 68060.
=> you need libs built for the 68060...

bebbo · 17 October 2021, 15:47

Quote:

Originally Posted by BSzili

Sorry I was a bit vague, I have custom FPU-based replacements for the missing instructions, that are faster the ones that come with the compiler. It's not a one size fits all solution as I have to set the FPU to round toward minus infinity, but for these games it's OK as they only use the FPU lightly and don't depend on the default rounding for the game logic.

something like

Code:

int mulscale32(int u, int v) {
return ((double)u) * v / ((double)(1<<16) * (1<<16));
}

?

aros-sg · 17 October 2021, 19:54

There are some gcc inline fixedmul/fixeddiv functions in DoomAttack source. If they still work with newer gcc versions (was in 2.95 era) maybe can be used as inspiration. They look like this:

Code:

extern __inline fixed_t FixedMul(fixed_t eins,fixed_t zwei)
{
	
#ifndef version060

	__asm __volatile
	("muls.l %1,%1:%0 \n\t"
	 "move %1,%0 \n\t"
	 "swap %0 "
					 
	  : "=d" (eins), "=d" (zwei)
	  : "0" (eins), "1" (zwei)
	);

	return eins;

#else
	__asm __volatile
	("fmove.l	%0,fp0 \n\t"
	 "fmul.l	%2,fp0 \n\t"
	 "fmul.x	fp7,fp0 \n\t"

/*	 "fintrz.x	fp0,fp0 \n\t"*/
	 "fmove.l	fp0,%0"
					 
	  : "=d" (eins)
	  : "0" (eins), "d" (zwei)
	  : "fp0"
	);

	return eins;

#endif

}

fp6/fp7 are constants initialized to 65536 and 1/65536 at prog start.

BSzili · 18 October 2021, 07:30

Quote:

Originally Posted by bebbo

something like

Code:

int mulscale32(int u, int v) {
return ((double)u) * v / ((double)(1<<16) * (1<<16));
}

?

Something like that, but first it checks if it can get away with the 32-bit multiplication without overflow:
https://github.com/BSzili/jfbuild/bl...ragmas.h#L3635

bebbo · 26 October 2021, 14:13

There is a tutorial from Wei-ju Wu: Setting up gcc for Amiga cross development

=> [ Show youtube player ]

BSzili · 28 October 2021, 18:30

I have a question about the mathieeedoubtrans.library dependency in libnix. After looking at this issue I tried to compile my executable with

-noixemul -m68881 -mhard-float

, but pow() for example it still pulls in the libm020/libm881/libm.a, which uses IEEEDPPow. Is is possible to avoid this?
I could make a PR to add inline asm replacements to libnix when __HAVE_68881__ is defined. For example this clib2's pow implementation:
https://github.com/adtools/clib2/blo...ath_pow.c#L122
It's available under BSD license, and these shouldn't take up much more space than the mathieeedoubtrans.library calls. Would you be interested in such a patch for libnix?

bebbo · 28 October 2021, 20:37

Quote:

Originally Posted by BSzili

I have a question about the mathieeedoubtrans.library dependency in libnix. After looking at this issue I tried to compile my executable with

-noixemul -m68881 -mhard-float

, but pow() for example it still pulls in the libm020/libm881/libm.a, which uses IEEEDPPow. Is is possible to avoid this?
I could make a PR to add inline asm replacements to libnix when __HAVE_68881__ is defined. For example this clib2's pow implementation:
https://github.com/adtools/clib2/blo...ath_pow.c#L122
It's available under BSD license, and these shouldn't take up much more space than the mathieeedoubtrans.library calls. Would you be interested in such a patch for libnix?

yes, it is possible: use -ffast-math

https://franke.ms/cex/z/EEnfbE

BSzili · 28 October 2021, 20:50

Thanks, I guess I can use this pow replacement to avoid mathieeedoubtrans.library

bebbo · 29 October 2021, 08:10

Quote:

Originally Posted by BSzili

Thanks, I guess I can use this pow replacement to avoid mathieeedoubtrans.library

I have no idea if it's smart to hook into the mathieee-stuff or not...
... it's an Amiga thing to provide these.

It would be also possible to provide these functions as builtins, which would result into a direct call into the Amiga libraries not using any stub...

... well only thoughts

14 October 2021, 18:06	#1302
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	EDIT: You forgot d1 in the clobberlist You have an error somewhere in copybufbyte (maybe because you forgot d1 from the clobberlist). Replacing its body with just Code: "1: subq.l #1, d0\n\t" " bmi.s 2f\n\t" " move.b (a0)+,(a1)+\n\t" " bra.s 1b\n\t" "2:\n\t" seems to work. Debugging note: copybuf doesn't seem to be called at all, so there's no reason to include it, and you could have checked whether either of the functions worked on their own to reduce the test case (sometimes they interact, but it doesn't seem to be the case in this example). BTW is that complicated function even worth it time wise compared to calling memcpy/CopyMemQuick or w/e? Last edited by paraj; 14 October 2021 at 18:15.

14 October 2021, 19:01	#1305
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	Thanks, I indeed missed the edit! I'll add d1 to the clobber list and try again. The reason I'm trying to use inline assembly is because the code is riddled with those 64-bit math functions, and there must be a few cycles I could save since they are called very often. If I can get this working then I'll try to use %0, %1, etc. instead of explicit registers. edit: With d1 in the clobber list it works perfectly! Thanks again, I'll go back to every function to verify if nothing is missing from the clobber lists! Last edited by BSzili; 14 October 2021 at 19:07.

28 October 2021, 18:30	#1317
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	I have a question about the mathieeedoubtrans.library dependency in libnix. After looking at this issue I tried to compile my executable with -noixemul -m68881 -mhard-float , but pow() for example it still pulls in the libm020/libm881/libm.a, which uses IEEEDPPow. Is is possible to avoid this? I could make a PR to add inline asm replacements to libnix when __HAVE_68881__ is defined. For example this clib2's pow implementation: https://github.com/adtools/clib2/blo...ath_pow.c#L122 It's available under BSD license, and these shouldn't take up much more space than the mathieeedoubtrans.library calls. Would you be interested in such a patch for libnix?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New GCC based dev toolchain for AmigaOS 3.x	cla	Coders. Releases	8	24 December 2017 10:18
Issue with photon/xxxx WinUAE Toolchain	arpz	Coders. Asm / Hardware	2	26 September 2015 22:33
New 68k gcc toolchain	arti	Coders. C/C++	17	31 July 2015 03:59
Hannibal's WinUAE Demo Toolchain 5	Bobic	Amiga scene	1	23 July 2015 21:04
From gcc to vbcc.	Cowcat	Coders. General	9	06 June 2014 14:45

14 October 2021, 18:14	#1303
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	As I mentioned all of these functions work perfectly as external assembly, they only break when I turn them into inline assembly functions. It doesn't matter if the complicated copybufbyte function is worth it or not, I only included it in the example because it triggers the bug very easily. In the actual program (Blood) even the simplest inline assembly functions break the code.

16 October 2021, 08:42	#1307
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	Of course. It was just an example to figure out what did I mess up (clobber lists). For the mulscale, etc. functions I need the assembly as the 64-bit multiplication is replaced with the FPU version on the 68060. The rest will be gradually replaced with C versions, as GCC6 generate pretty good code.

26 October 2021, 14:13	#1316
bebbo bye Join Date: Jun 2016 Location: Some / Where Posts: 680	There is a tutorial from Wei-ju Wu: Setting up gcc for Amiga cross development => [ Show youtube player ]

28 October 2021, 20:50	#1319
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	Thanks, I guess I can use this pow replacement to avoid mathieeedoubtrans.library

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)