English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Language > Coders. C/C++

 
 
Thread Tools
Old 14 October 2021, 08:51   #1301
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
I attached a small(ish) test case for the inline assembly issue. This is part of the cache system from Build. I basically took a bunch of external assembly functions and turned them into GCC inlines. While the original ones work fine, these inlines don't. Maybe I made a mistake when I converted the functions to inlines, but I can't o figure out what I double checked the input/output/clobber lists, and they looked OK.

I included generic C versions of the offending functions, if you define NOASM they will be used. You can build the example program with:
m68k-amigaos-gcc -Wall -noixemul -m68040  -O2 -fomit-frame-pointer -fno-strict-aliasing -o cachetest cachetest.c
Attached Files
File Type: c cachetest.c (6.6 KB, 70 views)
BSzili is offline  
Old 14 October 2021, 18:06   #1302
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
EDIT: You forgot d1 in the clobberlist

You have an error somewhere in copybufbyte (maybe because you forgot d1 from the clobberlist). Replacing its body with just
Code:
        "1: subq.l #1, d0\n\t"
        "   bmi.s  2f\n\t"
        "   move.b (a0)+,(a1)+\n\t"
        "   bra.s  1b\n\t"
        "2:\n\t"
seems to work.

Debugging note: copybuf doesn't seem to be called at all, so there's no reason to include it, and you could have checked whether either of the functions worked on their own to reduce the test case (sometimes they interact, but it doesn't seem to be the case in this example).

BTW is that complicated function even worth it time wise compared to calling memcpy/CopyMemQuick or w/e?

Last edited by paraj; 14 October 2021 at 18:15.
paraj is offline  
Old 14 October 2021, 18:14   #1303
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
As I mentioned all of these functions work perfectly as external assembly, they only break when I turn them into inline assembly functions. It doesn't matter if the complicated copybufbyte function is worth it or not, I only included it in the example because it triggers the bug very easily. In the actual program (Blood) even the simplest inline assembly functions break the code.
BSzili is offline  
Old 14 October 2021, 18:41   #1304
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Quote:
Originally Posted by BSzili View Post
As I mentioned all of these functions work perfectly as external assembly, they only break when I turn them into inline assembly functions. It doesn't matter if the complicated copybufbyte function is worth it or not, I only included it in the example because it triggers the bug very easily. In the actual program (Blood) even the simplest inline assembly functions break the code.


You just missed my edit, but your clobberlist is bugged. Inline assembly is extremely difficult to get right and the errors are very unforgiving (as you've noticed). It'll often seem to work fine until a specific set of circumstances arise (sometimes much later in the development process) and you'll have a very hard time tracking down the bugs.


My recommendation would be to avoid inline assembly and stick with externally defined asm functions and only use it for the hopefully very few cases where it's essential for performance. Your mulscale32 function could be an example, but you'd want to express it in a way that doesn't force specific registers to be used, which would be more difficult to express properly but would allow it to interact better with the C optimizer.


Really, avoid inline asm even if you think you know what you're doing. Speaking from experience

Last edited by paraj; 14 October 2021 at 18:57.
paraj is offline  
Old 14 October 2021, 19:01   #1305
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
Thanks, I indeed missed the edit! I'll add d1 to the clobber list and try again. The reason I'm trying to use inline assembly is because the code is riddled with those 64-bit math functions, and there must be a few cycles I could save since they are called very often.
If I can get this working then I'll try to use %0, %1, etc. instead of explicit registers.

edit: With d1 in the clobber list it works perfectly! Thanks again, I'll go back to every function to verify if nothing is missing from the clobber lists!

Last edited by BSzili; 14 October 2021 at 19:07.
BSzili is offline  
Old 15 October 2021, 17:17   #1306
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by BSzili View Post
I attached a small(ish) test case for the inline assembly issue. This is part of the cache system from Build. I basically took a bunch of external assembly functions and turned them into GCC inlines. While the original ones work fine, these inlines don't. Maybe I made a mistake when I converted the functions to inlines, but I can't o figure out what I double checked the input/output/clobber lists, and they looked OK.

I included generic C versions of the offending functions, if you define NOASM they will be used. You can build the example program with:
m68k-amigaos-gcc -Wall -noixemul -m68040  -O2 -fomit-frame-pointer -fno-strict-aliasing -o cachetest cachetest.c

sorry, but I don't see the reason for using such inline assembly with gcc.
the NOASM functions aren't worse. With loop unrolling it should be even faster.


https://franke.ms/cex/z/aTjM31
bebbo is offline  
Old 16 October 2021, 08:42   #1307
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
Of course. It was just an example to figure out what did I mess up (clobber lists). For the mulscale, etc. functions I need the assembly as the 64-bit multiplication is replaced with the FPU version on the 68060. The rest will be gradually replaced with C versions, as GCC6 generate pretty good code.
BSzili is offline  
Old 16 October 2021, 10:11   #1308
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by BSzili View Post
Of course. It was just an example to figure out what did I mess up (clobber lists). For the mulscale, etc. functions I need the assembly as the 64-bit multiplication is replaced with the FPU version on the 68060. The rest will be gradually replaced with C versions, as GCC6 generate pretty good code.

Oh, I see. with -m68060 the code is... insane^^ that should be fixed in gcc...
bebbo is offline  
Old 16 October 2021, 18:17   #1309
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by bebbo View Post
Oh, I see. with -m68060 the code is... insane^^ that should be fixed in gcc...

oh - the 68060 code is OK! Why?


Quote:
The unimplemented integer instructions include 64-bit divide and multiply, ...

which means: you may use mulu.l/muls.l on a MC68060 but this will raise an exception and an emulation of the instruction gets invoked instead.


I would not bet on which is faster in the end.
bebbo is offline  
Old 16 October 2021, 19:39   #1310
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Quote:
Originally Posted by bebbo View Post
oh - the 68060 code is OK! Why?





which means: you may use mulu.l/muls.l on a MC68060 but this will raise an exception and an emulation of the instruction gets invoked instead.


I would not bet on which is faster in the end.

Might be missing something here, but it seems like the code generated for the C version of mulscale32 is slower than it needs to be for 060.


It calls ___muldi3 (which is understandable since '060 doesn't have 32x32->64bit multiply), but then at least with my version (m68k-amigaos-gcc (GCC) 6.5.0b 210726154642, built from amiga-gcc commit 15656337dad68ed40f54d600ed2b19e64bfd9ea2) ___muldi3 looks like this:


Code:
__muldi3 (DWtype u, DWtype v)
{
    4af8:       4e55 0000       link.w a5,#0
    4afc:       48e7 3c00       movem.l d2-d5,-(sp)
    4b00:       242d 0008       move.l 8(a5),d2
    4b04:       262d 000c       move.l 12(a5),d3
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b08:       2003            move.l d3,d0
{
    4b0a:       282d 0010       move.l 16(a5),d4
    4b0e:       2a2d 0014       move.l 20(a5),d5
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b12:       4c05 0401       mulu.l d5,d1,d0
    4b16:       2041            movea.l d1,a0
    4b18:       2240            movea.l d0,a1
    4b1a:       2008            move.l a0,d0
    4b1c:       2209            move.l a1,d1

  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b1e:       4c04 3800       muls.l d4,d3
               + (UWtype) uu.s.high * (UWtype) vv.s.low);
    4b22:       4c05 2800       muls.l d5,d2
    4b26:       d483            add.l d3,d2
  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b28:       2002            move.l d2,d0
    4b2a:       d088            add.l a0,d0

  return w.ll;
}
    4b2c:       4cdf 003c       movem.l (sp)+,d2-d5
    4b30:       4e5d            unlk a5
    4b32:       4e75            rts


I.e. it uses an emulated 32x32->64bit multiply (mulu.l d5,d1,d0) and does a bunch of extra work?
paraj is offline  
Old 16 October 2021, 20:17   #1311
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
Quote:
Originally Posted by bebbo View Post
oh - the 68060 code is OK! Why?





which means: you may use mulu.l/muls.l on a MC68060 but this will raise an exception and an emulation of the instruction gets invoked instead.


I would not bet on which is faster in the end.
Sorry I was a bit vague, I have custom FPU-based replacements for the missing instructions, that are faster the ones that come with the compiler. It's not a one size fits all solution as I have to set the FPU to round toward minus infinity, but for these games it's OK as they only use the FPU lightly and don't depend on the default rounding for the game logic.
BSzili is offline  
Old 16 October 2021, 20:50   #1312
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by paraj View Post
Might be missing something here, but it seems like the code generated for the C version of mulscale32 is slower than it needs to be for 060.


It calls ___muldi3 (which is understandable since '060 doesn't have 32x32->64bit multiply), but then at least with my version (m68k-amigaos-gcc (GCC) 6.5.0b 210726154642, built from amiga-gcc commit 15656337dad68ed40f54d600ed2b19e64bfd9ea2) ___muldi3 looks like this:


Code:
__muldi3 (DWtype u, DWtype v)
{
    4af8:       4e55 0000       link.w a5,#0
    4afc:       48e7 3c00       movem.l d2-d5,-(sp)
    4b00:       242d 0008       move.l 8(a5),d2
    4b04:       262d 000c       move.l 12(a5),d3
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b08:       2003            move.l d3,d0
{
    4b0a:       282d 0010       move.l 16(a5),d4
    4b0e:       2a2d 0014       move.l 20(a5),d5
  DWunion w = {.ll = __umulsidi3 (uu.s.low, vv.s.low)};
    4b12:       4c05 0401       mulu.l d5,d1,d0
    4b16:       2041            movea.l d1,a0
    4b18:       2240            movea.l d0,a1
    4b1a:       2008            move.l a0,d0
    4b1c:       2209            move.l a1,d1

  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b1e:       4c04 3800       muls.l d4,d3
               + (UWtype) uu.s.high * (UWtype) vv.s.low);
    4b22:       4c05 2800       muls.l d5,d2
    4b26:       d483            add.l d3,d2
  w.s.high += ((UWtype) uu.s.low * (UWtype) vv.s.high
    4b28:       2002            move.l d2,d0
    4b2a:       d088            add.l a0,d0

  return w.ll;
}
    4b2c:       4cdf 003c       movem.l (sp)+,d2-d5
    4b30:       4e5d            unlk a5
    4b32:       4e75            rts
I.e. it uses an emulated 32x32->64bit multiply (mulu.l d5,d1,d0) and does a bunch of extra work?

hehe - fun - that lib wasn't build for 68060.
=> you need libs built for the 68060...
bebbo is offline  
Old 17 October 2021, 15:47   #1313
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by BSzili View Post
Sorry I was a bit vague, I have custom FPU-based replacements for the missing instructions, that are faster the ones that come with the compiler. It's not a one size fits all solution as I have to set the FPU to round toward minus infinity, but for these games it's OK as they only use the FPU lightly and don't depend on the default rounding for the game logic.

something like
Code:
int mulscale32(int u, int v) {
return ((double)u) * v / ((double)(1<<16) * (1<<16));
}

?

bebbo is offline  
Old 17 October 2021, 19:54   #1314
aros-sg
Registered User
 
Join Date: Nov 2015
Location: Italy
Posts: 191
There are some gcc inline fixedmul/fixeddiv functions in DoomAttack source. If they still work with newer gcc versions (was in 2.95 era) maybe can be used as inspiration. They look like this:

Code:
extern __inline fixed_t FixedMul(fixed_t eins,fixed_t zwei)
{
	
#ifndef version060

	__asm __volatile
	("muls.l %1,%1:%0 \n\t"
	 "move %1,%0 \n\t"
	 "swap %0 "
					 
	  : "=d" (eins), "=d" (zwei)
	  : "0" (eins), "1" (zwei)
	);

	return eins;

#else
	__asm __volatile
	("fmove.l	%0,fp0 \n\t"
	 "fmul.l	%2,fp0 \n\t"
	 "fmul.x	fp7,fp0 \n\t"

/*	 "fintrz.x	fp0,fp0 \n\t"*/
	 "fmove.l	fp0,%0"
					 
	  : "=d" (eins)
	  : "0" (eins), "d" (zwei)
	  : "fp0"
	);

	return eins;

#endif

}
fp6/fp7 are constants initialized to 65536 and 1/65536 at prog start.
aros-sg is offline  
Old 18 October 2021, 07:30   #1315
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
Quote:
Originally Posted by bebbo View Post
something like
Code:
int mulscale32(int u, int v) {
return ((double)u) * v / ((double)(1<<16) * (1<<16));
}

?
Something like that, but first it checks if it can get away with the 32-bit multiplication without overflow:
https://github.com/BSzili/jfbuild/bl...ragmas.h#L3635
BSzili is offline  
Old 26 October 2021, 14:13   #1316
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
There is a tutorial from Wei-ju Wu: Setting up gcc for Amiga cross development

=> [ Show youtube player ]
bebbo is offline  
Old 28 October 2021, 18:30   #1317
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
I have a question about the mathieeedoubtrans.library dependency in libnix. After looking at this issue I tried to compile my executable with
-noixemul -m68881 -mhard-float
, but pow() for example it still pulls in the libm020/libm881/libm.a, which uses IEEEDPPow. Is is possible to avoid this?
I could make a PR to add inline asm replacements to libnix when __HAVE_68881__ is defined. For example this clib2's pow implementation:
https://github.com/adtools/clib2/blo...ath_pow.c#L122
It's available under BSD license, and these shouldn't take up much more space than the mathieeedoubtrans.library calls. Would you be interested in such a patch for libnix?
BSzili is offline  
Old 28 October 2021, 20:37   #1318
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by BSzili View Post
I have a question about the mathieeedoubtrans.library dependency in libnix. After looking at this issue I tried to compile my executable with
-noixemul -m68881 -mhard-float
, but pow() for example it still pulls in the libm020/libm881/libm.a, which uses IEEEDPPow. Is is possible to avoid this?
I could make a PR to add inline asm replacements to libnix when __HAVE_68881__ is defined. For example this clib2's pow implementation:
https://github.com/adtools/clib2/blo...ath_pow.c#L122
It's available under BSD license, and these shouldn't take up much more space than the mathieeedoubtrans.library calls. Would you be interested in such a patch for libnix?

yes, it is possible: use -ffast-math

https://franke.ms/cex/z/EEnfbE

Last edited by bebbo; 28 October 2021 at 20:45.
bebbo is offline  
Old 28 October 2021, 20:50   #1319
BSzili
old chunk of coal
 
BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 1,289
Thanks, I guess I can use this pow replacement to avoid mathieeedoubtrans.library
BSzili is offline  
Old 29 October 2021, 08:10   #1320
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by BSzili View Post
Thanks, I guess I can use this pow replacement to avoid mathieeedoubtrans.library

I have no idea if it's smart to hook into the mathieee-stuff or not...
... it's an Amiga thing to provide these.


It would be also possible to provide these functions as builtins, which would result into a direct call into the Amiga libraries not using any stub...


... well only thoughts
bebbo is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
New GCC based dev toolchain for AmigaOS 3.x cla Coders. Releases 8 24 December 2017 10:18
Issue with photon/xxxx WinUAE Toolchain arpz Coders. Asm / Hardware 2 26 September 2015 22:33
New 68k gcc toolchain arti Coders. C/C++ 17 31 July 2015 03:59
Hannibal's WinUAE Demo Toolchain 5 Bobic Amiga scene 1 23 July 2015 21:04
From gcc to vbcc. Cowcat Coders. General 9 06 June 2014 14:45

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 20:13.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14845 seconds with 14 queries