16 May 2023, 21:46 | #1 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Fancy a tool that speeds up AMOS excutables?
Towards the end of the development of Ring around the World, which is written mostly in AMOS Professional, I wrote a tool that speeds up the game executable produced by the AMOS Professional Compiler by applying various optimizations to conditionals, branches, peeking/poking, arrays, etc.
The tool is general purpose: it loads an AMOS executable, patches it and saves the resulting executable. Please note that it isn't magical: since its optimizations are at machine language level only, it won't help (much), for example, when the load is mostly elsewhere (e.g. blitting), the frame rate is locked, and so on. The tool itself is written in AMOS, but every now and then I get the itch to rewrite it in assembly and release it - and every time I say to myself that it isn't worth the effort. This time around I thought I'd ask your opinion about it and to make some practical tests. So, I'd like to ask: a) do you know of any games/demos/apps that (supposedly) would benefit from such a tool? - I'd like to try the tool and see if it's actually useful in real world cases (other than RatW, that is); b) would you be interested in the tool? - I'd like to know if there is a potential audience. Thanks in advance for your feedback. |
17 May 2023, 04:00 | #2 |
Total Chaos forever!
Join Date: Aug 2007
Location: Waterville, MN, USA
Age: 49
Posts: 2,213
|
At one point I thought that making a library of a compiler backend of ECX or its EEC fork would be a way to improve other compilers. Now that GCC 13.1 is being ported to 68k I've shifted toward thinking that LibGCCJIT used as a static compiler optimizer and backend library would do even better.
At the end of the day, I'd like to be able to combine forces between AmigaE, AmosPro and Blitz to come up with something useful between them. Maybe W2C2 could have its ANSI C backend replaced with GCC's as well for a bytecode experience. Regarding peephole optimization, just being able to use VAsm as the code generator would go a long way in that regard. Without the original devs, who has the time? |
22 May 2023, 15:41 | #3 |
Banned
Join Date: May 2006
Location: n/a
Posts: 278
|
The only app i can think of to use the speed up tool on would be DMC.. Disk Magazine Creator. dont know if it would make it faster or not..
Note: its compressed with Crunchmania so would need unpacking first.. |
22 May 2023, 17:50 | #4 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,355
|
Note that there are similar tools like this. If you want to have a look, there is "Hunk" in Aminet which contains a couple of tiny script files (called "Hoppers") that apply such peep-hole optimizations for popular compilers. The Hoppers are scripted, so everyone can write their own. However, the overall benefit is really quite minor.
|
22 May 2023, 19:00 | #5 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Quote:
For a GUI-oriented program, I'd say it's probably impossible to spot differences in speed (unless there are non-interactive calculation-heavy parts). *I couldn't bother searching for the full version. Last edited by saimo; 10 June 2023 at 13:31. |
|
22 May 2023, 19:07 | #6 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Quote:
|
|
08 June 2023, 19:26 | #7 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Just for the record...
Yesterday, while updating Ring around the World, I decided to measure the speedup brought by the tool to the most CPU-intensive routine of the game, i.e. the routine that calculates the shortest path to go from one location to another. In the case of the longest path allowed by the map, the speedup amounted to about 8.7% 13% on a stock Amiga 500: not exactly a major improvement, but still not too bad. [ Show youtube player ] For completeness, this is the routine: Code:
Fill WMA To MA_ROWSIZE*MA_HEIGHT+WMA,$80008000 MCRV_TILEX=TILEX : MCRV_TILEY=TILEY : Call MCRA_GETTILEINDEX If OB_FLAGS(Param) and 1 Doke WMA+TMA-MA_ADDRESS,0 Else Dec D End If MT_CHECK_TILE: If TMA=TTMA Then Goto MT_WALK_PATH TWMA=WMA+TMA-MA_ADDRESS Inc D If Deek(TWMA-MA_ROWSIZE) and $8000 If OB_FLAGS(Deek(TMA-MA_ROWSIZE)) and 1 Loke WQAW,TMA-MA_ROWSIZE : Add WQAW,4 Doke TWMA-MA_ROWSIZE,D End If End If If Deek(TWMA-2) and $8000 If OB_FLAGS(Deek(TMA-2)) and 1 Loke WQAW,TMA-2 : Add WQAW,4 Doke TWMA-2,D End If End If If Deek(TWMA+2) and $8000 If OB_FLAGS(Deek(TMA+2)) and 1 Loke WQAW,TMA+2 : Add WQAW,4 Doke TWMA+2,D End If End If If Deek(TWMA+MA_ROWSIZE) and $8000 If OB_FLAGS(Deek(TMA+MA_ROWSIZE)) and 1 Loke WQAW,TMA+MA_ROWSIZE : Add WQAW,4 Doke TWMA+MA_ROWSIZE,D End If End If MT_CHECK_NEXT_TILE: If WQAR-WQAW TMA=Leek(WQAR) : Add WQAR,4 Goto MT_CHECK_TILE End If If DTF=0 Dec DTF WQAR=WQAS WQAW=WQAR TMA=DTMA-MA_ROWSIZE-2 TWMA=WMA+TMA-MA_ADDRESS If Deek(TWMA) and $8000 If OB_FLAGS(Deek(TMA)) and 1 Doke TWMA,0 Loke WQAW,TMA : Add WQAW,4 End If End If Add TMA,4 Add TWMA,4 If Deek(TWMA) and $8000 If OB_FLAGS(Deek(TMA)) and 1 Doke TWMA,0 Loke WQAW,TMA : Add WQAW,4 End If End If Add TMA,MA_ROWSIZE*2 Add TWMA,MA_ROWSIZE*2 If Deek(TWMA) and $8000 If OB_FLAGS(Deek(TMA)) and 1 Doke TWMA,0 Loke WQAW,TMA : Add WQAW,4 End If End If Add TMA,-4 Add TWMA,-4 If Deek(TWMA) and $8000 If OB_FLAGS(Deek(TMA)) and 1 Doke TWMA,0 Loke WQAW,TMA : Add WQAW,4 End If End If D=0 Goto MT_CHECK_NEXT_TILE End If Dec RC Goto MT_LEAVE MT_WALK_PATH: EDIT: added video link. Last edited by saimo; 09 June 2023 at 13:27. |
08 June 2023, 21:05 | #8 |
Aghnar
Join Date: Jan 2019
Location: France
Posts: 156
|
Hi Saimo,
Very interesting. I have a lot of code in Amos pro so we can apply your tool to a lot of code snippets and see the differences. I publish sometimes some things here : https://github.com/alain-treesong/amiga_coding_in_amos and i have a few demos here : https://demozoo.org/groups/111822/ So we can test a lot of cases I think Tell me if you are interested. See u |
09 June 2023, 00:35 | #9 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Quote:
EDIT2: elaborating more on the post made in a hurry, late at night, while I should have been in bed... I downloaded the latest demo and it turned out that no optimizations were possible - that happens with compressed executables; since I really should have gone to bed, I gave up; also, I downloaded one of the sources and, from a quick glance at it, it was clear that the tool would not help much as the core loop is dominated by FP maths (which the tool doesn't touch) and polygon rendering (which the tool doesn't touch). Anyway, later I'll give all the sources a spin and report back. Last edited by saimo; 09 June 2023 at 13:21. |
|
09 June 2023, 00:36 | #10 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Doh, I totally forgot to post the video that shows the test
[ Show youtube player ] Doh2: the 8.7% figure is bogus! The optimized code takes 87% (see where the figure came from?) of the unoptimized code time, so the speedup is 13%. This is what happens when doing things in a hurry and while falling asleep... Last edited by saimo; 09 June 2023 at 13:29. |
09 June 2023, 15:06 | #11 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
@alain.treesong
OK, I had now a look at your example sources. I chose SimpleCube.amos as test case because the rendering part (which APEO ignores) is minimal and it uses some arrays (which get accelerated by APEO) in the inner loop. CODE I modified the code as follows: Code:
Set Buffer 12 Rem Simple wire 3d cube Rem Rem Aghnar / Agima may 2022 Degree NDP=8 CX=160 CY=128 CZ=256*5 Dim X(NDP),Y(NDP),Z(NDP) Dim XE(NDP),YE(NDP) Dim C(1024),S(1024) ' Global X(),Y(),Z(),XE(),YE(),C(),S(),NDP,AX,AY,AZ,CX,CY,CZ Global FI ' Screen Open 0,320,256,2,Lowres Screen Display 0,128,40,320,256 Paper 0 : Hide On : Flash Off : Curs Off : Cls Palette $0,$666 ' Simple horizontal line using copper to enhance the scene Set Rainbow 0,0,16,"","","" Rain(0,0)=$CCC Rain(0,1)=$444 Rain(0,2)=$111 Rainbow 0,0,270,16 Ink 1 : Pen 1 Double Buffer : Autoback 0 ' Trigo table and definition of the 8 points for the cube For I=0 To 1023 C(I)=Qcos(I,256) : S(I)=Qsin(I,256) Next I ' For I=1 To NDP Read X(I),Y(I),Z(I) Next ' Data -1,-1,-1 Data 1,-1,-1 Data 1,1,-1 Data -1,1,-1 Data -1,-1,1 Data 1,-1,1 Data 1,1,1 Data -1,1,1 ' AX=0 AY=0 AZ=0 ' Timing start TS=Timer ' Main loop : rotation on the 3 axis Repeat Add AX,-1,0 To 1023 Add AY,1,0 To 1023 Add AZ,-1,0 To 1023 RENDER_CUBE Screen Swap FI=Timer Until AX=0 'Benchmark report ET=Timer-TS+1 Print "elapsed time:";ET;" frames" Print "speed: ";(1024*50.0)/ET;" fps" Screen Swap Wait Key Procedure RENDER_CUBE CAX=C(AX) SAX=S(AX) CAY=C(AY) SIY=S(AY) CAZ=C(AZ) SAZ=S(AZ) ' Rotation and projection of the 8 points. ' A lot of optimizations are possible here : ' - inlining, no array (xe(1) replaced by xe1) ' - the fact that the initial x,y z values are 1 or -1 ULC=10 While ULC I=NDP While I ' rotation %X X=X(I)*256 Y=Y(I)*CAX+Z(I)*SAX Z=-Y(I)*SAX+Z(I)*CAX ' ' rotation %Y X2=X*CAY+Z*SIY Z=-X*SIY+Z*CAY ' ' rotation %Z X2=X2/256 X=X2*CAZ+Y*SAZ Y=-X2*SAZ+Y*CAZ ' ' Projection D=CZ+Z/256 XE(I)=CX+X/D YE(I)=CY+Y/D Dec I Wend Dec ULC Wend Repeat : Until Timer-FI ' Draw all 12 lines Blitter Clear 0,0 Turbo Draw XE(2),YE(2) To XE(6),YE(6),1,1 Turbo Draw XE(6),YE(6) To XE(5),YE(5),1,1 Turbo Draw XE(5),YE(5) To XE(1),YE(1),1,1 Turbo Draw XE(1),YE(1) To XE(2),YE(2),1,1 Turbo Draw XE(5),YE(5) To XE(8),YE(8),1,1 Turbo Draw XE(8),YE(8) To XE(7),YE(7),1,1 Turbo Draw XE(7),YE(7) To XE(6),YE(6),1,1 Turbo Draw XE(1),YE(1) To XE(4),YE(4),1,1 Turbo Draw XE(4),YE(4) To XE(3),YE(3),1,1 Turbo Draw XE(3),YE(3) To XE(2),YE(2),1,1 Turbo Draw XE(3),YE(3) To XE(7),YE(7),1,1 Turbo Draw XE(8),YE(8) To XE(4),YE(4),1,1 End Proc Screen Swap Wait Vbl with Repeat : Until Timer-FI <rendering code> Screen Swap FI=Timer This allows to use all the available CPU cycles and thus get the best performance possible on underpowered machines (Wait Vbl, instead, just wastes time doing nothing). Then, given that the code already ran at 50 fps also on a stock A500, I forced the rotation and projection calculations to artificially repeat 10 times (While ULC... Wend loop). Then, I replaced the inner For...Next with While...Wend because the former compiles terribly (it should always be replaced by While...Wend or Repeat...Until). Finally, I added some code to measure the performance. Attached is the bootable .adf with the test executables. COMPILING AND OPTIMIZING I compiled the code and then I created an optimized executable. The result of the optimization was: That means that APEO: * optimized the global routine that handles the accesses to arrays; * optimized the Colour() and Colour routines (for some reason, the Compiler seems to always include them in the executables, even when, like in this case, they are not used); * optimized two divisions by a power of 2 (the /256 in X2=X2/256 and D=CZ+Z/256; this optimization is not beneficial on 68000, though). The While...Wend and Repeat...Until loops I added were not optimized because I wrote them in a way that the Compiler already produces its best output. Bootable .adf with the executables attached here. BENCHMARKING I ran the executables using a stock A500 configuration in WinUAE 5.0.0. and got these results: * normal version: 4253 frames -> 12.038 fps; * optimized version: 4089 frames -> 12.521 fps. The optimized version took 164 frames = 3.28 seconds less. The gain is minimal (less than 4%), but, after all, there wasn't much to optimize in first place. Anyway, a minimal gain is better than no gain |
09 June 2023, 22:02 | #12 |
Aghnar
Join Date: Jan 2019
Location: France
Posts: 156
|
@saimo
Great. I don't worry about the low optimization rate here because it is a single test. Some questions : 1. Arrays are very slow in Amos as you said, so generally i replace them by simple variables. So XE(1) becomes for example xe1, xe(0) xe0 etc. I use sometimes external parsers written in Java to do that but it remains fastidious and the produced code is verbose. I see arrays in the screenshot but can your tool handles this case (replacing arrays when possible or optimizing array to be as efficient than using single vars) ? 2. The pro compiler (2.0) already claims that it optimizes mul / div by power of two replacing by logical shift. It is why I generally user 2^n when possible. It isn't the case in the produced code by amos pro compiler 2.0 ? 3. Thx for the tips with while / wend etc. In fact generally the idea is to produce scene at 50 fps so the wait vbl is enough. Indeed, for slower scenes, it's interesting. 4. Will you publish your great tool (if it isn't already the case) ? Very nice to speak about Amos code in 2023 :-) Edit 1: I suppose from the screenshot that you optimize Amreg(). This is a great idea because it is a lot used in game with sprites and bobs and it is slow. Other thing that is slow is the (quite powerfull) rain command. Using big rains or multiple rain is very slow. I suppose that this is because the computed copperlist is then slow to do. Will be cool if not too difficult to optimize that. Edit 2 : The 64k intro (yes! and the pixelated world) are compressed using shrinkler but the other ones are not compressed Last edited by alain.treesong; 09 June 2023 at 22:12. |
09 June 2023, 22:29 | #13 |
Phone Homer
Join Date: Jun 2006
Location: 5150
Posts: 5,850
|
Can you try this? I'm currious
https://eab.abime.net/showthread.php...11#post1122111 Also some people claim Amos The Creator Compiler produces faster executables - any thoughts on this? |
09 June 2023, 23:01 | #14 | |||||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
@alain.treesong
Quote:
For critical single-index arrays, it's best to use Areg(), Dreg() and Amreg() (if possible). Quote:
Code:
The assignment A(...)=... gets compiled as follows: <fetched/calculated value to assign gets put in d3> move.l d3,-(a3) <fetched/calculated index of first dimension gets put in d3> move.l d3,-(a3) ... <fetched/calculated index of last dimension gets put in d3> move.l d3,-(a3) lea.l *(a6),a0 jsr *(a4) move.l (a3)+,(a0) where: * a0 ends up pointing to the address of the array descriptor; * the code at *(a4) is the routine that performs the safety and type checks, and puts in a0 the address where the value is to be stored. The array descriptor is (offset: content): 0: number of dimensions 1: log2(item size) 2-3: maximum index of first dimension 4-5: number of items in previous dimensions (for first dimension: 1) ...: ... ...: maximum index of last dimension ...: number of items in previous dimensions The routine is: move.l (a0),d0 ;2010 get array descriptor address beq.w #$0024 ;6700 0024 if array undefined... movea.l d0,a0 ;2040 get array descriptor address move.b (a0)+,d3 ;1618 get number of dimensions move.b (a0)+,d4 ;1818 get log2 of item size moveq.l #0,d0 ;7000 clear high word moveq.l #0,d2 ;7400 reset number of items to skip from array beginning .l move.w (a0)+,d0 ;3018 get maximum index move.l (a3)+,d1 ;221b get desired index cmp.l d0,d1 ;b280 check index against maximum possible bhi.w * ;6200 **** if index too big... mulu.w (a0)+,d1 ;c2d8 calculate number of items relative to previous dimensions add.l d1,d2 ;d481 update number of items to skip subq.b #1,d3 ;5303 check next dimension bne.b .l ;66ee if dimensions not over... lsl.l d4,d2 ;e9aa calculate offset of item as index<<log2(item size) adda.l d2,a0 ;d1c2 calculate address of item rts ;4e75 This code replaces the routine with: movea.l (a0),a0 ;2050 get array descriptor address move.b (a0)+,d3 ;1618 get number of dimensions move.b (a0)+,d4 ;1818 get log2 of item size moveq.l #0,d0 ;7000 clear high word moveq.l #0,d2 ;7400 reset number of items to skip from array beginning .l addq.w #2,a0 ;5448 skip maximum index move.l (a3)+,d1 ;221b get desired index mulu.w (a0)+,d1 ;c2d8 calculate number of items relative to previous dimensions add.l d1,d2 ;d481 update number of items to skip subq.b #1,d3 ;5303 check next dimension bne.b .l ;66f4 if dimensions not over... lsl.l d4,d2 ;e9aa calculate offset of item as index<<log2(item size) adda.l d2,a0 ;d1c2 calculate address of item rts ;4e75 Quote:
Code:
moveq.l #<count>,d0 asr/lsl.l d0,d3 Quote:
But thanks for your interest! Quote:
Code:
COMPILED move.l #$80000000,d1 ;223c 8000 0000 set flag bsr.w * ;6100 **** call address calculation routine move.w (a0),d3 ;3610 read item value ext.l d3 ;48c3 sign-extend value rts ;4e75 OPTIMIZED add.l d3,d3 ;d683 calculate item offset lea.l -$186e(a5),a0 ;41ed e792 calculate address of Amreg() move.w (a0,d3.l),d3 ;3630 3800 read item value ext.l d3 ;48c3 sign-extend value rts ;4e75 Quote:
Quote:
Last edited by saimo; 09 June 2023 at 23:50. |
|||||||
09 June 2023, 23:05 | #15 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
@Retro1234
Quote:
Quote:
|
||
09 June 2023, 23:10 | #16 |
Phone Homer
Join Date: Jun 2006
Location: 5150
Posts: 5,850
|
Yeah it was probably compressed I'll see if I can find the source, thanks
|
09 June 2023, 23:18 | #17 |
Phone Homer
Join Date: Jun 2006
Location: 5150
Posts: 5,850
|
I started work on a program to convert Amos to Blitz but I never finished it. Blitz in general is "faster" a blitting Bobs.
|
09 June 2023, 23:49 | #18 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
@alain.treesong
Quote:
I don't the think the optimizations make any difference: the demos don't seem to do anything CPU-intensive and are probably frame-locked. Notes: * I noticed the programs can be broken with CTRL-C: if you need speed, use Comp Test Off; * I had to remove the original executable from the NewImpact ADF as there wasn't enough space. |
||
09 June 2023, 23:56 | #19 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,667
|
Could AMOS be transpiled to C? I appreciate that one can't just magically do this without a runtime library to provide equivalent functionality for all the graphics and audio features that the language provides out of the box, but in principle, is there anything about the language, some fundamental impedance mismatch, that prevents automated conversion to C?
|
10 June 2023, 10:08 | #20 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 862
|
Quote:
Last edited by saimo; 10 June 2023 at 12:44. Reason: Fixed typo. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Wip] Amos Professional X : Bring AGA to Amos Pro! | freddix | Coders. AMOS | 53 | 22 July 2023 09:53 |
Anyone fancy a free Gameboy Colour? | Paul_s | MarketPlace | 17 | 16 May 2009 18:41 |
If Microshaft can have fancy qualifications for Windows... then why can't we? | Paul_s | Amiga scene | 30 | 14 April 2008 08:19 |
Anyone fancy putting some ADF`s onto disk? :) | Mike UK | MarketPlace | 4 | 22 January 2007 17:09 |
Fancy a NEW Amiga magazine? | ronniet | Amiga scene | 2 | 18 April 2006 02:14 |
|
|