68000 code optimisations - Page 9

ross · 27 April 2019, 01:35

Quote:

Originally Posted by PeterJ

and it uses a lot of lsr #8 and Lsl #8 so the about is just perfect

For

lsr #8

the trick is similar:

Code:

    d0.w=xx00
    moveq   #0,d1
    ....

    move.w  d0,-(sp)
    move.b  (sp)+,d1
    ....
    
    d1.w=00xx

In this case you can use stack (so no spare A register and mem) but you need a D register and you must never touch his upper bits.

NorthWay · 27 April 2019, 05:51

[lsl #8]
My personal preference is to start the program with "clr.l -(sp)" and match it before end with "move.l (sp)+,d0", and then use pairs of
move.b dX,(sp)
move.w (sp),dX

PeterJ · 27 April 2019, 09:15

Quote:

Originally Posted by ross

For

lsr #8

the trick is similar:

Code:

    d0.w=xx00
    moveq   #0,d1
    ....

    move.w  d0,-(sp)
    move.b  (sp)+,d1
    ....
    
    d1.w=00xx

In this case you can use stack (so no spare A register and mem) but you need a D register and you must never touch his upper bits.

i just tried with $ff56 and the result was $ff

is it not only if you use movem.w that it clear or set the upper word depending of bit15?

ross · 27 April 2019, 10:21

Quote:

Originally Posted by NorthWay

[lsl #8]
My personal preference is to start the program with "clr.l -(sp)" and match it before end with "move.l (sp)+,d0", and then use pairs of
move.b dX,(sp)
move.w (sp),dX

You just have to be careful not to use it in a nested routine so your sentence should be written "to start the subroutine with".

Quote:

Originally Posted by PeterJ

i just tried with $ff56 and the result was $ff

As is should (i've simply written d0=$xx00 because low bits are anyway lost so can be anything).
But from your next phrase is it not that you meant the

asr

instruction?

Quote:

is it not only if you use movem.w that it clear or set the upper word depending of bit15?

Regardless,

movem

deals with words (or longs) and never with bytes.

hooverphonique · 29 April 2019, 12:36

Quote:

Originally Posted by PeterJ

and it uses a lot of lsr #8 and Lsl #8 so the about is just perfect

If they are used in hot code, maybe the solution is to refactor the necessity for these shifts away completely

NorthWay · 14 August 2019, 20:51

Quote:

Originally Posted by NorthWay

once upon a time there was a thing called the GNU(gcc?) super-optimizer

I found this reference to it: https://courses.cs.washington.edu/co...s/massalin.pdf

Photon · 17 August 2019, 00:08

Quote:

Originally Posted by NorthWay

I found this reference to it: https://courses.cs.washington.edu/co...s/massalin.pdf

A factor from my experience is that optimized C runs 10x slower than optimized Assembly, and that C++ cannot be fully reduced to C for an application. (This is up to about 68020/80386, where actually Pascal in some cases had a lower factor than C. Since the popularity of C, this may have changed, but not measured by me.)

Later, on-chip caches affected performance more than the number and time length of instructions, and this allowed utility applications to not be bogged down and reduce this factor.

But even after this hardware acceleration (circa 1990), applications such as games and demos would never use C (or C++) in time-critical sections for another half a decade, as we know.

It's true to this day that any high-level language (or one posing as such!) will always be beaten by a great margin by "simply" writing the program in Assembly. (The advantage of truly portable languages is of course the portability and less code to write, if you're not using macros.)

All this to make clear that there is no language level higher than Assembly that will ever generate as efficient (or small) code as writing it in Assembly

It's self-evident. But just to give factors for the performance loss paid. The compiler doesn't know what you're trying to do, so it can't deliver the perfect translation.

sparhawk · 04 February 2020, 15:19

Maybe there is a faster way to clear the upper word of a register?

Replace this (16 cycles):

Code:

    and.l   #$ffff,d0

With this (12 cycles):

Code:

    moveq   #0,d1
    move.w  d0,d1
    move.l  d1,d0

I try to avoid (if possible) the second move by arranging the registers appropriatly, in which case the count would go down to 8 cycles.

Also 12 Cycles but only one register needed:

Code:

    swap    d0
    clr.w   d0
    swap    d0

ross · 04 February 2020, 16:51

Quote:

Originally Posted by sparhawk

Maybe there is a faster way to clear the upper word of a register?

If you know for sure that's a positive value, you can

ext.l dx

(4 cycles

)

But usually I keep a register with the upper part zeroed out of the main loop and then move the data only for the lower part.

sparhawk · 04 February 2020, 16:53

Quote:

Originally Posted by ross

If you know for sure that's a positive value, you can

ext.l dx

(4 cycles

)

Yes, that woul dbe the obvious solution.

But it depends, so in the general case, I can't know that.

I usually do a lot of prototyping in Easy68k and see if I can find faster solutions as it tells me the cycle count, which is IMO a great feature for that.

Antiriad_UK · 21 February 2020, 14:03

This is one I saw in another thread that made me scratch my head for a while is using add dx,dx to simultaneously test and clear a "flag". For example you have a loop where you are setting a flag from 0 to 1 if something occurred. Then at the end of the loop you check the flag to see if you need to loop again and reset the flag (sorting routine I did this in).

Instead of:

Code:

.loop:
	moveq	#0,d0			;reset flag
	...
	;If something occured, flag it
	moveq	#1,d0
	...
	;Do we need to loop again?
	tst.w	d0
	bne.s	.loop

Do this:

Code:

	moveq	#0,d0			;reset flag once
.loop:
	...
	;If something occured, flag it
	moveq	#-128,d0		;set flag = $80 ($fffffff80)
	...
	;Do we need to loop again? Also reset flag
	add.b	d0,d0
	bcs.s	.loop

Can do the same thing with subq and bmi, but I liked the use of carry

hooverphonique · 21 February 2020, 14:50

Quote:

Originally Posted by Antiriad_UK

Code:

    moveq    #0,d0            ;reset flag once
.loop:
    ...
    ;If something occured, flag it
    moveq    #-128,d0        ;set flag = $80 ($fffffff80)
    ...
    ;Do we need to loop again? Also reset flag
    add.b    d0,d0
    bcs.s    .loop

I stared at this for a while without seeing how it would reset the flag (carry), but I suppose you meant "reset" in the sense of returning d0 to zero?

Pixelfill · 21 February 2020, 15:00

Quote:

Originally Posted by hooverphonique

I stared at this for a while without seeing how it would reset the flag (carry), but I suppose you meant "reset" in the sense of returning d0 to zero?

I may be wide of the mark here, but I'm guessing adding bytes -128 to -128 results in -256, therefore an overflow beyond a byte (carry) and also results in d0.b set to 00?
the key part I believe is not the add.b. but the fact that d0 contains $xxxxxx80 beforehand from the moveq

forgive me if I'm wrong as I'm just starting out.

Mike

Antiriad_UK · 21 February 2020, 15:05

Quote:

Originally Posted by hooverphonique

I stared at this for a while without seeing how it would reset the flag (carry), but I suppose you meant "reset" in the sense of returning d0 to zero?

Yes resets d0 to 0 so you can save a whole 4 cycles per loop

hooverphonique · 21 February 2020, 15:42

Quote:

Originally Posted by Pixelfill

I may be wide of the mark here, but I'm guessing adding bytes -128 to -128 results in -256, therefore an overflow beyond a byte (carry) and also results in d0.b set to 00?
the key part I believe is not the add.b. but the fact that d0 contains $xxxxxx80 beforehand from the moveq

Yes, you're right, and it's also where I arrived at, hence the last part of my previous comment

VladR · 02 March 2020, 16:05

Quote:

Originally Posted by Photon

A factor from my experience is that optimized C runs 10x slower than optimized Assembly, and that C++ cannot be fully reduced to C for an application. (This is up to about 68020/80386, where actually Pascal in some cases had a lower factor than C. Since the popularity of C, this may have changed, but not measured by me.)

I hear you, I myself have spent a log time on Jaguar, after each build (there is a linker option to view resulting ASM), being dumbfounded by pages of ASM code (per 1-2 lines of C code) generated by the C compiler.

Quote:

Originally Posted by Photon

It's true to this day that any high-level language (or one posing as such!) will always be beaten by a great margin by "simply" writing the program in Assembly. (The advantage of truly portable languages is of course the portability and less code to write, if you're not using macros.)

Unfortunately, that is not true anymore

I burnt through a great heap of money (current estimate is between $150,000 - $200,000 :rising daily as I keep working on it alongside the game, so keep this in mind before you ask for free download - it's literally like asking to donate an average American house) during last two years on designing Higgs, which is slightly lower-level than C, but I designed it to be identical in speed to hand-written ASM.

Current features are:
- full access to all registers and ASM instructions
- choice of WorkingRegister to use by Higgs if the feature requires it
- global/local variables/constants
- byte/word/long access via .bwl (default is long, so no need to specify .l)
- arrays
- structures
- typecasting (word/long)
- conditions
- loops (continue + break)
- blocks {} allowing to pollute the name-space only within the current block
- debug printing
- function declarations with parameters (your choice of registers or global or local variables)
- function call with or without parameters
- local functions (invisible to outside world) like in Pascal
- push/pop stack syntax
- basic math operations (signed var1 = var2 * var3), (var3 += var1)

All of the features above are possible to implement (Higgs is written in C#) with the exact same instruction footprint as if you wrote it manually in Asm.

Some common C features like switch or do-while are high on my to-do list - I somehow managed to write the game without them, to my surprise, so they simply didn't get implemented yet.
On-Demand Inlining (e.g. only when you want, but can still force it to always) is in Top 5. On a 6502 target, I have an Unroll Loop, this still needs to be implemented to 68000 target.

You still have to think in terms of byte/word/long access and still have to prefer registers to variables (but don't have to if you don't feel like it). You are solely responsible for contents of registers, but if you want - you have an option to code using just variables.

Primarily, this targets .68000.
Most of the features are implemented also for a RISC backend (Jaguar's GPU and DSP processors). I also have a .6502 and .6502C targets (though those are currently simplest).

Once networking gets enabled in core for Vampire (and I can start deploying builds to my V4), I will make .68080 target, eventually with AMMX support.

Quick example:

Code:


 ; Arrays of structures are supposed to be accessed sequentially
 ; each time you simply advance the pointer via Next () which is a simple add.l #StructSizeOf,ptrStruct


  array SLaserShot LaserShots [MaxLaserShots]   ; Player's lasers
  SLaserShot.UseRegister (a2) ; Use this register for access



 Animate_LaserShots:
 {
     ; Animate (localZ + camY) Already Active LaserShots
    { ; Player's LS
         register d7:lpMain
          ; Keep d1 as WorldSpeed, since SLaserShot_UpdateZ requires d1 as input
         register d1:WorldSpeed  ; PlayerSpeed + LS_Speed
         register d2:CurrentPlayerSpeed
         CurrentPlayerSpeed = PlayerSpeed >> #3

         SLaserShot.InitRegister (LaserShots)
         loop (lpMain = #MaxLaserShots)
         {
             WorldSpeed = CurrentPlayerSpeed + SLaserShot.Speed
           ; print2H (SLaserShot.camY,SLaserShot.camZ,#110,#50)
             if.l (SLaserShot.IsActive == #1)
             {
                 if.l (SLaserShot.FrameDeactivate <= Frame:d0)
                 { ; Disable LS if it travelled too far
                     SLaserShot.IsActive = #0
                 }
                 else
                 { ; LS can still remain active
                      SLaserShot_UpdateZ () ; Update Z
                      SLaserShot_UpdateY () ; Update Y (after Z, so it is sync'ed)
                 }
             }
             SLaserShot.Next ()
       }
    }
 rts
 }

Quote:

Originally Posted by Photon

All this to make clear that there is no language level higher than Assembly that will ever generate as efficient (or small) code as writing it in Assembly

It's self-evident. But just to give factors for the performance loss paid. The compiler doesn't know what you're trying to do, so it can't deliver the perfect translation.

Not true for my Higgs.
Granted, it's lower level than C as it's not supposed to be completely safe and idiot-proof, like C is.

But it's infinitely more easy to add/remove Higgs code compared to ASM. The mental effort required for pure ASM (nested irregular conditions, etc.) makes it hard to simply discard the code you wrote. In Higgs, I don't even think about that - I simply delete the code and rewrite from scratch. Let the compiler insert all the jump labels and figure out the proper comparison/BXX instruction based on the parameters.

Quote:

Originally Posted by Photon

It's self-evident. But just to give factors for the performance loss paid. The compiler doesn't know what you're trying to do, so it can't deliver the perfect translation.

Real-world example of my Higgs.
On Atari Jaguar, 98% of code was written for 68000 and only 4 KB in RISC (3D transform and rasterizer loop).

So, quite literally, everything else is 68000. That's:
- input,
- Z-sorting,
- culling World track mesh,
- double-buffering,
- creating doublebuffered polygon list for RISC GPU,
- strafing physics,
- collision detection,
- camera,
- full 8-state AI,
- spawning enemies,
- procedural random generation of enemy RPG parameters,
- HUD,
- managing Jaguar's ObjectProcessor list (and related IRQ),
- damage equations.
And about two dozen things I didn't think of right this moment.

On Jag, about 90% of that was rewritten (I started with 100% ASM, gradually as I kept adding Higgs features, rewrote additional parts) into Higgs (100% on Amiga), yet benchmarks showed that it only took 10% of frame time on the 13.3 MHz 68000.

Meaning, I could still run the full logic of game ten times per frame, yet keep 60 fps. So, even if the Motorola was 10x slower at just 1.4 MHz, it still should fit within a frame time. Now that's funny

phx · 03 March 2020, 15:11

Quote:

Originally Posted by VladR

during last two years on designing Higgs, which is slightly lower-level than C, but I designed it to be identical in speed to hand-written ASM.

Higgs looks like a really interesting high-level assembler language, which might be useful to speed up development.

But to claim that it reaches identical speed to hand-optimized assembler cannot be true, so I have to defend Photon's statement here. You always have to make compromises when translating a high-level (even the lowest high-level) language into machine code. Give me a program generated by Higgs and I (and many other coders here) will always be able to show you sequences which allow optimization.

VladR · 03 March 2020, 20:17

Quote:

Originally Posted by phx

Higgs looks like a really interesting high-level assembler language, which might be useful to speed up development.

But to claim that it reaches identical speed to hand-optimized assembler cannot be true, so I have to defend Photon's statement here. You always have to make compromises when translating a high-level (even the lowest high-level) language into machine code. Give me a program generated by Higgs and I (and many other coders here) will always be able to show you sequences which allow optimization.

I really like the term HighLevel Assembler - after all, the baseline is the vasm source code file, where the Higgs Parser merely inserts new lines.

That's how it started - first with macros, then macro modifications at compile-time, and eventually parsing expressions and simple commands (loops, conditions, blocks, etc.).

Yeah, I probably wouldn't use "hand-optimized" term. Rather, I use "hand-written". Meaning, same efficiency as I would write it by hand in ASM (though, it is certainly possible to write a slightly faster version, if you are willing to bastardize the code to the point it's unreadable later).

It's always possible, in ASM, to rearrange and rewrite certain combination of instructions to save some cycles (as this thread has demonstrated probably dozens of times).

But, that creates unmaintainable code (long-term). You save 4 cycles by abusing some fluke register dependency, and when you need to change the code, boom. You burn half day debugging wth is going on

I'm sure we all did the same thing:
- you write version 1 - it works, it is nicely documented or even self-documented
- you spot something, make version 2 and it saves some cycles
- you do the same and have version 3
- 3 months later you make some change elsewhere that breaks some of the dependencies brought by optimizations (because you now use higher 16 bits or whatever else it is).

Now, it is possible, to implement a final Optimizer pass, that would go over the code, examine the register status and replace certain combination of ops by a different, faster one (like the ones mentioned in this thread).
That would be indeed useful for 68000, but since now I focus on Vampire and 68040-68060, it's not really critical for me.

That brings the question - is there some kind of optimizer like this already for 68000 ? Something that would do such analysis of the code and find combos of ops that are safe to replace with faster ones ?

StingRay · 03 March 2020, 20:38

Quote:

Originally Posted by VladR

(though, it is certainly possible to write a slightly faster version, if you are willing to bastardize the code to the point it's unreadable later).

Optimising doesn't necessarily equals unreadable!

Quote:

Originally Posted by VladR

But, that creates unmaintainable code (long-term).

And neither does it mean the code is unmaintainable.

mr.spiv · 03 March 2020, 20:55

Quote:

Originally Posted by VladR

But, that creates unmaintainable code (long-term). You save 4 cycles by abusing some fluke register dependency, and when you need to change the code, boom. You burn half day debugging wth is going on

Somehow I found myself here

04 February 2020, 15:19	#168
sparhawk Registered User Join Date: Sep 2019 Location: Essen/Germany Age: 55 Posts: 463	Maybe there is a faster way to clear the upper word of a register? Replace this (16 cycles): Code: and.l #$ffff,d0 With this (12 cycles): Code: moveq #0,d1 move.w d0,d1 move.l d1,d0 I try to avoid (if possible) the second move by arranging the registers appropriatly, in which case the count would go down to 8 cycles. Also 12 Cycles but only one register needed: Code: swap d0 clr.w d0 swap d0 Last edited by sparhawk; 04 February 2020 at 15:24.

21 February 2020, 14:03	#171
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	This is one I saw in another thread that made me scratch my head for a while is using add dx,dx to simultaneously test and clear a "flag". For example you have a loop where you are setting a flag from 0 to 1 if something occurred. Then at the end of the loop you check the flag to see if you need to loop again and reset the flag (sorting routine I did this in). Instead of: Code: .loop: moveq #0,d0 ;reset flag ... ;If something occured, flag it moveq #1,d0 ... ;Do we need to loop again? tst.w d0 bne.s .loop Do this: Code: moveq #0,d0 ;reset flag once .loop: ... ;If something occured, flag it moveq #-128,d0 ;set flag = $80 ($fffffff80) ... ;Do we need to loop again? Also reset flag add.b d0,d0 bcs.s .loop Can do the same thing with subq and bmi, but I liked the use of carry

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 boot code	billt	Coders. General	15	05 May 2012 20:13
Wasted Dreams on 68000	sanjyuubi	support.Games	5	27 May 2011 17:11
680x0 to 68000	Counia	Hardware mods	1	01 March 2011 10:18
quitting on 68000?	Hungry Horace	project.WHDLoad	60	19 December 2006 20:17
3D code and/or internet code for Blitz Basic 2.1	EdzUp	Retrogaming General Discussion	0	10 February 2002 11:40

27 April 2019, 05:51	#162
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	[lsl #8] My personal preference is to start the program with "clr.l -(sp)" and match it before end with "move.l (sp)+,d0", and then use pairs of move.b dX,(sp) move.w (sp),dX

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)