Fastest way to multiply by 200?

mcgeezer · 27 July 2021, 22:17

Hi all,

I'm not too familiar with exact cycle count timings but I'm looking for a fast way to multiply a number by 200, return value can be within 64Kb.

At the moment I have.

Code:

	lea	.mulu200(pc),a3
	add.w	d2,d2
	move.w	(a3,d2.w),d2
.
.
.

.mulu200:	
	rept	256
	dc.w	REPTN*200
	endr

I'm assuming this will be faster than a mulu #200,d2 which I recall takes something like 70 cycles?

Any faster ways you guys can think of?

Thanks,
Graeme

a/b · 27 July 2021, 22:46

Assuming 68000... mulu.w #200,dx is: 200=c8=11001000, 38+4+6=48 cycles.
If you break it down into shifts and adds it should be around 40 cycles.
Your table approach is 12+4+14=30 cycles worst case (meaning you execute all 3 instructions each time). Can be made 28 if you can 64kb align the table (move.l dx,ax + move.w (ax),dx is 12 cycles vs. 14 cycles).

Don_Adan · 27 July 2021, 22:58

Table version will be the fastest for 68000. I will be use only, something like this:

Code:

 add.w d2,d2
 move.w .mulu200(PC,D2.W),D2

mcgeezer · 27 July 2021, 23:12

Quote:

Originally Posted by Don_Adan

Table version will be the fastest for 68000. I will be use only, something like this:

Code:

 add.w d2,d2
 move.w .mulu200(PC,D2.W),D2

Thanks - indeed this is 68000

That code won't assemble as I'm getting an out of range displacement.

Don_Adan · 28 July 2021, 00:39

The best is placing this routine at end of your code. I always placed sample mixing routine at end of my players. PC table must start in 126 bytes range. Or you can use one register as base. I dont use register as table base because this is wasting register for critical routine. Exactly in my mixing routine i used PC for handling 2 tables, not only 1 table.

bebbo · 28 July 2021, 07:46

Quote:

Originally Posted by mcgeezer

Thanks - indeed this is 68000

That code won't assemble as I'm getting an out of range displacement.

the displacement must fit into 1 byte. Thus the code needs to be very close to the table.
used cycles: 4 + 14 = 18

the lea approach will work within a +-32k range
used cycles 8 + 4 + 14 = 26

If you end up putting that code into a subroutine...
... shift/add is faster (40 cycles), since bsr/rts eats the advantage up.
and the shift/add code can be inlined everywhere - ok, you need a scratch register...

... so if that is not available or size matters: use the mul ^^

meynaf · 28 July 2021, 08:41

For reference, shift and add version :

Code:

 move.w d0,d1
 add.w d0,d0
 add.w d0,d1
 lsl.w #2,d0
 lsl.w #6,d1
 add.w d1,d0

bebbo · 28 July 2021, 10:06

Quote:

Originally Posted by meynaf

For reference, shift and add version :

Code:

 move.w d0,d1
 add.w d0,d0
 add.w d0,d1
 lsl.w #2,d0
 lsl.w #6,d1
 add.w d1,d0

that's 44 cycles

faster is

Code:

lsl.w #3,d2 ; *8     6+6
move.w d2,d3           4
lsl.w #3,d3 ; *64    6+6
add.w d3,d2    ; *72    4
add.w d3,d2    ; *136   4
add.w d3,d2    ; *200   4

40 cycles

Don_Adan · 28 July 2021, 11:17

For some code (i dont see your code) is possible to change without speed penalty input D2 value from 0,1,2,3,4,5... to 0,2,4,6,8,10 etc then add.w d2,d2 can be removed, and only move.w from table will be used.

pink^abyss · 28 July 2021, 17:47

If you run the code while the blitter is active in parallel or a lot of bitplane DMA is going on then it may be fastest to simply do the mul (on 68k without fastmem).

Gorf · 30 July 2021, 02:41

Quote:

Originally Posted by bebbo

that's 44 cycles

faster is

Code:

lsl.w #3,d2 ; *8     6+6
move.w d2,d3           4
lsl.w #3,d3 ; *64    6+6
add.w d3,d2    ; *72    4
add.w d3,d2    ; *136   4
add.w d3,d2    ; *200   4

40 cycles

or:

Code:

move.w d2,d3           4
lsl.w #3,d3 ; *8     6+6
add.w d3,d2 ; *9       4
add.w d3,d3 ; *16      4
add.w d3,d2 ; *25      4
lsl.w #3,d2 ; *200   6+6

same

mcgeezer · 30 July 2021, 11:18

OK thanks for the replies guys...

I did in the end settle on this - it's basically a blitter routine which reconstructs bob parts (for sprites). The idea is to save memory and cycles, so for example instead of blitting a 64x80 bob (player character) which may contain a lot of blank space I'm only blitting those that have pixels in them.

Screen size is 320/256x5 bitplanes... hence the 200 bytes width. Bob's are interleaved.

I haven't optimised the code segments yet - normally I have a6 pointing into a data segment but will do that soon to speed it up.

Code:

; d0 = Frame Number
; d1 = xpos
; d2 = ypos
; a0 = Sprite Sheet struct
; a1 = Screen struct
; a2 = Restore pointer a3
agdPlotFrame:
	movem.l	d0-d2/a0-a1,-(a7)
	
; Get the frame structure.
	lea	SPRITES_FRAMES,a0
	add.w	d0,d0
	add.w	d0,d0
	move.l	(a0,d0),a0		
	
	move.l	hScreenPointers(a1),a1
	
; This would be the point for entry to the routine.	
	add.w	2(a0),d1		; add x offset for this frame
	add.w	4(a0),d2		; add y offset for this frame
	
	lea	.mulu200(pc),a3
	add.w	d2,d2
	move.w	(a3,d2.w),d2

	move.w	d1,d4			; Make a copy of the xpos to d4	
	and.w	#$fff0,d1		; Get the Xposition nearest word position
	lsr.w	#3,d1			; d1 now has nearest word
	add.l	d1,d2			; d2=Byte position

	ror.w	#4,d4			; Barrel Shift amount for BLTCON0 Source Mask
	clr.b	d4
	move.w	d4,d5			
	or.w	#$fca,d4		; We want Source A,B & C with D = $F, and Cookie Cut $CA = $FCA
	
	move.l	d2,d3			; save plot offset
	move.w	d4,d6			; save bltcon0 value
	move.w	d5,d7			; save bltcon1 value
	
	move.l	#16,a3

.part:
	move.l	d3,d2			; restore offset
	move.w	d6,d4			; restore bltcon0
	move.w	d7,d5			; restore bltcon1
	
	add.l	a3,a0			; advance 16 bytes
	
	move.w	(a0)+,d0		; bltsize
	move.w	(a0),d1			; mod
	swap	d1
	move.w	(a0)+,d1	

	WAIT_FOR_BLITTER
	
	move.l	#$ffff0000,BLTAFWM(a5)

	move.l	d1,BLTAMOD(a5)
	move.l	d1,BLTCMOD(a5)

	move.l	(a0)+,BLTBPTH(a5)	; bob
	move.l	(a0)+,BLTAPTH(a5)	; mask	
	add.w	(a0)+,d2		; plot offset y+x word

	add.w	(a0),d4			; barrel adjust
	add.w	(a0)+,d5		; barrel adjust
	bcc.s	.ovf
	addq.w	#2,d2			; overflow to next word
	
.ovf:	move.w	d0,(a2)+		; Save Blit size
	move.w	d1,(a2)+		; Save Modulo
	move.w	d2,(a2)+		; Save offset

	add.l	a1,d2

	move.w	d4,BLTCON0(a5)
	move.w	d5,BLTCON1(a5)
	move.l	d2,BLTCPTH(a5)
	move.l	d2,BLTDPTH(a5)
	move.w	d0,BLTSIZE(a5)
	
	tst.w	(a0)			; Terminate?
	bpl.s	.part
		
.exit:	movem.l	(a7)+,d0-d2/a0-a1
	rts
	
.mulu200:	
	rept	256
	dc.w	REPTN*200
	endr

Structure of a frame is like this...

Code:

SPRITES_FRAMES:	dc.l	.gripper_fall_left_frame1
		dc.l	.gripper_fall_left_frame2
		dc.l	.gripper_fall_left_frame3
		dc.l	.gripper_fall_left_frame4
		dc.l	.gripper_fall_left_frame5		
		dc.l	.gripper_fall_left_frame5	
		dc.l	-1	



.gripper_fall_left_frame1:	
; Part 1
		dc.w	0		; 0 sprite lock start
		dc.w	0		; 2 x src offset
		dc.w	0		; 4 y src offset
		dc.w	1		; 6  DER x spr size (words)
		dc.w	40		; 8  DER y spr size
		dc.w	0		; 10 DER x dst offset
		dc.w	0		; 12 DER y dst offset
		dc.w	0		; 14

		ds.b	16            ; Compiled blitter values here

; Part 2	
		dc.w	60		; sprite lock start
		dc.w	$DEAD		; x src offset
		dc.w	$BEEF		; y src offset
		dc.w	1		; x spr size (words)
		dc.w	19		; y spr size 
		dc.w	7		; x dst offset
		dc.w	40		; y dst offset (40 pixels down)
		dc.w	0

		ds.b	16            ; Compiled blitter values here
		dc.l	-1
		
		
.gripper_fall_left_frame2:

bebbo · 30 July 2021, 11:41

Quote:

Originally Posted by mcgeezer

OK thanks for the replies guys...
...

Screen size is 320/256x5 bitplanes... hence the 200 bytes width. Bob's are interleaved.
...

refering to Don Adams comment:

you are counting y like 0, 1, 2
can't you count y as 0, 200, 400, ... ?

then would be no need for * 200

mcgeezer · 30 July 2021, 11:57

Quote:

Originally Posted by bebbo

refering to Don Adams comment:

you are counting y like 0, 1, 2
can't you count y as 0, 200, 400, ... ?

then would be no need for * 200

Yes potentially I can do that and it's a nice idea, things might become a little tricky though when I start doing collisions. I'll keep the idea on ice for later.

NorthWay · 30 July 2021, 20:26

Quote:

Originally Posted by mcgeezer

Yes potentially I can do that and it's a nice idea, things might become a little tricky though when I start doing collisions. I'll keep the idea on ice for later.

Now you make it sound like you should do both. Unless that has more overhead than the 40 cycles?

Don_Adan · 30 July 2021, 22:06

Or you can try this table version, perhaps can be ok for PC range, but i dont know size of wait for blitter routine.

Code:

; d0 = Frame Number
; d1 = xpos
; d2 = ypos
; a0 = Sprite Sheet struct
; a1 = Screen struct
; a2 = Restore pointer a3
agdPlotFrame:
	movem.l	d0-d2/a0-a1,-(a7)
	
; Get the frame structure.
	lea	SPRITES_FRAMES,a0
	add.w	d0,d0
	add.w	d0,d0
	move.l	(a0,d0),a0		
	
	move.l	hScreenPointers(a1),a1
	
; This would be the point for entry to the routine.	
	add.w	2(a0),d1		; add x offset for this frame
	add.w	4(a0),d2		; add y offset for this frame
	
;	lea	.mulu200(pc),a3
;	add.w	d2,d2
;	move.w	(a3,d2.w),d2

       lea 16.W,A3

	move.w	d1,d4			; Make a copy of the xpos to d4	
	and.w	#$fff0,d1		; Get the Xposition nearest word position
	lsr.w	#3,d1			; d1 now has nearest word
;	add.l	d1,d2			; d2=Byte position  why add longword not word?

	ror.w	#4,d4			; Barrel Shift amount for BLTCON0 Source Mask
	clr.b	d4
	move.w	d4,d5			
	or.w	#$fca,d4		; We want Source A,B & C with D = $F, and Cookie Cut $CA = $FCA
	
;	move.l	d2,d3			; save plot offset
	move.w	d4,d6			; save bltcon0 value
	move.w	d5,d7			; save bltcon1 value
        add.w d2,d2
       move.w .mulu200(PC,D2.W),D3
       add.w D1,D3

;	move.l	#16,a3

.part:
	move.l	d3,d2			; restore offset  ; 88 bytes
	move.w	d6,d4			; restore bltcon0 ; 86 bytes 
	move.w	d7,d5			; restore bltcon1 ; 84 bytes
	
	add.l	a3,a0			; advance 16 bytes ; 82 bytes
	
	move.w	(a0)+,d0		; bltsize ; 80 bytes
	move.w	(a0),d1			; mod ; 78 bytes
	swap	d1    ; 76 bytes
	move.w	(a0)+,d1	; 74 bytes

	WAIT_FOR_BLITTER ; unknown size
	
	move.l	#$ffff0000,BLTAFWM(a5) ; 72 bytes

	move.l	d1,BLTAMOD(a5)   ; 64 bytes
	move.l	d1,BLTCMOD(a5) ; 60 bytes

	move.l	(a0)+,BLTBPTH(a5)	; bob ; 56 bytes
	move.l	(a0)+,BLTAPTH(a5)	; mask	; 52 bytes
	add.w	(a0)+,d2		; plot offset y+x word ; 48 bytes

	add.w	(a0),d4			; barrel adjust ; 46 bytes
	add.w	(a0)+,d5		; barrel adjust ; 44 bytes
	bcc.s	.ovf   ; 42 bytes
	addq.w	#2,d2			; overflow to next word ; 40 bytes
	
.ovf:	move.w	d0,(a2)+		; Save Blit size ; 38 bytes
	move.w	d1,(a2)+		; Save Modulo ; 36 bytes
	move.w	d2,(a2)+		; Save offset ; 34 bytes

	add.l	a1,d2 ; 32 bytes

	move.w	d4,BLTCON0(a5) ; 30 bytes
	move.w	d5,BLTCON1(a5) ; 26 bytes 
	move.l	d2,BLTCPTH(a5)  ; 22 bytes 
	move.l	d2,BLTDPTH(a5) ; 18 bytes
	move.w	d0,BLTSIZE(a5) ; 14 bytes 
	
	tst.w	(a0)			; Terminate? ; 10 bytes
	bpl.s	.part ; 8 bytes
		
.exit:	movem.l	(a7)+,d0-d2/a0-a1 ; 6 bytes
	rts ; 2 bytes
	
.mulu200:	
	rept	256
	dc.w	REPTN*200
	endr

28 July 2021, 08:41	#7
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,351	For reference, shift and add version : Code: move.w d0,d1 add.w d0,d0 add.w d0,d1 lsl.w #2,d0 lsl.w #6,d1 add.w d1,d0

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fast multiply / divide by 64?	mcgeezer	Coders. Asm / Hardware	10	06 April 2018 19:29
CD-200 crashes with SX-1	th4t1guy	support.Games	2	26 June 2015 16:41
200 % fps	turrican3	request.UAE Wishlist	13	30 July 2008 18:34
64 bit signed multiply	cdoty	Coders. General	2	16 December 2007 12:24
Moonstone for almost $200, are they serious?	Pyromania	Retrogaming General Discussion	29	13 November 2003 22:28

27 July 2021, 22:17	#1
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	Fastest way to multiply by 200? Hi all, I'm not too familiar with exact cycle count timings but I'm looking for a fast way to multiply a number by 200, return value can be within 64Kb. At the moment I have. Code: lea .mulu200(pc),a3 add.w d2,d2 move.w (a3,d2.w),d2 . . . .mulu200: rept 256 dc.w REPTN*200 endr I'm assuming this will be faster than a mulu #200,d2 which I recall takes something like 70 cycles? Any faster ways you guys can think of? Thanks, Graeme

27 July 2021, 22:46	#2
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,053	Assuming 68000... mulu.w #200,dx is: 200=c8=11001000, 38+4+6=48 cycles. If you break it down into shifts and adds it should be around 40 cycles. Your table approach is 12+4+14=30 cycles worst case (meaning you execute all 3 instructions each time). Can be made 28 if you can 64kb align the table (move.l dx,ax + move.w (ax),dx is 12 cycles vs. 14 cycles).

27 July 2021, 22:58	#3
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,029	Table version will be the fastest for 68000. I will be use only, something like this: Code: add.w d2,d2 move.w .mulu200(PC,D2.W),D2

28 July 2021, 00:39	#5
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,029	The best is placing this routine at end of your code. I always placed sample mixing routine at end of my players. PC table must start in 126 bytes range. Or you can use one register as base. I dont use register as table base because this is wasting register for critical routine. Exactly in my mixing routine i used PC for handling 2 tables, not only 1 table.

28 July 2021, 11:17	#9
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,029	For some code (i dont see your code) is possible to change without speed penalty input D2 value from 0,1,2,3,4,5... to 0,2,4,6,8,10 etc then add.w d2,d2 can be removed, and only move.w from table will be used.

28 July 2021, 17:47	#10
pink^abyss Registered User Join Date: Aug 2018 Location: Untergrund/Germany Posts: 410	If you run the code while the blitter is active in parallel or a lot of bitplane DMA is going on then it may be fastest to simply do the mul (on 68k without fastmem).

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)