English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 27 July 2021, 23:17   #1
mcgeezer
Registered User

 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,506
Fastest way to multiply by 200?

Hi all,

I'm not too familiar with exact cycle count timings but I'm looking for a fast way to multiply a number by 200, return value can be within 64Kb.

At the moment I have.

Code:
	lea	.mulu200(pc),a3
	add.w	d2,d2
	move.w	(a3,d2.w),d2
.
.
.

.mulu200:	
	rept	256
	dc.w	REPTN*200
	endr
I'm assuming this will be faster than a mulu #200,d2 which I recall takes something like 70 cycles?

Any faster ways you guys can think of?

Thanks,
Graeme
mcgeezer is online now  
Old 27 July 2021, 23:46   #2
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 536
Assuming 68000... mulu.w #200,dx is: 200=c8=11001000, 38+4+6=48 cycles.
If you break it down into shifts and adds it should be around 40 cycles.
Your table approach is 12+4+14=30 cycles worst case (meaning you execute all 3 instructions each time). Can be made 28 if you can 64kb align the table (move.l dx,ax + move.w (ax),dx is 12 cycles vs. 14 cycles).
a/b is offline  
Old 27 July 2021, 23:58   #3
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 53
Posts: 1,424
Table version will be the fastest for 68000. I will be use only, something like this:
Code:
 add.w d2,d2
 move.w .mulu200(PC,D2.W),D2
Don_Adan is offline  
Old 28 July 2021, 00:12   #4
mcgeezer
Registered User

 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,506
Quote:
Originally Posted by Don_Adan View Post
Table version will be the fastest for 68000. I will be use only, something like this:
Code:
 add.w d2,d2
 move.w .mulu200(PC,D2.W),D2
Thanks - indeed this is 68000

That code won't assemble as I'm getting an out of range displacement.
mcgeezer is online now  
Old 28 July 2021, 01:39   #5
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 53
Posts: 1,424
The best is placing this routine at end of your code. I always placed sample mixing routine at end of my players. PC table must start in 126 bytes range. Or you can use one register as base. I dont use register as table base because this is wasting register for critical routine. Exactly in my mixing routine i used PC for handling 2 tables, not only 1 table.
Don_Adan is offline  
Old 28 July 2021, 08:46   #6
bebbo
botcher

 
Join Date: Jun 2016
Location: Hamburg/Germany
Posts: 565
Quote:
Originally Posted by mcgeezer View Post
Thanks - indeed this is 68000

That code won't assemble as I'm getting an out of range displacement.
the displacement must fit into 1 byte. Thus the code needs to be very close to the table.
used cycles: 4 + 14 = 18

the lea approach will work within a +-32k range
used cycles 8 + 4 + 14 = 26


If you end up putting that code into a subroutine...
... shift/add is faster (40 cycles), since bsr/rts eats the advantage up.
and the shift/add code can be inlined everywhere - ok, you need a scratch register...

... so if that is not available or size matters: use the mul ^^

Last edited by bebbo; 28 July 2021 at 11:05.
bebbo is offline  
Old 28 July 2021, 09:41   #7
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,421
For reference, shift and add version :
Code:
 move.w d0,d1
 add.w d0,d0
 add.w d0,d1
 lsl.w #2,d0
 lsl.w #6,d1
 add.w d1,d0
meynaf is online now  
Old 28 July 2021, 11:06   #8
bebbo
botcher

 
Join Date: Jun 2016
Location: Hamburg/Germany
Posts: 565
Quote:
Originally Posted by meynaf View Post
For reference, shift and add version :
Code:
 move.w d0,d1
 add.w d0,d0
 add.w d0,d1
 lsl.w #2,d0
 lsl.w #6,d1
 add.w d1,d0

that's 44 cycles


faster is
Code:
lsl.w #3,d2 ; *8     6+6
move.w d2,d3           4
lsl.w #3,d3 ; *64    6+6
add.w d3,d2    ; *72    4
add.w d3,d2    ; *136   4
add.w d3,d2    ; *200   4
40 cycles
bebbo is offline  
Old 28 July 2021, 12:17   #9
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 53
Posts: 1,424
For some code (i dont see your code) is possible to change without speed penalty input D2 value from 0,1,2,3,4,5... to 0,2,4,6,8,10 etc then add.w d2,d2 can be removed, and only move.w from table will be used.
Don_Adan is offline  
Old 28 July 2021, 18:47   #10
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 321
If you run the code while the blitter is active in parallel or a lot of bitplane DMA is going on then it may be fastest to simply do the mul (on 68k without fastmem).
pink^abyss is offline  
Old 30 July 2021, 03:41   #11
Gorf
Registered User

Gorf's Avatar
 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 1,453
Quote:
Originally Posted by bebbo View Post
that's 44 cycles


faster is
Code:
lsl.w #3,d2 ; *8     6+6
move.w d2,d3           4
lsl.w #3,d3 ; *64    6+6
add.w d3,d2    ; *72    4
add.w d3,d2    ; *136   4
add.w d3,d2    ; *200   4
40 cycles
or:
Code:
move.w d2,d3           4
lsl.w #3,d3 ; *8     6+6
add.w d3,d2 ; *9       4
add.w d3,d3 ; *16      4
add.w d3,d2 ; *25      4
lsl.w #3,d2 ; *200   6+6
same
Gorf is online now  
Old 30 July 2021, 12:18   #12
mcgeezer
Registered User

 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,506
OK thanks for the replies guys...

I did in the end settle on this - it's basically a blitter routine which reconstructs bob parts (for sprites). The idea is to save memory and cycles, so for example instead of blitting a 64x80 bob (player character) which may contain a lot of blank space I'm only blitting those that have pixels in them.

Screen size is 320/256x5 bitplanes... hence the 200 bytes width. Bob's are interleaved.

I haven't optimised the code segments yet - normally I have a6 pointing into a data segment but will do that soon to speed it up.

Code:
; d0 = Frame Number
; d1 = xpos
; d2 = ypos
; a0 = Sprite Sheet struct
; a1 = Screen struct
; a2 = Restore pointer a3
agdPlotFrame:
	movem.l	d0-d2/a0-a1,-(a7)
	
; Get the frame structure.
	lea	SPRITES_FRAMES,a0
	add.w	d0,d0
	add.w	d0,d0
	move.l	(a0,d0),a0		
	
	move.l	hScreenPointers(a1),a1
	
; This would be the point for entry to the routine.	
	add.w	2(a0),d1		; add x offset for this frame
	add.w	4(a0),d2		; add y offset for this frame
	
	lea	.mulu200(pc),a3
	add.w	d2,d2
	move.w	(a3,d2.w),d2

	move.w	d1,d4			; Make a copy of the xpos to d4	
	and.w	#$fff0,d1		; Get the Xposition nearest word position
	lsr.w	#3,d1			; d1 now has nearest word
	add.l	d1,d2			; d2=Byte position

	ror.w	#4,d4			; Barrel Shift amount for BLTCON0 Source Mask
	clr.b	d4
	move.w	d4,d5			
	or.w	#$fca,d4		; We want Source A,B & C with D = $F, and Cookie Cut $CA = $FCA
	
	move.l	d2,d3			; save plot offset
	move.w	d4,d6			; save bltcon0 value
	move.w	d5,d7			; save bltcon1 value
	
	move.l	#16,a3

.part:
	move.l	d3,d2			; restore offset
	move.w	d6,d4			; restore bltcon0
	move.w	d7,d5			; restore bltcon1
	
	add.l	a3,a0			; advance 16 bytes
	
	move.w	(a0)+,d0		; bltsize
	move.w	(a0),d1			; mod
	swap	d1
	move.w	(a0)+,d1	

	WAIT_FOR_BLITTER
	
	move.l	#$ffff0000,BLTAFWM(a5)

	move.l	d1,BLTAMOD(a5)
	move.l	d1,BLTCMOD(a5)

	move.l	(a0)+,BLTBPTH(a5)	; bob
	move.l	(a0)+,BLTAPTH(a5)	; mask	
	add.w	(a0)+,d2		; plot offset y+x word

	add.w	(a0),d4			; barrel adjust
	add.w	(a0)+,d5		; barrel adjust
	bcc.s	.ovf
	addq.w	#2,d2			; overflow to next word
	
.ovf:	move.w	d0,(a2)+		; Save Blit size
	move.w	d1,(a2)+		; Save Modulo
	move.w	d2,(a2)+		; Save offset

	add.l	a1,d2

	move.w	d4,BLTCON0(a5)
	move.w	d5,BLTCON1(a5)
	move.l	d2,BLTCPTH(a5)
	move.l	d2,BLTDPTH(a5)
	move.w	d0,BLTSIZE(a5)
	
	tst.w	(a0)			; Terminate?
	bpl.s	.part
		
.exit:	movem.l	(a7)+,d0-d2/a0-a1
	rts
	
.mulu200:	
	rept	256
	dc.w	REPTN*200
	endr
Structure of a frame is like this...

Code:
SPRITES_FRAMES:	dc.l	.gripper_fall_left_frame1
		dc.l	.gripper_fall_left_frame2
		dc.l	.gripper_fall_left_frame3
		dc.l	.gripper_fall_left_frame4
		dc.l	.gripper_fall_left_frame5		
		dc.l	.gripper_fall_left_frame5	
		dc.l	-1	



.gripper_fall_left_frame1:	
; Part 1
		dc.w	0		; 0 sprite lock start
		dc.w	0		; 2 x src offset
		dc.w	0		; 4 y src offset
		dc.w	1		; 6  DER x spr size (words)
		dc.w	40		; 8  DER y spr size
		dc.w	0		; 10 DER x dst offset
		dc.w	0		; 12 DER y dst offset
		dc.w	0		; 14

		ds.b	16            ; Compiled blitter values here

; Part 2	
		dc.w	60		; sprite lock start
		dc.w	$DEAD		; x src offset
		dc.w	$BEEF		; y src offset
		dc.w	1		; x spr size (words)
		dc.w	19		; y spr size 
		dc.w	7		; x dst offset
		dc.w	40		; y dst offset (40 pixels down)
		dc.w	0

		ds.b	16            ; Compiled blitter values here
		dc.l	-1
		
		
.gripper_fall_left_frame2:
mcgeezer is online now  
Old 30 July 2021, 12:41   #13
bebbo
botcher

 
Join Date: Jun 2016
Location: Hamburg/Germany
Posts: 565
Quote:
Originally Posted by mcgeezer View Post
OK thanks for the replies guys...
...

Screen size is 320/256x5 bitplanes... hence the 200 bytes width. Bob's are interleaved.
...

refering to Don Adams comment:


you are counting y like 0, 1, 2
can't you count y as 0, 200, 400, ... ?



then would be no need for * 200
bebbo is offline  
Old 30 July 2021, 12:57   #14
mcgeezer
Registered User

 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,506
Quote:
Originally Posted by bebbo View Post
refering to Don Adams comment:


you are counting y like 0, 1, 2
can't you count y as 0, 200, 400, ... ?



then would be no need for * 200
Yes potentially I can do that and it's a nice idea, things might become a little tricky though when I start doing collisions. I'll keep the idea on ice for later.

mcgeezer is online now  
Old 30 July 2021, 21:26   #15
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 721
Quote:
Originally Posted by mcgeezer View Post
Yes potentially I can do that and it's a nice idea, things might become a little tricky though when I start doing collisions. I'll keep the idea on ice for later.

Now you make it sound like you should do both. Unless that has more overhead than the 40 cycles?
NorthWay is offline  
Old 30 July 2021, 23:06   #16
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 53
Posts: 1,424
Or you can try this table version, perhaps can be ok for PC range, but i dont know size of wait for blitter routine.
Code:
; d0 = Frame Number
; d1 = xpos
; d2 = ypos
; a0 = Sprite Sheet struct
; a1 = Screen struct
; a2 = Restore pointer a3
agdPlotFrame:
	movem.l	d0-d2/a0-a1,-(a7)
	
; Get the frame structure.
	lea	SPRITES_FRAMES,a0
	add.w	d0,d0
	add.w	d0,d0
	move.l	(a0,d0),a0		
	
	move.l	hScreenPointers(a1),a1
	
; This would be the point for entry to the routine.	
	add.w	2(a0),d1		; add x offset for this frame
	add.w	4(a0),d2		; add y offset for this frame
	
;	lea	.mulu200(pc),a3
;	add.w	d2,d2
;	move.w	(a3,d2.w),d2

       lea 16.W,A3

	move.w	d1,d4			; Make a copy of the xpos to d4	
	and.w	#$fff0,d1		; Get the Xposition nearest word position
	lsr.w	#3,d1			; d1 now has nearest word
;	add.l	d1,d2			; d2=Byte position  why add longword not word?

	ror.w	#4,d4			; Barrel Shift amount for BLTCON0 Source Mask
	clr.b	d4
	move.w	d4,d5			
	or.w	#$fca,d4		; We want Source A,B & C with D = $F, and Cookie Cut $CA = $FCA
	
;	move.l	d2,d3			; save plot offset
	move.w	d4,d6			; save bltcon0 value
	move.w	d5,d7			; save bltcon1 value
        add.w d2,d2
       move.w .mulu200(PC,D2.W),D3
       add.w D1,D3

;	move.l	#16,a3

.part:
	move.l	d3,d2			; restore offset  ; 88 bytes
	move.w	d6,d4			; restore bltcon0 ; 86 bytes 
	move.w	d7,d5			; restore bltcon1 ; 84 bytes
	
	add.l	a3,a0			; advance 16 bytes ; 82 bytes
	
	move.w	(a0)+,d0		; bltsize ; 80 bytes
	move.w	(a0),d1			; mod ; 78 bytes
	swap	d1    ; 76 bytes
	move.w	(a0)+,d1	; 74 bytes

	WAIT_FOR_BLITTER ; unknown size
	
	move.l	#$ffff0000,BLTAFWM(a5) ; 72 bytes

	move.l	d1,BLTAMOD(a5)   ; 64 bytes
	move.l	d1,BLTCMOD(a5) ; 60 bytes

	move.l	(a0)+,BLTBPTH(a5)	; bob ; 56 bytes
	move.l	(a0)+,BLTAPTH(a5)	; mask	; 52 bytes
	add.w	(a0)+,d2		; plot offset y+x word ; 48 bytes

	add.w	(a0),d4			; barrel adjust ; 46 bytes
	add.w	(a0)+,d5		; barrel adjust ; 44 bytes
	bcc.s	.ovf   ; 42 bytes
	addq.w	#2,d2			; overflow to next word ; 40 bytes
	
.ovf:	move.w	d0,(a2)+		; Save Blit size ; 38 bytes
	move.w	d1,(a2)+		; Save Modulo ; 36 bytes
	move.w	d2,(a2)+		; Save offset ; 34 bytes

	add.l	a1,d2 ; 32 bytes

	move.w	d4,BLTCON0(a5) ; 30 bytes
	move.w	d5,BLTCON1(a5) ; 26 bytes 
	move.l	d2,BLTCPTH(a5)  ; 22 bytes 
	move.l	d2,BLTDPTH(a5) ; 18 bytes
	move.w	d0,BLTSIZE(a5) ; 14 bytes 
	
	tst.w	(a0)			; Terminate? ; 10 bytes
	bpl.s	.part ; 8 bytes
		
.exit:	movem.l	(a7)+,d0-d2/a0-a1 ; 6 bytes
	rts ; 2 bytes
	
.mulu200:	
	rept	256
	dc.w	REPTN*200
	endr
Don_Adan is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast multiply / divide by 64? mcgeezer Coders. Asm / Hardware 10 06 April 2018 20:29
CD-200 crashes with SX-1 th4t1guy support.Games 2 26 June 2015 17:41
200 % fps turrican3 request.UAE Wishlist 13 30 July 2008 19:34
64 bit signed multiply cdoty Coders. General 2 16 December 2007 13:24
Moonstone for almost $200, are they serious? Pyromania Retrogaming General Discussion 29 13 November 2003 23:28

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 14:27.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.10721 seconds with 15 queries