Optimizing the 68020+ 32-bit math - Page 23

Don_Adan · 01 July 2021, 12:44

Maybe changing length of text for msg1,msg2, msg4 to same value 23 or 24 bytes can win some D3? But perhaps it needs some changes in getnum error handling routine, if can works.

a/b · 01 July 2021, 13:01

Quote:

Originally Posted by Don_Adan

Maybe changing length of text for msg1,msg2, msg4 to same value 23 or 24 bytes can win some D3? But perhaps it needs some changes in getnum error handling routine, if can works.

Yeah, I've been looking at those numbers too many times (doing occasional checks of raw code for those kind of situations), it's a possibility, but it seemed silly coming up with superfluous chars to pad the length as well as problematic so I never considered actually doing it.
I'm not calling the shots here, everyone is free to do whatever they want and make their owns versions. I'm just having fun with m68k coding and kind of doing my own thing, and trying to align my stuff with the rest as much as I can and find appropriate (e.g. I didn't merge pure BCPL version that ross wrote; let say I don't want to steal all the fun so I'll let him merge my additions and adjust his code :P).

alkis · 01 July 2021, 13:21

There is no real consistency between platforms. Some implementations have high level language (basic) to print those strings and numbers + asm and the exe size reported is the sum of asm bin and basic source code.

Anyways, although bad design to do staff to print a constant, I can see that it makes comparisons hard since we can not go fix a million other implementations

So, rolled back, msg1 and msg2 unified. Bug fixed on msg1. Day transition code integrated, not activated.

490 bytes (510 bytes)

Code:

;***************************************************************

; N = 7*D, D = digits, e.g. N = 700 for 100 digits

; user settings
PRINT_DIGITS		= 1
HACKS			= 1	; use undocumented OS stuff?
LONG_TIMER              = 0     ; check for day transition?
        
; exec
TDNestCnt		= 295
LibList			= 378
LN_NAME			= 10	; list node name

; dos
Input			= -54
Output			= -60
Read			= -42
Write			= -48
DateStamp		= -192
TICKS_PER_SECOND	= 50	; dos timer frequency

;***************************************************************

start
; Instead of opening/closing dos library we find it in exec's list.
	move.l	4.w,a5			; exec library
	IFEQ	HACKS
	lea	LibList(a5),a6
.lib_loop
	move.l	(a6),a6			; next library in the list
	move.l	LN_NAME(a6),a0
	move.l	#'.sod',d1
.lib_name
	cmp.b	(a0)+,d1
	bne.b	.lib_loop
	lsr.l	#8,d1
	bne.b	.lib_name
	ELSE
	lea	-$148(a2),a6		; dos library from bcpl vector
	ENDIF	; HACKS

	jsr	Output(a6)
	move.l	d0,a3			; a3 = stdout
	lea	workspace(pc),a4

	bsr.w	getnum			; returns N in d6 (k = N)

	addq.b	#1,TDNestCnt(a5)	; FORBID macro, a2/a5 are now free

	bsr.b	.gettime
	move.l	d7,-(a7)

;*** TIMED PART START ******************************************

	move.l	#2000<<16+2000,d0
	move.l	a4,a0
	move.l	d6,d7
.fill	move.l	d0,(a0)+
	subq.w	#4,d7
	bne.b	.fill

	move.w	#10000,d5

; outer+inner loop:
;	d3 upper word must initially be and remain 0
; 	d7 must initially be 0 (c = 0)
; d0=*, d1=d, d2=b, d3=tmp, d4=10, d5=10000, d6=k, d7=c
; a0=*, a1=*, a4=r[] (a2=--, a3=stdout, a5=--, a6=dos)

.outer_loop
	moveq	#0,d1			; d = 0
	add.l	d6,a4			; &r[i]
	move.w	d6,d2
	subq.w	#1,d2			; b = k-1
	bra.b	.inner_entry

.gettime	; returns ticks in d7
	lea	-12(sp),sp		; -ds_SIZEOF
	move.l	sp,d1
	jsr	DateStamp(a6)
	movem.l	(sp)+,d0/d1/d7		; d0=days, d1=minutes, d7=ticks
	mulu.w	#TICKS_PER_SECOND*60,d1
	add.l	d1,d7
	rts

.longdiv
	swap	d0
	move.w	d0,d3
	divu.w	d2,d3
	swap	d3
	move.w	d3,d0
	swap	d0
	divu.w	d2,d0

	move.w	d0,d3
	clr.w	d0
	swap	d0
	move.w	d0,(a4)			; r[i] = d%b
	exg	d0,d3

	subq.w	#2,d2			; b -= 2
	bcs.b	.inner_done

.inner_loop
	sub.l	d0,d1			; d = (d-d/b-d%b)/2
	sub.l	d3,d1			; same as d *= i
	lsr.l	#1,d1
.inner_entry
	move.w	-(a4),d0		; r[i]
	mulu.w	d5,d0
	add.l	d0,d1			; d += r[i]*10000
	move.l	d1,d0
	divu.w	d2,d0			; d/b
	bvs.b	.longdiv

	move.w	d0,d3			; d/b
	clr.w	d0
	swap	d0			; d%b
	move.w	d0,(a4)			; r[i] = d%b

	subq.w	#2,d2			; b -= 2
	bcc.b	.inner_loop

.inner_done
	divu.w	d5,d1			; d/10000
	add.w	d7,d1			; d = c+d/10000 (to be printed out)
	move.l	d1,d7
	swap	d7			; c = d%10000
	IFNE	PRINT_DIGITS
	bsr.b	PR0000
	ENDIF

	sub.w	#7*4,d6			; k -= 7*4
	bne.b	.outer_loop

;*** TIMED PART END ********************************************

	bsr.b	.gettime
	sub.l	(a7)+,d7		; end_time-start_time
        IFNE	LONG_TIMER
        ; I'll shoot if you ask for DST adjustment or anything similar.
	bpl.b	.same_day
	add.l	#TICKS_PER_SECOND*60*60*24,d7
.same_day
	ENDIF

	add.l	d7,d7			; dos ticks (1/50) to 1/100
	addq.l	#1,d7			; round up
	divu.w	#100,d7			; 1/100ths upper, seconds lower

	move.l	a4,d2			; print buffer
	move.b	#' ',(a4)+

	moveq	#0,d3			; skip leading zeroes
	bsr.b	SPrintTime		; d5 is already set to 10000

	move.b	#'.',(a4)+

	moveq	#'0',d3			; print leading zeroes
	swap	d7
	moveq	#10,d5
	bsr.b	SPrintTime

	move.b	d4,(a4)+		; newline

	move.l	a4,d3
	sub.l	d2,d3			; string size
	bra.b	callwrite

	; END OF PROGRAM (exec will re-enable multitasking)

;***************************************************************

SPrintTime	; d7=value, a4=buffer
	move.w	d7,d6
.Next	ext.l	d6
	divu.w	d5,d6
	cmp.b	d3,d6
	beq.b	.LeadZero
	moveq	#'0',d3
	add.b	d3,d6
	move.b	d6,(a4)+
.LeadZero
	swap	d6
	divu.w	d4,d5
	bne.b	.Next
foo	rts

PR0000		; d1=value
	move.l	#'0000'-$01010001,d0
	move.w	-(a4),d3
.Loop	addq.b	#1,d0			; top 3 digits in a loop
	add.w	d3,d1
	bpl.b	.Loop
	sub.w	d3,d1
	rol.l	#8,d0
	move.w	-(a4),d3		; last value is string size (4)
	bmi.b	.Loop
	add.b	d1,d0			; 4th digit
 
	move.l	d0,-(a4)

	moveq	#pbuffer-workspace,d2
	sub.l	d2,a4
writetext
	add.l	a4,d2			; offset to buffer address
callwrite
	move.l	a3,d1			; stdout
	jmp	Write(a6) 		; call Write(stdout,buffer,size)

;***************************************************************

; Data must be in this order all up to msg1.

	CNOP	0,4
pbuffer	DCB.B	4,0
dec2str	DC.W	dec2str-pbuffer,-10,-100,-1000

;*** OVERWRITTEN CODE/DATA STARTS HERE *************************
workspace

; Reorder as needed and don't lose bytes due to EVENs.
msg1	DC.B	"number pi calculator v16",10	; odd length
msg1end
msg2	DC.B	"number of digits (up to "	; even length
msg2end        
msg3	DC.B	")? "				; odd length
msg3end
	EVEN
printnum				; shortcut within 128 bytes
	bra.b	PR0000

msg4	DC.B	" digits will be printed",10	; even length
msg4end
	EVEN

;***************************************************************

getnum
	moveq	#10,d4			; global const
	move.w	#((65536-(workspace-start))/7)&(~3),d7 ; maxD (multiple of 4)
	lea	writetext(pc),a2	; a2 only used in this subroutine

.error
        moveq	#msg1-workspace,d2
	moveq	#msg2end-msg1,d3
	jsr	(a2)

;.error	moveq	#msg2-workspace,d2
;	moveq	#msg2end-msg2,d3
;	jsr	(a2)
	move.w	d7,d1
	bsr.b	printnum
	moveq	#msg3-workspace,d2
	moveq	#msg3end-msg3,d3
	jsr	(a2)

	jsr	Input(a6)		; get stdin
	move.l	d0,d1
        lea.l   start(pc),a0            ; self-destruction-mode on
	move.l	a0,d2			; read overwrites msg1
	moveq	#4+1,d3			; up to 4 digits + newline
	jsr	Read(a6)		; returns length in d0

	move.l	d2,a0
	moveq	#0,d2
.nextch	subq.w	#1,d0
	beq.b	.parsed
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	d4,d6			; digit 0-9?
	bhs.b	.error
	mulu.w	d4,d2
	add.w	d6,d2			; D = D*10+digit
	bra.b	.nextch
.parsed
	cmp.w	d7,d2			; D > maxD?
	bhi.b	.error
	move.w	d2,d1			; D = 0?
	beq.b	.error

	addq.w	#3,d1
	and.w	#~3,d1			; adjust D to a multiple of 4
	moveq	#7,d6
	mulu.w	d1,d6			; k = N = 7*D
	cmp.b	(a0),d4			; last char is newline (1-4 digits)?
	bne.b	.adjusted
	cmp.w	d2,d1
	beq.s	.not_adjusted
.adjusted
	bsr.b	printnum		; either 5 digits or adjusted D
	moveq	#msg4-workspace,d2
	moveq	#msg4end-msg4,d3
	jmp	(a2)
.not_adjusted
	rts

;***************************************************************

bss	DS.B	65536-(*-start)		; 64kb allowed for code+data

;***************************************************************

; asm-one&co
	PRINTV	bss-start+36		; 36 = hunk overhead
	PRINTV	((65536-(workspace-start))/7)&(~3)

;***************************************************************

alkis · 01 July 2021, 13:24

btw, the amiga way (IMO) would be to grab argument from command line (a0). This design is like Input "What's your name";A$ we all did in basic on 8 bits

Don_Adan · 01 July 2021, 15:06

You shortened this program much, then maybe something like this will be works?

Code:

;***************************************************************

; N = 7*D, D = digits, e.g. N = 700 for 100 digits

; user settings
PRINT_DIGITS		= 1
HACKS			= 0	; use undocumented OS stuff?

; exec
TDNestCnt		= 295
LibList			= 378
LN_NAME			= 10	; list node name

; dos
Input			= -54
Output			= -60
Read			= -42
Write			= -48
DateStamp		= -192
TICKS_PER_SECOND	= 50	; dos timer frequency

;***************************************************************

start
; Instead of opening/closing dos library we find it in exec's list.
	move.l	4.w,a5			; exec library
	IFEQ	HACKS
	lea	LibList(a5),a6
.lib_loop
	move.l	(a6),a6			; next library in the list
	move.l	LN_NAME(a6),a0
	move.l	#'.sod',d1
.lib_name
	cmp.b	(a0)+,d1
	bne.b	.lib_loop
	lsr.l	#8,d1
	bne.b	.lib_name
	ELSE
	lea	-$148(a2),a6		; dos library from bcpl vector
	ENDIF	; HACKS

	jsr	Output(a6)
	move.l	d0,a3			; a3 = stdout
	lea	workspace(pc),a4

	bsr.w	getnum			; returns N in d6 (k = N)

	addq.b	#1,TDNestCnt(a5)	; FORBID macro, a2/a5 are now free

	bsr.b	.gettime
	move.l	d7,-(a7)

;*** TIMED PART START ******************************************

	move.l	#2000<<16+2000,d0
	move.l	a4,a0
	move.l	d6,d7
.fill	move.l	d0,(a0)+
	subq.w	#4,d7
	bne.b	.fill

	move.w	#10000,d5

; outer+inner loop:
;	d3 upper word must initially be and remain 0
; 	d7 must initially be 0 (c = 0)
; d0=*, d1=d, d2=b, d3=tmp, d4=10, d5=10000, d6=k, d7=c
; a0=*, a1=*, a4=r[] (a2=--, a3=stdout, a5=--, a6=dos)

.outer_loop
	moveq	#0,d1			; d = 0
	add.l	d6,a4			; &r[i]
	move.w	d6,d2
	subq.w	#1,d2			; b = k-1
	bra.b	.inner_entry

.gettime	; returns ticks in d7
	lea	-12(sp),sp		; -ds_SIZEOF
	move.l	sp,d1
	jsr	DateStamp(a6)
	movem.l	(sp)+,d0/d1/d7		; d0=days, d1=minutes, d7=ticks
	mulu.w	#TICKS_PER_SECOND*60,d1
	add.l	d1,d7
	rts

.longdiv
	swap	d0
	move.w	d0,d3
	divu.w	d2,d3
	swap	d3
	move.w	d3,d0
	swap	d0
	divu.w	d2,d0

	move.w	d0,d3
	clr.w	d0
	swap	d0
	move.w	d0,(a4)			; r[i] = d%b
	exg	d0,d3

	subq.w	#2,d2			; b -= 2
	bcs.b	.inner_done

.inner_loop
	sub.l	d0,d1			; d = (d-d/b-d%b)/2
	sub.l	d3,d1			; same as d *= i
	lsr.l	#1,d1
.inner_entry
	move.w	-(a4),d0		; r[i]
	mulu.w	d5,d0
	add.l	d0,d1			; d += r[i]*10000
	move.l	d1,d0
	divu.w	d2,d0			; d/b
	bvs.b	.longdiv

	move.w	d0,d3			; d/b
	clr.w	d0
	swap	d0			; d%b
	move.w	d0,(a4)			; r[i] = d%b

	subq.w	#2,d2			; b -= 2
	bcc.b	.inner_loop

.inner_done
	divu.w	d5,d1			; d/10000
	add.w	d7,d1			; d = c+d/10000 (to be printed out)
	move.l	d1,d7
	swap	d7			; c = d%10000
	IFNE	PRINT_DIGITS
	bsr.b	PR0000
	ENDIF

	sub.w	#7*4,d6			; k -= 7*4
	bne.b	.outer_loop

;*** TIMED PART END ********************************************

	bsr.b	.gettime
	sub.l	(a7)+,d7		; end_time-start_time

	add.l	d7,d7			; dos ticks (1/50) to 1/100
	addq.l	#1,d7			; round up
	divu.w	#100,d7			; 1/100ths upper, seconds lower

	move.l	a4,d2			; print buffer
	move.b	#' ',(a4)+

	moveq	#0,d3			; skip leading zeroes
	bsr.b	SPrintTime		; d5 is already set to 10000

	move.b	#'.',(a4)+

	moveq	#'0',d3			; print leading zeroes
	swap	d7
	moveq	#10,d5
	bsr.b	SPrintTime

	move.b	d4,(a4)+		; newline

	move.l	a4,d3
	sub.l	d2,d3			; string size
	bra.b	callwrite

	; END OF PROGRAM (exec will re-enable multitasking)

;***************************************************************

SPrintTime	; d7=value, a4=buffer
	move.w	d7,d6
.Next	ext.l	d6
	divu.w	d5,d6
	cmp.b	d3,d6
	beq.b	.LeadZero
	moveq	#'0',d3
	add.b	d3,d6
	move.b	d6,(a4)+
.LeadZero
	swap	d6
	divu.w	d4,d5
	bne.b	.Next
	rts

PR0000		; d1=value
	move.l	#'0000'-$01010001,d0
	move.w	-(a4),d3
.Loop	addq.b	#1,d0			; top 3 digits in a loop
	add.w	d3,d1
	bpl.b	.Loop
	sub.w	d3,d1
	rol.l	#8,d0
	move.w	-(a4),d3		; last value is string size (4)
	bmi.b	.Loop
	add.b	d1,d0			; 4th digit
 
	move.l	d0,-(a4)

	moveq	#pbuffer-workspace,d2
	sub.l	d2,a4
writetext
	add.l	a4,d2			; offset to buffer address
callwrite
	move.l	a3,d1			; stdout
	jmp	Write(a6) 		; call Write(stdout,buffer,size)

;***************************************************************

; Data must be in this order all up to msg1.

	CNOP	0,4
pbuffer	DCB.B	4,0
dec2str	DC.W	dec2str-pbuffer,-10,-100,-1000

;*** OVERWRITTEN CODE/DATA STARTS HERE *************************
workspace

; Reorder as needed and don't lose bytes due to EVENs.
msg1	DC.B	"number pi calculator v16",10	; odd length
msg1end
msg2	DC.B	"number of digits (up to "	; even length
msg2end
msg3	DC.B	")? "				; odd length
msg3end
	EVEN
printnum				; shortcut within 128 bytes
	bra.b	PR0000

msg4	DC.B	" digits will be printed",10	; even length
msg4end
	EVEN

;***************************************************************

getnum
	moveq	#10,d4			; global const    move this if too large distance
	move.w	#((65536-(workspace-start))/7)&(~3),d7 ; maxD (multiple of 4) maybe this too
;	lea	writetext(pc),a2	; a2 only used in this subroutine

	moveq	#msg1-workspace,d2
	moveq	#msg1end-msg1,d3
;	jsr	(a2)
       bsr.b writtext
.error	moveq	#msg2-workspace,d2
	moveq	#msg2end-msg2,d3
;	jsr	(a2)
        bsr.b writetext
	move.w	d7,d1
	bsr.b	printnum
	moveq	#msg3-workspace,d2
	moveq	#msg3end-msg3,d3
;	jsr	(a2)
        bsr.b writetext
	jsr	Input(a6)		; get stdin
	move.l	d0,d1
	move.l	a4,d2			; read overwrites msg1
	moveq	#4+1,d3			; up to 4 digits + newline
	jsr	Read(a6)		; returns length in d0

	move.l	d2,a0
	moveq	#0,d2
.nextch	subq.w	#1,d0
	beq.b	.parsed
	move.w	#256-'0',d6
	add.b	(a0)+,d6
	cmp.w	d4,d6			; digit 0-9?
	bhs.b	.error
	mulu.w	d4,d2
	add.w	d6,d2			; D = D*10+digit
	bra.b	.nextch
.parsed
	cmp.w	d7,d2			; D > maxD?
	bhi.b	.error
	move.w	d2,d1			; D = 0?
	beq.b	.error

	addq.w	#3,d1
	and.w	#~3,d1			; adjust D to a multiple of 4
	moveq	#7,d6
	mulu.w	d1,d6			; k = N = 7*D
	cmp.b	(a0),d4			; last char is newline (1-4 digits)?
	bne.b	.adjusted
	cmp.w	d2,d1
	beq.b	.not_adjusted
.adjusted
	bsr.b	printnum		; either 5 digits or adjusted D
	moveq	#msg4-workspace,d2
	moveq	#msg4end-msg4,d3
;	jmp	(a2)
        bra.w writetext
.not_adjusted
	rts

;***************************************************************

bss	DS.B	65536-(*-start)		; 64kb allowed for code+data

;***************************************************************

; asm-one&co
;	PRINTV	bss-start+36		; 36 = hunk overhead
;	PRINTV	((65536-(workspace-start))/7)&(~3)

;***************************************************************

a/b · 01 July 2021, 15:45

It works for me with 2 bytes to spare (or 6 if I move d7 out). Getting rid of timer.L put the 3rd call back within range.

ross · 01 July 2021, 17:11

Quote:

Originally Posted by a/b

(e.g. I didn't merge pure BCPL version that ross wrote; let say I don't want to steal all the fun so I'll let him merge my additions and adjust his code :P).

I didn't have time today (damn work..), maybe I'll look at it when there is a new 'stable' version, I see there have been a few new changes

[strings of equal length would be useful for the BCPL version because in any case I have to save the whole environment and I can carry the values between one call to another]

a/b · 01 July 2021, 17:17

And with another 2, I can go -4 bytes in exe size:

Code:

	move.l	#2000<<16+2000,d0
;bye	move.l	a4,a0
	move.l	d6,d7
;.fill	move.l	d0,(a0)+
.fill	move.l	d0,(a4)+
	subq.w	#4,d7
	bne.b	.fill
...
.outer_loop
	moveq	#0,d1			; d = 0
;	add.l	d6,a4			; &r[i]
...
	sub.w	#7*4,d6			; k -= 7*4
	add.l	d6,a4			; &r[k/2]
	bne.b	.outer_loop

Don_Adan · 01 July 2021, 18:31

Because this is size vs speed optimization, then you can perhaps use this:

Code:

	move.w	#2000,d5
	move.l	d6,d7
.fill	move.w	d5,(a4)+
	subq.w	#2,d7
	bne.b	.fill
        move.w	#10000,d5

And if you preserve d3=5
then later

mulu.w D3,D5

Also using A2 and A5 registers is perhaps possible to free d6 or d7 register. And later Ross dot com idea, can be used. But this is a few complicated and need some changes.

a/b · 01 July 2021, 18:51

I was already thinking about the fill loop, switching to words. Didn't feel like doing it because it would be slower. Dunno, I guess I'll do a few runs and see how much slower it is.
An hour or so ago I had another idea that would need another word reg to initialize with a constant to make the code shorter (right now it breaks even with 3 regs): movem.w (a4),d4/d5/d7 (5 words) vs. 2x move # and 1x moveq (also 5 words), was considering fill words but left it open for now, still didn't like switching to words.

edit: Almost forgot. I wrote about d6/d7 this morning while replying to Ross, can't do I think: d7 needs swap, d6 needed for ccr setting.
Speed is fine, we should got for it I guess. But that movem.w (a4) opt is a total pain in the butt. Can't see it working atm, lots of reg switching to preserve everything long enough, if even possible ;(.

Don_Adan · 01 July 2021, 19:36

Quote:

Originally Posted by a/b

I was already thinking about the fill loop, switching to words. Didn't feel like doing it because it would be slower. Dunno, I guess I'll do a few runs and see how much slower it is.
An hour or so ago I had another idea that would need another word reg to initialize with a constant to make the code shorter (right now it breaks even with 3 regs): movem.w (a4)+,d4/d5/d7 (5 words) vs. 2x move # and 1x moveq (also 5 words), was considering fill words but left it open for now, still didn't like switching to words.

edit: Almost forgot. I wrote about d6/d7 this morning while replying to Ross, can't do I think: d7 needs swap, d6 needed for ccr setting.

I know, d6 is not necessary to ccr setting in my idea. Then can be free.
Anyway you must check first if something like this is possible, because we must compensate 2 bytes lose too.

Code:

Edit, perhaps A5 can be trashed too, then only one more register is necessary to be trashed.

      lea.28.W,A2       ; somewhere in init part
      .....
.gettime	; returns ticks in d7
;	lea	-12(sp),sp		; -ds_SIZEOF
        sub.l A2,SP
	move.l	sp,d1
	jsr	DateStamp(a6)
;	movem.l	(sp)+,d0/d1/d7		; d0=days, d1=minutes, d7=ticks
      movem.l (SP)+,    ; here we must restore 7 registers, then 4 registers will be trashed, perhaps a0,a1 and 2 more registers, you better know this code
	mulu.w	#TICKS_PER_SECOND*60,d1
	add.l	d1,d7

a/b · 01 July 2021, 20:27

I'm using 10, 10000, maxD and 2000 with movem now. Retrying with 4*7 now.

That's 504 bytes or 484 haxed bytes.

Code:

;***************************************************************

; N = 7*D, D = digits, e.g. N = 700 for 100 digits

; user settings
PRINT_DIGITS		= 1
LONG_TIMER		= 0	; check for day transition?
HACKS			= 0	; use undocumented OS stuff?

; exec
TDNestCnt		= 295
LibList			= 378
LN_NAME			= 10	; list node name

; dos
Input			= -54
Output			= -60
Read			= -42
Write			= -48
DateStamp		= -192
TICKS_PER_SECOND	= 50	; dos timer frequency

;***************************************************************

start
; Instead of opening/closing dos library we find it in exec's list.
	move.l	4.w,a5			; exec library
	IFEQ	HACKS
	lea	LibList(a5),a6
.lib_loop
	move.l	(a6),a6			; next library in the list
	move.l	LN_NAME(a6),a0
	move.l	#'.sod',d0
.lib_name
	cmp.b	(a0)+,d0
	bne.b	.lib_loop
	lsr.l	#8,d0
	bne.b	.lib_name
	ELSE
	lea	-$148(a2),a6		; dos library from bcpl vector
	ENDIF	; HACKS

	jsr	Output(a6)
	move.l	d0,a3			; a3 = stdout
	lea	workspace(pc),a4
	movem.w	(a4),d4/d5/d7/a2	; 10, 10000, maxD, 2000

	bsr.w	getnum			; returns N in d6 (k = N)

	addq.b	#1,TDNestCnt(a5)	; FORBID macro, a5 is free now

	bsr.b	.gettime
	move.l	d7,-(a7)

;*** TIMED PART START ******************************************

	move.l	d6,d7
.fill	move.w	a2,(a4)+		; a2 is free after this loop
	subq.w	#2,d7
	bne.b	.fill

; outer+inner loop:
;	d3 upper word must initially be and remain 0
; 	d7 must initially be 0 (c = 0)
; d0=*, d1=d, d2=b, d3=tmp, d4=10, d5=10000, d6=k, d7=c
; a0=*, a1=*, a4=r[] (a2=--, a3=stdout, a5=--, a6=dos)

.outer_loop
	moveq	#0,d1			; d = 0
	move.w	d6,d2
	subq.w	#1,d2			; b = k-1
	bra.b	.inner_entry

.gettime	; returns ticks in d7
	lea	-12(sp),sp		; -ds_SIZEOF
	move.l	sp,d1
	jsr	DateStamp(a6)
	movem.l	(sp)+,d0/d1/d7		; d0=days, d1=minutes, d7=ticks
	mulu.w	#TICKS_PER_SECOND*60,d1
	add.l	d1,d7
	rts

.longdiv
	swap	d0
	move.w	d0,d3
	divu.w	d2,d3
	swap	d3
	move.w	d3,d0
	swap	d0
	divu.w	d2,d0

	move.w	d0,d3
	clr.w	d0
	swap	d0
	move.w	d0,(a4)			; r[i] = d%b
	exg	d0,d3

	subq.w	#2,d2			; b -= 2
	bcs.b	.inner_done

.inner_loop
	sub.l	d0,d1			; d = (d-d/b-d%b)/2
	sub.l	d3,d1			; same as d *= i
	lsr.l	#1,d1
.inner_entry
	move.w	-(a4),d0		; r[i]
	mulu.w	d5,d0
	add.l	d0,d1			; d += r[i]*10000
	move.l	d1,d0
	divu.w	d2,d0			; d/b
	bvs.b	.longdiv

	move.w	d0,d3			; d/b
	clr.w	d0
	swap	d0			; d%b
	move.w	d0,(a4)			; r[i] = d%b

	subq.w	#2,d2			; b -= 2
	bcc.b	.inner_loop

.inner_done
	divu.w	d5,d1			; d/10000
	add.w	d7,d1			; d = c+d/10000 (to be printed out)
	move.l	d1,d7
	swap	d7			; c = d%10000
	IFNE	PRINT_DIGITS
	bsr.b	PR0000
	ENDIF

	sub.w	#7*4,d6			; k -= 7*4
	add.l	d6,a4			; &r[k/2]
	bne.b	.outer_loop

;*** TIMED PART END ********************************************

	bsr.b	.gettime
	sub.l	(a7)+,d7		; end_time-start_time
	IFNE	LONG_TIMER
; I'll shoot if you ask for DST adjustment or anything similar.
	bpl.b	.same_day
	add.l	#TICKS_PER_SECOND*60*60*24,d7
.same_day
	ENDIF
	add.l	d7,d7			; dos ticks (1/50) to 1/100
	addq.l	#1,d7			; round up
	divu.w	#100,d7			; 1/100ths upper, seconds lower

	move.l	a4,d2			; print buffer
	move.b	#' ',(a4)+

	moveq	#0,d1			; skip leading zeroes
	bsr.b	SPrintTime		; d5 is already set to 10000

	move.b	#'.',(a4)+

	moveq	#'0',d1			; print leading zeroes
	swap	d7
	moveq	#10,d5
	bsr.b	SPrintTime

	move.b	d4,(a4)+		; newline

	move.l	a4,d3
	sub.l	d2,d3			; string size
	bra.b	callwrite

	; END OF PROGRAM (exec will re-enable multitasking)

;***************************************************************

SPrintTime	; d7=value, a4=buffer
	move.w	d7,d0
.Next	ext.l	d0
	divu.w	d5,d0
	cmp.b	d1,d0
	beq.b	.LeadZero
	moveq	#'0',d1
	add.b	d1,d0
	move.b	d0,(a4)+
.LeadZero
	swap	d0
	divu.w	d4,d5
	bne.b	.Next
	rts

PR0000		; d1=value
	move.l	#'0000'-$01010001,d0
	move.w	-(a4),d3
.Loop	addq.b	#1,d0			; top 3 digits in a loop
	add.w	d3,d1
	bpl.b	.Loop
	sub.w	d3,d1
	rol.l	#8,d0
	move.w	-(a4),d3		; last value is string size (4)
	bmi.b	.Loop
	add.b	d1,d0			; 4th digit
 
	move.l	d0,-(a4)

	moveq	#pbuffer-workspace,d2
	sub.l	d2,a4
writetext
	add.l	a4,d2			; offset to buffer address
callwrite
	move.l	a3,d1			; stdout
	jmp	Write(a6) 		; call Write(stdout,buffer,size)

;***************************************************************

; Data must be in this order all up to msg1.

	CNOP	0,4
pbuffer	DCB.B	4,0
dec2str	DC.W	dec2str-pbuffer,-10,-100,-1000

;*** OVERWRITTEN CODE/DATA STARTS HERE *************************
workspace
	DC.W	10,10000
	DC.W	((65536-(workspace-start))/7)&(~3) ; maxD (multiple of 4)
	DC.W	2000

; Reorder as needed and don't lose bytes due to EVENs.
msg1	DC.B	"number pi calculator v17",10	; odd length
msg1end
msg2	DC.B	"number of digits (up to "	; even length
msg2end
msg3	DC.B	")? "				; odd length
msg3end
	EVEN
printnum				; shortcut within 128 bytes
	bra.b	PR0000

msg4	DC.B	" digits will be printed",10	; even length
msg4end
	EVEN

;***************************************************************

getnum
	moveq	#msg1-workspace,d2
	moveq	#msg1end-msg1,d3
	bsr.b	writetext

.error	moveq	#msg2-workspace,d2
	moveq	#msg2end-msg2,d3
	bsr.b	writetext
	move.w	d7,d1			; maxD
	bsr.b	printnum
	moveq	#msg3-workspace,d2
	moveq	#msg3end-msg3,d3
	bsr.b	writetext

	jsr	Input(a6)		; get stdin
	move.l	d0,d1
	move.l	a4,d2			; read overwrites msg1
	moveq	#4+1,d3			; up to 4 digits + newline
	jsr	Read(a6)		; returns length in d0

	move.l	d2,a0
	moveq	#0,d2
.nextch	subq.w	#1,d0
	beq.b	.parsed
	move.w	#256-'0',d3
	add.b	(a0)+,d3
	cmp.w	d4,d3			; digit 0-9?
	bhs.b	.error
	mulu.w	d4,d2
	add.w	d3,d2			; D = D*10+digit
	bra.b	.nextch
.parsed
	cmp.w	d7,d2			; D > maxD?
	bhi.b	.error
	move.w	d2,d1			; D = 0?
	beq.b	.error

	addq.w	#3,d1
	and.w	#~3,d1			; adjust D to a multiple of 4
	moveq	#7,d6
	mulu.w	d1,d6			; k = N = 7*D
	cmp.b	(a0),d4			; last char is newline (1-4 digits)?
	bne.b	.adjusted
	cmp.w	d2,d1
	beq.b	.not_adjusted
.adjusted
	bsr.b	printnum		; either 5 digits or adjusted D
	moveq	#msg4-workspace,d2
	moveq	#msg4end-msg4,d3
	bra.w	writetext
.not_adjusted
	rts

;***************************************************************

bss	DS.B	65536-(*-start)		; 64kb allowed for code+data

;***************************************************************

; asm-one&co
;	PRINTV	bss-start+36		; 36 = hunk overhead
;	PRINTV	((65536-(workspace-start))/7)&(~3)

;***************************************************************

a/b · 01 July 2021, 21:52

OK, we could get 2 more bytes (=502) with those 4x trash regs in gettime *if* we:
- replace 2000 with 4*7 in movem for a2
- change stack decrement from 12 to a2 (+2)
- change d6 decrement to a2 (+2)
- change the fill loop to move.w #2000,(a4)+ (-2 because zero free regs for movem, and extra -2 bytes to load would be back to 504) ;\
But now there is a range problem with the 3rd writetext call at the end, it's 130 bytes away because everything before CNOP moved 2 bytes up and until we get 2 more bytes to get it back to 128 it's 504 or 504.

Don_Adan · 01 July 2021, 23:04

For free D6 register you can try to use something next.
Because you used movem.w then maybe free 2 bytes in gettime routine is not necessary.
Also moveq #10,D4 dont must be handled via movem.w.
Anyway after your latest movem.w changes my idea to free d6 is perhaps more complicated than previously.

Code:

;	move.l	d6,d7
        move.l a5,d7
.fill	move.w	a2,(a4)+		; a2 is free after this loop
	subq.w	#2,d7
	bne.b	.fill
        lea 28.W,A2
.outer_loop
	moveq	#0,d1			; d = 0
;	move.w	d6,d2
	subq.w	#1,d2
        bra.b .inner_entry

;	sub.w	#7*4,d6			; k -= 7*4
;	add.l	d6,a4			; &r[k/2]

       sub.w a2,a5
       move.w a5,d2 ; ccr
       add.l a5,a4
	bne.b	.outer_loop

a/b · 01 July 2021, 23:42

Quote:

Originally Posted by Don_Adan

Code:

;	move.w	d6,d2
	subq.w	#1,d2
...
; too late to set d2 here
       move.w a5,d2 ; ccr

d2 has to be initialized as b=k-1 (k=d6 in this case) before the inner loop, it's used as a divisor in there.
If ax is used instead of d6, then moving ax to a dummy dx, d1 for example, to set ccr is ok, it's the same code size (#28 vs. extra move) but if it leads to size gain elsewhere it's a good deal. E.g. a5 could be used as a counter instead, because it's needed after movem (for forbid) and can't be preloaded with a constant.

Yeah, d4 in movem doesn't reduce code size, but it's 1 less instruction and a larger workspace (potentially larger max digits). And d4 is needed before any other movem reg is not needed any more, so it doesn't harm if it's there.

a/b · 02 July 2021, 06:45

Here is the 500 bytes version, v2. Figured I could copy the registers "for free" with movem in gettime. It's +2 and then -2, same size, but fill loop is now faster and I'm happy again.
edit: Cosmetics.

Code:

;***************************************************************

; N = 7*D, D = digits, e.g. N = 700 for 100 digits

; user settings
PRINT_DIGITS		= 1
LONG_TIMER		= 0	; check for day transition?
HACKS			= 0	; use undocumented OS stuff?

; exec
TDNestCnt		= 295
LibList			= 378
LN_NAME			= 10	; list node name

; dos
Input			= -54
Output			= -60
Read			= -42
Write			= -48
DateStamp		= -192
TICKS_PER_SECOND	= 50	; dos timer frequency

;***************************************************************

start
	move.l	4.w,a5			; exec library
	IFEQ	HACKS
	lea	LibList(a5),a6		; find dos in library list
.lib_loop
	move.l	(a6),a6			; next library
	move.l	LN_NAME(a6),a0
	move.l	#'.sod',d0
.lib_name
	cmp.b	(a0)+,d0
	bne.b	.lib_loop
	lsr.l	#8,d0
	bne.b	.lib_name
	ELSE
	lea	-$148(a2),a6		; dos library from bcpl vector
	ENDIF	; HACKS

	jsr	Output(a6)
	move.l	d0,a3			; a3 = stdout
	lea	workspace(pc),a4
	moveq	#10,d4			; global const
	movem.w	(a4),d5/d6/d7/a2	; 10000, maxD, 2000, 4*7

	bsr.w	getnum			; returns N in d6 (k = N = 7*D)

	addq.b	#1,TDNestCnt(a5)	; FORBID macro, a5 is free now

	bsr.b	.gettime		; reg copy: a0 = d7, d7 = d6
	move.l	d1,-(a7)		; start time

;*** TIMED PART START ******************************************

.fill	move.w	a0,(a4)+
	subq.w	#2,d7
	bne.b	.fill

; outer+inner loop:
;	d3 upper word must initially be and remain 0
; 	d7 must initially be 0 (c = 0)
; d0=*, d1=d, d2=b, d3=tmp, d4=10, d5=10000, d6=k, d7=c
; a0=*, a1=*, a2=4*7, a4=r[] (a3=stdout, a5=--, a6=dos)

.outer_loop
	moveq	#0,d1			; d = 0
	move.w	d6,d2
	subq.w	#1,d2			; b = k-1
	bra.b	.inner_entry

.gettime	; returns ticks in d1, and copies: d7->a0, d6->d7
	movem.l	d0/d1/d2/d6/d7,-(sp)
	move.l	sp,d1
	jsr	DateStamp(a6)
	movem.l	(sp)+,d0/d1/d2/d7/a0	; d0=days, d1=minutes, d2=ticks
	mulu.w	#TICKS_PER_SECOND*60,d1	; minutes to ticks
	add.l	d2,d1
	rts

.longdiv	; d0/d2, 32/16 -> 32q/16r
	swap	d0
	move.w	d0,d3
	divu.w	d2,d3
	swap	d3
	move.w	d3,d0
	swap	d0
	divu.w	d2,d0

	move.w	d0,d3
	clr.w	d0
	swap	d0
	move.w	d0,(a4)			; r[i] = d%b
	exg	d0,d3

	subq.w	#2,d2			; b -= 2
	bcs.b	.inner_done

.inner_loop
	sub.l	d0,d1			; d = (d-d/b-d%b)/2
	sub.l	d3,d1			;  (same as d *= i)
	lsr.l	#1,d1
.inner_entry
	move.w	-(a4),d0		; r[i]
	mulu.w	d5,d0
	add.l	d0,d1			; d += r[i]*10000
	move.l	d1,d0
	divu.w	d2,d0			; d/b
	bvs.b	.longdiv

	move.w	d0,d3			; d/b
	clr.w	d0
	swap	d0			; d%b
	move.w	d0,(a4)			; r[i] = d%b

	subq.w	#2,d2			; b -= 2
	bcc.b	.inner_loop

.inner_done
	divu.w	d5,d1			; d/10000
	add.w	d7,d1			; d = c+d/10000 (to be printed out)
	move.l	d1,d7
	swap	d7			; c = d%10000
	IFNE	PRINT_DIGITS
	bsr.b	PR0000
	ENDIF

	sub.w	a2,d6			; k -= 7*4
	add.l	d6,a4			; &r[k/2]
	bne.b	.outer_loop		; k = 0?

;*** TIMED PART END ********************************************

	bsr.b	.gettime
	sub.l	(a7)+,d1		; end-start time
	IFNE	LONG_TIMER
; I'll shoot if you ask for DST adjustment or anything similar.
	bpl.b	.same_day
	add.l	#TICKS_PER_SECOND*60*60*24,d1
.same_day
	ENDIF
	add.l	d1,d1			; dos ticks (1/50) to 1/100
	divu.w	#100,d1			; 1/100ths upper, seconds lower

	move.l	a4,d2			; print buffer
	move.b	#' ',(a4)+

	moveq	#0,d3			; skip leading zeroes
	bsr.b	SPrintTime		; d5 is already set to 10000

	move.b	#'.',(a4)+

	moveq	#'0',d3			; print leading zeroes
	swap	d1
	moveq	#10,d5
	bsr.b	SPrintTime

	move.b	d4,(a4)+		; newline

	move.l	a4,d3
	sub.l	d2,d3			; string size
	bra.b	callwrite

	; END OF PROGRAM (exec will re-enable multitasking)

;***************************************************************

SPrintTime	; d1=value, a4=buffer
	move.w	d1,d0
.Next	ext.l	d0
	divu.w	d5,d0
	cmp.b	d3,d0
	beq.b	.LeadZero
	moveq	#'0',d3
	add.b	d3,d0
	move.b	d0,(a4)+
.LeadZero
	swap	d0
	divu.w	d4,d5
	bne.b	.Next
	rts

PR0000		; d1=value
	move.l	#'0000'-$01010001,d0
	move.w	-(a4),d3
.Loop	addq.b	#1,d0			; top 3 digits in a loop
	add.w	d3,d1
	bpl.b	.Loop
	sub.w	d3,d1
	rol.l	#8,d0
	move.w	-(a4),d3		; last value is string size (4)
	bmi.b	.Loop
	add.b	d1,d0			; 4th digit
 
	move.l	d0,-(a4)

	moveq	#pbuffer-workspace,d2
	sub.l	d2,a4
writetext
	add.l	a4,d2			; offset to buffer address
callwrite
	move.l	a3,d1			; stdout
	jmp	Write(a6) 		; call Write(stdout,buffer,size)

;***************************************************************

; Data must be in this order all up to msg1.

	CNOP	0,4
pbuffer	DCB.B	4,0			; keep it lword aligned preferably
dec2str	DC.W	dec2str-pbuffer,-10,-100,-1000

;*** OVERWRITTEN CODE/DATA STARTS HERE *************************
workspace
	DC.W	10000
	DC.W	((65536-(workspace-start))/7)&(~3) ; maxD (multiple of 4)
	DC.W	2000,4*7

; Reorder as needed and don't lose bytes due to EVENs.
msg1	DC.B	"number pi calculator v17",10	; odd length
msg1end
msg2	DC.B	"number of digits (up to "	; even length
msg2end
msg3	DC.B	")? "				; odd length
msg3end
	EVEN
printnum				; shortcut within 128 bytes
	bra.b	PR0000

msg4	DC.B	" digits will be printed",10	; even length
msg4end
	EVEN

;***************************************************************

getnum
	moveq	#msg1-workspace,d2
	moveq	#msg1end-msg1,d3
	bsr.b	writetext

.error	moveq	#msg2-workspace,d2
	moveq	#msg2end-msg2,d3
	bsr.b	writetext
	move.w	d6,d1			; maxD
	bsr.b	printnum
	moveq	#msg3-workspace,d2
	moveq	#msg3end-msg3,d3
	bsr.b	writetext

	jsr	Input(a6)		; get stdin
	move.l	d0,d1
	move.l	a4,d2			; read buffer
	moveq	#4+1,d3			; up to 4 digits + newline
	jsr	Read(a6)		; returns size in d0

	move.l	d2,a0
	moveq	#0,d2
.nextch	subq.w	#1,d0
	beq.b	.parsed
	move.w	#256-'0',d3
	add.b	(a0)+,d3
	cmp.w	d4,d3			; digit 0-9?
	bhs.b	.error
	mulu.w	d4,d2
	add.w	d3,d2			; D = D*10+digit
	bra.b	.nextch
.parsed
	cmp.w	d6,d2			; D > maxD?
	bhi.b	.error
	move.w	d2,d1			; D = 0?
	beq.b	.error

	addq.w	#3,d1
	and.w	#~3,d1			; adjust D to a multiple of 4
	moveq	#7,d6
	mulu.w	d1,d6			; k = N = 7*D
	cmp.b	(a0),d4			; last char is newline (1-4 digits)?
	bne.b	.adjusted
	cmp.w	d2,d1
	beq.b	.not_adjusted
.adjusted
	bsr.b	printnum		; either 5 digits or adjusted D
	moveq	#msg4-workspace,d2
	moveq	#msg4end-msg4,d3
	bra.w	writetext
.not_adjusted
	rts

;***************************************************************

bss	DS.B	65536-(*-start)		; 64kb allowed for code+data

;***************************************************************

; asm-one&co
;	PRINTV	bss-start+36		; 36 = hunk overhead
;	PRINTV	((65536-(workspace-start))/7)&(~3) ; max digits
;	PRINTV	(pbuffer-start)&3	; pbuffer alignment

;***************************************************************

Don_Adan · 02 July 2021, 19:32

Looks good. I dont see more possible optimisations. Now you can go on hot holidays or if you want you can try to optimise my old attempt, optimisation utility.library or other short Amiga ROM modules.

a/b · 02 July 2021, 21:36

Maybe something could still be done with the printing... It looks "great" (mvq/mvq/bsr) but it's repeated 4 times, 26 bytes of the same.
The problem is that it's all borked up, with number printing and a partial loop. So even with a simple trick like add.l d3,d2 right after Write() to set up the next string we are actually going backwards, there's isn't enough cases to reduce code size (can't jmp, need jsr+rts).
I'll think about it, if some kind of a byte array + wrapper could be squeezed in to replace mvq/mvq/bsr...

Don_Adan · 03 July 2021, 00:12

Quote:

Originally Posted by a/b

Maybe something could still be done with the printing... It looks "great" (mvq/mvq/bsr) but it's repeated 4 times, 26 bytes of the same.
The problem is that it's all borked up, with number printing and a partial loop. So even with a simple trick like add.l d3,d2 right after Write() to set up the next string we are actually going backwards, there's isn't enough cases to reduce code size (can't jmp, need jsr+rts).
I'll think about it, if some kind of a byte array + wrapper could be squeezed in to replace mvq/mvq/bsr...

I only mean about repeated text "number " copy or add one more write. but only minimal gain here, if any.

Don_Adan · 03 July 2021, 12:35

You can try this version, if works then is 3 bytes shortest (in real 2 or 4 bytes).

Code:

;***************************************************************

; N = 7*D, D = digits, e.g. N = 700 for 100 digits

; user settings
PRINT_DIGITS		= 1
LONG_TIMER		= 0	; check for day transition?
HACKS			= 0	; use undocumented OS stuff?

; exec
TDNestCnt		= 295
LibList			= 378
LN_NAME			= 10	; list node name

; dos
Input			= -54
Output			= -60
Read			= -42
Write			= -48
DateStamp		= -192
TICKS_PER_SECOND	= 50	; dos timer frequency

;***************************************************************

start
	move.l	4.w,a5			; exec library
	IFEQ	HACKS
	lea	LibList(a5),a6		; find dos in library list
.lib_loop
	move.l	(a6),a6			; next library
	move.l	LN_NAME(a6),a0
	move.l	#'.sod',d0
.lib_name
	cmp.b	(a0)+,d0
	bne.b	.lib_loop
	lsr.l	#8,d0
	bne.b	.lib_name
	ELSE
	lea	-$148(a2),a6		; dos library from bcpl vector
	ENDIF	; HACKS

	jsr	Output(a6)
	move.l	d0,a3			; a3 = stdout
	lea	workspace(pc),a4
	moveq	#10,d4			; global const
	movem.w	(a4),d5/d6/d7/a2	; 10000, maxD, 2000, 4*7

	bsr.w	getnum			; returns N in d6 (k = N = 7*D)

	addq.b	#1,TDNestCnt(a5)	; FORBID macro, a5 is free now

	bsr.b	.gettime		; reg copy: a0 = d7, d7 = d6
	move.l	d1,-(a7)		; start time

;*** TIMED PART START ******************************************

.fill	move.w	a0,(a4)+
	subq.w	#2,d7
	bne.b	.fill

; outer+inner loop:
;	d3 upper word must initially be and remain 0
; 	d7 must initially be 0 (c = 0)
; d0=*, d1=d, d2=b, d3=tmp, d4=10, d5=10000, d6=k, d7=c
; a0=*, a1=*, a2=4*7, a4=r[] (a3=stdout, a5=--, a6=dos)

.outer_loop
	moveq	#0,d1			; d = 0
	move.w	d6,d2
	subq.w	#1,d2			; b = k-1
	bra.b	.inner_entry

.gettime	; returns ticks in d1, and copies: d7->a0, d6->d7
	movem.l	d0/d1/d2/d6/d7,-(sp)
	move.l	sp,d1
	jsr	DateStamp(a6)
	movem.l	(sp)+,d0/d1/d2/d7/a0	; d0=days, d1=minutes, d2=ticks
	mulu.w	#TICKS_PER_SECOND*60,d1	; minutes to ticks
	add.l	d2,d1
	rts

.longdiv	; d0/d2, 32/16 -> 32q/16r
	swap	d0
	move.w	d0,d3
	divu.w	d2,d3
	swap	d3
	move.w	d3,d0
	swap	d0
	divu.w	d2,d0

	move.w	d0,d3
	clr.w	d0
	swap	d0
	move.w	d0,(a4)			; r[i] = d%b
	exg	d0,d3

	subq.w	#2,d2			; b -= 2
	bcs.b	.inner_done

.inner_loop
	sub.l	d0,d1			; d = (d-d/b-d%b)/2
	sub.l	d3,d1			;  (same as d *= i)
	lsr.l	#1,d1
.inner_entry
	move.w	-(a4),d0		; r[i]
	mulu.w	d5,d0
	add.l	d0,d1			; d += r[i]*10000
	move.l	d1,d0
	divu.w	d2,d0			; d/b
	bvs.b	.longdiv

	move.w	d0,d3			; d/b
	clr.w	d0
	swap	d0			; d%b
	move.w	d0,(a4)			; r[i] = d%b

	subq.w	#2,d2			; b -= 2
	bcc.b	.inner_loop

.inner_done
	divu.w	d5,d1			; d/10000
	add.w	d7,d1			; d = c+d/10000 (to be printed out)
	move.l	d1,d7
	swap	d7			; c = d%10000
	IFNE	PRINT_DIGITS
	bsr.b	PR0000
	ENDIF

	sub.w	a2,d6			; k -= 7*4
	add.l	d6,a4			; &r[k/2]
	bne.b	.outer_loop		; k = 0?

;*** TIMED PART END ********************************************

	bsr.b	.gettime
	sub.l	(a7)+,d1		; end-start time
	IFNE	LONG_TIMER
; I'll shoot if you ask for DST adjustment or anything similar.
	bpl.b	.same_day
	add.l	#TICKS_PER_SECOND*60*60*24,d1
.same_day
	ENDIF
	add.l	d1,d1			; dos ticks (1/50) to 1/100
	divu.w	#100,d1			; 1/100ths upper, seconds lower

	move.l	a4,d2			; print buffer
	move.b	#' ',(a4)+

	moveq	#0,d3			; skip leading zeroes
	bsr.b	SPrintTime		; d5 is already set to 10000

	move.b	#'.',(a4)+

	moveq	#'0',d3			; print leading zeroes
	swap	d1
	moveq	#10,d5
	bsr.b	SPrintTime

	move.b	d4,(a4)+		; newline

	move.l	a4,d3
	sub.l	d2,d3			; string size
	bra.b	callwrite

	; END OF PROGRAM (exec will re-enable multitasking)

;***************************************************************

SPrintTime	; d1=value, a4=buffer
	move.w	d1,d0
.Next	ext.l	d0
	divu.w	d5,d0
	cmp.b	d3,d0
	beq.b	.LeadZero
	moveq	#'0',d3
	add.b	d3,d0
	move.b	d0,(a4)+
.LeadZero
	swap	d0
	divu.w	d4,d5
	bne.b	.Next
	rts

PR0000		; d1=value
	move.l	#'0000'-$01010001,d0
	move.w	-(a4),d3
.Loop	addq.b	#1,d0			; top 3 digits in a loop
	add.w	d3,d1
	bpl.b	.Loop
	sub.w	d3,d1
	rol.l	#8,d0
	move.w	-(a4),d3		; last value is string size (4)
	bmi.b	.Loop
	add.b	d1,d0			; 4th digit
 
	move.l	d0,-(a4)

	moveq	#pbuffer-workspace,d2
	sub.l	d2,a4
writetext
	add.l	a4,d2			; offset to buffer address
callwrite
	move.l	a3,d1			; stdout
	jmp	Write(a6) 		; call Write(stdout,buffer,size)

;***************************************************************

; Data must be in this order all up to msg1.

	CNOP	0,4
pbuffer	DCB.B	4,0			; keep it lword aligned preferably
dec2str	DC.W	dec2str-pbuffer,-10,-100,-1000

;*** OVERWRITTEN CODE/DATA STARTS HERE *************************
workspace
	DC.W	10000
	DC.W	((65536-(workspace-start))/7)&(~3) ; maxD (multiple of 4)
	DC.W	2000,4*7

; Reorder as needed and don't lose bytes due to EVENs.
msg1	DC.B	"number pi calculator v17",10	; odd length
msg1end
;msg2	DC.B	"number of digits (up to "	; even length
msg2	DC.B	"of digits (up to "	;
msg2end
msg3	DC.B	")? "				; odd length
msg3end
	EVEN
printnum				; shortcut within 128 bytes
	bra.b	PR0000

msg4	DC.B	" digits will be printed",10	; even length
msg4end
	EVEN

;***************************************************************

getnum
	moveq	#msg1-workspace,d2
	moveq	#msg1end-msg1,d3
	bsr.b	writetext
        moveq   #7,d3                   ; "number " size, d2 already set
        bsr.b   callwrite
.error	moveq	#msg2-workspace,d2
	moveq	#msg2end-msg2,d3
	bsr.b	writetext
	move.w	d6,d1			; maxD
	bsr.b	printnum
	moveq	#msg3-workspace,d2
	moveq	#msg3end-msg3,d3
	bsr.b	writetext

	jsr	Input(a6)		; get stdin
	move.l	d0,d1
	move.l	a4,d2			; read buffer
	moveq	#4+1,d3			; up to 4 digits + newline
	jsr	Read(a6)		; returns size in d0

	move.l	d2,a0
	moveq	#0,d2
.nextch	subq.w	#1,d0
	beq.b	.parsed
	move.w	#256-'0',d3
	add.b	(a0)+,d3
	cmp.w	d4,d3			; digit 0-9?
	bhs.b	.error
	mulu.w	d4,d2
	add.w	d3,d2			; D = D*10+digit
	bra.b	.nextch
.parsed
	cmp.w	d6,d2			; D > maxD?
	bhi.b	.error
	move.w	d2,d1			; D = 0?
	beq.b	.error

	addq.w	#3,d1
	and.w	#~3,d1			; adjust D to a multiple of 4
	moveq	#7,d6
	mulu.w	d1,d6			; k = N = 7*D
	cmp.b	(a0),d4			; last char is newline (1-4 digits)?
	bne.b	.adjusted
	cmp.w	d2,d1
	beq.b	.not_adjusted
.adjusted
	bsr.b	printnum		; either 5 digits or adjusted D
	moveq	#msg4-workspace,d2
	moveq	#msg4end-msg4,d3
	bra.w	writetext
.not_adjusted
	rts

;***************************************************************

bss	DS.B	65536-(*-start)		; 64kb allowed for code+data

;***************************************************************

; asm-one&co
;	PRINTV	bss-start+36		; 36 = hunk overhead
;	PRINTV	((65536-(workspace-start))/7)&(~3) ; max digits
;	PRINTV	(pbuffer-start)&3	; pbuffer alignment

;***************************************************************

01 July 2021, 18:51	#450
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	I was already thinking about the fill loop, switching to words. Didn't feel like doing it because it would be slower. Dunno, I guess I'll do a few runs and see how much slower it is. An hour or so ago I had another idea that would need another word reg to initialize with a constant to make the code shorter (right now it breaks even with 3 regs): movem.w (a4),d4/d5/d7 (5 words) vs. 2x move # and 1x moveq (also 5 words), was considering fill words but left it open for now, still didn't like switching to words. edit: Almost forgot. I wrote about d6/d7 this morning while replying to Ross, can't do I think: d7 needs swap, d6 needed for ccr setting. Speed is fine, we should got for it I guess. But that movem.w (a4) opt is a total pain in the butt. Can't see it working atm, lots of reg switching to preserve everything long enough, if even possible ;(. Last edited by a/b; 01 July 2021 at 19:36.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

01 July 2021, 12:44	#441
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	Maybe changing length of text for msg1,msg2, msg4 to same value 23 or 24 bytes can win some D3? But perhaps it needs some changes in getnum error handling routine, if can works.

01 July 2021, 13:24	#444
alkis Registered User Join Date: Dec 2010 Location: Athens/Greece Age: 53 Posts: 719	btw, the amiga way (IMO) would be to grab argument from command line (a0). This design is like Input "What's your name";A$ we all did in basic on 8 bits

01 July 2021, 15:45	#446
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	It works for me with 2 bytes to spare (or 6 if I move d7 out). Getting rid of timer.L put the 3rd call back within range.

01 July 2021, 18:31	#449
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	Because this is size vs speed optimization, then you can perhaps use this: Code: move.w #2000,d5 move.l d6,d7 .fill move.w d5,(a4)+ subq.w #2,d7 bne.b .fill move.w #10000,d5 And if you preserve d3=5 then later mulu.w D3,D5 Also using A2 and A5 registers is perhaps possible to free d6 or d7 register. And later Ross dot com idea, can be used. But this is a few complicated and need some changes.

01 July 2021, 21:52	#453
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	OK, we could get 2 more bytes (=502) with those 4x trash regs in gettime if we: - replace 2000 with 4*7 in movem for a2 - change stack decrement from 12 to a2 (+2) - change d6 decrement to a2 (+2) - change the fill loop to move.w #2000,(a4)+ (-2 because zero free regs for movem, and extra -2 bytes to load would be back to 504) ;\ But now there is a range problem with the 3rd writetext call at the end, it's 130 bytes away because everything before CNOP moved 2 bytes up and until we get 2 more bytes to get it back to 128 it's 504 or 504.

01 July 2021, 23:04	#454
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	For free D6 register you can try to use something next. Because you used movem.w then maybe free 2 bytes in gettime routine is not necessary. Also moveq #10,D4 dont must be handled via movem.w. Anyway after your latest movem.w changes my idea to free d6 is perhaps more complicated than previously. Code: ; move.l d6,d7 move.l a5,d7 .fill move.w a2,(a4)+ ; a2 is free after this loop subq.w #2,d7 bne.b .fill lea 28.W,A2 .outer_loop moveq #0,d1 ; d = 0 ; move.w d6,d2 subq.w #1,d2 bra.b .inner_entry ; sub.w #74,d6 ; k -= 74 ; add.l d6,a4 ; &r[k/2] sub.w a2,a5 move.w a5,d2 ; ccr add.l a5,a4 bne.b .outer_loop

02 July 2021, 19:32	#457
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	Looks good. I dont see more possible optimisations. Now you can go on hot holidays or if you want you can try to optimise my old attempt, optimisation utility.library or other short Amiga ROM modules.

02 July 2021, 21:36	#458
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Maybe something could still be done with the printing... It looks "great" (mvq/mvq/bsr) but it's repeated 4 times, 26 bytes of the same. The problem is that it's all borked up, with number printing and a partial loop. So even with a simple trick like add.l d3,d2 right after Write() to set up the next string we are actually going backwards, there's isn't enough cases to reduce code size (can't jmp, need jsr+rts). I'll think about it, if some kind of a byte array + wrapper could be squeezed in to replace mvq/mvq/bsr...

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)