BCD Arithmetic - howto^ - Page 3

modrobert · 23 August 2014, 12:43

Quote:

Originally Posted by alkis

You can always use the OS.

Code:

* converts d0 long number to decimal ascii at (a0)
decimalconvert
    movem.l d0/a0-a3,-(sp)
    move.l d0,savelongvalue
    lea.l savelongvalue(pc),a1
    move.l a0,a3
    lea.l formatString(pc),a0
    lea.l stuffChar(pc),a2
    CALLEXEC RawDoFmt
    movem.l (sp)+,d0/a0-a3
    rts

stuffChar:
    move.b  d0,(a3)+
    rts

savelongvalue dc.l 0
formatString  dc.b '%10ld',0
    EVEN

Thanks a lot! Will single step through the process and see how the OS does it.

Didn't know there was a system call for it.

Leffmann · 23 August 2014, 13:41

Personally I would skip the BCD and just do long-form division on 32-bit integers. You can extract up to 4 digits with each run of 2 divisions, so it's reasonably fast. Finding the digits by means of subtracting 10^n is even faster, and both methods are trivial to extend for integers of any length.

modrobert · 23 August 2014, 14:11

Quote:

Originally Posted by Leffmann

Personally I would skip the BCD and just do long-form division on 32-bit integers. You can extract up to 4 digits with each run of 2 divisions, so it's reasonably fast. Finding the digits by means of subtracting 10^n is even faster, and both methods are trivial to extend for integers of any length.

Please treat me like an idiot, because it's the truth (at least in this case).

Could you explain that with some sample code?

Whenever I search it's usually crappy little endian x86 code showing up, or Atmel.

Leffmann · 23 August 2014, 14:40

f.ex. like this:

Code:

Print	move.l	sp, a0
	sub	#12, sp
	sf	-(a0)

.loop	clr.l	d1
	swap	d0
	move.w	d0, d1
	divu.w	#10, d1
	move.w	d1, d0
	swap	d0
	move.w	d0, d1
	divu.w	#10, d1
	move.w	d1, d0
	swap	d1
	add.b	#'0', d1
	move.b	d1, -(a0)

	tst.l	d0
	bne	.loop

	; A0 now points to ASCII string

	add	#12, sp
	rts

We're working in base 2^16, but other than that it's really no different from what we did with pen and paper back in the school bench. But I don't want to stir up bad memories

modrobert · 23 August 2014, 14:45

Thanks! It was when/how to do the 'swap' I needed to see, will test soon, still fiddling with alkis code.

Asman · 15 September 2014, 22:02

Quote:

Originally Posted by phx

This is the code I'm currently using:

Code:

8       subq.l  #8,sp
4       move.l  sp,a0
12      move.b  Score(a4),d0
4       moveq   #15,d1
4       and.b   d0,d1
8       move.b  d1,(a0)+
14      lsr.b   #4,d0
8       move.b  d0,(a0)+
12      move.b  Score+1(a4),d0
4       moveq   #15,d1
4       and.b   d0,d1
8       move.b  d1,(a0)+
14      lsr.b   #4,d0
8       move.b  d0,(a0)
28      add.l   #'0000',(sp)
---
140

My second approach of 4 digits bcd score routine.

Code:

    lea score(pc),a0 ;8c
    move.b  (a0),d3 ;8c
    moveq   #$f,d1  ;4c
    and.b   d1,d3   ;4c
    move.w  (a0)+,d0    ;8c, now a0 points to digits
    move.b  d0,d2   ;4c
    and.w   d1,d2   ;4c
    lsr.w   #4,d0   ;14c
    and.w   d0,d1   ;4c
    move.b  d3,d0   ;4c

    ;if ascii then uncomment (take extra 20c)
    ;
    ;move.w  #$3030,d3   ;8c
    ;add.w   d3,d0       ;4c
    ;add.b   d3,d1       ;4c
    ;add.b   d3,d2       ;4c
    
    move.w  d0,(a0)+    ;8c
    move.b  d1,(a0)+    ;8c
    move.b  d2,(a0)+    ;8c
                               ; = 86c  (or 106c with ascii version)

score:  dc.w    $1234 ; score in bcd format
digits: dc.l    0,0

It's hard for me to beats DonAdan approach with table and very hard for me to beat Codetapper 8 bytes bcd version, but I will try

.

mc6809e · 15 September 2014, 23:07

If the blitter is expected to run or if bitplane DMA > 4 planes in a chip/slow ram system, it might be interesing to compare algorithms based on how often they touch memory rather by how many cycles they take.

Leffman's code has some time consuming DIVUs but the code leaves plenty of DMA cycles free.

Somewhat related:

MULS and DIVS instructions can be paired with a MULS immediately followed by a DIVS to create a nice memory access free window of up to 238 cycles, provided that all data is already in Dx registers.

This is possible because the MULS instruction first prefetches the DIVS instruction before beginning internal execution while the DIVS instruction does its prefetch cycle at the end following its internal execution.

Asman · 16 September 2014, 09:57

Quote:

Originally Posted by mc6809e

If the blitter is expected to run or if bitplane DMA > 4 planes in a chip/slow ram system, it might be interesing to compare algorithms based on how often they touch memory rather by how many cycles they take.

I don't get it. So its mean that routine with less amount of read/write to chip/slow will be faster ? How can I check this ?

for example this routine has one read and one write and takes 114c

Code:

    lea score(pc),a0
    move.w  (a0)+,d0
    move.w  #$f0f0,d1
    and.w   d0,d1
    eor.w   d1,d0
    move.b  d1,d2
    rol.w   #4,d2
    ror.w   #4,d1
    move.b  d0,d2
    ror.w   #8,d0
    move.b  d0,d1
    swap    d1
    move.w  d2,d1
;for ascii uncomment (extra 16c )
;   add.l #$30303030,d1 
    move.l  d1,(a0)+
;=114c (with ascii take 130c)

score:  dc.w    $1234
digits: dc.l    0,0

Mrs Beanbag · 16 September 2014, 11:38

of course we can divide by 10,000 with a divu.w for a maximum Long input of 655359999, then convert each half of the result separately (i.e. a recursive approach).

also one could divide by 1,000,000 by shifting right by 4 and then divide by 62500, i'll leave correcting the remainder as an exercise to the student

mc6809e · 16 September 2014, 19:42

Quote:

Originally Posted by Asman

I don't get it. So its mean that routine with less amount of read/write to chip/slow will be faster ? How can I check this ?

for example this routine has one read and one write and takes 114c

You have to include prefetch cycles for each instruction if your code is in chipram. The and.w d0,d1 instruction in your code, for example, has one prefetch cycle so even though the work is being done inside registers, the instruction still needs to prefetch the next instruction from memory before it can finish.

Now consider the rol.w #4, d2 instruction. The instruction runs one prefetch cycle at the beginning of execution, then there are a number of internal operations that execute internally to rotate the data. For rol.w #4, d2, the number of internal cycles is equal to 10.

The rol instruction is unusual in that is has cycles that don't require memory access. Most of the time though you can assume an instruction busily spends all its time accessing memory for things like instruction prefetch and operand reads and writes. And if the number of bitplanes for the display is four or fewer, DMA for bitplane display can usually overlap with CPU memory accesses since the CPU begins a memory access cycle by placing an address on the address bus and not transferring data during the first two cycles of a memory access cycle. Bitplane DMA can occur during those first two cycles.

Things change as you add more bitplanes or use the blitter. Instructions that need to access memory are more often made to wait if bitplane DMA or blitter DMA blocks the CPU in the two cycles after the address is placed on the bus.

This is where instruction that have internal operations can make a difference. If internal operation overlaps with DMA, then there is less slow down.

The page
http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt give data bus usage for the 68000 cpu.

It can be a little confusing. Make sure to ignore whitespace and pipes when trying to comprehend bus usages.

For example, EORI.L #$55555555, d0 runs like this:

npnpnpnn

Each 'n' represents two cycles of internal processing. Most of the time this is when the CPU puts an address on the bus before a memory access. The 'p' means prefetch. The last two 'n's in the instruction represent four CPU cycles of internal processing. System DMA can occur during any of the 'n's without slowing the CPU down.

koobo · 22 November 2021, 06:38

I just recently realized (woke up in the middle of the night) that I can get rid of a DIVU from inside a loop by using BCD. I have a two digit line counter that gets displayed, so I converted this:

Code:

        lea	.pos(pc),a0		
	move	d6,d0
	divu	#10,d0
	or.b	#'0',d0
	move.b	d0,(a0)
	swap	d0
	or.b	#'0',d0
	move.b	d0,1(a0)

to this:

Code:

	lea	.pos(pc),a0		
	move.w	d6,d0	* $00XY
	lsl.w	#4,d0	* $0XY0
	lsr.b	#4,d0	* $0X0Y
	or.w	#$3030,d0
	move	d0,(a0)

Neat

A lookup table would be another alternative but this shall suffice.

23 August 2014, 14:40	#44
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	f.ex. like this: Code: Print move.l sp, a0 sub #12, sp sf -(a0) .loop clr.l d1 swap d0 move.w d0, d1 divu.w #10, d1 move.w d1, d0 swap d0 move.w d0, d1 divu.w #10, d1 move.w d1, d0 swap d1 add.b #'0', d1 move.b d1, -(a0) tst.l d0 bne .loop ; A0 now points to ASCII string add #12, sp rts We're working in base 2^16, but other than that it's really no different from what we did with pen and paper back in the school bench. But I don't want to stir up bad memories

22 November 2021, 06:38	#51
koobo Registered User Join Date: Sep 2019 Location: Finland Posts: 361	I just recently realized (woke up in the middle of the night) that I can get rid of a DIVU from inside a loop by using BCD. I have a two digit line counter that gets displayed, so I converted this: Code: lea .pos(pc),a0 move d6,d0 divu #10,d0 or.b #'0',d0 move.b d0,(a0) swap d0 or.b #'0',d0 move.b d0,1(a0) to this: Code: lea .pos(pc),a0 move.w d6,d0 * $00XY lsl.w #4,d0 * $0XY0 lsr.b #4,d0 * $0X0Y or.w #$3030,d0 move d0,(a0) Neat A lookup table would be another alternative but this shall suffice.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Prefs/DefIcons howto ?	amiga	support.Apps	1	04 October 2008 18:34
Got a Catweasel MK2... howto?	Photon	support.Hardware	3	27 July 2008 16:22
MKick Howto?	maxlock	support.Other	2	12 June 2008 19:01
My CD32-compilation HOWTO...	frostwork	Amiga scene	1	05 January 2005 15:53

23 August 2014, 13:41	#42
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	Personally I would skip the BCD and just do long-form division on 32-bit integers. You can extract up to 4 digits with each run of 2 divisions, so it's reasonably fast. Finding the digits by means of subtracting 10^n is even faster, and both methods are trivial to extend for integers of any length.

23 August 2014, 14:45	#45
modrobert old bearded fool Join Date: Jan 2010 Location: Bangkok Age: 56 Posts: 775	Thanks! It was when/how to do the 'swap' I needed to see, will test soon, still fiddling with alkis code.

15 September 2014, 23:07	#47
mc6809e Registered User Join Date: Jan 2012 Location: USA Posts: 372	If the blitter is expected to run or if bitplane DMA > 4 planes in a chip/slow ram system, it might be interesing to compare algorithms based on how often they touch memory rather by how many cycles they take. Leffman's code has some time consuming DIVUs but the code leaves plenty of DMA cycles free. Somewhat related: MULS and DIVS instructions can be paired with a MULS immediately followed by a DIVS to create a nice memory access free window of up to 238 cycles, provided that all data is already in Dx registers. This is possible because the MULS instruction first prefetches the DIVS instruction before beginning internal execution while the DIVS instruction does its prefetch cycle at the end following its internal execution.

16 September 2014, 11:38	#49
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	of course we can divide by 10,000 with a divu.w for a maximum Long input of 655359999, then convert each half of the result separately (i.e. a recursive approach). also one could divide by 1,000,000 by shifting right by 4 and then divide by 62500, i'll leave correcting the remainder as an exercise to the student

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)