Linefeed counter code

Yulquen74 · 11 June 2024, 21:14

Below snip is part of something I pieced together, and it appears to work.
Can it be made faster? Target is 68000.

move.l _FileSz,d0
clr.l d1
move.l _BufferPt,a0

LineCountLoop:
cmp.b #LF,0(a0,d0)
beq NewLineFound

LineCountLoop2:
dbra d0,LineCountLoop
bra LineCountStop

NewLineFound:
addq.l #1,d1
bra LineCountLoop2

LineCountStop:
move.l d1,_LineCount

a/b · 11 June 2024, 21:55

Something like this:

Code:

	move.l	_BufferPt,a0
	move.l	_FileSz,d0
	moveq	#LF,d2
; char after the last one (last to be checked) must not be a LF
; buffer size should be (at least) 1 char larger than filesize
	clr.b	(a0,d0.l)
	addq.l	#1,d0
	moveq	#-1,d1
found:	subq.l	#1,d0
	addq.l	#1,d1
loop:	cmp.b	(a0)+,d2
	dbeq	d0,loop
	beq.b	found
	sub.l	#1<<16,d0	; support for files >64kb
	bcc.b	loop
	move.l	d1,_LineCount

jimshaw · 12 June 2024, 03:07

Instead of branching, you could try
seq d3
add.l d3,d1
after the cmp.

Your total will end up 255 times too big, so you would need to divide that out.

Don_Adan · 12 June 2024, 19:49

Nice trick, anyway divide can overflow on 68000.
Maybe something like this can works and be fast enough?

Code:

 moveq #0,d3
 ......

 seq d3
 neg.b d3
 add.l d3,d1

Yulquen74 · 12 June 2024, 21:21

Nice suggestions everyone!

I tested the seq trick with a 2MB file with 108000 lines, and at 7MHz loading and execution 18 secs, without the divide by 255 at the end.

Fast processing of the larger files is a priority, so I can perhaps unroll the loop combined with code to do proper 32/32 divide to get it even faster.

Thomas Richter · 12 June 2024, 21:28

For all practical matters, the I/O will be dominating the actual processing of the file content...

a/b · 12 June 2024, 22:20

seq&co will be done per char, so unless you expect a very high % of NLs, you want as few instructions per char as possible. E.g. 8+10 cycles for cmp/dbeq for every non-NL char beats seq&co easily.
Using your example, 2mb chars and 108k lines (~5% of NLs):
- cmp/seq/add/dbf ~= 30*2mb ~= 63mil cycles
- cmp/dbeq = 18*(2mb-108k)+48*108k ~= 41mil cycles
50% faster and don't even have to bother unrolling it (it would still be faster).

a/b · 13 June 2024, 02:10

Since I have nothing better to do while waiting on an NBA game, here is another take. A little faster but longer (unrolled). The same buffer requirement as above (1 extra byte at the end).

Code:

	move.l	_BufferPt,a0
	move.l	a0,a1
	add.l	_FileSz,a1
	moveq	#LF,d0
	move.b	d0,(a1)
	moveq	#-1,d1
loop	REPT	15
	cmp.b	(a0)+,d0
	beq.b	found
	ENDR
	cmp.b	(a0)+,d0
	bne.b	loop
found	addq.l	#1,d1
	cmp.l	a1,a0
	bls.b	loop
	move.l	d1,_LineCount

It's roughly 35-36mil cycles.

Yulquen74 · 13 June 2024, 08:02

Great, will test it later!

Don_Adan · 13 June 2024, 09:18

Nice code.
Anyway, when I was young, sometimes I needed "in place" searching/patching without extra allocated 1 byte.
I dont want to copy source file and alloc new buffer too.
I used something like this:

Code:

	move.l	_BufferPt,a0
	move.l	a0,a1
	add.l	_FileSz,a1
	moveq	#LF,d0
   move.b (a1),d2          ; backup byte
	move.b	d0,(a1)
	moveq	#-1,d1
loop	REPT	15
	cmp.b	(a0)+,d0
	beq.b	found
	ENDR
	cmp.b	(a0)+,d0
	bne.b	loop
found	addq.l	#1,d1
	cmp.l	a1,a0
	bls.b	loop
  move.b d2,(a1)            ; restore byte
	move.l	d1,_LineCount

roondar · 13 June 2024, 10:39

Quote:

Originally Posted by Yulquen74

I tested the seq trick with a 2MB file with 108000 lines, and at 7MHz loading and execution 18 secs, without the divide by 255 at the end.

Quote:

Originally Posted by a/b

- cmp/seq/add/dbf ~= 30*2mb ~= 63mil cycles- cmp/dbeq = 18*(2mb-108k)+48*108k ~= 41mil cycles

Quote:

Originally Posted by a/b

It's roughly 35-36mil cycles.

Just to give a different perspective than millions of cycles, I converted the above to seconds and a percentage of the total 18 seconds runtime.

Assuming cmp/seq/add/dbf is "the seq trick", that code will spend ~8,9 seconds counting newlines out of 18 seconds (or ~41% of the total time spent).
The simpler cmp/dbeq would take ~5,8 seconds counting newlines out of 15 seconds (or ~38% of the total time spent).
The unrolled code would take ~5,1 seconds counting newlines out of 14 seconds (or ~36% of the totel time spent)

Of course, these are theoretical numbers and they assume the system can spend all it's available cycles on counting newlines with the CPU for the period of time it's doing the counting. This is of course not entirely accurate, as the OS will both have some overhead and some form of disk caching that will likely impact the actual results.

nocash · 13 June 2024, 15:20

For not allocating an extra byte, you could temporarily change the last byte (if it's writeable, and if no other code is reading from it).

But fastest would be doing the counting on the fly during loading, though results depend on whether it's loaded with/without OS, from which media, with/without RAM and disk cache.

And there are cases where you don't really need to know the exact number of lines, such like displaying a vertical scrollbar for a large file: simply use the byte-address as scrollbar index.

Don_Adan · 13 June 2024, 15:43

Counting on the fly during loading will be much slower.
For me not worth.
Maybe only, if not enough continues memory available for loading big file and it will be second (optional) routine.
But this is dependent for what this routine is necessary.

Yulquen74 · 13 June 2024, 18:39

So, at worst case (7MHz) a/b's new code uses about 7.3 secs to complete using the 2.1MB example file (1000 startup-sequences stitched together).
About 1 sec is spent loading the program + file to RAM (A590).

At regular speed in this A500 (61MHz/1ws in this case) the operation takes about 1.6 secs. Just under 1s is spent on HD activity.

So this is a great improvement, and I am pretty happy with the code as it is now.

12 June 2024, 19:49	#4
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,088	Nice trick, anyway divide can overflow on 68000. Maybe something like this can works and be fast enough? Code: moveq #0,d3 ...... seq d3 neg.b d3 add.l d3,d1

13 June 2024, 02:10	#8
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,087	Since I have nothing better to do while waiting on an NBA game, here is another take. A little faster but longer (unrolled). The same buffer requirement as above (1 extra byte at the end). Code: move.l _BufferPt,a0 move.l a0,a1 add.l _FileSz,a1 moveq #LF,d0 move.b d0,(a1) moveq #-1,d1 loop REPT 15 cmp.b (a0)+,d0 beq.b found ENDR cmp.b (a0)+,d0 bne.b loop found addq.l #1,d1 cmp.l a1,a0 bls.b loop move.l d1,_LineCount It's roughly 35-36mil cycles.

13 June 2024, 09:18	#10
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,088	Nice code. Anyway, when I was young, sometimes I needed "in place" searching/patching without extra allocated 1 byte. I dont want to copy source file and alloc new buffer too. I used something like this: Code: move.l _BufferPt,a0 move.l a0,a1 add.l _FileSz,a1 moveq #LF,d0 move.b (a1),d2 ; backup byte move.b d0,(a1) moveq #-1,d1 loop REPT 15 cmp.b (a0)+,d0 beq.b found ENDR cmp.b (a0)+,d0 bne.b loop found addq.l #1,d1 cmp.l a1,a0 bls.b loop move.b d2,(a1) ; restore byte move.l d1,_LineCount

13 June 2024, 15:20	#12
nocash Registered User Join Date: Feb 2016 Location: Homeless Posts: 67	For not allocating an extra byte, you could temporarily change the last byte (if it's writeable, and if no other code is reading from it). But fastest would be doing the counting on the fly during loading, though results depend on whether it's loaded with/without OS, from which media, with/without RAM and disk cache. And there are cases where you don't really need to know the exact number of lines, such like displaying a vertical scrollbar for a large file: simply use the byte-address as scrollbar index. Last edited by nocash; 13 June 2024 at 15:26.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
feature request: frame counter	a/b	support.WinUAE	3	24 May 2023 21:15
Frame counter in Blitz	LateBit	Coders. Blitz Basic	3	27 December 2022 21:40
dbra to handle long counter	Geijer	Coders. Asm / Hardware	11	27 November 2020 10:50
Address pointers with Program Counter	Lonewolf10	Coders. Asm / Hardware	8	27 October 2015 11:40
Program Counter with Offset - why?	Jherek Carnelia	Coders. General	26	21 March 2011 10:49

11 June 2024, 21:14	#1
Yulquen74 Registered User Join Date: May 2013 Location: Kleppe / Norway Posts: 273	Linefeed counter code Below snip is part of something I pieced together, and it appears to work. Can it be made faster? Target is 68000. move.l _FileSz,d0 clr.l d1 move.l _BufferPt,a0 LineCountLoop: cmp.b #LF,0(a0,d0) beq NewLineFound LineCountLoop2: dbra d0,LineCountLoop bra LineCountStop NewLineFound: addq.l #1,d1 bra LineCountLoop2 LineCountStop: move.l d1,_LineCount

12 June 2024, 03:07	#3
jimshaw Registered User Join Date: Apr 2021 Location: Sydney Posts: 8	Instead of branching, you could try seq d3 add.l d3,d1 after the cmp. Your total will end up 255 times too big, so you would need to divide that out.

12 June 2024, 21:21	#5
Yulquen74 Registered User Join Date: May 2013 Location: Kleppe / Norway Posts: 273	Nice suggestions everyone! I tested the seq trick with a 2MB file with 108000 lines, and at 7MHz loading and execution 18 secs, without the divide by 255 at the end. Fast processing of the larger files is a priority, so I can perhaps unroll the loop combined with code to do proper 32/32 divide to get it even faster.

12 June 2024, 21:28	#6
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,403	For all practical matters, the I/O will be dominating the actual processing of the file content...

12 June 2024, 22:20	#7
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,087	seq&co will be done per char, so unless you expect a very high % of NLs, you want as few instructions per char as possible. E.g. 8+10 cycles for cmp/dbeq for every non-NL char beats seq&co easily. Using your example, 2mb chars and 108k lines (~5% of NLs): - cmp/seq/add/dbf ~= 302mb ~= 63mil cycles - cmp/dbeq = 18(2mb-108k)+48*108k ~= 41mil cycles 50% faster and don't even have to bother unrolling it (it would still be faster).

13 June 2024, 08:02	#9
Yulquen74 Registered User Join Date: May 2013 Location: Kleppe / Norway Posts: 273	Great, will test it later!

13 June 2024, 15:43	#13
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,088	Counting on the fly during loading will be much slower. For me not worth. Maybe only, if not enough continues memory available for loading big file and it will be second (optional) routine. But this is dependent for what this routine is necessary.

13 June 2024, 18:39	#14
Yulquen74 Registered User Join Date: May 2013 Location: Kleppe / Norway Posts: 273	So, at worst case (7MHz) a/b's new code uses about 7.3 secs to complete using the 2.1MB example file (1000 startup-sequences stitched together). About 1 sec is spent loading the program + file to RAM (A590). At regular speed in this A500 (61MHz/1ws in this case) the operation takes about 1.6 secs. Just under 1s is spent on HD activity. So this is a great improvement, and I am pretty happy with the code as it is now.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)