Here is a simple HLA utility I wrote to count the lines in a file. I ran it against a file with about 250,000 lines and it counted in a little under a minute. As a comparison, I used the WC command from I believe the MKS toolkit and it returned the line count within a few seconds.

Why is my routine so slow?


Thanks all.



// LineCount-
//
// This program will count the number of lines
// in the file specified on the command line

// Calling the program:
// LineCount inFile


program LineCount;

#include( "stdlib.hhf" );

#macro inc64( dest ):noOvrFlw;
inc( (type dword dest) );
jnz noOvrFlw;
inc( (type dword dest[4]) );
noOvrFlw:
#endmacro


static
f: file;
inFile: String;
uns64Result: dword;
i: int32;

begin LineCount;

arg.c();
mov (eax, i);
if( i <> 2 ) then
stdout.put( "Usage: LineCount inFile", stdio.lf);
exit LineCount;
endif;

mov ( arg.v(1), inFile );
mov( 0, (type dword uns64Result[0]));
mov( 0, (type dword uns64Result[4]));

f.create();
f.open( inFile, fileio.r );

while( !f.eof() ) do
f.a_gets();
inc64(uns64Result);
endwhile;

f.close();
stdout.put( (type uns64 uns64Result[0]), stdio.lf );

end LineCount;:confused:
Posted on 2003-06-05 22:31:57 by C. Wardell
#macro inc64( dest );
add( 1, (type dword dest) );
adc( 0, (type dword dest[4]) );
#endmacro


Beyond that the slow down is in reading the file only a line at a time! Another way would be to read the file in large chunks and count the lines in a chunk of bytes. Which I'm sure the a_gets(); function is already doing to get a line, or reading the file a byte at a time! :)
Posted on 2003-06-05 22:49:10 by bitRAKE
I needed a 64 Bit # some of the counts I do are in excess of 5 Billion. This is the Inc macro I am using as sugested by Randy. It works, but no where near as fast as I would have hoped.

:mad:
Posted on 2003-06-06 00:41:25 by C. Wardell
I have suggested a replacement macro above without branching. I have never used HLA before - my suggestions are based on knowledge of ASM only and assumptions of HLA syntax based on examples seen (flame away or ignore me :)). Really, I know how to increment a 64-bit number. ;)

...or you could use the FPU stack, or keep the value in registers if they are preserved.

The bottleneck is the algorithm - a_gets(); already counts(only knows how to count to one ;)) the lines to stop at them. If you had the code for a_gets(); then you could take out the stopping part and add inc64.
Posted on 2003-06-06 00:48:04 by bitRAKE
bitRake,

The only problem with the new macro is that additions affect the carry flag, whereas increment/decrement leaves the carry as is. Wardell note this may or may not affect any other calculations you are doing... So, this is not necessarily a replacement for your inc64 macro.
{But in the given circumstances, it is the MOST OPTIMAL inc64...}


Wardell,

General comments...

1. For the task you are doing, you will scarcely need 64 bits: You should scarcecly need 32 bits (4 billion) to count the number of bytes in a file, much less the nuumber of lines in a file. You have reached 250,000 which is not quite a 20 bit number. I would say, stick with a 32 bit nuumber. Don't put in 64 bit numbers just because you have a nifty inc64 macro.

2. Of course this is just a test program for something bigger, but, hey, optimizations are in order: you are not doing anything with the registers. Why not use one as a counter?

while( !f.eof() ) do
f.a_gets();
inc ebx;
endwhile;

3. Is there a better way to read a line? eg...
fileio.readLn( );
I'm not sure how to port it to the format you are using, ,probably
f.readLn();

4. The code for a_gets() is in hla\libsrc\fileio\fgets.hla
Posted on 2003-06-06 03:01:31 by V Coder
Unfortunately, We do need 64 Bit #'s. We have well over 4.5 Billion rows in our database. What this program is designed to do is to take the Max Sequence ID from the database and genereate ID's for the next incoming flat file that is to be loaded into the data warehouse. (Actually our next ID is around 7 Billion )


Does your suggestion using the EBX register support #'s over 4 Billion?
Posted on 2003-06-06 10:00:13 by C. Wardell
Let me clarify my last post.

The line count program works hand in hand with my AddId program. So I guess they both suffer from the same bottle neck. When dealing with such huge files and many of them, optimization can save days of processing.

Your suggestions are greatly appreciated..

Thank you,
Charlie



// AddId-
//
// This program will read each line in the
// file specified on the command line and
// append an auto incrementing ID to the end
// of each record. The id will start at the
// number passed in on the command line.

// Calling the program:
// AddId inFile startingID FieldSep outfile
// inFile: the name of the file to be read
// startingID: is the ID to assign to the first record
// fieldSep: the ID should first be prefixed by this character

// AddId myFile.dat 1000 ~ > myFile.id


program AddId;

#include( "stdlib.hhf" );

#macro inc64( dest ):noOvrFlw;
inc( (type dword dest) );
jnz noOvrFlw;
inc( (type dword dest[4]) );
noOvrFlw:
#endmacro


static
f: file;
f2: file;
inFile: String;
fieldSep: String;
currentRec: String;
digitStr: String;
uns64Result: dword;
i: int32;

begin AddId;

arg.c();
mov (eax, i);
if( i <> 4 ) then
stdout.put( "Usage: AddId inFile StartingID fieldSep > outFile" );
exit AddId;
endif;


mov ( arg.v(1), inFile );
mov ( arg.v(3), fieldSep );

arg.v(2);
mov( eax, digitStr ); // digitStr is type <<string>>.
conv.strTou64( digitStr, 0 ); // Result comes back in EDX:EAX
mov( eax, (type dword uns64Result[0]));
mov( edx, (type dword uns64Result[4]));
strfree( digitStr );

f.create();
f.open( inFile, fileio.r );

while( !f.eof() ) do
f.a_gets();
mov( eax, currentRec );
inc64(uns64Result);
stdout.put( currentRec, "|", (type uns64 uns64Result[0]), stdio.cr, stdio.lf );
strfree( currentRec );
endwhile;

f.close();

end AddId;
Posted on 2003-06-06 10:06:26 by C. Wardell
I know this is the hla group but it seems hla over complicates things?

by why not...

Spasm Syntax


; Line Count test
; Look maw! no libraries

[push | push #1 | #+1]
[pop | pop #1 | #+1]
[call | push #L>2 | call #1]
[TextToCount: B$ "1
" #3000 0]

Main:
mov esi TextToCount
mov ecx 0
L1: inc esi | mov ax W$esi
cmp ax 0000 | je L3>
cmp ax 0A0D | jne L1<
inc ecx | jmp L1<
L3: inc ecx

call 'KERNEL32.ExitProcess' 0
ret



or Masm32 syntax


; #################################################

.486
.model flat, stdcall
option casemap :none ; case sensitive

; #################################################

include \masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
include \masm32\include\gdi32.inc
include \masm32\include\masm32.inc
include \masm32\include\debug.inc

includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\gdi32.lib
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\debug.lib

main PROTO

.data
TextToCount db "Test Line",13,10
db "Line Two",0

; #################################################

.code

start:

call main

invoke ExitProcess,0

; #################################################
main proc

mov esi, OFFSET TextToCount
mov ecx, 0

@@:
inc esi
mov ax, word ptr [esi]
cmp ax, 0000
je @F
cmp ax, 0A0Dh
jne @B
inc ecx
jmp @B

@@:
inc ecx

PrintDec ecx
ret

main endp

; #################################################
end start


You are really counting the number of CRLF aren't you?

I apologize for this post, I am no master asm guy.
I am a former C programmer and am lost as to why hla simplifies things.

?Need more than 4 million, change it and use 64bit.
or read your file in chunks. What happens in the futur when it becomes
100 milion?

Once again sorry, my goal isnt to start a war with randy. :eek:

RobotBob
Posted on 2003-06-06 15:36:20 by RobotBob

Here is a simple HLA utility I wrote to count the lines in a file. I ran it against a file with about 250,000 lines and it counted in a little under a minute. As a comparison, I used the WC command from I believe the MKS toolkit and it returned the line count within a few seconds.

Why is my routine so slow?


Thanks all.


All the time is begin spent in stdin.a_gets.

Two comments:

1. stdin.a_gets allocates storage for the new line on the heap by calling stralloc, which calls malloc, which calls Win32's global allocation routines. Memory allocation is a big time hog in most OSes. It doesn't help that you never free the storage, so this winds up causing paging to occur (most likely, because you're reading so many lines, and that is *slow*).

2. Here's what stdin.a_gets is doing:
a. Windows may or may not buffer up file data in its own internal buffs
b. The HLA StdLib reads a chuck of the data to an internal buffer
c. stdin.a_gets allocates storage and copies the data to that buffer
and then reallocates the buffer to the size of the string read.
d. You process the string.

This requires several copies of the data, plus the expense of the malloc calls. This is all expensive!

Under normal circumstances, the right solution would be to map the file into memory using memory mapped I/O and then simply scan through the data in memory searching for line feeds. However, because your file can be so large, this is not a possibility. Therefore, the next best solution is to read the file a large block at a time into a memory buffer and scan through that buffer. I'd recommend reading the file in at least 64K blocks.

For the absolute best performance, don't bother using the HLA Standard Library routines - they add an extra layer between your application and the OS. While the extra time is probably negligible, if all you're doing is text I/O, using the Win32 API functions is just as easy as using the HLA Standard Library (other than you've got to figure out what functions to call). You'll want to look up w.CreateFile, w.Read, and w.CloseHandle for the task at handle.

Also, don't use the scans instruction to search for those line feeds if speed is really important. A short loop that tests each character individually (or better yet, a loop that tests a dword at a time) runs faster.
A modification of the HLA Standard Library str.zlen routine might be a good approach (note that if you subtract $0a0a0a0a from a dword, searching for line feed becomes the easier case of searching for zero in a string).
Cheers,
Randy Hyde
Posted on 2003-06-06 15:57:34 by rhyde
Here are the results:

Using my the AppID in HLA took 1hr 28 Minutes on a 1.4 Gig file -

What shocked me is that the same app written in MS VC++ took 8 Minutes.. :o
Posted on 2003-06-06 16:35:26 by C. Wardell
Why not multi thread it and each thread read a block of the file.
at the end add up all of your invidual line counts?

8 minutes seems excessive to me, also. One hour would not be acceptable,obviously. Granted it is 1.4 Gig file.

But simple word cmp on moderate chunks will fly, we are measure this in
milliseconds. The simple word compare flys through my text copy of 'war and peace', it doesn't even register one second on my machine.

As randy mentioned you must keep the os from paging, if you allow it.
all of your 'speed up' work goes to trash.

Once again I could be wrong, not a guru (never hope to be, gurus must constantly prove that). I 've only spent a god awful amount of time parsing things.

Good luck Wardell

RobotBob
Posted on 2003-06-06 19:31:05 by RobotBob
Use MMX to scan for end of line, and use multiple threads so something is done while loading the next block from disk - like RobotBob suggests. The line count can be done in the time to read the file from disk. :)

Read here:
http://www.asmcommunity.net/board/index.php?topic=3525&highlight=long+string+length
Posted on 2003-06-06 19:46:32 by bitRAKE
Thank you all for your input, by the way, did I mention I was a newbie to ASM and don't know what the hell I am doing?
Posted on 2003-06-06 20:09:08 by C. Wardell
I find most of the people on this board first assume
a person is not a beginner, to prevent from insulting
them by implying that.
-robotbob
Posted on 2003-06-06 20:21:49 by RobotBob
I think randall.hyde means something like the below (sorry for asm code instead of hla).


.data?
hFile dd ?
dd1 dd ?
highdd dd ?
buffer db 1024 dup (?)
.code
...
invoke CreateFile,OFFSET FileName, GENERIC_READ, FILE_SHARE_READ,0,OPEN_ALWAYS,FILE_ATTRIBUTE_NORMAL,0
mov hFile,eax
invoke GetFileSize, eax, offset highdd
xchg esi,eax ;esi = low dword
mov edi, highdd
L1:
sub esi,1024
jns @F
dec edi
@@:
invoke ReadFile,hFile, offset buffer,1024,offset dd1,0
lea edx,buffer
mov eax,1024
L2:
xor ecx,ecx
cmp byte ptr[edx+eax], 13h
sete cl
cmp byte ptr[edx+1+eax], 13h
sete cl
add counter, ecx
sub eax,2
jnz L2
test edi,edi
jnz L1
test esi,esi
jnz L1

Something like that but I think there is some bug in it.
Posted on 2003-06-06 23:17:44 by roticv
Wow,

This small project has really turned into a treasure trove of technique. The reason why I chose the HLA was to learn some of the ins and outs of asm before jumping into asm directly.

To tell you the truth, I am at a loss with the suggestions that have been presented. Holy smokes it looks like greek to me. Now I have been programming for about 16 years and my hat is off to all of you. I did some asm programming on the 6502 and a86 back in the late 80's but it seemed much simpler then. What a rude awakening.

I guess what I need to do is to go through each line of code to see what the sugestions are actually accomplishing. Any links to a fast refresher course?

Thanks again,
Charlie
Posted on 2003-06-07 00:00:19 by C. Wardell
Well all it has is just 2 loops.
The first loop is to read 1024 chunk at a time.
The second loop is to scan the for CRLF (actually scanning for just LF or CR would be sufficient). I made it 2 scan 2 bytes at a time since CRLF takes up 2 space it the possibility of LF or CR to appear twice in the 2bytes should not be possible. (Well I used xor ecx,ecx and sete cl to reduce branching.)

Seriously I do not find asm as hard as some people claims it is.
http://www.madwizard.org/view.php?page=tutorials.contents&PHPSESSID=9f2dc59c35d62979bcfbc18c04a17a52
(Thomas's tutorial to asm)

I'd recommend reading the file in at least 64K blocks

And since randall.hyde suggest using 64k blocks, just change the 1024 to 65536 and to use HeapAlloc/VirtualAlloc to create a buffer for the data and scan through the buffer for CRLF.
Posted on 2003-06-07 01:21:14 by roticv
Here's a quick and dirty solution.
Not really faster than what a *good* C implementation will produce, but won't be an order of magnitude slower:



program fastLC;
#include( "stdlib.hhf" )
#include( "w.hhf" )

const
bufSize_c := 64*1024;

static
bytesRead :dword;
filename :string;
handle :dword;
count :uns64;

var
buffer :char[ bufSize_c ];

begin fastLC;

if( arg.c() <> 2 ) then

stdout.put( "usage: fastlc <<filename>>" );
exit fastLC;

endif;

mov( arg.v( 1 ), filename );
w.CreateFile
(
filename,
w.GENERIC_READ,
0,
NULL,
w.OPEN_EXISTING,
0,
null
);
if( eax = w.INVALID_HANDLE_VALUE ) THEN

stdout.put( "Could not open file '", filename, "'" nl );
exit fastLC;

endif;
mov( eax, handle );
forever

lea( edi, buffer );
w.ReadFile( handle, [edi], bufSize_c, bytesRead, NULL );
breakif( bytesRead = 0 );
xor( edi, edi );
while( edi < bytesRead ) do

if( buffer[edi] = stdio.lf ) then

inc( (type dword count) );
if( @z ) then

inc( (type dword count[4]) );

endif;

endif;
inc( edi );

endwhile;

endfor;
w.CloseHandle( handle );
stdout.put( "Number of lines in '", filename, "' is ", count, nl );

end fastLC;


Now, to really speed things up, you've got to work on the loop that scans the string for line feeds. As I mentioned earlier, you really want to process this four bytes at a time. Or better yet, use the MMX registers or SSE register and do it eight bytes at a time. Speeding this code up is where you'll beat the pants off a C implementation. Everything else is system or OS dependent and whether you use a HLL or assembly won't make much difference.
Cheers,
Randy Hyde
Posted on 2003-06-07 01:34:10 by rhyde
Oh, I forgot to mention,
A quick test with an arbitrary 1,000,000+ line source file took about 14.8 seconds to process (the arbitrary file was a C file produced by compiling HLA's bison code into C, and then I duplicated that output about 8-12 times). Most of the lines were fairly long.
Cheers,
Randy Hyde
Posted on 2003-06-07 01:36:26 by rhyde

Oh, I forgot to mention,
A quick test with an arbitrary 1,000,000+ line source file took about 14.8 seconds to process (the arbitrary file was a C file produced by compiling HLA's bison code into C, and then I duplicated that output about 8-12 times). Most of the lines were fairly long.
Cheers,
Randy Hyde

I Also forgot to mention that my version of WC produced comparable timings.
This is on a 300 MHz PII machine, btw.
Now WC does more work, so this isn't very impressive, but then I've not tuned the assembly code at all, either.

You might want to take a look at Terje Mathisen's WC program from many years ago. This was an 8086 (!) program that really kicked butt at scanning through text files looking for words, lines, etc. Don't know that his algorithm (using a 64K lookup table) would still fly today, but it had some interesting ideas for speeding up the code).

There is no question in my mind that with some very straight-forward changes to this code, you can easily double the speed. Going wild, you should be able to quadruple it, at least.
Cheers,
Randy Hyde
Posted on 2003-06-07 01:41:24 by rhyde