I've read a few posts and different algos about line counting, and it got me courious about my solution. Mainly I'm curious if I'm doing it the 'wrong' way.

Basically, I use CreateFile/GetFileSize/GlobalAlloc/GlobalLock/ReadFile. Then I loop through every char byte by byte checking for 10 (line feed). It seems like it would be horribly unoptimized doing it byte by byte but it's pretty fast on my work machine (p4 2.4 ghz/256mb ram). I'm not exactly a hardware guru by any means, so I don't know if that's a good enough test platform to test average speed or whatnot.

I'm grabbing the filename from the commandline and the program just displays a messagebox with the number of lines and then exits. Windows.inc (28 thousand odd lines) pops up the messagebox a split second after hitting enter at a dos prompt (lc "c:\masm32\include\windows.inc". Now I don't think I've ever had a single source file that comes anywhere close to 28000 lines, so am I safe with the byte by byte loop?

mov esi,pMemory
xor ebx,ebx
cmp ebx,fSize
jge _eof
cmp byte ptr[esi + ebx],13
jz _crlf
inc ebx
jmp _start
inc ebx
inc lCount
jmp _start
;here I just clean up and close the file handle, free the memory, etc

Is that a bad solution? It seems that it would be, but all of my tests seem really good.

Posted on 2003-07-15 14:20:02 by Will
Byte by byte is pretty much the only way I can think of doing this.
Unless you find a magic number that'll test 4 bytes at a time for the character.

A good test platform is what you'll expect to run the code on under normal conditions. If you're standard platform is a 386, then your P4 will be misleading as a benchmark system, but if you're targeting current and future processors Intel has a good share of the market!

You should consider using a register such as edx to store the "lCount" variable, it may help performance a bit.

Here is my version. I take some pain in setting up, as it allows me to remove the comparison to "fSize" inside the loop (done via the increment). On a P4, you may want to replace the inc and dec instructions with add XXX, 1 and sub XXX, 1 respectively.

Also try to ensure that the loop is aligned, this will buy you a fair bit of speed!

mov eax, fSize
mov ecx, pMemory
xor edx, edx
lea ecx, [eax + ecx]
neg eax
dec eax

inc eax
jz @F
cmp byte ptr [eax + ecx], 13
jne @B
inc edx
jmp @B

mov lCount, edx

Posted on 2003-07-15 17:29:50 by Mirno
Thanks for the info Mirno. Please forgive my ignorance, but what do you mean by 'ensure that the loop is aligned'?
Posted on 2003-07-15 18:05:51 by Will
dunno! if this is faster.

push ebx
push esi
mov esi, [esp+12]
mov ecx, [esp+16]
dec ecx
xor ebx, ebx
inc ebx
xor eax, eax
xor edx, edx
cmp BYTE PTR [esi+ecx], 13d
cmove edx, ebx
add eax, edx
dec ecx
jnz __bytec
pop esi
pop ebx
retn 8
you can use MMX-SSE with a combination of pcmpeqb-pmovmskb-shr-adc, propagate 13(0Dh) using movd-punpcklbw(3x). Just an idea. :)
Posted on 2003-07-15 18:24:25 by arkane
Been there, done that, bought the T-shirt:

Posted on 2003-07-15 18:40:39 by bitRAKE

I've read a few posts and different algos about line counting, and it got me courious about my solution. Mainly I'm curious if I'm doing it the 'wrong' way.

That's great. As I said, I've read different posts on this subject. I was curious if there was anything wrong with my simpler algo, since it seems to work pretty fast for me.
Posted on 2003-07-15 18:50:47 by Will
There is nothing 'wrong' with the code - it does the job. Other code would have other features. Use that code that has the features you desire. Often for me the primary feature of the code I use is that it is mine. :)
Posted on 2003-07-15 18:58:15 by bitRAKE
I think there is nothing wrong.

I'm sure you've already done this: test the algo using odd and even string length.

here's more info:

Windows = CRLF or 0D0A or \r\n
Unix = LF
Mac OS = CR
Posted on 2003-07-15 19:04:16 by arkane

"Byte by byte is pretty much the only way I can think of doing this.
Unless you find a magic number that'll test 4 bytes at a time for the character."

If the working buffer is for example 2000 bytes long
put zero at 2001 byte (your buffer should be larger at least 10 bytes!!)
and try my faster InSrtCR proc:

InStrCR proto lpBuffer:DWORD

;Usage: push offset Buffer
; call InStrCR
;On exit: eax->num of CRs

Align 16 ; Align 16 before the proc
InStrCR proc lpBuffer:DWORD
mov edx, [esp+4] ; edx->lpBuffer
db 3Eh
mov ecx, -4
push ebx ; save ebx
push edi ; save edi
mov edi, [edx] ; edi->lpBuffer
xor ebx, ebx ; ebx->counter
lea eax, [edi+0FEFEFEFFh]
xor edi, 0D0D0D0Dh ; 0Dh =13->CR
add edx, 4
add edi, 0FEFEFEFFh
or eax, edi ; testing 13 and 0 simultaneously
mov edi, [edx]
test eax, 80808080h
je BytesScan ; 4 clocks per 4 bytes
cmp byte ptr [edx+ecx],0Dh
je IncCounter
cmp byte ptr [edx+ecx],0 ; is it the end of buffer->2001 byte and beyond!!!
je ToEnd
add ecx, 1 ; ecx-> -4 to 0
jne SrchNextByte
add ecx, -4 ; ecx = -4
jno BytesScan
Align 16
add ebx, 1 ; inc counter
add ecx, 1 ; ecx-> -4 to 0
jne SrchNextByte
add ecx, -4 ; ecx = -4
jno BytesScan
Align 16
pop edi ; restore edi
mov eax, ebx ; ebx=eax->counter
pop ebx ; restore ebx
ret 4
InStrCR endp
OPTION PROLOGUE:PROLOGUEDEF ; turn back on the defaults

Posted on 2003-07-15 19:49:09 by lingo12
Hi all!
IMHO, to minimize time of counting need previously write MOV AL,13 and use CMP al,[ ] inside the loop.
Posted on 2003-07-16 07:55:19 by MikDay
MikDay, that would not work. The goal is to count multiple lines.
Posted on 2003-07-16 08:24:13 by bitRAKE
Thanks guys! You've all given me some new ideas that I'm playing with.

What exactly did you mean though Mirno, with 'ensure that the loop is aligned'?
Posted on 2003-07-16 17:32:43 by Will
Hi bitRAKE. In my previous post I want to say something like this:

mov esi,PMemory
xor eax,eax
mov ecx,fSize
jecxz L1 ;FileLength may be=0
dec ecx
jz EofCount
mov bl,13 ;outside the loop

cmp ,bl
sete dl
add eax,edx
dec ecx
jnz CountLoop

mov lCount,eax

By the way, Will. Why your sample contains:
cmp ebx,fSize
jge _eof ;?signed jump

FileSize is unsigned value. If the size of tested file is more or equal 2^31,
you will get wrong result.
Posted on 2003-07-18 07:00:56 by MikDay
I test for file length of 0 prior to entering that loop. The jge is kind of a long story. I've gotten in the habit of doing that on the off chance that something else fails and the file pointer increments past the actual file size. It's more of a paranoid thing to catch any other error in my code while I'm testing a new algo so I don't have to do an end task on a frozen program. Is a jge any different then a je in terms of speed or general performance? I'm not too knowledgeable in that respect. As for the 2^31 filesize being greater then 0ffffffffh or whatever.....this is only meant for my use for counting lines in source code files and if I ever have source files that large I'll have more to worry about (like all my hair being pulled out, needing a padded room, etc) then a gpf. Besides, this algo is really only 'good' for small source files anyways. :)

I still don't understand the 'ensure loop is aligned' comment though.
Posted on 2003-07-18 11:12:16 by Will
Will, alignment is the placing of code on particular boundaries. As you know powers of 2 mean a lot in computing, and the case of placing your code is no exception.

The cache will hold chunks of data (a cache line) of a particular size (the size is processor architecture dependant). Typically this will be something like 32 bytes. Each of these chunks are a copy of memory that is fast to access (2 clocks vs 100s of memory on a P4), and so making the most of this space is important. Caches for implementation reasons will only fetch lines that are aligned to a boundary in memory. This means that the bottom N bits are zero, and in the case of a 16 byte cache line, N will usually be 4 (2^N = size of cache line). This makes things easy to find:

If you want data at address X in our 16-byte cache line machine, then:

found = -1;
for (i = 0; i < cache_size; i++)
if (cacheline[i].addr == (X & 0xFFFFFFF0))
found = i;

if (found < 0)
return CacheMiss(); // We need to go fetch the data!
return cacheline[i].data >> (X & 0xF);

It means that the hardware should never have 2 cached copies of the same data (updating one, and not the other could be a disaster if the wrong one got copied back with stale data), and that the comparators in the hardware can be smaller that the full address width (this can help timing, searching a big cache can push the limits of the clock, this is why the P3 Celerons were more overclockable than the P3-propper. The smaller cache meant less things to search, so more tolerance on the clock edges).

This method though does have the downside of making the cache less effective if the programmer hasn't put data on boundaries. If the data isn't packed tight, then part of the cache line is going to waste, hence more lines are taken up (this in it self isn't bad if the data set is small, but if you have a bigger amount of data, then you could run out of cache). Second problem is that of the data being misaligned, this means that half of the data is on one cache line, the other half is in another. So a worst case scenario would be to place data such that you need two cache lines per item.

Code suffers from the exact same problem, if a loop is placed so that it starts on a cache boundary, it is going to take less space inside the cache, and so less likely to be flushed out. If it is in the cache then there is less strain on memory bandwidth, so other data accesses should potentially be quicker too.

If you use the "ALIGN" directive, then MASM will insert NOPs for you to get the code placed on the appropriate boundary. If you benchmark your code with and without an "ALIGN 16" command before the loop, you should see some difference in speed (asuming you weren't lucky, and it already was)! Also use this before the PROC declaration to make that start on a suitable boundary.

When building your code use the following command line to generate a list file:
\masm32\bin\ml /c /Flcode.txt /Sa /Sn mycode.asm

This will create a HUGE file (it lists all of the includes and all that junk) called code.txt.Down the bottom of this file there will be the code and the hex that it represents. If you look at where things are placed, then you can (if you are really concerned about speed) select instructions accordingly (e.g. changing "cmp <reg>, 0" for "or <reg>, <reg>", and "dec <reg>", "sub <reg>, 1") for placement reasons. But only if you REALLY want to crush every last clock out of your code!

Posted on 2003-07-18 15:01:00 by Mirno