Maybe he's playing games with the cache?


I'm concerned with aligning data too. I haven't had to "ask what is the alignment?" yet but then again I'm only aligning strings on the heap for now.

I haven't had any time to play with the following yet as I'm up to my ass in alligators at both work & home :sigh: I pulled the old write code during a boring meeting trick.

There are probobly good reasons for not doing the following but if I don't post & get feedback I'll never learn.

lea edx, variable
mov eax, 1 ; Init return value to "ones"
or edx, 32 ; Make sure we exit loop

shr edx, 1 ; Test lo order bit
jc @F ; Exit if set
shl eax, 1 ; Double return vlaue
jmp @B ; Try next lo order bit

I tried to work on pairing & no stalls (haven't gotten to the chapter on jumps yet)... but let's see what you & others think.
Posted on 2001-08-02 09:52:22 by rafe
sorry no offense intended.

actually I'm putting strings on the heap (pages) which are variable length (tacked onto a structure really-- symbol hash table). so aligning has to be done every time. i'm using a method similar to the one you mentioned to get things aligned.

As i said, i haven't had to ask the question... I just align & save the pointer like in Agner's docs... but i'm not in hutch's shoes ... given that & my newbie ways i'm not prepared to do to much judging at this stage.

PS: if the code i posted is bad then let me know...
Posted on 2001-08-02 10:40:43 by rafe
just a little sidenote.... make sure the stack is not misaligned (i.e. by using byte locals. i'm not sure whether masm allows that, tasm surely does) as otherwise you'll get pretty funny results when using api calls on win2k.
Posted on 2001-08-02 10:43:34 by Tola
no. you did a lot of hand waving in pseudo-c code & spoke in generalities. i read that part just fine.

i still don't see the asm code that's light-years better.
Posted on 2001-08-02 10:52:01 by rafe
well... i just checked it. tasm does indeed align the stack when using byte locals. it does not, however, align it correctly when you're doing something like this:

OneByte db ?
MyStruc ENDS

LOCAL MyVar: MyStruc

which will assemble to

enter 0001, 00

don't know about masm, though.
Posted on 2001-08-02 11:20:23 by Tola
Better version of finding the alignment is:

lea edx, MyVar
mov eax, 1
or edx, 32

bsf ecx, edx
dec cl
shl eax, cl

No branching!


; ------- Why I just edited this -------
After actually getting off my fat lazy computer programming arse and trying, I discovered that you DO need the "dec cl" line!
Posted on 2001-08-02 11:47:58 by Mirno
In agners documentation, the amount of time taken is:
11 + (2*n) cycles

And as we have specified the "or edx, 32", the value of "n" can be at most 5. Hence the operation will take at most 21 cycles.

This is only true for the PPlain and P-MMX as BSF/BSR take only 2 micro-ops (reg & reg) on the PPro/PII/PIII. I personally believe that the majority of users will have P6 architecture machines.

Even if you are optimising for the PPlain, the extra itterations of the loop (2 shifts, 1 conditional jump, 1 unconditional jump) will cost you a similar amount.

Posted on 2001-08-02 12:07:16 by Mirno

Something to compare & contrast with when I get home.

I definitely like the whole no-branch thing plus a "new" instruction has been kicked into play. & yes, I agree that most users will likely have a p6 or more.

Exactly what I was hoping for. Thanks.
Posted on 2001-08-02 13:01:28 by rafe
I was posting an alternative to rafe's code....

As for not reading, I did state the speed of BSF on P6 arch (2 micro ops).

The advantage of the code I gave was the lack of jumps, this will give you a fair speed increase, which offsets the loss on the PPlain P-MMX anyway.

I agree you could use a LUT, I prefere not to use LUTs where possible (just personal taste).

I would like to see a divide and conquer method, but only if it can be done without jumps. Miss-predicted jumps will cost a lot more than one expensive instruction!


P.S. Your tone may be construed as offensive, and I would prefere that this did not turn into some personal vendetta.
Posted on 2001-08-03 05:02:20 by Mirno
I think it would be better to discuss this further by email, rather than continuing by cluttering up the message board.

P.S. The link you posted (in that last message at least gave a 403 Forbidden error).

I will mail you with a reply shortly...

Posted on 2001-08-03 07:20:16 by Mirno
Are you sure a LUT is faster ? Generally under Windows you get a high penalty when the lut is not in the cache, something like 2000 clock cycles and even more. I think it happens when there is a page fault, and you go through the Windows code that will load the page from the disk.
Posted on 2001-08-03 11:03:47 by Dr. Manhattan
If you only run the algo once, then there is a big penalty associated with the LUT method.
This is a one time only overhead, and when averaged out over 4096 repeated tests it comes in at the same speed at the BSF version (on a PIII).

The BSF algo is stated above, the LUT algo I used is as follows:

MyLUT db 32, 1, 2, 1, 4, 1, 2, 1, 8, 1, 2, 1, 4, 1, 2, 1
db 16, 1, 2, 1, 4, 1, 2, 1, 8, 1, 2, 1, 4, 1, 2, 1

mov edx, lpVar
xor eax, eax
and edx, 31

mov al, BYTE PTR [offset MyLUT + edx]

I think the two algos produce the same output, but I haven't tested....

Posted on 2001-08-03 11:54:03 by Mirno

A couple of things, the code I posted is test code, it should not be included in an app, it was aimed at testing alignment.


If you are dealing with a number of small strings, try using OLE string memory as it is optimally 16 BYTE aligned, access is always fast and as it is preallocate memory, its faster there as well.


Posted on 2001-08-03 20:48:47 by hutch--