X-Calibre,
Maybe he's playing games with the cache?
Hutch--,
I'm concerned with aligning data too. I haven't had to "ask what is the alignment?" yet but then again I'm only aligning strings on the heap for now.
I haven't had any time to play with the following yet as I'm up to my ass in alligators at both work & home :sigh: I pulled the old write code during a boring meeting trick.
There are probobly good reasons for not doing the following but if I don't post & get feedback I'll never learn.
I tried to work on pairing & no stalls (haven't gotten to the chapter on jumps yet)... but let's see what you & others think.
Maybe he's playing games with the cache?
Hutch--,
I'm concerned with aligning data too. I haven't had to "ask what is the alignment?" yet but then again I'm only aligning strings on the heap for now.
I haven't had any time to play with the following yet as I'm up to my ass in alligators at both work & home :sigh: I pulled the old write code during a boring meeting trick.
There are probobly good reasons for not doing the following but if I don't post & get feedback I'll never learn.
lea edx, variable
mov eax, 1 ; Init return value to "ones"
or edx, 32 ; Make sure we exit loop
@@:
shr edx, 1 ; Test lo order bit
jc @F ; Exit if set
shl eax, 1 ; Double return vlaue
jmp @B ; Try next lo order bit
@@:
I tried to work on pairing & no stalls (haven't gotten to the chapter on jumps yet)... but let's see what you & others think.
sorry no offense intended.
actually I'm putting strings on the heap (pages) which are variable length (tacked onto a structure really-- symbol hash table). so aligning has to be done every time. i'm using a method similar to the one you mentioned to get things aligned.
As i said, i haven't had to ask the question... I just align & save the pointer like in Agner's docs... but i'm not in hutch's shoes ... given that & my newbie ways i'm not prepared to do to much judging at this stage.
PS: if the code i posted is bad then let me know...
actually I'm putting strings on the heap (pages) which are variable length (tacked onto a structure really-- symbol hash table). so aligning has to be done every time. i'm using a method similar to the one you mentioned to get things aligned.
As i said, i haven't had to ask the question... I just align & save the pointer like in Agner's docs... but i'm not in hutch's shoes ... given that & my newbie ways i'm not prepared to do to much judging at this stage.
PS: if the code i posted is bad then let me know...
just a little sidenote.... make sure the stack is not misaligned (i.e. by using byte locals. i'm not sure whether masm allows that, tasm surely does) as otherwise you'll get pretty funny results when using api calls on win2k.
no. you did a lot of hand waving in pseudo-c code & spoke in generalities. i read that part just fine.
i still don't see the asm code that's light-years better.
i still don't see the asm code that's light-years better.
@x-calibre:
well... i just checked it. tasm does indeed align the stack when using byte locals. it does not, however, align it correctly when you're doing something like this:
which will assemble to
don't know about masm, though.
well... i just checked it. tasm does indeed align the stack when using byte locals. it does not, however, align it correctly when you're doing something like this:
MyStruc STRUCT
OneByte db ?
MyStruc ENDS
MyProc PROC
LOCAL MyVar: MyStruc
ret
MyProc ENDP
which will assemble to
enter 0001, 00
leave
ret
don't know about masm, though.
Better version of finding the alignment is:
No branching!
Mirno
; ------- Why I just edited this -------
After actually getting off my fat lazy computer programming arse and trying, I discovered that you DO need the "dec cl" line!
lea edx, MyVar
mov eax, 1
or edx, 32
bsf ecx, edx
dec cl
shl eax, cl
No branching!
Mirno
; ------- Why I just edited this -------
After actually getting off my fat lazy computer programming arse and trying, I discovered that you DO need the "dec cl" line!
In agners documentation, the amount of time taken is:
And as we have specified the "or edx, 32", the value of "n" can be at most 5. Hence the operation will take at most 21 cycles.
This is only true for the PPlain and P-MMX as BSF/BSR take only 2 micro-ops (reg & reg) on the PPro/PII/PIII. I personally believe that the majority of users will have P6 architecture machines.
Even if you are optimising for the PPlain, the extra itterations of the loop (2 shifts, 1 conditional jump, 1 unconditional jump) will cost you a similar amount.
Mirno
11 + (2*n) cycles
And as we have specified the "or edx, 32", the value of "n" can be at most 5. Hence the operation will take at most 21 cycles.
This is only true for the PPlain and P-MMX as BSF/BSR take only 2 micro-ops (reg & reg) on the PPro/PII/PIII. I personally believe that the majority of users will have P6 architecture machines.
Even if you are optimising for the PPlain, the extra itterations of the loop (2 shifts, 1 conditional jump, 1 unconditional jump) will cost you a similar amount.
Mirno
Mirno,
Something to compare & contrast with when I get home.
I definitely like the whole no-branch thing plus a "new" instruction has been kicked into play. & yes, I agree that most users will likely have a p6 or more.
Exactly what I was hoping for. Thanks.
Something to compare & contrast with when I get home.
I definitely like the whole no-branch thing plus a "new" instruction has been kicked into play. & yes, I agree that most users will likely have a p6 or more.
Exactly what I was hoping for. Thanks.
I was posting an alternative to rafe's code....
As for not reading, I did state the speed of BSF on P6 arch (2 micro ops).
The advantage of the code I gave was the lack of jumps, this will give you a fair speed increase, which offsets the loss on the PPlain P-MMX anyway.
I agree you could use a LUT, I prefere not to use LUTs where possible (just personal taste).
I would like to see a divide and conquer method, but only if it can be done without jumps. Miss-predicted jumps will cost a lot more than one expensive instruction!
Mirno
P.S. Your tone may be construed as offensive, and I would prefere that this did not turn into some personal vendetta.
As for not reading, I did state the speed of BSF on P6 arch (2 micro ops).
The advantage of the code I gave was the lack of jumps, this will give you a fair speed increase, which offsets the loss on the PPlain P-MMX anyway.
I agree you could use a LUT, I prefere not to use LUTs where possible (just personal taste).
I would like to see a divide and conquer method, but only if it can be done without jumps. Miss-predicted jumps will cost a lot more than one expensive instruction!
Mirno
P.S. Your tone may be construed as offensive, and I would prefere that this did not turn into some personal vendetta.
I think it would be better to discuss this further by email, rather than continuing by cluttering up the message board.
P.S. The link you posted (in that last message at least gave a 403 Forbidden error).
I will mail you with a reply shortly...
Mirno
P.S. The link you posted (in that last message at least gave a 403 Forbidden error).
I will mail you with a reply shortly...
Mirno
Are you sure a LUT is faster ? Generally under Windows you get a high penalty when the lut is not in the cache, something like 2000 clock cycles and even more. I think it happens when there is a page fault, and you go through the Windows code that will load the page from the disk.
If you only run the algo once, then there is a big penalty associated with the LUT method.
This is a one time only overhead, and when averaged out over 4096 repeated tests it comes in at the same speed at the BSF version (on a PIII).
The BSF algo is stated above, the LUT algo I used is as follows:
I think the two algos produce the same output, but I haven't tested....
Mirno
This is a one time only overhead, and when averaged out over 4096 repeated tests it comes in at the same speed at the BSF version (on a PIII).
The BSF algo is stated above, the LUT algo I used is as follows:
.data
MyLUT db 32, 1, 2, 1, 4, 1, 2, 1, 8, 1, 2, 1, 4, 1, 2, 1
db 16, 1, 2, 1, 4, 1, 2, 1, 8, 1, 2, 1, 4, 1, 2, 1
.code
mov edx, lpVar
xor eax, eax
and edx, 31
mov al, BYTE PTR [offset MyLUT + edx]
ret
I think the two algos produce the same output, but I haven't tested....
Mirno
Folks,
A couple of things, the code I posted is test code, it should not be included in an app, it was aimed at testing alignment.
Rafe,
If you are dealing with a number of small strings, try using OLE string memory as it is optimally 16 BYTE aligned, access is always fast and as it is preallocate memory, its faster there as well.
Regards,
hutch@pbq.com.au
A couple of things, the code I posted is test code, it should not be included in an app, it was aimed at testing alignment.
Rafe,
If you are dealing with a number of small strings, try using OLE string memory as it is optimally 16 BYTE aligned, access is always fast and as it is preallocate memory, its faster there as well.
Regards,
hutch@pbq.com.au