dwWidthPixels = 1 << (int)floor(( log(dwWidthPixels)/log(2.0f) ) + 0.5f);

How do I perform those logarithm functions with FPU code, and what the heck is FLOOR?
Posted on 2005-07-13 03:52:27 by Homer
Don't know how to write the algo for FPU, but floor function is used to calculate largest integer which is less or equal to the given value.
Posted on 2005-07-13 04:30:52 by arafel
http://www.google.com/search?q=floor+function+site:msdn.microsoft.com :) - fourth hit.
Posted on 2005-07-13 04:39:46 by f0dder
dwWidthPixels = 1 << (int)floor(( log(dwWidthPixels)/log(2.0f) ) + 0.5f);
first executed part is
in other words, get log2 (dwWidth).
"floor" rounds down its argument ( 4.1, 4.6 and 4.9 become 4)
but we have a +0.5f, which actually turns the whole stuff to round to the nearest integer.
(4.1 -> 4;  4.6 -> 5; 4.9 -> 5)

Then we do a shl of 1 by that floored number.
Thus this line of code returns the nearest 2^n number to dwWidthPixels
500-> 512
560-> 512
Posted on 2005-07-13 07:49:16 by Ultrano
One maybe unoptimized asm translation is:

mov eax,dwWidthPixels
bsr ecx,eax
mov edx,eax
shl edx,1
shr edx,cl
and edx,1
add ecx,edx
mov eax,1
shl eax,cl
mov dwWidthPixels,eax

In any case, it's 200 times faster than the C code :)
Posted on 2005-07-13 08:07:51 by Ultrano

In any case, it's 200 times faster than the C code :)

The C code is a pretty na?ve implemtation anyway - involving FPU and everything, yuck :)

Btw, why are you finding the nearest 2-power? If it's for surface allocation, there might be other considerations like the pitch required by the hardware...
Posted on 2005-07-13 08:12:25 by f0dder
Or this little tricky bugger:

mov eax,dwWidthPixels
bsr ecx,eax
shl eax,1
shr eax,cl
and al,1
inc al
shl eax,cl
mov dwWidthPixels,eax
Posted on 2005-07-13 08:13:09 by Ultrano
OK, according to the Comments, the algo is supposed to return a power of two which is below or equal to the input value.

That being the case, why round UP via 0.5f , then DOWN via Floor?

mov eax,dwInputValue
bsr ecx,eax ;ecx <-- highest SET bit from eax
mov eax,1
shl eax,cl ;eax = 1 << ecx

Thanks guys :)

f0dder - it's for an opengl TextureLoader class (yes, another one, hopefully my last !!)
Having ascertained the maximum texture size allowable on the local hardware, if the desired size is too large I clamp to that value, or if smaller , clamp to the next lowest power of 2...

Fixed the tag problems - roticv
Posted on 2005-07-13 08:45:09 by Homer
Lousy C coder? I can sure write better C codes than that.
Posted on 2005-07-13 09:31:25 by roticv
:shock: ...it's the worst C code I've ever seen...  hey, maybe that was the goal?  :lol:
Posted on 2005-07-13 09:57:13 by ti_mo_n
Well, the code is for games...
Some good gamedeveloper (initially was an asm coder) said that most good gamedevs are rarely good programmers - and that he's frustrated how the heck their code actually works in the end :)

Homer, I also smelled something fishy about that algo, but shouldn't it get the larger width (513->1024), instead of losing texture data?
Posted on 2005-07-13 11:06:10 by Ultrano

Yes, you are right, it SHOULD.
That must be what the increment was for...
Posted on 2005-07-13 12:35:45 by Homer
The BSR opcode (even on a p4) is pretty slow.
You could make your own binary searchish version to shave a few clock cycles.

Since BSR has a 16bit version (which is faster) I use it instead of the 32bit version.

align 16
BSRx86: ;slow
        mov eax,
        BSR eax,eax
ret 4

align 16
BSRopt: ;faster
        mov eax,
        test eax,0FFFF0000h
        jz .LowHalf
        shr eax,16
        BSR ax,ax
        add ax,15
        ret 4
        BSR ax,ax
        ret 4

It's not optimized as much as it could be, but I just wanted to show that a little TWEAK like that improves performance on a p4 3.2ghz by about 10-15%.
Posted on 2005-07-13 14:09:33 by r22
AthlonXP 2000+
bsr 32-bit - 8 to 9 cycles
bsr 16-bit - 8 to 9 cycles

Where source has any 11-bit value (for bitmap widths 1..1024). Are 9 cycles too much?!
Exactly how "slow" is the bsr on P4??

Using jumps can only slow down execution in this case.
Posted on 2005-07-13 14:55:12 by Ultrano
You are right the impact is negligible.

WITH a test number who's 23rd bit is set the function RUNs 10-15% faster :D
BUT with random numbers meh

I initially wanted to implement the old fpu speed up for bsr using SSE.

;convert 64bit int to real, get exponent, manipulate
MOV eax,
MOV , eax
MOV ,0
SHR eax,20
SUB eax,3ffh

The above code runs about 4x SLOWER than BSR on a p4.

My attempt at an SSE solution

MOV eax,
CVTSI2SD xmm0, eax
PEXTRW eax,xmm0,3  ; get exponenet
SHR ax,4                    ; finish getting exponent
ADD ax,0fc01h            ; add the magic number

After testing, the above code runs well
BUT it fails if bit 32 is set (because of stupid CVTSI2SD) there's NO unsigned int cvt.

Oh well
the BSR opcode has beaten me :\
Posted on 2005-07-13 17:52:25 by r22
Well I'd just like to point out that as far as this algo pertains to the purpose I described that I intended to use it for, I would NEVER see bit 31 set. It would be quite OK to mask it out and use your SSE code.

Remember, I'm interested in powers of 2 which describe texture sizes as supported in hardware (by your graphic card).


Since I don't execute the code enough to warrant intensive optimizing, I won't bother using the SSE in the TextureLoader, but I MIGHT use it elsewhere at some stage, so thanks :)
Posted on 2005-07-13 18:36:46 by Homer

AthlonXP 2000+
bsr 32-bit - 8 to 9 cycles
bsr 16-bit - 8 to 9 cycles

Ultrano: what are you using to give you the clock cycle count of the code (Vtune?) ?
Posted on 2005-07-18 01:09:51 by grv575
I test the code in several different ways, the basics being RDTSC and GetTickCount.
The GetTickCount is for rough approximation, then I replace it with RDTSC. At some time I procomment the measured code, to get how much cicles the raw loop takes. I make the measured code into a macro:

testit macro
bsr ecx,eax

This macro is placed in a loop. Calling rdtsc is done before and after the loop.
First, I measure the time 1 such block takes, then 8, then 32 and finally - 64 or even 128.
Also, I measure what the dependency stalls are, and whether we could execute more opcodes in "parallel" to "testit".
Posted on 2005-07-18 07:29:42 by Ultrano