Is there a fast sqrt arithm of a 32bit integer?
thanks in advance
thanks in advance
sqrt32: ; returns: eax=Square root
mov ebx,32768 ; initial guess
mov edi,ebx
mov ecx,eax
.top:
mov eax,ecx
xor edx,edx
div ebx
add eax,ebx
shr eax,1
cmp eax,ebx
je .finished
cmp eax,edi
mov edi,ebx
mov ebx,eax
jne .top
cmp eax,edi
jl .finished
mov eax,edi
.finished:
ret
MATRIX
You may also want to look at the following procedure which is part of a fixed point math library available at the very bottom of the following site page (mixlib09.zip):
http://www.movsd.com/source.htm
Another similar procedure is also available to extract the square root of 64-bit integers (albeit slower).
Raymond
http://www.movsd.com/source.htm
sqrt32 proc public uInt:DWORD
; -------------------------------------------------------------
; Returns the square root of a 32-bit unsigned integer
; The result is rounded to the closest integer.
;
; eax, ebx, ecx, edx used
;
; EXAMPLE: invoke sqrt32, 32_bit_value
;
; Returns result in eax
;
; -------------------------------------------------------------
mov edx,uInt
xor eax,eax
mov ebx,40000000h
mov ecx,16
sqr10:
add eax,ebx
cmp eax,edx
ja sqr40
sqr20:
sub edx,eax
add eax,ebx
jnz sqr50
sqr40:
sub eax,ebx
sqr50:
shr eax,1
shr ebx,2
dec ecx
jnz sqr10
cmp eax,edx
jae @F
inc eax
@@:
ret
sqrt32 endp
Another similar procedure is also available to extract the square root of 64-bit integers (albeit slower).
Raymond
Hi,
Or alternate fpu command way;
Best Regards,
Or alternate fpu command way;
.data
numb dd 0
quad dd 0
.code
fldz
push 32768
pop [numb]
fild dword ptr [numb]
fsqrt
fistp dword ptr [quad]
Best Regards,
Greetings,
This is an alternative approach based on an article which appeared
in Dr. Dobbs some time ago:
This is an alternative approach based on an article which appeared
in Dr. Dobbs some time ago:
sqroot PROC USES edi N:DWORD
mov edi, N
mov eax, 1
cmp edi, 2
jle short @F
mov ecx, 4
mov edx, 5
@1:
cmp edi, ecx
jle short @F
add ecx, edx ;edx = 2*N-1 ecx = N^2 eax = N-1
add edx, 2
inc eax
jmp short @1
@@:
jl short @2
inc eax
@2:
ret
sqroot ENDP
Greetings,
This is an alternative approach based on an article which appeared
in Dr. Dobbs some time ago:
This is an alternative approach based on an article which appeared
in Dr. Dobbs some time ago:
sqroot PROC USES edi N:DWORD
mov edi, N
mov eax, 1
cmp edi, 2
jle short @F
mov ecx, 4
mov edx, 5
@1:
cmp edi, ecx
jle short @F
add ecx, edx ;edx = 2*N-1 ecx = N^2 eax = N-1
add edx, 2
inc eax
jmp short @1
@@:
jl short @2
inc eax
@2:
ret
sqroot ENDP
Hy Poimander
nice, you say it is size optimized?
not the fastest way but we hope its working :)
have you compared the speeds yet?
what do you think what will be the difference in speed trying for example 8 divisions, or making 30000 additions?
MATRIX
Greetings >Matrix<,
The code isn't optimal and is suitable for relatively "small" integer inputs.
However it can be used as the basis for a more optimal version.
The code isn't optimal and is suitable for relatively "small" integer inputs.
However it can be used as the basis for a more optimal version.
On average, the fastest way of extracting the square root of a random 32-bit integer will always be to use the FPU's fsqrt instruction.
With the CPU, the algo I posted previously can be improved slightly, and the speed improved primarily by unrolling the loop. Doing 32 shifts and additions should be significantly faster than doing 8 slow divisions.
Raymond
With the CPU, the algo I posted previously can be improved slightly, and the speed improved primarily by unrolling the loop. Doing 32 shifts and additions should be significantly faster than doing 8 slow divisions.
Raymond
Well, my previous code submission is as slow as molasses even after
souping it up alittle. I'm not exactly an optimization expert. As an
alternative I developed a mod of Raymond's sqrt32 routine. The resulting
code varies from about 0% to 50% faster over the entire 32bit input range
depending on the input.
souping it up alittle. I'm not exactly an optimization expert. As an
alternative I developed a mod of Raymond's sqrt32 routine. The resulting
code varies from about 0% to 50% faster over the entire 32bit input range
depending on the input.
;------------------------------------------------
; Modified version of Raymond's nearest integer square root
; routine 'sqrt32'.
;
; Basic idea: Enhance computation by using the highest bit set in
; the operand to compute a suitable upper bound.
;
; Destroys ecx and edx
; Result returned in eax
;------------------------------------------------
sqrt32b proc public USES ebx uInt:DWORD
xor eax, eax
mov edx, uInt
bsr ecx, edx
and ecx, 0FFFFFFFEh
mov ebx, 1
shl ebx, cl
shr ecx, 1
inc ecx
@@:
add eax, ebx
cmp eax, edx
ja short @@1
sub edx, eax
add eax, ebx
jnz short @@2
@@1:
sub eax, ebx
@@2:
shr eax, 1
shr ebx, 2
dec ecx
jnz short @B
cmp eax, edx
jae short @F
inc eax
@@:
ret
sqrt32b endp
I remember seeing some cool square root algos on a Polish site, so I went on a hunt...
http://faqs.org.ru/progr/graph/ddesign4.htm
...found some stuff, but not the Polish site. IIRC, it had fast psuedo-root code as well -- when percision is not needed.
http://faqs.org.ru/progr/graph/ddesign4.htm
...found some stuff, but not the Polish site. IIRC, it had fast psuedo-root code as well -- when percision is not needed.
To make our initial guess at the sqrt, we could use the BSR (bit scan reverse) opcode, which is a sort of crude base-2 logarithm. Say the input is in EAX.
bsr ecx,eax ;If eax<>0, this puts a value of 0 to 31 in ecx, namely
;the index of the most significant set bit in eax.
jnz eax_was_zero
shr ecx,1 ;For the square root, divide the "logarithm" in half.
jz eax_was_one
xor ebx,ebx
stc
rcl ebx,cl ;Now ebx is our initial guess.
.586
.model flat, stdcall
option casemap:none
;================================================
.nolist
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib
include \masm32\macros\macros.asm
;================================================
.listall
.code
;================================================
Sqrt32 proc uses ebx ecx edx iValue
mov ecx, 8000h
mov ebx, ecx
@@:
mov eax, ebx
mul ebx
.if eax > iValue
mov eax, ecx
not eax
and ebx, eax
.endif
shr ecx, 1
or ebx, ecx
.if ecx
jmp @B
.endif
mov eax, ebx
ret
Sqrt32 endp
;================================================
WinMain proc
local zTmp[16]:byte
mov eax, 213423905
invoke Sqrt32, eax
invoke wsprintf, addr zTmp, SADD("%i"), eax
invoke MessageBox, 0, addr zTmp, SADD("Title"), MB_OK
ret
WinMain endp
;================================================
start:
invoke WinMain
invoke ExitProcess, 0
end start
;================================================
If you don't mind some inaccuracy, Andre LaMothe in one of his huge game programming books used this short routine to extract a square root of a 32bit floating point number. It can easily be converted from and to an integer with the fild and fst instructions which might round off some of the error. According to this the square root of 144 is 12.5 instead of 12 so use with caution.
.data ?
float_variable real4 ?
.code
mov eax, float_variable
sub eax, 3f800000h
sar eax,1
add eax, 3f800000h
mov float_variable, eax
Cool snippet - wish I thought of it myself. :) It works because floating point numbers are biased and the top bit is missing.
So it converts:
1.A x 2^B
to
x 2^INT(B/2)
144 = 10010000y
144 = 1.001y x 2^7
...and the routine generates:
1.1001y x 2^3
Where does the extra bit come from? Well, if the exponent is odd then the routine adds 0.5 to the mantissa.
So it converts:
1.A x 2^B
to
x 2^INT(B/2)
144 = 10010000y
144 = 1.001y x 2^7
...and the routine generates:
1.1001y x 2^3
Where does the extra bit come from? Well, if the exponent is odd then the routine adds 0.5 to the mantissa.
I just whipped it up...
I don't know much about IC design, but I think this algorithm could have a space saving over the implementation defined in the following document:
http://venus.elfak.ni.ac.yu/projects/IMPEG/DigitalSystem.pdf
Sqrt PROC uint32:DWORD
xor ecx, ecx
xor eax, eax
bsr edx, uint32
and edx, -2 ; even
bts ecx, edx
s0: add eax, ecx
sub uint32, eax
jnc s1
add uint32, eax
sub eax, ecx
jmp s2
s1: add eax, ecx
s2: shr eax, 1
shr ecx, 2
jne s0
ret
Sqrt ENDP
This in the itterative method that one might use by hand with pencil and paper; but in binary and optimized. The algorithm is undefined at zero.
I don't know much about IC design, but I think this algorithm could have a space saving over the implementation defined in the following document:
http://venus.elfak.ni.ac.yu/projects/IMPEG/DigitalSystem.pdf