Is there a fast sqrt arithm of a 32bit integer?

thanks in advance

thanks in advance

```
```

sqrt32: ; returns: eax=Square root

mov ebx,32768 ; initial guess

mov edi,ebx

mov ecx,eax

.top:

mov eax,ecx

xor edx,edx

div ebx

add eax,ebx

shr eax,1

cmp eax,ebx

je .finished

cmp eax,edi

mov edi,ebx

mov ebx,eax

jne .top

cmp eax,edi

jl .finished

mov eax,edi

.finished:

ret

MATRIX

You may also want to look at the following procedure which is part of a fixed point math library available at the very bottom of the following site page (mixlib09.zip):

http://www.movsd.com/source.htm

Another similar procedure is also available to extract the square root of 64-bit integers (albeit slower).

Raymond

http://www.movsd.com/source.htm

```
sqrt32 proc public uInt:DWORD
```

; -------------------------------------------------------------

; Returns the square root of a 32-bit unsigned integer

; The result is rounded to the closest integer.

;

; eax, ebx, ecx, edx used

;

; EXAMPLE: invoke sqrt32, 32_bit_value

;

; Returns result in eax

;

; -------------------------------------------------------------

mov edx,uInt

xor eax,eax

mov ebx,40000000h

mov ecx,16

sqr10:

add eax,ebx

cmp eax,edx

ja sqr40

sqr20:

sub edx,eax

add eax,ebx

jnz sqr50

sqr40:

sub eax,ebx

sqr50:

shr eax,1

shr ebx,2

dec ecx

jnz sqr10

cmp eax,edx

jae @F

inc eax

@@:

ret

sqrt32 endp

Another similar procedure is also available to extract the square root of 64-bit integers (albeit slower).

Raymond

Hi,

Or alternate fpu command way;

Best Regards,

Or alternate fpu command way;

```
```

.data

numb dd 0

quad dd 0

.code

fldz

push 32768

pop [numb]

fild dword ptr [numb]

fsqrt

fistp dword ptr [quad]

Best Regards,

Greetings,

This is an alternative approach based on an article which appeared

in Dr. Dobbs some time ago:

This is an alternative approach based on an article which appeared

in Dr. Dobbs some time ago:

```
```

sqroot PROC USES edi N:DWORD

mov edi, N

mov eax, 1

cmp edi, 2

jle short @F

mov ecx, 4

mov edx, 5

@1:

cmp edi, ecx

jle short @F

add ecx, edx ;edx = 2*N-1 ecx = N^2 eax = N-1

add edx, 2

inc eax

jmp short @1

@@:

jl short @2

inc eax

@2:

ret

sqroot ENDP

Greetings,

This is an alternative approach based on an article which appeared

in Dr. Dobbs some time ago:

This is an alternative approach based on an article which appeared

in Dr. Dobbs some time ago:

```
```

sqroot PROC USES edi N:DWORD

mov edi, N

mov eax, 1

cmp edi, 2

jle short @F

mov ecx, 4

mov edx, 5

@1:

cmp edi, ecx

jle short @F

add ecx, edx ;edx = 2*N-1 ecx = N^2 eax = N-1

add edx, 2

inc eax

jmp short @1

@@:

jl short @2

inc eax

@2:

ret

sqroot ENDP

Hy Poimander

nice, you say it is size optimized?

not the fastest way but we hope its working :)

have you compared the speeds yet?

what do you think what will be the difference in speed trying for example 8 divisions, or making 30000 additions?

MATRIX

Greetings >Matrix<,

The code isn't optimal and is suitable for relatively "small" integer inputs.

However it can be used as the basis for a more optimal version.

The code isn't optimal and is suitable for relatively "small" integer inputs.

However it can be used as the basis for a more optimal version.

On average, the fastest way of extracting the square root of a random 32-bit integer will always be to use the FPU's

With the CPU, the algo I posted previously can be improved slightly, and the speed improved primarily by unrolling the loop. Doing 32 shifts and additions should be significantly faster than doing 8 slow divisions.

Raymond

**fsqrt**instruction.With the CPU, the algo I posted previously can be improved slightly, and the speed improved primarily by unrolling the loop. Doing 32 shifts and additions should be significantly faster than doing 8 slow divisions.

Raymond

Well, my previous code submission is as slow as molasses even after

souping it up alittle. I'm not exactly an optimization expert. As an

alternative I developed a mod of Raymond's sqrt32 routine. The resulting

code varies from about 0% to 50% faster over the entire 32bit input range

depending on the input.

souping it up alittle. I'm not exactly an optimization expert. As an

alternative I developed a mod of Raymond's sqrt32 routine. The resulting

code varies from about 0% to 50% faster over the entire 32bit input range

depending on the input.

```
```

;------------------------------------------------

; Modified version of Raymond's nearest integer square root

; routine 'sqrt32'.

;

; Basic idea: Enhance computation by using the highest bit set in

; the operand to compute a suitable upper bound.

;

; Destroys ecx and edx

; Result returned in eax

;------------------------------------------------

sqrt32b proc public USES ebx uInt:DWORD

xor eax, eax

mov edx, uInt

bsr ecx, edx

and ecx, 0FFFFFFFEh

mov ebx, 1

shl ebx, cl

shr ecx, 1

inc ecx

@@:

add eax, ebx

cmp eax, edx

ja short @@1

sub edx, eax

add eax, ebx

jnz short @@2

@@1:

sub eax, ebx

@@2:

shr eax, 1

shr ebx, 2

dec ecx

jnz short @B

cmp eax, edx

jae short @F

inc eax

@@:

ret

sqrt32b endp

I remember seeing some cool square root algos on a Polish site, so I went on a hunt...

http://faqs.org.ru/progr/graph/ddesign4.htm

...found some stuff, but not the Polish site. IIRC, it had fast psuedo-root code as well -- when percision is not needed.

http://faqs.org.ru/progr/graph/ddesign4.htm

...found some stuff, but not the Polish site. IIRC, it had fast psuedo-root code as well -- when percision is not needed.

To make our initial guess at the sqrt, we could use the BSR (bit scan reverse) opcode, which is a sort of crude base-2 logarithm. Say the input is in EAX.

```
```

bsr ecx,eax ;If eax<>0, this puts a value of 0 to 31 in ecx, namely

;the index of the most significant set bit in eax.

jnz eax_was_zero

shr ecx,1 ;For the square root, divide the "logarithm" in half.

jz eax_was_one

xor ebx,ebx

stc

rcl ebx,cl ;Now ebx is our initial guess.

```
.586
```

.model flat, stdcall

option casemap:none

;================================================

.nolist

include \masm32\include\windows.inc

include \masm32\include\kernel32.inc

include \masm32\include\user32.inc

includelib \masm32\lib\kernel32.lib

includelib \masm32\lib\user32.lib

include \masm32\macros\macros.asm

;================================================

.listall

.code

;================================================

Sqrt32 proc uses ebx ecx edx iValue

mov ecx, 8000h

mov ebx, ecx

@@:

mov eax, ebx

mul ebx

.if eax > iValue

mov eax, ecx

not eax

and ebx, eax

.endif

shr ecx, 1

or ebx, ecx

.if ecx

jmp @B

.endif

mov eax, ebx

ret

Sqrt32 endp

;================================================

WinMain proc

local zTmp[16]:byte

mov eax, 213423905

invoke Sqrt32, eax

invoke wsprintf, addr zTmp, SADD("%i"), eax

invoke MessageBox, 0, addr zTmp, SADD("Title"), MB_OK

ret

WinMain endp

;================================================

start:

invoke WinMain

invoke ExitProcess, 0

end start

;================================================

If you don't mind some inaccuracy, Andre LaMothe in one of his huge game programming books used this short routine to extract a square root of a 32bit floating point number. It can easily be converted from and to an integer with the fild and fst instructions which might round off some of the error. According to this the square root of 144 is 12.5 instead of 12 so use with caution.

```
```

.data ?

float_variable real4 ?

.code

mov eax, float_variable

sub eax, 3f800000h

sar eax,1

add eax, 3f800000h

mov float_variable, eax

Cool snippet - wish I thought of it myself. :) It works because floating point numbers are biased and the top bit is missing.

So it converts:

1.A x 2^B

to

x 2^INT(B/2)

144 = 10010000y

144 = 1.001y x 2^7

...and the routine generates:

1.1001y x 2^3

Where does the extra bit come from? Well, if the exponent is odd then the routine adds 0.5 to the mantissa.

So it converts:

1.A x 2^B

to

x 2^INT(B/2)

144 = 10010000y

144 = 1.001y x 2^7

...and the routine generates:

1.1001y x 2^3

Where does the extra bit come from? Well, if the exponent is odd then the routine adds 0.5 to the mantissa.

I just whipped it up...

I don't know much about IC design, but I think this algorithm could have a space saving over the implementation defined in the following document:

http://venus.elfak.ni.ac.yu/projects/IMPEG/DigitalSystem.pdf

```
Sqrt PROC uint32:DWORD
```

xor ecx, ecx

xor eax, eax

bsr edx, uint32

and edx, -2 ; even

bts ecx, edx

s0: add eax, ecx

sub uint32, eax

jnc s1

add uint32, eax

sub eax, ecx

jmp s2

s1: add eax, ecx

s2: shr eax, 1

shr ecx, 2

jne s0

ret

Sqrt ENDP

This in the itterative method that one might use by hand with pencil and paper; but in binary and optimized. The algorithm is undefined at zero.
I don't know much about IC design, but I think this algorithm could have a space saving over the implementation defined in the following document:

http://venus.elfak.ni.ac.yu/projects/IMPEG/DigitalSystem.pdf