My question is simple, how is it possible to test whether a number is perfect square? I just need an algorithm which is fast, so I am wondering if anyone knows what is the fastest method. I have seen some examples for example gliptic which is like

Also can someone explain what Hel is trying to say in this post? http://www.asmcommunity.net/board/index.php?topic=3276&perpage=15&pagenumber=2

I do not really understand him.

Also regarding square root algorithm, what's the Wolfram's literation? The portion in mathsworld on it is under-construction. http://mathworld.wolfram.com/WolframsIteration.html Also what is the best/fast method for squareroot? The data I am going to do the square root on is pretty huge, well over 128bit.

```
```

xor eax,eax

xor ecx,ecx

inc ecx

_loop:

inc ecx

inc ecx

inc eax

sub edx,ecx

ja _loop

jz _perfectsquare

Also can someone explain what Hel is trying to say in this post? http://www.asmcommunity.net/board/index.php?topic=3276&perpage=15&pagenumber=2

I do not really understand him.

Also regarding square root algorithm, what's the Wolfram's literation? The portion in mathsworld on it is under-construction. http://mathworld.wolfram.com/WolframsIteration.html Also what is the best/fast method for squareroot? The data I am going to do the square root on is pretty huge, well over 128bit.

roticv,

What is the above code supposed to do? Have you REALLY looked at it? Below is a simplified equivalent version.

Seems to me like a clunky way to determine if the value in EDX is odd or even.

Anyway there are may algos and implementations, just like prime number finders. As in decimal format, there is nothing in binary that is quick and simple for square roots. If you get involved in very large numbers, you are going to have to become familiar with multiword arithmetic. Hel is just explaining the classical method of finding square roots. That involves trying and "guessing" what is the largest square contained within the number, subtracting that number, determining the fractional increment to add to the previous determination, subtracting that and continuing on. You have a perfect square when you need add no increment to the last guess. Ratch

```
```

xor eax,eax

xor ecx,ecx

inc ecx

_loop:

inc ecx

inc ecx

inc eax

sub edx,ecx ; WHAT is EDX here? WHEN did it get loaded?

ja _loop

jz _perfectsquare

What is the above code supposed to do? Have you REALLY looked at it? Below is a simplified equivalent version.

```
```

dec edx

_loop:

sub edx,2

ja _loop

jz _perfectsquare ; It should say jz _Odd

Seems to me like a clunky way to determine if the value in EDX is odd or even.

Anyway there are may algos and implementations, just like prime number finders. As in decimal format, there is nothing in binary that is quick and simple for square roots. If you get involved in very large numbers, you are going to have to become familiar with multiword arithmetic. Hel is just explaining the classical method of finding square roots. That involves trying and "guessing" what is the largest square contained within the number, subtracting that number, determining the fractional increment to add to the previous determination, subtracting that and continuing on. You have a perfect square when you need add no increment to the last guess. Ratch

Seems like the lack of comments to the code makes you not understand the code. The code is based on odd-number theorem, ie perfect square is the sum of odd numbers, eg 4 = 1+3 , 9 = 1+3+5, 16 = 1+3+5+7. edx contains the number to test whether it is perfect square or not.

Also your converted version is wrong.

Also your converted version is wrong.

Ratch: If you can't understand such a simple and elementary code sequence, then I suggest you find something else to do.

roticv: You can calculate a square root in linear time. Here's a way to do it:

This is almost directly translated from some 6502 assembly code I wrote some time ago. You can probably optimize it quite a bit.

roticv: You can calculate a square root in linear time. Here's a way to do it:

```
; At the beginning, EAX contains the number for which we want to calculate the square root
```

push NumberOfBits/2

pop ecx

xor esi,esi

xor edx,edx

theloop:

add edx,edx

add eax,eax

adc esi,esi

add eax,eax

cmc

mov edi,esi

pushfd

sbb esi,edx

jb sq0

inc edx

popfd

cmc

jmp short sq1

sq0:

mov esi,edi

popfd

sq1:

adc esi,esi

loop theloop

ret

; At the end, EDX contains the result and ESI contains the remainder.

This is almost directly translated from some 6502 assembly code I wrote some time ago. You can probably optimize it quite a bit.

roticv,

You are correct. My alternate code is not correct. I forgot to take into account the cummulative addition of ECX. Now that you explained the theory behind the code, I understand what it is trying to do. I don't think it is correct, however. See my post to Sephiroth3. Another implementation of square root can be found here http://www.df.lth.se/~john_e/gems/gem0034.html . It is rather lengthy, but it claims to be fast. RWA

You are correct. My alternate code is not correct. I forgot to take into account the cummulative addition of ECX. Now that you explained the theory behind the code, I understand what it is trying to do. I don't think it is correct, however. See my post to Sephiroth3. Another implementation of square root can be found here http://www.df.lth.se/~john_e/gems/gem0034.html . It is rather lengthy, but it claims to be fast. RWA

Sephiroth3,

You are right. I cannot understand how that simple code sequence can work.

Specifically, I cannot understand why EAX is being manipulated. Also, when I assume EDX is 4, it does not jump out to _perfectsquare. Since you understand it, could you enlighted me? Perhaps ECX should be incremented after SUB EDX,ECX? Ratch

You are right. I cannot understand how that simple code sequence can work.

```
xor eax,eax
```

xor ecx,ecx

inc ecx

_loop:

inc ecx

inc ecx

inc eax

sub edx,ecx

ja _loop

jz _perfectsquare

Specifically, I cannot understand why EAX is being manipulated. Also, when I assume EDX is 4, it does not jump out to _perfectsquare. Since you understand it, could you enlighted me? Perhaps ECX should be incremented after SUB EDX,ECX? Ratch

Ah... damn... you're right. The very first INC should be a DEC. But that's still far from testing whether a number is odd or even :P

Sephiroth3,

Yes, DECing ECX first will work too. As will doing the ECX incrementation after the subtraction of EDX. As I explained to roticv, I overlooked the cumulative effect of adding ECX by 2, so I thought it was just subtracting 2 from every loop iteration. Ratch

Yes, DECing ECX first will work too. As will doing the ECX incrementation after the subtraction of EDX. As I explained to roticv, I overlooked the cumulative effect of adding ECX by 2, so I thought it was just subtracting 2 from every loop iteration. Ratch

The algo modified as follows would work (but still very slow).

```
xor eax,eax
```

xor ecx,ecx

inc ecx

_loop:

add eax,ecx

inc ecx

inc ecx

cmp edx,eax

ja _loop

jz _perfectsquare

Raymondwell as you all have said before that add is faster then inc

xor eax,eax

xor ecx,ecx

inc ecx

_loop:

add eax,ecx

add ecx,1h

add ecx,1h

cmp edx,eax

ja _loop

jz _perfectsquare

what i dont understand is why add ecx,2 wont work unless there has something to do with the inc and how it changes the EFLAGS

xor eax,eax

xor ecx,ecx

inc ecx

_loop:

add eax,ecx

add ecx,1h

add ecx,1h

cmp edx,eax

ja _loop

jz _perfectsquare

what i dont understand is why add ecx,2 wont work unless there has something to do with the inc and how it changes the EFLAGS

Nay, I think that method is pretty slow, since the number which I want to test whether is prefect number would be about 256bit in size. So it would differ

Also can anyone care to explain Hel's method of square root? Hmm perhaps bitrake, since he seemed to understand it.

Devilclaw, I just used the original code by glistic. Probably he optimised it for size.

Ratch, I will look into the table lookup method, but I was wondering if anyone know of Wolfram's literation. Sounds like it is a fast method for square root, though I do not have the specifications of it (ie the algorithm).

Also can anyone care to explain Hel's method of square root? Hmm perhaps bitrake, since he seemed to understand it.

Devilclaw, I just used the original code by glistic. Probably he optimised it for size.

Ratch, I will look into the table lookup method, but I was wondering if anyone know of Wolfram's literation. Sounds like it is a fast method for square root, though I do not have the specifications of it (ie the algorithm).

Unfortunately, the diagram was very important in understanding the algorithm posted above - I'll have to look over the text and create some code, but until that time I think this is another way of stating the method: http://www4.wittenberg.edu/academics/mathcomp/bjsdir/ENIACSquareRoot.htm (down at the bottom). Convert the algorithm to binary and it should be linear with regaurd to the bit length of the source number. GMP uses another method, so maybe there is something I'm missing? x86 has powerful bit string operations whereas other processors usually don't - that could explain the choice of algorithm used in GMP.

Here is a way to speed up the loop for small numbers:

Here is a way to speed up the loop for small numbers:

```
; test if EAX is a perfect square
```

xor edx, edx

bsr ecx, eax ; get top bit index

btr eax, ecx ; clear top bit

shr ecx, 1 ; half of magnitude

bts edx, ecx ; set in result

lea ecx, [edx*2 + 1] ; next odd number

test eax, eax

je Square

_s1: sub eax, ecx

lea ecx, [ecx+2]

ja _s1

je Square

; not a perfect square

Square:

Hi,

here are some thoughts to improve the performance for (very)

large numbers to be tested ..

First, in hexadecimal a perfect sqare ends only with the nibble 0, 1, 4, 9

therefore if the last nibble is not one of these its not a perfect square.

Second, since there is a repetitive pattern of 0, 1, 4, 9, 0, 9, 4, 1 for the last nibble

you may choose a different increment in your loop reducing the overall

count of loops by 75 percent.

example: if the number to be tested is 123456789h the loop may only test

the values 3, 5, 11, 13, 19, 21 etc. in decimal until reaching the number

and for 11111114h the loop should use the values 2, 6, 10, 14, 18, 22 etc

in decimal.

It's up to you to proof me wrong and implement it in asm :)

BTW, if the numbers to be tested are not too large

a byte or even bit lookup table may be a good choice.

Bye Miracle

here are some thoughts to improve the performance for (very)

large numbers to be tested ..

First, in hexadecimal a perfect sqare ends only with the nibble 0, 1, 4, 9

therefore if the last nibble is not one of these its not a perfect square.

Second, since there is a repetitive pattern of 0, 1, 4, 9, 0, 9, 4, 1 for the last nibble

you may choose a different increment in your loop reducing the overall

count of loops by 75 percent.

example: if the number to be tested is 123456789h the loop may only test

the values 3, 5, 11, 13, 19, 21 etc. in decimal until reaching the number

and for 11111114h the loop should use the values 2, 6, 10, 14, 18, 22 etc

in decimal.

It's up to you to proof me wrong and implement it in asm :)

BTW, if the numbers to be tested are not too large

a byte or even bit lookup table may be a good choice.

Bye Miracle

Hey miracle,

Thanks for the hint. I am sure that would speed up my program. So you are saying that, the last 4bits will always be 0, 1, 4 or 9?

Thanks for the hint. I am sure that would speed up my program. So you are saying that, the last 4bits will always be 0, 1, 4 or 9?

Hmmm ..

I have simply done a test from 0 .. 20 and I'm pretty sure but

there is guaranty.

What about a lookup-table if the values are not that large and

assuming that there are some Megabytes of RAM to be wasted ;-)

Bye Miracle

I have simply done a test from 0 .. 20 and I'm pretty sure but

there is guaranty.

What about a lookup-table if the values are not that large and

assuming that there are some Megabytes of RAM to be wasted ;-)

Bye Miracle

It's up to you to proof me wrong and implement it in asm :)

0149 0941 0149 0941 square low nibble

Why prove you wrong? ;)

Generally speaking, taking into account the lower M bits will reduce the tests by (M-1/M) percent. ;)

Mulitplication can be looked at as just a shift and add algorithm: 110 x 101 means to shift 110, so the lower bit is in the position of the one bits in 101; and add. It can be easily seen that only the lower bits of both numbers effect the low bit of the result, and only the lower nibble of both numbers effects the lower nibble of the result, and only the lower byte of both numbers...

Any reasons why this is so? I am really puzzled.

Yeap it reduces the numbers to test from n to n/4 (75% as you claimed, agreed) :alright:

Yeap it reduces the numbers to test from n to n/4 (75% as you claimed, agreed) :alright:

Any reasons why this is so? I am really puzzled.

```
```

11001001

x 11001001

----------------

11001001

11001001

11001001

11001001

----------------

1001110111010001

Look at this carefully. The least significant bit being "1" tells you that the least significant bit of the result __must__be one. This is a direct result of the binary mulitplication operation.

X = 2n + m

X^2 = 4n^2 + 4nm + m^2 = 4 n + m

(m) is a bit -- equals one or zero.

(n) is some number.

Do you see how the least significant bit of X^2 is only dependant on (m)?

I could have just said: ODD*ODD = ODD and EVEN*EVEN = EVEN

...but this same method can be used to show how the first four bits effect the result...

X = 16n + m

X^2 = 32n[8n + m] + m^2

(m) is four bits (nibble)

X^2 = * n * 100000y + m^2

X^2 = *n SHL 5 + m^2

(n) is shifted out of the way! (m) is the only thing effecting the first five bits.

Oh, look I rambled on about nothing because we are wanting to go the other way from X^2 to X. :o

As with binary division, the classic method of figuring square roots is simpler in binary than in decimal.

For a 32-bit unsigned number, it would take at most 16 iterations to find the (truncated) square root.

For a 32-bit unsigned number, it would take at most 16 iterations to find the (truncated) square root.

```
Let's take decimal 26 or 1Ah.
```

Split the "dividend" into bit pairs:

01 10 10

In decimal, we would need to figure what the largest square is

for the first pair. In binary the first significant pair always yields

a 1. (Four doesn't fit in two bits.) That is the first of three root

bits (one for each pair).

We square it and then subtract it.

/ 01 10 10 = 1 x x

01

--

00 10

The iterative steps include doubling the partial root, and adding

a digit to the end. This new number is multiplied by the added

digit to create the number we want to subtract. In decimal we

may need to test more than one digit to determine the largest

one that still leaves a positive remainder. The added digit is the

next root digit. In binary, doubling is adding a 0, and after that,

we can only add 0 or 1. Because we multiply only by 0 or 1, we

only subtract 0 or the new built-up number.

In the following text, we show the construction of the test

number as (partial root, 0 for doubling, 1 for appended digit).

The next test is to see if 101 (1,0,1) can be subtracted from the

partial dividend. Nope, so the next root bit is 0.

/ 01 10 10 = 1 0 x

01

--

00 10

00 00

-- --

10 01

The next test is to see if 1001 (10,0,1) can be subtracted. Yes,

so the next root bit is 1.

/ 01 10 10 = 1 0 1

01

--

00 10

00 00

-----

10 10

10 01

-----

01

It's fairly easy to see from the last step that if we had started

with the perfect square 25, the remainder would be zero.

```
; EBX is number to find root of
```

bsr ecx, ebx

and ecx, -2 ; bit pairs

ror ebx, cl

; setup registers

; first bit (can't do zero)

mov edx, 1

; most significant bit pair containing a set bit

mov eax, ebx

and eax, 11y

sub eax, edx

shr ecx, 1

imul ecx, ecx, -25

lea ecx, [JumpTabs + ecx]

jmp ecx

i=0

WHILE i LT 15

shld eax, ebx, 2

rol ebx, 2

lea ecx, [edx*4 + 1]

cmp eax, ecx

cmc

sbb esi, esi

adc edx, edx

and ecx, esi

sub eax, ecx

i=i+1

ENDM

JumpTabs:

; EDX = is root

je NotPerfectSquare

Of course, this can be made faster - this is just a quick code up of **tenkey**'s good explaination.

http://www.azillionmonkeys.com/qed/sqroot.html - another source