My question is simple, how is it possible to test whether a number is perfect square? I just need an algorithm which is fast, so I am wondering if anyone knows what is the fastest method. I have seen some examples for example gliptic which is like
``````
xor eax,eax
xor ecx,ecx
inc ecx
_loop:
inc ecx
inc ecx
inc eax
sub edx,ecx
ja _loop
jz _perfectsquare
``````

Also can someone explain what Hel is trying to say in this post? http://www.asmcommunity.net/board/index.php?topic=3276&perpage=15&pagenumber=2

I do not really understand him.

Also regarding square root algorithm, what's the Wolfram's literation? The portion in mathsworld on it is under-construction. http://mathworld.wolfram.com/WolframsIteration.html Also what is the best/fast method for squareroot? The data I am going to do the square root on is pretty huge, well over 128bit.
Posted on 2003-10-21 07:39:51 by roticv
roticv,

``````
xor eax,eax
xor ecx,ecx
inc ecx
_loop:
inc ecx
inc ecx
inc eax
sub edx,ecx              ; WHAT is EDX here?  WHEN did it get loaded?
ja _loop
jz _perfectsquare
``````

What is the above code supposed to do? Have you REALLY looked at it? Below is a simplified equivalent version.
``````
dec edx
_loop:
sub edx,2
ja _loop
jz _perfectsquare    ; It should say jz _Odd
``````

Seems to me like a clunky way to determine if the value in EDX is odd or even.

Anyway there are may algos and implementations, just like prime number finders. As in decimal format, there is nothing in binary that is quick and simple for square roots. If you get involved in very large numbers, you are going to have to become familiar with multiword arithmetic. Hel is just explaining the classical method of finding square roots. That involves trying and "guessing" what is the largest square contained within the number, subtracting that number, determining the fractional increment to add to the previous determination, subtracting that and continuing on. You have a perfect square when you need add no increment to the last guess. Ratch
Posted on 2003-10-21 11:42:37 by Ratch
Seems like the lack of comments to the code makes you not understand the code. The code is based on odd-number theorem, ie perfect square is the sum of odd numbers, eg 4 = 1+3 , 9 = 1+3+5, 16 = 1+3+5+7. edx contains the number to test whether it is perfect square or not.

Also your converted version is wrong.
Posted on 2003-10-21 11:54:34 by roticv
Ratch: If you can't understand such a simple and elementary code sequence, then I suggest you find something else to do.

roticv: You can calculate a square root in linear time. Here's a way to do it:
``````; At the beginning, EAX contains the number for which we want to calculate the square root
push NumberOfBits/2
pop ecx
xor esi,esi
xor edx,edx
theloop:
cmc
mov edi,esi
pushfd
sbb esi,edx
jb sq0
inc edx
popfd
cmc
jmp short sq1
sq0:
mov esi,edi
popfd
sq1:
loop theloop
ret
; At the end, EDX contains the result and ESI contains the remainder.``````

This is almost directly translated from some 6502 assembly code I wrote some time ago. You can probably optimize it quite a bit.
Posted on 2003-10-21 12:23:03 by Sephiroth3
roticv,
You are correct. My alternate code is not correct. I forgot to take into account the cummulative addition of ECX. Now that you explained the theory behind the code, I understand what it is trying to do. I don't think it is correct, however. See my post to Sephiroth3. Another implementation of square root can be found here http://www.df.lth.se/~john_e/gems/gem0034.html . It is rather lengthy, but it claims to be fast. RWA
Posted on 2003-10-21 13:28:34 by Ratch
Sephiroth3,
You are right. I cannot understand how that simple code sequence can work.
``````xor eax,eax
xor ecx,ecx
inc ecx
_loop:
inc ecx
inc ecx
inc eax
sub edx,ecx
ja _loop
jz _perfectsquare
``````

Specifically, I cannot understand why EAX is being manipulated. Also, when I assume EDX is 4, it does not jump out to _perfectsquare. Since you understand it, could you enlighted me? Perhaps ECX should be incremented after SUB EDX,ECX? Ratch
Posted on 2003-10-21 13:38:53 by Ratch
Ah... damn... you're right. The very first INC should be a DEC. But that's still far from testing whether a number is odd or even :P
Posted on 2003-10-21 14:04:11 by Sephiroth3
Sephiroth3,
Yes, DECing ECX first will work too. As will doing the ECX incrementation after the subtraction of EDX. As I explained to roticv, I overlooked the cumulative effect of adding ECX by 2, so I thought it was just subtracting 2 from every loop iteration. Ratch
Posted on 2003-10-21 15:43:42 by Ratch
The algo modified as follows would work (but still very slow).
``````xor eax,eax
xor ecx,ecx
inc ecx
_loop:
inc ecx
inc ecx
cmp edx,eax
ja _loop
jz _perfectsquare``````
Raymond
Posted on 2003-10-21 22:10:17 by Raymond
well as you all have said before that add is faster then inc

xor eax,eax
xor ecx,ecx
inc ecx
_loop:
cmp edx,eax
ja _loop
jz _perfectsquare

what i dont understand is why add ecx,2 wont work unless there has something to do with the inc and how it changes the EFLAGS
Posted on 2003-10-21 22:26:16 by devilsclaw
Nay, I think that method is pretty slow, since the number which I want to test whether is prefect number would be about 256bit in size. So it would differ

Also can anyone care to explain Hel's method of square root? Hmm perhaps bitrake, since he seemed to understand it.

Devilclaw, I just used the original code by glistic. Probably he optimised it for size.

Ratch, I will look into the table lookup method, but I was wondering if anyone know of Wolfram's literation. Sounds like it is a fast method for square root, though I do not have the specifications of it (ie the algorithm).
Posted on 2003-10-21 23:06:17 by roticv
Unfortunately, the diagram was very important in understanding the algorithm posted above - I'll have to look over the text and create some code, but until that time I think this is another way of stating the method: http://www4.wittenberg.edu/academics/mathcomp/bjsdir/ENIACSquareRoot.htm (down at the bottom). Convert the algorithm to binary and it should be linear with regaurd to the bit length of the source number. GMP uses another method, so maybe there is something I'm missing? x86 has powerful bit string operations whereas other processors usually don't - that could explain the choice of algorithm used in GMP.

Here is a way to speed up the loop for small numbers:
``````; test if EAX is a perfect square
xor	edx, edx
bsr	ecx, eax		; get top bit index
btr	eax, ecx		; clear top bit
shr	ecx, 1			; half of magnitude
bts	edx, ecx		; set in result
lea	ecx, [edx*2 + 1]	; next odd number

test	eax, eax
je	Square

_s1:	sub	eax, ecx
lea	ecx, [ecx+2]
ja	_s1
je	Square

; not a perfect square

Square:``````
Posted on 2003-10-22 08:12:19 by bitRAKE
Hi,

here are some thoughts to improve the performance for (very)
large numbers to be tested ..

First, in hexadecimal a perfect sqare ends only with the nibble 0, 1, 4, 9
therefore if the last nibble is not one of these its not a perfect square.

Second, since there is a repetitive pattern of 0, 1, 4, 9, 0, 9, 4, 1 for the last nibble
you may choose a different increment in your loop reducing the overall
count of loops by 75 percent.

example: if the number to be tested is 123456789h the loop may only test
the values 3, 5, 11, 13, 19, 21 etc. in decimal until reaching the number
and for 11111114h the loop should use the values 2, 6, 10, 14, 18, 22 etc
in decimal.

It's up to you to proof me wrong and implement it in asm :)

BTW, if the numbers to be tested are not too large
a byte or even bit lookup table may be a good choice.

Bye Miracle
Posted on 2003-10-23 07:25:09 by miracle
Hey miracle,

Thanks for the hint. I am sure that would speed up my program. So you are saying that, the last 4bits will always be 0, 1, 4 or 9?
Posted on 2003-10-23 07:42:32 by roticv
Hmmm ..

I have simply done a test from 0 .. 20 and I'm pretty sure but
there is guaranty.

What about a lookup-table if the values are not that large and
assuming that there are some Megabytes of RAM to be wasted ;-)

Bye Miracle
Posted on 2003-10-23 08:01:01 by miracle

It's up to you to proof me wrong and implement it in asm :)
0123 4567 89AB CDEF low nibble
0149 0941 0149 0941 square low nibble

Why prove you wrong? ;)

Generally speaking, taking into account the lower M bits will reduce the tests by (M-1/M) percent. ;)

Mulitplication can be looked at as just a shift and add algorithm: 110 x 101 means to shift 110, so the lower bit is in the position of the one bits in 101; and add. It can be easily seen that only the lower bits of both numbers effect the low bit of the result, and only the lower nibble of both numbers effects the lower nibble of the result, and only the lower byte of both numbers...
Posted on 2003-10-23 08:02:23 by bitRAKE
Any reasons why this is so? I am really puzzled.

Yeap it reduces the numbers to test from n to n/4 (75% as you claimed, agreed) :alright:
Posted on 2003-10-23 08:52:32 by roticv

Any reasons why this is so? I am really puzzled.
``````
11001001
x  11001001
----------------
11001001
11001001
11001001
11001001
----------------
1001110111010001``````
Look at this carefully. The least significant bit being "1" tells you that the least significant bit of the result must be one. This is a direct result of the binary mulitplication operation.

X = 2n + m

X^2 = 4n^2 + 4nm + m^2 = 4 n + m

(m) is a bit -- equals one or zero.
(n) is some number.

Do you see how the least significant bit of X^2 is only dependant on (m)?

I could have just said: ODD*ODD = ODD and EVEN*EVEN = EVEN
...but this same method can be used to show how the first four bits effect the result...

X = 16n + m

X^2 = 32n[8n + m] + m^2

(m) is four bits (nibble)

X^2 = * n * 100000y + m^2

X^2 = *n SHL 5 + m^2

(n) is shifted out of the way! (m) is the only thing effecting the first five bits.

Oh, look I rambled on about nothing because we are wanting to go the other way from X^2 to X. :o
Posted on 2003-10-23 22:02:55 by bitRAKE
As with binary division, the classic method of figuring square roots is simpler in binary than in decimal.

For a 32-bit unsigned number, it would take at most 16 iterations to find the (truncated) square root.
``````Let's take decimal 26 or 1Ah.

Split the "dividend" into bit pairs:

01 10 10

In decimal, we would need to figure what the largest square is
for the first pair. In binary the first significant pair always yields
a 1.  (Four doesn't fit in two bits.) That is the first of three root
bits (one for each pair).

We square it and then subtract it.

/ 01 10 10 = 1 x x
01
--
00 10

The iterative steps include doubling the partial root, and adding
a digit to the end. This new number is multiplied by the added
digit to create the number we want to subtract. In decimal we
may need to test more than one digit to determine the largest
one that still leaves a positive remainder. The added digit is the
next root digit. In binary, doubling is adding a 0, and after that,
we can only add 0 or 1. Because we multiply only by 0 or 1, we
only subtract 0 or the new built-up number.

In the following text, we show the construction of the test
number as (partial root, 0 for doubling, 1 for appended digit).

The next test is to see if 101 (1,0,1) can be subtracted from the
partial dividend. Nope, so the next root bit is 0.

/ 01 10 10 = 1 0 x
01
--
00 10
00 00
-- --
10 01

The next test is to see if 1001 (10,0,1) can be subtracted. Yes,
so the next root bit is 1.

/ 01 10 10 = 1 0 1
01
--
00 10
00 00
-----
10 10
10 01
-----
01

It's fairly easy to see from the last step that if we had started
with the perfect square 25, the remainder would be zero.``````
Posted on 2003-10-24 18:00:32 by tenkey
``````; EBX is number to find root of

bsr ecx, ebx
and ecx, -2	; bit pairs
ror ebx, cl

; setup registers
; first bit (can't do zero)
mov edx, 1
; most significant bit pair containing a set bit
mov eax, ebx
and eax, 11y

sub eax, edx

shr ecx, 1
imul ecx, ecx, -25
lea ecx, [JumpTabs + ecx]
jmp ecx

i=0
WHILE i LT 15
shld eax, ebx, 2
rol ebx, 2

lea ecx, [edx*4 + 1]
cmp eax, ecx
cmc
sbb esi, esi