Thanks bitrake, I will chew on your code (though it looks pretty hard to understand). Thank you, tenkey for explaining square root in binary to me.:alright:

**Norbert Juffa**'s code might be a better read:

```
sqr_iter MACRO N:REQ
```

lea edx, [eax + 1 SHL N]

lea ebx, [eax + 2 SHL N]

shl edx, N

mov esi, ecx

sub ecx, edx

cmovnc eax, ebx

cmovc ecx, esi

ENDM

; ECX is X^2

xor eax, eax ; 2*root

; iteration 15

mov ebx, 2 SHL 15

mov esi, ecx

sub ecx, 1 SHL 30

cmovnc eax, ebx

cmovc ecx, esi

sqr_iter(14)

sqr_iter(13)

sqr_iter(12)

sqr_iter(11)

sqr_iter(10)

sqr_iter(9)

sqr_iter(8)

sqr_iter(7)

sqr_iter(6)

sqr_iter(5)

sqr_iter(4)

sqr_iter(3)

sqr_iter(2)

sqr_iter(1)

; iteration 0

lea edx, [eax + 1]

lea ebx, [eax + 2]

sub ecx, edx

cmovnc eax, ebx

shr eax, 1

; EAX = X

Hi

I developed my own large integer math library and use follow method

1.) if Value <= 0 then return (Value = 0)

2.) lookup Value and $FF in a table with 256 entries, 0.171875 Candidates pass this test

3.) compute the Crosssum over Value to 2^96 -1 in S, this computation is very fast, only additions

4.) compute T = S mod (193 * 97 * 17 * 13 * 7 * 5), only <= 3 divisions 32 bit, a 3*32Bit / 32 Bit div

5.) lookup T mod 193, T mod 97, T mod 17, T mod 13, T mod 7, T mod 5, if any are'nt perfect square remainders return False

6.) compute T = S mod (9 * 673 * 257 * 241)

7.) lookup T mod 9, T mod 673, T mod 257, T mod 241, if any are'nt perfect square remainders return False

8.) now only 0.0002392 pass the test upto here, eg. 1 of 4181 candidates, all test are ordered by they probabilty of perfect squares. Upto here we need to any arbitrary Integer only 16 times a 32 Bit Division.

9.) We have now to expand the tests. Instead of 32 Bit divisions of arbitrary Integers to calc remainders to any 32 bit number, we choose a formula such that Result * (2^32)^C == Q * D - A, where D is odd. This way we could test an arbitrary Integer if it divisible to a small and odd 32 bit Number. We can't calculate in this way the exact remainder. This operations need about Ln2(Value) / 32 * (1 mul + 2 add), Value is our arbitrary integer. Thus we can test to all primes upto 257 in chunks. The final probability that a numer pass all these tests are 0.000000000000000008613904520439 = 1 of 116091372690195275 Candidates. Thats very low. Remaining Numbers must be checked with a Square Root algorithm, but not with Remainder ! Instead of calculating the remainder we scale up our Number * 2^32, do the Square Root, test if Result mod 2^16 <> 0, and scale down Result / 2^16. The test Result mod 2^16 <> 0 get us the knownledge if it is NOT a perfect square with a probality of 1/2^16. If mod 2^16 = 0 we have to squaring Result and check with Value. If equal then we know definitely it's a perfect square. Thus we testing in Perfect Square Test if a Number is NOT a perfect Square until to the last test by Squaring again the Sqrt and comparing with Input.

The Sqrt Algorithm is thus called only 1/116091372690195275 and the final Squaring 1/(116091372690195275 * 2^16).

Testing 1 Million random 1024 bit numbers take on my P4 130 milliseconds, eg. need about 190 clockcycles per number. On average a 1 million bits number need 0.047 clockcycles per bit.

Naive algorithms to calculate the Sqrt of any arbitrary Integer are inpracticable eg. inperformant. Most used are Newton Iterations. In my library i use the so called "Recursive Karatsuba Square Root Algorithm" of (INRIA, "Karatsuba Square Root" by Paul Zimmermann Research Report No 3805, file: RR-3805.ps). It's a adaption of the fast "Karatsuba Divsion Algorithm" of (Christoph Burnikel & Joachim Ziegler, MPI-I-98-1-022 October 1998, Research Report Max Planck Institut f?r Informatik).

Best Regards, Hagen

I developed my own large integer math library and use follow method

1.) if Value <= 0 then return (Value = 0)

2.) lookup Value and $FF in a table with 256 entries, 0.171875 Candidates pass this test

3.) compute the Crosssum over Value to 2^96 -1 in S, this computation is very fast, only additions

4.) compute T = S mod (193 * 97 * 17 * 13 * 7 * 5), only <= 3 divisions 32 bit, a 3*32Bit / 32 Bit div

5.) lookup T mod 193, T mod 97, T mod 17, T mod 13, T mod 7, T mod 5, if any are'nt perfect square remainders return False

6.) compute T = S mod (9 * 673 * 257 * 241)

7.) lookup T mod 9, T mod 673, T mod 257, T mod 241, if any are'nt perfect square remainders return False

8.) now only 0.0002392 pass the test upto here, eg. 1 of 4181 candidates, all test are ordered by they probabilty of perfect squares. Upto here we need to any arbitrary Integer only 16 times a 32 Bit Division.

9.) We have now to expand the tests. Instead of 32 Bit divisions of arbitrary Integers to calc remainders to any 32 bit number, we choose a formula such that Result * (2^32)^C == Q * D - A, where D is odd. This way we could test an arbitrary Integer if it divisible to a small and odd 32 bit Number. We can't calculate in this way the exact remainder. This operations need about Ln2(Value) / 32 * (1 mul + 2 add), Value is our arbitrary integer. Thus we can test to all primes upto 257 in chunks. The final probability that a numer pass all these tests are 0.000000000000000008613904520439 = 1 of 116091372690195275 Candidates. Thats very low. Remaining Numbers must be checked with a Square Root algorithm, but not with Remainder ! Instead of calculating the remainder we scale up our Number * 2^32, do the Square Root, test if Result mod 2^16 <> 0, and scale down Result / 2^16. The test Result mod 2^16 <> 0 get us the knownledge if it is NOT a perfect square with a probality of 1/2^16. If mod 2^16 = 0 we have to squaring Result and check with Value. If equal then we know definitely it's a perfect square. Thus we testing in Perfect Square Test if a Number is NOT a perfect Square until to the last test by Squaring again the Sqrt and comparing with Input.

The Sqrt Algorithm is thus called only 1/116091372690195275 and the final Squaring 1/(116091372690195275 * 2^16).

Testing 1 Million random 1024 bit numbers take on my P4 130 milliseconds, eg. need about 190 clockcycles per number. On average a 1 million bits number need 0.047 clockcycles per bit.

Naive algorithms to calculate the Sqrt of any arbitrary Integer are inpracticable eg. inperformant. Most used are Newton Iterations. In my library i use the so called "Recursive Karatsuba Square Root Algorithm" of (INRIA, "Karatsuba Square Root" by Paul Zimmermann Research Report No 3805, file: RR-3805.ps). It's a adaption of the fast "Karatsuba Divsion Algorithm" of (Christoph Burnikel & Joachim Ziegler, MPI-I-98-1-022 October 1998, Research Report Max Planck Institut f?r Informatik).

Best Regards, Hagen

**Hagen**, your detailed replies and experience in these matters is greatly appreciated.

@bitRake, many thanks :)

But i think many times that nobody can understand my bad english !

Above informations are based on deep self expiriences with such implementations and the goal to implement the fastest possible algorithms, I'm a algorithm junky, such as You :-)

Some small different Perfect Square Test's can be found in the libraries GMP, Miracl, HFloat and so on...

But i have'nt seen any Algorithm as mine with these low probabilities. The needed Lookuptables requieres about 1024 bytes storage. That's to big for some needs.

Realy more complex are Perfect Power Tests, i suggest then to search for ("Detecting perfect powers in essentially linear time" Daniel J. Bernstein, file: powers.ps)

Regards, Hagen

But i think many times that nobody can understand my bad english !

Above informations are based on deep self expiriences with such implementations and the goal to implement the fastest possible algorithms, I'm a algorithm junky, such as You :-)

Some small different Perfect Square Test's can be found in the libraries GMP, Miracl, HFloat and so on...

But i have'nt seen any Algorithm as mine with these low probabilities. The needed Lookuptables requieres about 1024 bytes storage. That's to big for some needs.

Realy more complex are Perfect Power Tests, i suggest then to search for ("Detecting perfect powers in essentially linear time" Daniel J. Bernstein, file: powers.ps)

Regards, Hagen