A large unsigned integer square root routine,
as coded it works on 1024 bit unsigned integers.
To use it with other sizes requires no changes to the code,
only change the content or size of the variables.

The content of those that hold sizes of the main variables:
jbd : size of bdd
jbs : size of bsp
jbr : size of brm, brt

The size of the main variables:
bdd : number
bsp : number,spread out two bits at a time
brt : root
brm : remainder

It doesn't use fpu,mmx,3dnow,sse,sse2, or the div instruction.
Uses a single mul which occurs after the routine
to verify if a root is from a square
root 40 from the square 1600 
root 40 rem 3 from 1603

The code calculates 1,000,000 square roots using sequential 1024 bit unsigned integers,
which takes 0:02:37 CPU time on a 1.2 Ghz AMD Athlon.
The only output is a messagebox at the end displaying the number of iterations.
Posted on 2005-12-01 19:06:40 by dsouza123
Does your 0:02:37 timing include conversions to/from binary/ASCII or only the computation of the square roots from a binary variable?

If only from a binary variable, would you have any estimate of the time required for converting an ASCII input equivalent to a 1024-bit binary number (i.e. approx. 300 decimal digits) before extracting its square root, and the estimate of time to convert the answer back to ASCII (i.e. approx. 150 decimal digits)?

Posted on 2005-12-01 20:39:02 by Raymond
No decimal ascii to binary and binary to decimal ascii conversion code.
Hex string to binary and binary to hex string will be done.

The 0:02:37 includes the time to initialize the number with random bytes once
also setting the high nibble to F and the low dword to 0.  
The low dword does double duty as a counter.

It also includes code after the square root calc to quickly check if the root is from a square,
though I haven't determined if there are any values that it would miss and give a false positive.
This quick verify code is included at the end of each of the 1 million evaluations.

The rational of the quick verify is, if the remainder, brm, is zero and the low dword of the root, brt, is squared
and then the low dword of brt*brt == low dword of the number, bdd, then the root is from a square.

In the original delphi code is a routine to convert an even length hex string to binary,
also a routine to get the root back to a hex string,
but I haven't done the assembly code for them yet.

 j := length(s) div 2;
 for i := j downto 1 do begin
   bdd := fromhex(s + s);

 s := '';
 for i := j downto 0 do  
   s := s + tohex(ba);

Each conversion is a byte at a time.

I don't have conversion routines from decimal ascii to binary or binary to decimal ascii
tested with the delphi or assembly code.
Posted on 2005-12-01 21:36:50 by dsouza123
Your algo may be relatively fast and accurate (although I haven't checked the accuracy myself), but what would you consider as its usefulness if it can't accept a decimal string input nor provide the answer as a decimal string? Very few humans can grasp the meaning of a long hex string.

Posted on 2005-12-02 20:42:15 by Raymond
I have a simple bin2ascii rotine that may help a little with those big numbers,
it isn't optimized at all, takes 528k cycles to convert a 1024 bit number to ascii and 113k cycles for a 512 bit number,
i need 1 sec to convert a 65536 bit number to ascii,
it writes the result in dst buffer and reads from src buffer (this one have a big number in little endian format), cnt is the length of src buffer (in dwords), the dst buffer must be 2.5 times larger than src and, for a zero based string, have an extra byte at the end.

bin2ascii proc uses edi esi ebx dst:DWORD, src:DWORD, cnt:DWORD
   mov ecx, cnt
   lea ecx,
   shl ecx, 1
   mov edi, dst
       invoke bigdiv, 10, src, cnt
       add al, 30h
       mov , al
   dec ecx
   jnz @B
bin2ascii endp

bigdiv proc uses ebx esi edi, dvr:DWORD, dst:DWORD, cnt:DWORD
   mov esi, cnt
   mov edi, dst
   mov ebx, dvr
   xor edx, edx
       mov eax,
       div ebx
       mov , eax
   dec esi
   jnz @B
   mov eax, edx
bigdiv endp
Posted on 2005-12-02 20:48:18 by Eduardo Schardong
Thanks for the replies and code.

I found a bug in the first attachment, after checking if brt < brm it didn't
save the result so the following code  brm - (brm + 1), brt + 2 never was

Time with the fixed code increased to 3:35, did a few optimizations able
to do the first 30 iterations working on a single dword size for brt, brm.
On the "Is it an exact square root" test, ie brt*brt == bdd, delayed the more expensive
tests, only does them if it gets by the first quickest test. Time dropped to 3:08.

Added routines to display last value of bdd, brt, brm in ascii hex at the end.
Posted on 2005-12-07 15:41:53 by dsouza123
Added routines:
Convert from decimal (base 10) ascii str to binary.
Convert from hex (base 16) ascii str to binary.
Write hex string to logfile.
Posted on 2005-12-15 00:01:59 by dsouza123