```
; Example:
```

;

__K__ QWORD 0B1F00FFF800003FFh

; compute: 7^12821765729162363903 MOD 1973 = 911

;

lea esi, __K__ ; exponent

mov ebx, 64 - 1 ; bits in __K__ - 1

mov edi, 7 ; mod

mov ecx, 1973 ; modulus base

call MOD_POW

;----------------

MOD_POW:

mov edx, 1

jmp _1

_0: mul eax ; A^2

div ecx ; MOD N

_1: bt [esi], ebx ; B(n-1), B(n-2), ... , 2, 1, 0

mov eax, edi

jnc _2 ; *UNPREDICTABLE CONDITIONAL JUMP*

mul edx ; A*M

div ecx ; MOD N

_2: dec ebx ; next lower bit of power

mov eax, edx ; bound to MOD N

jns _0 ; until all bits of power have been tested

ret

...fresh out of a hot steaming oven.
Don't you just love the smell...

Cook up some code! :grin:

From the book:

*Foundations of Algorithms Using C++ Pseudocode*, Third Edition

by Richard Neapolitan and Kumarss Naimipour

The C++ in this book is crazy - he doesn't use any high-level types (vector, matrix, etc.) - instead he mixes the mathematical expressions with the C++ -- damn crazy to understand. I just wanted to read the code and translate into ASM, but the code was too scary by itself. He couldn't have put much thought into this, imho. Luckily, I was able to gleam enough from the book text to reconstruct the algorithm - which he did explain well.

I've been thinking of ways to speed up the algorithm - lazy at first, then maybe for more exotic transformations. I also have not browsed every corner of the web for a solution, or looked through the libraries on my harddrive for someone else's code -- the goal here is to use my own creative ideas - for me it helps to have a clean slate so to speak, a fresh mind.

Well, the very first thing to hit my mind was not such a simple thing: I was thinking about creating 256 optimized routines and then branching on a byte instead of a bit - because that branch is going to be very costly on newer machines and I really want to reduce that negitive. Oh, and those divides - I don't like them either.

...I'm rambling, but bandwidth is getting cheaper as you read this. :grin:

Hello Bitrake

What about extending The Svin's divider to 64-bit (u know, that trick that allows to compute q = x / y as a MUL ; after that u would just make a x - (q * 1973) to get the remainder :) No, useless : I think his divider gives u the remainder too.

And as I am lazy, please code it for my Opteron ;)

What about extending The Svin's divider to 64-bit (u know, that trick that allows to compute q = x / y as a MUL ; after that u would just make a x - (q * 1973) to get the remainder :) No, useless : I think his divider gives u the remainder too.

And as I am lazy, please code it for my Opteron ;)

Hi bitRAKE,

"modulus power" is slightly bad choosen, better "modular exponentation" :)

The bt , ebx should be avoided. Instead left shift the exponent until the MSB goes into carry. Then the BT instruction can be replaced by simple ADD ESI,ESI and JNC/JC. EBX should contains the bitcount of the exponent.

But contrary to your code, it works then only to 32 bit exponents.

One optimization more can be done. We can work with Montgomery trick. Instead 1 MUL + 1 DIV we use then 2 MUL + 1 IMUL. We need the values U = M^-1 mod 2^32, M = Modulus, Base' = Base * 2^32 mod M.

Reduceing are then:

Reducing two times a value we convert from montgomery domain back to normal domain.

If we want a fast primality check then your exponentation is the basis algorithm. The montgomery percomputation are than only one times done, for more as one exponentation. This trick get us the way to use SSE2.

And it exists ways to reduce big exponents down to size smaller the modulus.

If you are interessted I have some IsPrime() code for 2^32 bounds.

Hagen

"modulus power" is slightly bad choosen, better "modular exponentation" :)

The bt , ebx should be avoided. Instead left shift the exponent until the MSB goes into carry. Then the BT instruction can be replaced by simple ADD ESI,ESI and JNC/JC. EBX should contains the bitcount of the exponent.

But contrary to your code, it works then only to 32 bit exponents.

One optimization more can be done. We can work with Montgomery trick. Instead 1 MUL + 1 DIV we use then 2 MUL + 1 IMUL. We need the values U = M^-1 mod 2^32, M = Modulus, Base' = Base * 2^32 mod M.

Reduceing are then:

```
```

mul eax // Base' = Base'^2

mov edi,eax // montgomery REDC

mov esi,edx

imul eax,U

mul M

add eax,edi // eax = 0

adc edx,esi

jnc @@1

sub edx,M

@@1:

Reducing two times a value we convert from montgomery domain back to normal domain.

If we want a fast primality check then your exponentation is the basis algorithm. The montgomery percomputation are than only one times done, for more as one exponentation. This trick get us the way to use SSE2.

And it exists ways to reduce big exponents down to size smaller the modulus.

If you are interessted I have some IsPrime() code for 2^32 bounds.

Hagen

**Hagen**, Thanks for clearing up the naming - I want to confuse the least number of people as possible. :)

The BT instruction is okay on Athlons - in fact the register version is direct path and only one cycle. Branching on a byte would remove this instruction and the branch altogether.

The multiply/divide reduction is very good. I was trying to think of something like this, but didn't make the connection just yet.

__Handbook of Applied Cryptography__

by Alfred J. Menezes, Paul C. Van Oorschot, Scott A. Vanstone

http://www.cacr.math.uwaterloo.ca/hac/index.html (PDF sample chapters to download)

p.600, Section 14.3.2, explains the Montgomery reduction and gives examples.

I would like to see your IsPrime() code, but it might be better suited to it's own thread or attached to an existing thread on primes. Primes seem to get people motivated because of their uses, but my goals are more general - I mentioned it merely as where I first came across the algorithm. Many people want to learn about algorithm development and I think reduction of mathematical operations on very large numbers is an area where many people can relate easily. Also, there seems to be some quite experienced people in this area that frequent the board. ;)

**valy**, It is not always clear where to go from a na?ve implementation of an algorithm. There are two distinct skills to develop: on one hand we have the knowledge of the most optimal algorithm, and on the other hand there is the implementation. The same techniques we apply within the algorithm can be used to build the algorithm - high-level programming does not require this added complexity. I mean you can try to order things to give the compiler a better chance at optimization, but it does not have the same effectiveness of coding in ASM.

Posted on 2003-11-04 20:13:42 by bitRAKE

I think it would be useful to apply Fermat's Theorem at the outset in order to reduce the size of the exponent. For our purposes 7^12821765729162363903 = 7^1923 mod 1973, according to the theorem. This reduces the complexity significantly and works well with exponents that are larger than the prime modulus.

In general does anyone think it would be a performance inprovement to store the numbers as an array of primes. For example, 72 would be stored as 2,2,2,3,3 or rather (2,3),(3,2) might be better. Could also store negitive exponents rather than performing the divide. 72/13 = (2,3),(3,2),(13,-1). Add and subtract operations would take longer, but I'd think it would simplify many other algorithms.

Isn't the (2,3)(3,2) thing called factorisation? (sorry for compactness i'm on a F-keyboard, scrambled everything)

@bitRAKE, that depends on what you want to compute. Using the prime factorization leads as example in computation of big factorial/binomials etc. to very fast algorithms.

Using it for plan modular exponentation are'nt helpfully. Then it is better for arbitrary integers to use a prime remainder representation of each number. Now addition, subtractions are linear and modular reduction works only on the remainder arrays. With some special hardware we get the fastest performances. Take a look for David Bernstein, he descripe this technic. Some other sources are Arnold Sch?nhage with the Modular Fermat Fast Fourier Transformation where the numbers are remainder representations to modulis of the form 2^(r*m) +1. All these technics are based on the Chinese Remainder Theorem.

Best Regards, Hagen

Using it for plan modular exponentation are'nt helpfully. Then it is better for arbitrary integers to use a prime remainder representation of each number. Now addition, subtractions are linear and modular reduction works only on the remainder arrays. With some special hardware we get the fastest performances. Take a look for David Bernstein, he descripe this technic. Some other sources are Arnold Sch?nhage with the Modular Fermat Fast Fourier Transformation where the numbers are remainder representations to modulis of the form 2^(r*m) +1. All these technics are based on the Chinese Remainder Theorem.

Best Regards, Hagen

I found this link to a very entertaining and readable

survey paper on fast modular exponentiation algorithms: http://citeseer.ist.psu.edu/gordon97survey.html

survey paper on fast modular exponentiation algorithms: http://citeseer.ist.psu.edu/gordon97survey.html

This is how long Mathematica takes for my test numbers:

```
In[1]:= Timing[PowerMod[2^9689 - 1, 2^9941 - 1, 2^11213 - 1];]
```

Out[1]= {7.656 Second, Null}

If **Hagen**or anyone else could post timings of their algortihms to give me a reference. What is the fastest you can compute (2^9689 - 1) ^ (2^9941 - 1) MOD (2^11213 - 1)? Thank you,**Poimander**, that was exactly what I was looking for - should be some more code coming soon.With my lib 1.2 seconds, but thats a little bit tricky to explain :)

Your numbers are all of special form of 2^k-1, thus we can use special algorithms there are special exorbitant faster ! If I use a generic modular Exponentation then my lib need about 8 seconds. Thus Mathematika as a interpreter MUST be taken care of your special numbers. Otherwise I could'nt explain why Mathematica should run such as fast.

I used for the 1.2 second variant a B^E mod (2^k-1) algorithm, but it should be realy faster to use another algorithm, because B = 2^k-1 and E = 2^k-1.

Thus, try to generate Random Numbers B = 9689 bits, E = 9941 bits and M = 11213 bits and even. Then run a test with Mathematica. If than Mathematica runs faster as 8 seconds, my Library must be realy slow :-)

Best regards, Hagen

Your numbers are all of special form of 2^k-1, thus we can use special algorithms there are special exorbitant faster ! If I use a generic modular Exponentation then my lib need about 8 seconds. Thus Mathematika as a interpreter MUST be taken care of your special numbers. Otherwise I could'nt explain why Mathematica should run such as fast.

I used for the 1.2 second variant a B^E mod (2^k-1) algorithm, but it should be realy faster to use another algorithm, because B = 2^k-1 and E = 2^k-1.

Thus, try to generate Random Numbers B = 9689 bits, E = 9941 bits and M = 11213 bits and even. Then run a test with Mathematica. If than Mathematica runs faster as 8 seconds, my Library must be realy slow :-)

Best regards, Hagen

Hagen: how fast your library is, compared to LibGMP ?

I used LibNTL to do some lattice reductions, and had to convert LibGMP in Intel assembly format.

If you have a fast library, I'm interested in adding it to my current code...

I used LibNTL to do some lattice reductions, and had to convert LibGMP in Intel assembly format.

If you have a fast library, I'm interested in adding it to my current code...

**Hagen**, Mathematica is actually faster with random numbers!?

```
In[1]:= R1 = Random[Integer, {2^9689, 2^9690}];
```

R2 = Random[Integer, {2^9941, 2^9942}];

R3 = 2 * Random[Integer, {2^11212, 2^11213}];

Timing[PowerMod[R1, R2, R3];]

Out[1]= {5.625 Second, Null}

My code doesn't work yet. So, anything else is faster at this point, but I'd like to beat the best thing out there. :)@MCoder, my Lib is faster as LibGMP. That depends on some different things.

1. my Lib use equal many lines of assembler optimized code, especialy for SSE2

2. my Lib use assembler optimized Algorithms for as example Montgomery, where GMP don't use such Algorithms.

3. especialy with very large arbitrary Integers I have used better Algorithms as in GMP. As Example, i and GMP use a Sch?nhage Fermat Fast Fourier Transformation combined with a Fast Karatsuba Division of Burnikel & Ziegler. But GMP use a one to one translation of Burnikel & Ziegler Papers. Thats strange because they Algorithm use only half the Time they described advantages. (you can call it as a bug :-) My own Version of this fast Division use always they fast trick and should be thus about 2 times faster.

4. try Miracl instead of GMP. Miracl is faster as GMP especial with modular computation.

5. GMP is a usefull base to get very low level optimized source. If we want a fast math library we can use GMP as base to implement the important and complex algorithm. But exactly this step is the hardest one, harder as to program a simple and fast addition,multiplication into assembler. As example a clever modular multiplication on Elliptic Curve use some mathematical tricks. Understanding these math tricks and get it running are the harder work.

But all these matters are'nt the big differences. My Lib is in Pascal and use a quit better Interface for usage. Means I implemented Garbarge Collections and Copy on Write demand Feature. And there exists no need for the user to take care of allocation, deallocations or complex and fast algorithms. My goal was it to get a ready to use math library to concentrate your self on the mathematical problem.

Follow a small piece of code to create and test RSA keygeneration, encryption and decryption.

The variables with type IInteger are the arbitrary numbers. There is no manual deallocations needed and all numbers are allocated on need. If we assign such number to another, we don't copy the full context. Instead it's reference based. Operations that modify such a multireferenced Number made a "Copy on Write demand", means detects modifications and copy the actual context to a new single used context. Especialy THIS operation get us a easy to controled and fully transparant base to optimze any operation to use "move memory by the way" in these operation. Instead to copy memory on any assignment we copy the memory context inclusive doing the specificaly operation, such as subtraction,addition and so on.

The ONLY feature are missed in PASCAL are Operator Overloading such as in C++. By the way I mean that "Operator Overloading" as it is, is only usefull with such a math library :-)

@BitRAKE: 5 seconds thats realy strange for me !? on a P4 50 Ghz Machine ? My timings are on P4 1.5GHz. Wich Version of Mathematica are used ?

Best Regards, Hagen

1. my Lib use equal many lines of assembler optimized code, especialy for SSE2

2. my Lib use assembler optimized Algorithms for as example Montgomery, where GMP don't use such Algorithms.

3. especialy with very large arbitrary Integers I have used better Algorithms as in GMP. As Example, i and GMP use a Sch?nhage Fermat Fast Fourier Transformation combined with a Fast Karatsuba Division of Burnikel & Ziegler. But GMP use a one to one translation of Burnikel & Ziegler Papers. Thats strange because they Algorithm use only half the Time they described advantages. (you can call it as a bug :-) My own Version of this fast Division use always they fast trick and should be thus about 2 times faster.

4. try Miracl instead of GMP. Miracl is faster as GMP especial with modular computation.

5. GMP is a usefull base to get very low level optimized source. If we want a fast math library we can use GMP as base to implement the important and complex algorithm. But exactly this step is the hardest one, harder as to program a simple and fast addition,multiplication into assembler. As example a clever modular multiplication on Elliptic Curve use some mathematical tricks. Understanding these math tricks and get it running are the harder work.

But all these matters are'nt the big differences. My Lib is in Pascal and use a quit better Interface for usage. Means I implemented Garbarge Collections and Copy on Write demand Feature. And there exists no need for the user to take care of allocation, deallocations or complex and fast algorithms. My goal was it to get a ready to use math library to concentrate your self on the mathematical problem.

Follow a small piece of code to create and test RSA keygeneration, encryption and decryption.

The variables with type IInteger are the arbitrary numbers. There is no manual deallocations needed and all numbers are allocated on need. If we assign such number to another, we don't copy the full context. Instead it's reference based. Operations that modify such a multireferenced Number made a "Copy on Write demand", means detects modifications and copy the actual context to a new single used context. Especialy THIS operation get us a easy to controled and fully transparant base to optimze any operation to use "move memory by the way" in these operation. Instead to copy memory on any assignment we copy the memory context inclusive doing the specificaly operation, such as subtraction,addition and so on.

The ONLY feature are missed in PASCAL are Operator Overloading such as in C++. By the way I mean that "Operator Overloading" as it is, is only usefull with such a math library :-)

```
```

procedure TestRSA(KeySize: Integer);

var

P,Q,E,D,N,Dp,Dq,U: IInteger; // RSA key param

M,C,X,Y: IInteger; // RSA en/decryption

S: Integer;

begin

repeat

S := KeySize shr 1;

NRnd(P, S);

NBit(P, S -2, True);

NMakePrime(P, [1, 2]);

S := KeySize - NSize(P);

repeat

NRnd(Q, S);

NBit(Q, S -2, True);

NMakePrime(Q, [1, 2]);

until NCmp(P, Q) <> 0;

if NCmp(P, Q) < 0 then NSwp(P, Q);

NMul(N, P, Q);

until NSize(N) = KeySize;

NDec(P);

NDec(Q);

NLCM(U, P, Q); // U = LCM(P -1, Q -1)

repeat

repeat

NRnd(E, NLog2(NSize(N)) * 4);

NOdd(E, True);

until NGCD1(E, P) and NGCD1(E, Q);

until NInvMod(D, E, U) and (NSize(D) >= NSize(E)) and NGCD1(D, N);

NMod(Dp, D, P); // Dp = D mod (P -1)

NMod(Dq, D, Q); // Dq = Q mod (Q -1)

NInc(P);

NInc(Q);

NInvMod(U, P, Q); // U = P^-1 mod Q

// Encryption

NSet(M, 'RSA Public Key scheme with DEC', 256);

NPowMod(C, M, E, N);

// Decryption with CRT

NPowMod(X, C, Dp, P);

NPowMod(Y, C, Dq, Q);

NSub(Y, X);

NMulMod(Y, U, Q);

NMul(Y, P);

NAdd(Y, X);

end;

@BitRAKE: 5 seconds thats realy strange for me !? on a P4 50 Ghz Machine ? My timings are on P4 1.5GHz. Wich Version of Mathematica are used ?

Best Regards, Hagen

@BitRAKE: 5 seconds thats realy strange for me !? on a P4 50 Ghz Machine ? My timings are on P4 1.5GHz. Wich Version of Mathematica are used ?

I am working on a big integer lib - so far all my algorithms are faster than GMP, but they are specially optimized for XP processors (I'll get an Opteron in a couple months

**valy**and start working on optimizing for it). AMD processors have a faster multiply and faster MMX/FPU than P4. I haven't done anything with SSE2, yet.

For example, I was able to reduce the single limb multiply down to 3 cycles per limb for cached data - fastest possible on XP processors. AMD says MUL has a latency of 6 cycles. Not only is the processor doing two MUL's at once, but all other instructions are executed in the latency shadow. How is this possible when one multiple requires the result of the other!? This is a 10% improvement over GMP (and it is smaller in code size :)).

Wow !

Your both libs should improve factoring projects like ECMNet significantly (http://www.loria.fr/~zimmerma/records/ecmnet.html).

In my project (named SumBKZ), I needed: mpn_sub, mpn_add, mpn_mul, mpn_lshift, mpn_rshift and a multiprecision multiplication.

If your libs are freely available, I'll be glad to include them in this project !

Your both libs should improve factoring projects like ECMNet significantly (http://www.loria.fr/~zimmerma/records/ecmnet.html).

In my project (named SumBKZ), I needed: mpn_sub, mpn_add, mpn_mul, mpn_lshift, mpn_rshift and a multiprecision multiplication.

If your libs are freely available, I'll be glad to include them in this project !

Hm, Mathematica should than need ~7 seconds on a 1.5GHz P4, thats always better as my code. I can't believe this. Could you try smaller numbers please, as example in range 4096 bit ?? My lib need 600 ms for this.

AMD processors have a faster multiply and faster MMX/FPU than P4. I haven't done anything with SSE2, yet.

You can assume that SSE2 Code is minimal 2 times faster as plain i486 code.

3 cyles per digit/limb is very good. My lib need 6 cycles. Can I see your optimizations ?? You use native i386 code ??

@MCoder: sorry my lib is'nt freely available :(

Improving ECMNet depends not on best optimized assemblercode. Any small better algorithm thats used is far faster as very good assembler optimization of the currently used algorithms. Thus optimizing the used mathematical algorithms or introducing of distibuted computing would improve far more. Thats my 3 cent meaning, of course :)

Best regards, Hagen

AMD processors have a faster multiply and faster MMX/FPU than P4. I haven't done anything with SSE2, yet.

You can assume that SSE2 Code is minimal 2 times faster as plain i486 code.

3 cyles per digit/limb is very good. My lib need 6 cycles. Can I see your optimizations ?? You use native i386 code ??

@MCoder: sorry my lib is'nt freely available :(

Improving ECMNet depends not on best optimized assemblercode. Any small better algorithm thats used is far faster as very good assembler optimization of the currently used algorithms. Thus optimizing the used mathematical algorithms or introducing of distibuted computing would improve far more. Thats my 3 cent meaning, of course :)

Best regards, Hagen

3 cyles per digit/limb is very good. My lib need 6 cycles. Can I see your optimizations ?? You use native i386 code ??

```
OPTION PROLOGUE:NONE
```

OPTION EPILOGUE:NONE

ALIGN _CODE_ALIGNMENT_

; loop range from 2 thru 62

;32: 3.0700 cycles per limb (DWORD)

mpn_mul_1__k7_04_UNROLL = 32

mpn_mul_1__k7_04 PROC USES ebx esi edi, bInt:PTR BIGINT, limb:DWORD, carry:DWORD

pBIGINT EQU <[esp+4*(4+1)]>

_limb EQU <[esp+4*(4+2)]>

_carry EQU <[esp+4*(4+3)]>

push ebp

push ebx

push esi

push edi

mov esi, pBIGINT

IF 32 EQ mpn_mul_1__k7_04_UNROLL

mov ecx, mpn_mul_1__k7_04_UNROLL

mov edi, [esi].BIGINT.limbs - SIZEOF BIGINT

mov edx, edi

shr edi, 5 ; divide by 32

and edx, 1Fh ; remainder

ELSE

mov ecx, mpn_mul_1__k7_04_UNROLL

mov eax, [esi].BIGINT.limbs - SIZEOF BIGINT

cdq

div ecx

mov edi, eax

ENDIF

imul ecx, edx, -14 ; mpn_mul_1__k7_04_LIMB_CODE

mov eax, [esi] ; [-4*mpn_mul_1__k7_03_UNROLL]

mov ebp, _limb ; multiplier

mov ebx, _carry ; carry

; adjust ESI to point at correct block of data to allow greater

; unroll range while still having only a bytes offset:

; each block can be a maximum of 128 bytes

;

; ESI points at BIGINT[0], but needs to point at BIGINT[80h - (BIGINT.limbs MOD UNROLL) + UNROLL]

;

; ESI = ESI + 4*(UNROLL - (BIGINT.limbs MOD UNROLL))

sub edx, 31

shl edx, 2

add esi, edx

; jump to complete partial roll first

lea edx, [ecx][_2]

xor ecx, ecx

jmp edx

ALIGN 16

_0:

i=80h - 4*mpn_mul_1__k7_04_UNROLL

WHILE i LT 80h

mul ebp

add ebx, eax

IF i EQ 0

BYTE 8Bh, 46h, 00h

mov [esi][i][-4], ebx

ELSEIF i EQ 4

mov eax, [esi][i]

BYTE 89h, 5Eh, 00h

ELSEIF i GT -80h

mov eax, [esi][i]

mov [esi][i][-4], ebx

ELSE ; only byte offsets

.err mpn_mul_1__k7_04_UNROLL too big!

ENDIF

mov ebx, ecx

adc ebx, edx

i=i+4

ENDM

_2: dec edi

lea esi, [esi + 4*mpn_mul_1__k7_04_UNROLL]

jns _0

mov eax, ebx ; return carry

pop edi

pop esi

pop ebx

pop ebp

retn 4*3

mpn_mul_1__k7_04 ENDP

I should try a speed test with 16 bytes of code per limb and see if it executes at the same speed, so I can simplify the setup code.
...on a side note, I think I might have found a faster MMX string length! But it's only better for a larger than average string length (256+ maybe, and in the cache). Long string that we don't know the length of are rarely in the cache. :)

```
R1 = 2^4095 - 5;
```

R2 = 2^4095 - 3;

R3 = 2^4095 - 1;

Timing[PowerMod[R1, R2, R3];]

{0.36 Second, Null}Hi bitRake

R1 = 2^4095 - 5;

R2 = 2^4095 - 3;

R3 = 2^4095 - 1;

Timing;

Thats not what I wanted, all numbers are of special form. Now, Mathematica is an interpreter and possible know of this special form, and thus can use a highly specialized version of PowerMod.

I need a comparsion with real random choosen numbers in range of 4096 bits.

Regards, Hagen

R1 = 2^4095 - 5;

R2 = 2^4095 - 3;

R3 = 2^4095 - 1;

Timing;

Thats not what I wanted, all numbers are of special form. Now, Mathematica is an interpreter and possible know of this special form, and thus can use a highly specialized version of PowerMod.

I need a comparsion with real random choosen numbers in range of 4096 bits.

Regards, Hagen

Sorry for my haste.

0.437 Second on average, for random numbers of 4096 bits.

0.437 Second on average, for random numbers of 4096 bits.