``````; Example:
;
__K__ QWORD 0B1F00FFF800003FFh

;	compute: 7^12821765729162363903 MOD 1973 = 911
;
lea	esi, __K__	; exponent
mov	ebx, 64 - 1	; bits in __K__ - 1
mov	edi, 7		; mod
mov	ecx, 1973	; modulus base
call	MOD_POW

;----------------

MOD_POW:
mov	edx, 1
jmp	_1

_0:	mul	eax		; A^2
div	ecx		; MOD N
_1:	bt	[esi], ebx	; B(n-1), B(n-2), ... , 2, 1, 0
mov	eax, edi
jnc	_2		; *UNPREDICTABLE CONDITIONAL JUMP*
mul	edx		; A*M
div	ecx		; MOD N
_2:	dec	ebx		; next lower bit of power
mov	eax, edx	; bound to MOD N
jns	_0		; until all bits of power have been tested

ret``````
...fresh out of a hot steaming oven.
Don't you just love the smell...

Cook up some code! :grin:

From the book:
Foundations of Algorithms Using C++ Pseudocode, Third Edition
by Richard Neapolitan and Kumarss Naimipour

The C++ in this book is crazy - he doesn't use any high-level types (vector, matrix, etc.) - instead he mixes the mathematical expressions with the C++ -- damn crazy to understand. I just wanted to read the code and translate into ASM, but the code was too scary by itself. He couldn't have put much thought into this, imho. Luckily, I was able to gleam enough from the book text to reconstruct the algorithm - which he did explain well.

I've been thinking of ways to speed up the algorithm - lazy at first, then maybe for more exotic transformations. I also have not browsed every corner of the web for a solution, or looked through the libraries on my harddrive for someone else's code -- the goal here is to use my own creative ideas - for me it helps to have a clean slate so to speak, a fresh mind.

Well, the very first thing to hit my mind was not such a simple thing: I was thinking about creating 256 optimized routines and then branching on a byte instead of a bit - because that branch is going to be very costly on newer machines and I really want to reduce that negitive. Oh, and those divides - I don't like them either.

...I'm rambling, but bandwidth is getting cheaper as you read this. :grin:
Posted on 2003-11-03 22:18:53 by bitRAKE
Hello Bitrake

What about extending The Svin's divider to 64-bit (u know, that trick that allows to compute q = x / y as a MUL ; after that u would just make a x - (q * 1973) to get the remainder :) No, useless : I think his divider gives u the remainder too.
And as I am lazy, please code it for my Opteron ;)
Posted on 2003-11-04 05:21:05 by valy
Hi bitRAKE,

"modulus power" is slightly bad choosen, better "modular exponentation" :)
The bt , ebx should be avoided. Instead left shift the exponent until the MSB goes into carry. Then the BT instruction can be replaced by simple ADD ESI,ESI and JNC/JC. EBX should contains the bitcount of the exponent.
But contrary to your code, it works then only to 32 bit exponents.

One optimization more can be done. We can work with Montgomery trick. Instead 1 MUL + 1 DIV we use then 2 MUL + 1 IMUL. We need the values U = M^-1 mod 2^32, M = Modulus, Base' = Base * 2^32 mod M.
Reduceing are then:

``````

mul  eax      // Base' = Base'^2

mov  edi,eax  // montgomery REDC
mov  esi,edx
imul eax,U
mul  M
add  eax,edi  // eax = 0
jnc  @@1
sub  edx,M
@@1:

``````

Reducing two times a value we convert from montgomery domain back to normal domain.
If we want a fast primality check then your exponentation is the basis algorithm. The montgomery percomputation are than only one times done, for more as one exponentation. This trick get us the way to use SSE2.

And it exists ways to reduce big exponents down to size smaller the modulus.

If you are interessted I have some IsPrime() code for 2^32 bounds.

Hagen
Posted on 2003-11-04 07:45:21 by Hagen
Hagen, Thanks for clearing up the naming - I want to confuse the least number of people as possible. :)

The BT instruction is okay on Athlons - in fact the register version is direct path and only one cycle. Branching on a byte would remove this instruction and the branch altogether.

The multiply/divide reduction is very good. I was trying to think of something like this, but didn't make the connection just yet.

Handbook of Applied Cryptography
by Alfred J. Menezes, Paul C. Van Oorschot, Scott A. Vanstone
p.600, Section 14.3.2, explains the Montgomery reduction and gives examples.

I would like to see your IsPrime() code, but it might be better suited to it's own thread or attached to an existing thread on primes. Primes seem to get people motivated because of their uses, but my goals are more general - I mentioned it merely as where I first came across the algorithm. Many people want to learn about algorithm development and I think reduction of mathematical operations on very large numbers is an area where many people can relate easily. Also, there seems to be some quite experienced people in this area that frequent the board. ;)

valy, It is not always clear where to go from a na?ve implementation of an algorithm. There are two distinct skills to develop: on one hand we have the knowledge of the most optimal algorithm, and on the other hand there is the implementation. The same techniques we apply within the algorithm can be used to build the algorithm - high-level programming does not require this added complexity. I mean you can try to order things to give the compiler a better chance at optimization, but it does not have the same effectiveness of coding in ASM.
Posted on 2003-11-04 20:13:42 by bitRAKE
I think it would be useful to apply Fermat's Theorem at the outset in order to reduce the size of the exponent. For our purposes 7^12821765729162363903 = 7^1923 mod 1973, according to the theorem. This reduces the complexity significantly and works well with exponents that are larger than the prime modulus.
Posted on 2003-11-05 19:27:01 by Poimander
In general does anyone think it would be a performance inprovement to store the numbers as an array of primes. For example, 72 would be stored as 2,2,2,3,3 or rather (2,3),(3,2) might be better. Could also store negitive exponents rather than performing the divide. 72/13 = (2,3),(3,2),(13,-1). Add and subtract operations would take longer, but I'd think it would simplify many other algorithms.
Posted on 2003-11-09 11:43:34 by bitRAKE
Isn't the (2,3)(3,2) thing called factorisation? (sorry for compactness i'm on a F-keyboard, scrambled everything)
Posted on 2003-11-10 10:02:32 by inFinie
@bitRAKE, that depends on what you want to compute. Using the prime factorization leads as example in computation of big factorial/binomials etc. to very fast algorithms.
Using it for plan modular exponentation are'nt helpfully. Then it is better for arbitrary integers to use a prime remainder representation of each number. Now addition, subtractions are linear and modular reduction works only on the remainder arrays. With some special hardware we get the fastest performances. Take a look for David Bernstein, he descripe this technic. Some other sources are Arnold Sch?nhage with the Modular Fermat Fast Fourier Transformation where the numbers are remainder representations to modulis of the form 2^(r*m) +1. All these technics are based on the Chinese Remainder Theorem.

Best Regards, Hagen
Posted on 2003-11-11 05:39:08 by Hagen
I found this link to a very entertaining and readable
survey paper on fast modular exponentiation algorithms: http://citeseer.ist.psu.edu/gordon97survey.html
Posted on 2003-11-15 16:41:01 by Poimander
This is how long Mathematica takes for my test numbers:
``````In:= Timing[PowerMod[2^9689 - 1, 2^9941 - 1, 2^11213 - 1];]
Out= {7.656 Second, Null}``````
If Hagen or anyone else could post timings of their algortihms to give me a reference. What is the fastest you can compute (2^9689 - 1) ^ (2^9941 - 1) MOD (2^11213 - 1)? Thank you, Poimander, that was exactly what I was looking for - should be some more code coming soon.
Posted on 2003-11-16 18:51:52 by bitRAKE
With my lib 1.2 seconds, but thats a little bit tricky to explain :)
Your numbers are all of special form of 2^k-1, thus we can use special algorithms there are special exorbitant faster ! If I use a generic modular Exponentation then my lib need about 8 seconds. Thus Mathematika as a interpreter MUST be taken care of your special numbers. Otherwise I could'nt explain why Mathematica should run such as fast.

I used for the 1.2 second variant a B^E mod (2^k-1) algorithm, but it should be realy faster to use another algorithm, because B = 2^k-1 and E = 2^k-1.

Thus, try to generate Random Numbers B = 9689 bits, E = 9941 bits and M = 11213 bits and even. Then run a test with Mathematica. If than Mathematica runs faster as 8 seconds, my Library must be realy slow :-)

Best regards, Hagen
Posted on 2003-11-17 04:49:13 by Hagen
Hagen: how fast your library is, compared to LibGMP ?

I used LibNTL to do some lattice reductions, and had to convert LibGMP in Intel assembly format.
If you have a fast library, I'm interested in adding it to my current code...
Posted on 2003-11-17 05:59:14 by MCoder
Hagen, Mathematica is actually faster with random numbers!?
``````In:= R1 = Random[Integer, {2^9689, 2^9690}];
R2 = Random[Integer, {2^9941, 2^9942}];
R3 = 2 * Random[Integer, {2^11212, 2^11213}];
Timing[PowerMod[R1, R2, R3];]

Out= {5.625 Second, Null}``````
My code doesn't work yet. So, anything else is faster at this point, but I'd like to beat the best thing out there. :)
Posted on 2003-11-17 08:46:32 by bitRAKE
@MCoder, my Lib is faster as LibGMP. That depends on some different things.
1. my Lib use equal many lines of assembler optimized code, especialy for SSE2
2. my Lib use assembler optimized Algorithms for as example Montgomery, where GMP don't use such Algorithms.
3. especialy with very large arbitrary Integers I have used better Algorithms as in GMP. As Example, i and GMP use a Sch?nhage Fermat Fast Fourier Transformation combined with a Fast Karatsuba Division of Burnikel & Ziegler. But GMP use a one to one translation of Burnikel & Ziegler Papers. Thats strange because they Algorithm use only half the Time they described advantages. (you can call it as a bug :-) My own Version of this fast Division use always they fast trick and should be thus about 2 times faster.
4. try Miracl instead of GMP. Miracl is faster as GMP especial with modular computation.
5. GMP is a usefull base to get very low level optimized source. If we want a fast math library we can use GMP as base to implement the important and complex algorithm. But exactly this step is the hardest one, harder as to program a simple and fast addition,multiplication into assembler. As example a clever modular multiplication on Elliptic Curve use some mathematical tricks. Understanding these math tricks and get it running are the harder work.

But all these matters are'nt the big differences. My Lib is in Pascal and use a quit better Interface for usage. Means I implemented Garbarge Collections and Copy on Write demand Feature. And there exists no need for the user to take care of allocation, deallocations or complex and fast algorithms. My goal was it to get a ready to use math library to concentrate your self on the mathematical problem.

Follow a small piece of code to create and test RSA keygeneration, encryption and decryption.
The variables with type IInteger are the arbitrary numbers. There is no manual deallocations needed and all numbers are allocated on need. If we assign such number to another, we don't copy the full context. Instead it's reference based. Operations that modify such a multireferenced Number made a "Copy on Write demand", means detects modifications and copy the actual context to a new single used context. Especialy THIS operation get us a easy to controled and fully transparant base to optimze any operation to use "move memory by the way" in these operation. Instead to copy memory on any assignment we copy the memory context inclusive doing the specificaly operation, such as subtraction,addition and so on.

The ONLY feature are missed in PASCAL are Operator Overloading such as in C++. By the way I mean that "Operator Overloading" as it is, is only usefull with such a math library :-)

``````
procedure TestRSA(KeySize: Integer);
var
P,Q,E,D,N,Dp,Dq,U: IInteger; // RSA key param
M,C,X,Y: IInteger;  // RSA en/decryption
S: Integer;
begin
repeat
S := KeySize shr 1;
NRnd(P, S);
NBit(P, S -2, True);
NMakePrime(P, [1, 2]);
S := KeySize - NSize(P);
repeat
NRnd(Q, S);
NBit(Q, S -2, True);
NMakePrime(Q, [1, 2]);
until NCmp(P, Q) <> 0;
if NCmp(P, Q) < 0 then NSwp(P, Q);
NMul(N, P, Q);
until NSize(N) = KeySize;
NDec(P);
NDec(Q);
NLCM(U, P, Q);       // U = LCM(P -1, Q -1)
repeat
repeat
NRnd(E, NLog2(NSize(N)) * 4);
NOdd(E, True);
until NGCD1(E, P) and NGCD1(E, Q);
until NInvMod(D, E, U) and (NSize(D) >= NSize(E)) and NGCD1(D, N);
NMod(Dp, D, P);    // Dp = D mod (P -1)
NMod(Dq, D, Q);    // Dq = Q mod (Q -1)
NInc(P);
NInc(Q);
NInvMod(U, P, Q);  // U = P^-1 mod Q

// Encryption
NSet(M, 'RSA Public Key scheme with DEC', 256);
NPowMod(C, M, E, N);

// Decryption with CRT
NPowMod(X, C, Dp, P);
NPowMod(Y, C, Dq, Q);
NSub(Y, X);
NMulMod(Y, U, Q);
NMul(Y, P);
end;
``````

@BitRAKE: 5 seconds thats realy strange for me !? on a P4 50 Ghz Machine ? My timings are on P4 1.5GHz. Wich Version of Mathematica are used ?

Best Regards, Hagen
Posted on 2003-11-18 03:19:55 by Hagen

@BitRAKE: 5 seconds thats realy strange for me !? on a P4 50 Ghz Machine ? My timings are on P4 1.5GHz. Wich Version of Mathematica are used ?
It is version 5, and I'm testing on a 2Ghz XP Barton.

I am working on a big integer lib - so far all my algorithms are faster than GMP, but they are specially optimized for XP processors (I'll get an Opteron in a couple months valy and start working on optimizing for it). AMD processors have a faster multiply and faster MMX/FPU than P4. I haven't done anything with SSE2, yet.

For example, I was able to reduce the single limb multiply down to 3 cycles per limb for cached data - fastest possible on XP processors. AMD says MUL has a latency of 6 cycles. Not only is the processor doing two MUL's at once, but all other instructions are executed in the latency shadow. How is this possible when one multiple requires the result of the other!? This is a 10% improvement over GMP (and it is smaller in code size :)).
Posted on 2003-11-18 05:39:37 by bitRAKE
Wow !

Your both libs should improve factoring projects like ECMNet significantly (http://www.loria.fr/~zimmerma/records/ecmnet.html).

In my project (named SumBKZ), I needed: mpn_sub, mpn_add, mpn_mul, mpn_lshift, mpn_rshift and a multiprecision multiplication.

If your libs are freely available, I'll be glad to include them in this project !
Posted on 2003-11-18 07:07:07 by MCoder
Hm, Mathematica should than need ~7 seconds on a 1.5GHz P4, thats always better as my code. I can't believe this. Could you try smaller numbers please, as example in range 4096 bit ?? My lib need 600 ms for this.

AMD processors have a faster multiply and faster MMX/FPU than P4. I haven't done anything with SSE2, yet.

You can assume that SSE2 Code is minimal 2 times faster as plain i486 code.

3 cyles per digit/limb is very good. My lib need 6 cycles. Can I see your optimizations ?? You use native i386 code ??

@MCoder: sorry my lib is'nt freely available :(
Improving ECMNet depends not on best optimized assemblercode. Any small better algorithm thats used is far faster as very good assembler optimization of the currently used algorithms. Thus optimizing the used mathematical algorithms or introducing of distibuted computing would improve far more. Thats my 3 cent meaning, of course :)

Best regards, Hagen
Posted on 2003-11-19 08:54:08 by Hagen

3 cyles per digit/limb is very good. My lib need 6 cycles. Can I see your optimizations ?? You use native i386 code ??
I will be forwarding my code for inclusion in GMP, but can post the algorithm here as well. I have not yet worked to find a useful algorithm in MMX for multi-percision multiply.
``````	OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

ALIGN _CODE_ALIGNMENT_

; loop range from 2 thru 62
;32: 3.0700 cycles per limb (DWORD)
mpn_mul_1__k7_04_UNROLL = 32

mpn_mul_1__k7_04 PROC USES ebx esi edi, bInt:PTR BIGINT, limb:DWORD, carry:DWORD
pBIGINT	EQU <[esp+4*(4+1)]>
_limb	EQU <[esp+4*(4+2)]>
_carry	EQU <[esp+4*(4+3)]>

push	ebp
push	ebx
push	esi
push	edi

mov	esi, pBIGINT
IF 32 EQ mpn_mul_1__k7_04_UNROLL
mov	ecx, mpn_mul_1__k7_04_UNROLL
mov	edi, [esi].BIGINT.limbs - SIZEOF BIGINT
mov	edx, edi
shr	edi, 5		; divide by 32
and	edx, 1Fh	; remainder
ELSE
mov	ecx, mpn_mul_1__k7_04_UNROLL
mov	eax, [esi].BIGINT.limbs - SIZEOF BIGINT
cdq
div	ecx
mov	edi, eax
ENDIF

imul	ecx, edx, -14		; mpn_mul_1__k7_04_LIMB_CODE

mov	eax, [esi]		; [-4*mpn_mul_1__k7_03_UNROLL]
mov	ebp, _limb		; multiplier
mov	ebx, _carry		; carry

; adjust ESI to point at correct block of data to allow greater
; unroll range while still having only a bytes offset:
;	each block can be a maximum of 128 bytes
;
; ESI points at BIGINT, but needs to point at BIGINT[80h - (BIGINT.limbs MOD UNROLL) + UNROLL]
;
; ESI = ESI + 4*(UNROLL - (BIGINT.limbs MOD UNROLL))
sub	edx, 31
shl	edx, 2

; jump to complete partial roll first
lea	edx, [ecx][_2]
xor	ecx, ecx
jmp	edx

ALIGN 16

_0:
i=80h - 4*mpn_mul_1__k7_04_UNROLL
WHILE i LT 80h
mul	ebp
IF i EQ 0
BYTE 8Bh, 46h, 00h
mov	[esi][i][-4], ebx
ELSEIF i EQ 4
mov	eax, [esi][i]
BYTE 89h, 5Eh, 00h
ELSEIF i GT -80h
mov	eax, [esi][i]
mov	[esi][i][-4], ebx
ELSE ; only byte offsets
.err mpn_mul_1__k7_04_UNROLL too big!
ENDIF
mov	ebx, ecx
i=i+4
ENDM

_2:	dec	edi
lea	esi, [esi + 4*mpn_mul_1__k7_04_UNROLL]
jns	_0

mov	eax, ebx ; return carry

pop	edi
pop	esi
pop	ebx
pop	ebp
retn	4*3
mpn_mul_1__k7_04 ENDP``````
I should try a speed test with 16 bytes of code per limb and see if it executes at the same speed, so I can simplify the setup code.

...on a side note, I think I might have found a faster MMX string length! But it's only better for a larger than average string length (256+ maybe, and in the cache). Long string that we don't know the length of are rarely in the cache. :)

``````R1 = 2^4095 - 5;
R2 = 2^4095 - 3;
R3 = 2^4095 - 1;
Timing[PowerMod[R1, R2, R3];]``````
{0.36 Second, Null}
Posted on 2003-11-22 01:42:06 by bitRAKE
Hi bitRake

R1 = 2^4095 - 5;
R2 = 2^4095 - 3;
R3 = 2^4095 - 1;
Timing;

Thats not what I wanted, all numbers are of special form. Now, Mathematica is an interpreter and possible know of this special form, and thus can use a highly specialized version of PowerMod.
I need a comparsion with real random choosen numbers in range of 4096 bits.

Regards, Hagen
Posted on 2003-12-11 10:55:25 by Hagen
Sorry for my haste.

0.437 Second on average, for random numbers of 4096 bits.
Posted on 2003-12-16 20:28:09 by bitRAKE