I've been working on implementing the Diffie-Hellman algorithm, but I'm really stumped on performing the multiprecision modular exponentiation... I'm not even certain how to store/retrieve the numbers. I've stumbled upon the Montgomery Method, but I'm not even certain on how to perform the Montgomery Reduction let alone implement it... I am reading up on it, but I was wondering if anyone had experience with large numbers in asm. I would apreciate example code, advice, ideas...

Uhm, at first step sell "Handbook of Applied Cryptography" ISBN 0-8493-8523-7

In this book you find all about Montgomery, binary Exponentations, large Integers etc.

The nice thing with this book is it get you the mathematical basics and very clear algorithmic descriptions not far from you programming skills.

For your modular exponentations you have to consider:

- how to store large Integers, eg. on 32Bit Intel CPU's use linear arrays of 32Bit integers, from LSB to MSB.

- how to manage the signs, eg. complementary representations or absolute representation with extra sign (I prefer last case)

- you need implementation of Multiplications (eg School-boy, COMBA, Karatsuba, FFT)

- you need Additition/Subtraction

- you need generic Divisions with Remainder, even for Montomerys preprocessing

- you need modular Inversions, best is based on an Extended GCD Algorithm (Lehmer/Sorenson variant is very fast).

- you need binary Exponentation based on normal Divisions with Remainder, Squaring and Multiplications, best Variant is a Sliding Window Method

- now you can implementent the modular Reduction of Montgomery

- after this you can extend above normal Exponentation to use Montgomery

Please consider that any Montgomery Stuff need Divisions/Inversions/Multiplication/Addtions in normal Domain. First after implementation of these functions you can implement Montgomery. On Pre/Postprocessing for Montgomery you need these normal operations.

Hagen

In this book you find all about Montgomery, binary Exponentations, large Integers etc.

The nice thing with this book is it get you the mathematical basics and very clear algorithmic descriptions not far from you programming skills.

For your modular exponentations you have to consider:

- how to store large Integers, eg. on 32Bit Intel CPU's use linear arrays of 32Bit integers, from LSB to MSB.

- how to manage the signs, eg. complementary representations or absolute representation with extra sign (I prefer last case)

- you need implementation of Multiplications (eg School-boy, COMBA, Karatsuba, FFT)

- you need Additition/Subtraction

- you need generic Divisions with Remainder, even for Montomerys preprocessing

- you need modular Inversions, best is based on an Extended GCD Algorithm (Lehmer/Sorenson variant is very fast).

- you need binary Exponentation based on normal Divisions with Remainder, Squaring and Multiplications, best Variant is a Sliding Window Method

- now you can implementent the modular Reduction of Montgomery

- after this you can extend above normal Exponentation to use Montgomery

Please consider that any Montgomery Stuff need Divisions/Inversions/Multiplication/Addtions in normal Domain. First after implementation of these functions you can implement Montgomery. On Pre/Postprocessing for Montgomery you need these normal operations.

Hagen

Excellent post Hagen. I was looking into the Handbook of Applied Cryptography... Found almost all I need in it, except storing of large integers/basic operations on these. I just glanced at chapter 14 though, so I may have missed a part about large integers. Is that in the book?

Yes if I remember correctly then there is some abstract stuff in the book. For a better explanation you should take a look into D.Knuth's "The Art of Computer Programming" especial Chapter 4. Maybe, many many other peoples means Knuths Book is THE Book for such things, my little personal statement and meaning is quit different to they meaning. I don't like hypothetical machines with prohibitary machine codes like Knuths. HAC was many times more helpfully for me. (ok thats to my personal 3 cents meaning).

Now about your Problem:

You store linear into an dynamical array of Int32 your numbers.

As example LargeInteger = array of Cardinal/Int32. Then You need a Boolean to hold the Sign of this LargeInteger and the Count of used Elements in the array. Ok, my native programming language is Pascal so I try to show it with it.

type

LargeInteger = record

Sign: Boolean;

Count: Integer;

Size: Integer;

Digits: array of Cardinal; // Cardinal = Int32

end;

Sign holds the sign of the Number in Digits.

Count holds the Count of used Digits = 2^32 per Digit into Digits Member.

Size holds the maximal Size of Digits, allocation of Digits. So you can reallocate in Chunks .Digits to avoid many reallocation. If Count + X > Size then you have to reallocate Digits.

Digits it self store the absolute non-complementary representation in Little-Digits-Notation :)

1 -> Sign = False, Count = 1, Digits[0] = 1

-1 -> Sign = True, Count = 1, Digits[0] = 1

-2^64 -1 -> Sign = True, Count = 2, Digits[0] = $FFFFFFF, Digits[1] = $FFFFFFF

2^32 -> Sign = False, Count = 2, Digits[0] = 0, Digits[1] = 1.

If any operation expands the result then we expand Digits by Count. The highest Bits are storred in Digits.

On many math. operations we need fast to evaluate the Signs of the Operands to choose the right internal operation. As example an Addition of two Values have to consider the cases of both signs to add or subtract the internal digits. Thats why i choose Boolean's as Sign representation. Some other lib's choose as Sign an Integer where -1 = negative, +1 = positive and 0 = Null. But then these opertions need a multiplication to evaluate the used state of two operands.

This representation is ideal for Intel, eg. Little Endian Notation. On my own library I use for these Structures garbage collection, reference counting and "copy on write demand". Thus all operation are fully transparent for the user and do the underlaying memory management hidden. This is important for a usefull math. library because designig or translating a high complex mathematicaly algorithm into software is hard enough without additional care for memory management, assignments, liftetimes etc.

Any lower Operation such as Addition/Subtraction/Shifts/Bitoperations and,xor,or are absolutly easily in assembler. All these operations can be done by chaining of simple Opcodes. The intel Opcodes like ADD/SUB/NEG/DIV/MUL can be all chained. These operations are faily easy for MMX or even SSE2 to get 2-8 times speedup.

One disadvantage over other Numberbases as 2^32 exists with FFT's, eg Fast Fourier Transformation, or Modular Representations of Numbers. Such numbers need some translations for these higher algorithms to floating points and back. With base 2^32 we get some trouble to handle this, instead if we choose base 10^9 as example. Base 10^9 storre in one 32 Bit Digit only 0 upto 10^9 -1, intead of 0 upto 2^32-1.

All these considerations are made in HAC, but nothing more, because it's to dependend of the need and the used machine. Knuths go here little deeper, but not far enough.

The complementary representation is small helpfully. If we have only fixed sizes Number, as example 256Bits, then it could be slightly faster. But then it is far specialized and can't be used for other math. problems.

Regards, Hagen

Now about your Problem:

You store linear into an dynamical array of Int32 your numbers.

As example LargeInteger = array of Cardinal/Int32. Then You need a Boolean to hold the Sign of this LargeInteger and the Count of used Elements in the array. Ok, my native programming language is Pascal so I try to show it with it.

type

LargeInteger = record

Sign: Boolean;

Count: Integer;

Size: Integer;

Digits: array of Cardinal; // Cardinal = Int32

end;

Sign holds the sign of the Number in Digits.

Count holds the Count of used Digits = 2^32 per Digit into Digits Member.

Size holds the maximal Size of Digits, allocation of Digits. So you can reallocate in Chunks .Digits to avoid many reallocation. If Count + X > Size then you have to reallocate Digits.

Digits it self store the absolute non-complementary representation in Little-Digits-Notation :)

1 -> Sign = False, Count = 1, Digits[0] = 1

-1 -> Sign = True, Count = 1, Digits[0] = 1

-2^64 -1 -> Sign = True, Count = 2, Digits[0] = $FFFFFFF, Digits[1] = $FFFFFFF

2^32 -> Sign = False, Count = 2, Digits[0] = 0, Digits[1] = 1.

If any operation expands the result then we expand Digits by Count. The highest Bits are storred in Digits.

On many math. operations we need fast to evaluate the Signs of the Operands to choose the right internal operation. As example an Addition of two Values have to consider the cases of both signs to add or subtract the internal digits. Thats why i choose Boolean's as Sign representation. Some other lib's choose as Sign an Integer where -1 = negative, +1 = positive and 0 = Null. But then these opertions need a multiplication to evaluate the used state of two operands.

This representation is ideal for Intel, eg. Little Endian Notation. On my own library I use for these Structures garbage collection, reference counting and "copy on write demand". Thus all operation are fully transparent for the user and do the underlaying memory management hidden. This is important for a usefull math. library because designig or translating a high complex mathematicaly algorithm into software is hard enough without additional care for memory management, assignments, liftetimes etc.

Any lower Operation such as Addition/Subtraction/Shifts/Bitoperations and,xor,or are absolutly easily in assembler. All these operations can be done by chaining of simple Opcodes. The intel Opcodes like ADD/SUB/NEG/DIV/MUL can be all chained. These operations are faily easy for MMX or even SSE2 to get 2-8 times speedup.

One disadvantage over other Numberbases as 2^32 exists with FFT's, eg Fast Fourier Transformation, or Modular Representations of Numbers. Such numbers need some translations for these higher algorithms to floating points and back. With base 2^32 we get some trouble to handle this, instead if we choose base 10^9 as example. Base 10^9 storre in one 32 Bit Digit only 0 upto 10^9 -1, intead of 0 upto 2^32-1.

All these considerations are made in HAC, but nothing more, because it's to dependend of the need and the used machine. Knuths go here little deeper, but not far enough.

The complementary representation is small helpfully. If we have only fixed sizes Number, as example 256Bits, then it could be slightly faster. But then it is far specialized and can't be used for other math. problems.

Regards, Hagen

If you are familar with C/C++ then you have a realy wide range of available libraries.

Take a look into GMP, Miracl, Bignum, LiDIA, HugeInt, ApFloat.

Then there exists some JAVA libraries, but these are far from performant.

GMP and Miracl are the fastest one that I have tested (except to my own assembler/Pascal library :))

Both, or better all of the C/C++ libraries are to "lowlevel" and "inefficiently" for my own purposes. As example for GMP you should write some container classes to handle allocations, refrence counting, garbarge collections and so on. C++ is very helpfully because of it usefull operator overloadings (never realy needed this feature otherwise :))

For trials and pre-writing of mathematical algorithms i suggest you to Maple.

Ah and one tipp from me. We are here in an ASM group, but don't invest to many time into assembler as first step. You must use an high level languages as first and try to optimize first at higher algorithmic stage. With handmade assembler you get in average maximal 20% performance boost. With clever designed high level algorithms you can get far more as 1000% boost.

On my own library i write now two years, as hobby :) And about 85% are in handwitten assembler.

Hagen

Take a look into GMP, Miracl, Bignum, LiDIA, HugeInt, ApFloat.

Then there exists some JAVA libraries, but these are far from performant.

GMP and Miracl are the fastest one that I have tested (except to my own assembler/Pascal library :))

Both, or better all of the C/C++ libraries are to "lowlevel" and "inefficiently" for my own purposes. As example for GMP you should write some container classes to handle allocations, refrence counting, garbarge collections and so on. C++ is very helpfully because of it usefull operator overloadings (never realy needed this feature otherwise :))

For trials and pre-writing of mathematical algorithms i suggest you to Maple.

Ah and one tipp from me. We are here in an ASM group, but don't invest to many time into assembler as first step. You must use an high level languages as first and try to optimize first at higher algorithmic stage. With handmade assembler you get in average maximal 20% performance boost. With clever designed high level algorithms you can get far more as 1000% boost.

On my own library i write now two years, as hobby :) And about 85% are in handwitten assembler.

Hagen

Hagen: You've been of great help to me. I went over what you said, but I'm a little tired at the moment. I'll certainely take it into consideration. In the meantime I think Donald and HAC have something to say to me ;)