```
; multiply N-bit number by M-bit number and add to larger number
```

;

; ESI = N-bit number

; EDI = M-bit number

; EBP = (>N*M)-bit number

;

; Result = [EBP] + [ESI] * [EDI]

mov ecx, N - 1

_3: bt [esi], ecx

jnc _7

mov edx, M - 1

_4: bt [edi], edx

jnc _6

lea eax, [edx][ecx]

_5: btc [ebp], eax

lea eax, [eax+1]

jc _5

_6: dec edx

jns _4

_7: dec ecx

jns _3

It isn't fast, but it is understandable and doesn't use temporary storage.
ESI = (2^n0 + 2^n1 +...+ 2^nN)

EDI = (2^m0 + 2^m1 +...+ 2^mM)

EBP = (2^s0 + 2^s1 +...+ 2^sS)

The bit size of S > N*M.

**Result**:

EBP <= EBP + (2^(n0+m0) + 2^(n0+m1) +...+ 2^(nN*mM))

You should check out Karatsuba Multiplication. It is used for multiplying large numbers together.

http://mathworld.wolfram.com/KaratsubaMultiplication.html

The guy on the following webpage did a comparison of grade school multiplication and Karatsuba. He has source code

http://www-2.cs.cmu.edu/~cburch/251/karat/

http://mathworld.wolfram.com/KaratsubaMultiplication.html

The guy on the following webpage did a comparison of grade school multiplication and Karatsuba. He has source code

http://www-2.cs.cmu.edu/~cburch/251/karat/

I only posted the algorithm because I thought it was "cute". The bit string instructions are really under utilized on x86, and there is so much one can do with a bit field, or array. For special types of numbers the above algorithm can be adapted to perform faster than Karatsuba. I'm experimenting with different number representations to discover the advantages / disadvantages. Also, I have this affinity towards base two math.

I only posted the algorithm because I thought it was "cute". The bit string instructions are really under utilized on x86, and there is so much one can do with a bit field, or array. For special types of numbers the above algorithm can be adapted to perform faster than Karatsuba. I'm experimenting with different number representations to discover the advantages / disadvantages. Also, I have this affinity towards base two math.

"for special numbers"? which numbers are those that it is faster than Karatsuba?

"for special numbers"? which numbers are those that it is faster than Karatsuba?

Mark, could you check the e-mail address you registered on the board with? I got a very angry letter demanding "removal from this list"

the e-mails are currently sent tohttp://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=f76248e5df03ece6ae5ee11136c91f26 which apparantly is an address hold by a potty-mouth called "Melissa"

I've removed your "thread follow-up settings" from the DB right now, sorry.

Please verify this.

Thx.

the e-mails are currently sent tohttp://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=f76248e5df03ece6ae5ee11136c91f26 which apparantly is an address hold by a potty-mouth called "Melissa"

I've removed your "thread follow-up settings" from the DB right now, sorry.

Please verify this.

Thx.

Just to mention that some years ago, I used 2 other multiplication algorithms on 6502 processors.

The first one uses a left-to-right approach, and the other one a right-to-left. If you are interested, I can post the sources here (around 10 lines for both routines).

The first one uses a left-to-right approach, and the other one a right-to-left. If you are interested, I can post the sources here (around 10 lines for both routines).

Mark, could you check the e-mail address you registered on the board with? I got a very angry letter demanding "removal from this list"

the e-mails are currently sent tohttp://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=f76248e5df03ece6ae5ee11136c91f26 which apparantly is an address hold by a potty-mouth called "Melissa"

I've removed your "thread follow-up settings" from the DB right now, sorry.

Please verify this.

Thx.

I switched ISPs. I doubt they gave out my old email address to someone. More than likely she works for the cable company. I updated my profile with my new email address.

**MCoder**, if would be interesting to hear some of your experience with multiplication algorithms, or to see the code.

Bitrake, did you try Karatsuba at all? Just curious. I am working on a Karatsuba multiplication. I have mine running about 10 times faster than the code that guy posted on his web site ( the link I sent above) with 1024 digits. I am still tweaking and making changes.

**Mark**, I haven't tried to code Karatsuba in x86, but I've used other implementations. It would be nice if you could post the cycle counts you are achieving. Are you using SSE2?

**Mark**, I haven't tried to code Karatsuba in x86, but I've used other implementations. It would be nice if you could post the cycle counts you are achieving. Are you using SSE2?

I am not even that far along yet to changing to MMX or SSE or SSE2. I usually do that kind of stuff at the end. I basically used his code as a framework, and added my own to it. He prints times in milliseconds. I am cutting and pasting the commandline and what the EXE printed out. 1024.txt is a 1024 base 10 number that is passed to karat.exe. The first 2 timings are both from his code. The first timing is his karatsuba. The second timing is his grade school karatsuba. The 3rd timing is my new karatsuba code. He has his code set up to time for 1 second. I changed it to 8 seconds, to get more accurate timings. The system I am running it on is a 1.2 GHz P3. In his timings he divides by the total number of trials. The total number of trials varies because he times for 8 seconds. I am not even close to being done with tweaking, so I expect to speed it up by quite a bit more.

C:SYSTEM~3KARATS~1>karat 0<1024.txt

2.091479 ms (3826 trials)

3.812292 ms (2099 trials)

0.076667 ms (104360 trials)

0.076667 * 1000 = 76.667 microseconds * 1200 ( 1.2 GHz processor ) = 92,000.4 cycles.

EDIT: Keep in mind that I am dealing with bytes not bits. In your code above you are dealing with bits. So you'd have to do 1024*8 to get an accurate comparison. Also I am having a bit of a brain fart. It's a bit over 27 times faster than his original C code and not around 10 times faster as I stated in the previous post.

92,000 cycles to perform 104360 multiplications of 1024 base10 digit numbers? That doesn't seem correct - I'm missing something.

92,000 cycles to perform 104360 multiplications of 1024 base10 digit numbers? That doesn't seem correct - I'm missing something.

It's not 104,360 multiplications. It is considerably less. I converted the entered base 10 values to base 1 billion. That's how I got the big speed up ;) You do 3 multiplies per 4 digits of 1 billion based digit numbers. I added code to have it print out the total number of digits after converting to base 1 billion, and it's 102 ;)

EDIT: I am still making changes to his C code to speed it up. I have not written any code in assembler yet. Generally I do 3 levels of optimizations when optimizing C code.

1) Algorithmic ( what I am doing now)

2) Generic C optimzations

3) Assembler optimizations

EDIT: Keep in mind that I am dealing with bytes not bits. In your code above you are dealing with bits. So you'd have to do 1024*8 to get an accurate comparison.

N*M=P

{N} = (a,b,c,...,N)

{M} = (a',b',c',...,M)

{P} = ((a+a'), (a+b'), (a+c'), ..., (N+M))

1. copy array {N} into {P}, lengthof {M} times.

2. add array {M} into {P}, lengthof {N} times.

3. radix sort {P} and combine/promote terms.

Again this improves speedwise as the

*number of bits set*decrease and

__isn't based on the size of N, M, or P__. Note that (1) and (2) can be combined and coded to run at the speed of memory bandwidth; (3) is an O(nm) operation but should be faster in practice. This type of algorithm really has different uses from Karatsuba.

That is somewhat true (you use base 10^9, I use base 2), but if the base2 exponents were stored in an array - for all operands, including destination; then the algorithm can be changed significantly!

N*M=P

{N} = (a,b,c,...,N)

{M} = (a',b',c',...,M)

{P} = ((a+a'), (a+b'), (a+c'), ..., (N+M))

1. copy array {N} into {P}, lengthof {M} times.

2. add array {M} into {P}, lengthof {N} times.

3. radix sort {P} and combine/promote terms.

Again this improves speedwise as the

*number of bits*decrease and isn't based on the size of N, M, or P. Note that (1) and (2) can be combined and coded to run at the speed of memory bandwidth; (3) is an O(nm) operation but should be faster in practice. This type of algorithm really has different uses from Karatsuba.

I am having problems getting your idea to work with some sample data. I picked 1212 in decimal and 34345 in decimal. Can you show me what you are talking about using those examples?

N = 1212 = 10010111100 (base 2)

M = 34345 = 1000011000101001 (base 2)

{N} = 2,3,4,5,7,10

{M} = 0,3,5,9,10,15

{P} = 2,3,4,9,11,13,16,17,19,20,21,22,25

P = 10011110110010101000011100 (base 2)

Note: the arrays are the base two exponents - similar to the algorithm at the top.

M = 34345 = 1000011000101001 (base 2)

{N} = 2,3,4,5,7,10

{M} = 0,3,5,9,10,15

{P} = 2,3,4,9,11,13,16,17,19,20,21,22,25

P = 10011110110010101000011100 (base 2)

Note: the arrays are the base two exponents - similar to the algorithm at the top.

N = 1212 = 10010111100 (base 2)

M = 34345 = 1000011000101001 (base 2)

{N} = 2,3,4,5,7,10

{M} = 0,3,5,9,10,15

{P} = 2,3,4,9,11,13,16,17,19,20,21,22,25

P = 10011110110010101000011100 (base 2)

Note: the arrays are the base two exponents - similar to the algorithm at the top.

Ah now I gotcha!!! A picture is worth a thousand words ;) Just a couple off the top of my head alternatives to the sort in step 3. I am not sure how good/bad they are in comparison to doing a full sort. I haven't thought about it that much. Just tossing them out on the table.

1) Each row is guaranteed to be sorted. Do a binary search on each row for the current power of 2 you are collating.

2) Have pointers into each row that start at the beginning of the row. Check to see if the first value in the first row is less than your current value. If so increment to the next one. repeat until you get a value that is equal or greater. If equal then you have a match. if greater than stop the pointer, and leave it pointing there. Go to the next row and repeat. This will work because the rows are sorted.

you can turn both of the above options to work vertically instead of horizontally based on the size of M versus N.

In the grid of (3) the number below or to the right is always greater. I was hoping there was some way to take advantage of this to reduce the number of comparisons needed. We know we are starting at the upper left and ending at the lower right, but how to determine the minimum path? There also should be some way to quickly exclude whole rows for sequential exponents as 7=8-1; 15=16-1; etc.

In the grid of (3) the number below or to the right is always greater. I was hoping there was some way to take advantage of this to reduce the number of comparisons needed. We know we are starting at the upper left and ending at the lower right, but how to determine the minimum path? There also should be some way to quickly exclude whole rows for sequential exponents as 7=8-1; 15=16-1; etc.

Yep that's what i was doing in those 2 suggestions was taking advantage that the one to the right is always greater, which means it's sorted on each row.

I think the second option is faster. The second option would more look like this. C pseudo-code

```
```

unsigned char *row_ptrs[6];

unsigned char *cur_row;

; ... in here make each array element of row_ptrs point to the start of a row in that 2x2 grid you showed me in x1.bmp.

int power_we_are_looking_for = 3;

for ( t = 0; t < MAX_ROW; t++)

{

while (1)

{

cur_ptr = (unsigned char *)(&row_ptrs[t]);

if (*cur_ptr < power_we_are_looking_for)

{

cur_ptr++;

continue;

}

if (*cur_ptr == power_we_are_looking_for)

{

collate(cur_ptr); //perform actions on the power we just found in the 2x2 array

break;

}

if (*cur_ptr > power_we_are_looking_for)

{

//went too far, so stop at the current row for the next power. Since we search the powers in order, this means we start

// FURTHER down the row next time!!!

break;

}

}

(unsigned char *)(&row_ptrs[t]) = cur_ptr; //update the rows each loop to reflect where we've already checked.

// Since the powers are in a row, subsequent calls to this routine will start checking in the middle of the row instead of at the start.

}