; multiply N-bit number by M-bit number and add to larger number

;
; ESI = N-bit number
; EDI = M-bit number
; EBP = (>N*M)-bit number
;
; Result = [EBP] + [ESI] * [EDI]

mov ecx, N - 1
_3: bt [esi], ecx
jnc _7
mov edx, M - 1
_4: bt [edi], edx
jnc _6
lea eax, [edx][ecx]
_5: btc [ebp], eax
lea eax, [eax+1]
jc _5
_6: dec edx
jns _4
_7: dec ecx
jns _3
It isn't fast, but it is understandable and doesn't use temporary storage.

ESI = (2^n0 + 2^n1 +...+ 2^nN)
EDI = (2^m0 + 2^m1 +...+ 2^mM)
EBP = (2^s0 + 2^s1 +...+ 2^sS)

The bit size of S > N*M.

Result:

EBP <= EBP + (2^(n0+m0) + 2^(n0+m1) +...+ 2^(nN*mM))
Posted on 2004-08-09 14:13:39 by bitRAKE
You should check out Karatsuba Multiplication. It is used for multiplying large numbers together.

http://mathworld.wolfram.com/KaratsubaMultiplication.html


The guy on the following webpage did a comparison of grade school multiplication and Karatsuba. He has source code

http://www-2.cs.cmu.edu/~cburch/251/karat/
Posted on 2004-08-10 16:11:15 by mark_larson
I only posted the algorithm because I thought it was "cute". The bit string instructions are really under utilized on x86, and there is so much one can do with a bit field, or array. For special types of numbers the above algorithm can be adapted to perform faster than Karatsuba. I'm experimenting with different number representations to discover the advantages / disadvantages. Also, I have this affinity towards base two math.
Posted on 2004-08-10 20:00:07 by bitRAKE

I only posted the algorithm because I thought it was "cute". The bit string instructions are really under utilized on x86, and there is so much one can do with a bit field, or array. For special types of numbers the above algorithm can be adapted to perform faster than Karatsuba. I'm experimenting with different number representations to discover the advantages / disadvantages. Also, I have this affinity towards base two math.


"for special numbers"? which numbers are those that it is faster than Karatsuba?
Posted on 2004-08-11 10:57:51 by mark_larson

"for special numbers"? which numbers are those that it is faster than Karatsuba?
Sparse bit denisity - if it takes less space to store the bit indexes than the complete binary representation. As usual, a better model for the problem will out perform algorithmically reduced general solutions.
Posted on 2004-08-11 13:18:00 by bitRAKE
Mark, could you check the e-mail address you registered on the board with? I got a very angry letter demanding "removal from this list"

the e-mails are currently sent tohttp://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=f76248e5df03ece6ae5ee11136c91f26 which apparantly is an address hold by a potty-mouth called "Melissa"

I've removed your "thread follow-up settings" from the DB right now, sorry.

Please verify this.

Thx.
Posted on 2004-08-11 22:39:30 by Hiroshimator
Just to mention that some years ago, I used 2 other multiplication algorithms on 6502 processors.
The first one uses a left-to-right approach, and the other one a right-to-left. If you are interested, I can post the sources here (around 10 lines for both routines).
Posted on 2004-08-12 06:18:16 by MCoder

Mark, could you check the e-mail address you registered on the board with? I got a very angry letter demanding "removal from this list"

the e-mails are currently sent tohttp://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=f76248e5df03ece6ae5ee11136c91f26 which apparantly is an address hold by a potty-mouth called "Melissa"

I've removed your "thread follow-up settings" from the DB right now, sorry.

Please verify this.

Thx.


I switched ISPs. I doubt they gave out my old email address to someone. More than likely she works for the cable company. I updated my profile with my new email address.
Posted on 2004-08-12 10:41:20 by mark_larson
MCoder, if would be interesting to hear some of your experience with multiplication algorithms, or to see the code.
Posted on 2004-08-12 10:45:58 by bitRAKE
Bitrake, did you try Karatsuba at all? Just curious. I am working on a Karatsuba multiplication. I have mine running about 10 times faster than the code that guy posted on his web site ( the link I sent above) with 1024 digits. I am still tweaking and making changes.
Posted on 2004-08-13 14:12:12 by mark_larson
Mark, I haven't tried to code Karatsuba in x86, but I've used other implementations. It would be nice if you could post the cycle counts you are achieving. Are you using SSE2?
Posted on 2004-08-13 15:22:51 by bitRAKE

Mark, I haven't tried to code Karatsuba in x86, but I've used other implementations. It would be nice if you could post the cycle counts you are achieving. Are you using SSE2?


I am not even that far along yet to changing to MMX or SSE or SSE2. I usually do that kind of stuff at the end. I basically used his code as a framework, and added my own to it. He prints times in milliseconds. I am cutting and pasting the commandline and what the EXE printed out. 1024.txt is a 1024 base 10 number that is passed to karat.exe. The first 2 timings are both from his code. The first timing is his karatsuba. The second timing is his grade school karatsuba. The 3rd timing is my new karatsuba code. He has his code set up to time for 1 second. I changed it to 8 seconds, to get more accurate timings. The system I am running it on is a 1.2 GHz P3. In his timings he divides by the total number of trials. The total number of trials varies because he times for 8 seconds. I am not even close to being done with tweaking, so I expect to speed it up by quite a bit more.


C:SYSTEM~3KARATS~1>karat 0<1024.txt
2.091479 ms (3826 trials)
3.812292 ms (2099 trials)
0.076667 ms (104360 trials)

0.076667 * 1000 = 76.667 microseconds * 1200 ( 1.2 GHz processor ) = 92,000.4 cycles.


EDIT: Keep in mind that I am dealing with bytes not bits. In your code above you are dealing with bits. So you'd have to do 1024*8 to get an accurate comparison. Also I am having a bit of a brain fart. It's a bit over 27 times faster than his original C code and not around 10 times faster as I stated in the previous post.
Posted on 2004-08-13 15:41:02 by mark_larson
92,000 cycles to perform 104360 multiplications of 1024 base10 digit numbers? That doesn't seem correct - I'm missing something.
Posted on 2004-08-13 16:01:44 by bitRAKE

92,000 cycles to perform 104360 multiplications of 1024 base10 digit numbers? That doesn't seem correct - I'm missing something.


It's not 104,360 multiplications. It is considerably less. I converted the entered base 10 values to base 1 billion. That's how I got the big speed up ;) You do 3 multiplies per 4 digits of 1 billion based digit numbers. I added code to have it print out the total number of digits after converting to base 1 billion, and it's 102 ;)


EDIT: I am still making changes to his C code to speed it up. I have not written any code in assembler yet. Generally I do 3 levels of optimizations when optimizing C code.
1) Algorithmic ( what I am doing now)
2) Generic C optimzations
3) Assembler optimizations
Posted on 2004-08-13 16:04:53 by mark_larson

EDIT: Keep in mind that I am dealing with bytes not bits. In your code above you are dealing with bits. So you'd have to do 1024*8 to get an accurate comparison.
That is somewhat true (you use base 10^9, I use base 2), but if the base2 exponents were stored in an array - for all operands, including destination; then the algorithm can be changed significantly!

N*M=P

{N} = (a,b,c,...,N)
{M} = (a',b',c',...,M)

{P} = ((a+a'), (a+b'), (a+c'), ..., (N+M))

1. copy array {N} into {P}, lengthof {M} times.
2. add array {M} into {P}, lengthof {N} times.
3. radix sort {P} and combine/promote terms.

Again this improves speedwise as the number of bits set decrease and isn't based on the size of N, M, or P. Note that (1) and (2) can be combined and coded to run at the speed of memory bandwidth; (3) is an O(nm) operation but should be faster in practice. This type of algorithm really has different uses from Karatsuba.
Posted on 2004-08-13 16:41:05 by bitRAKE

That is somewhat true (you use base 10^9, I use base 2), but if the base2 exponents were stored in an array - for all operands, including destination; then the algorithm can be changed significantly!

N*M=P

{N} = (a,b,c,...,N)
{M} = (a',b',c',...,M)

{P} = ((a+a'), (a+b'), (a+c'), ..., (N+M))

1. copy array {N} into {P}, lengthof {M} times.
2. add array {M} into {P}, lengthof {N} times.
3. radix sort {P} and combine/promote terms.

Again this improves speedwise as the number of bits decrease and isn't based on the size of N, M, or P. Note that (1) and (2) can be combined and coded to run at the speed of memory bandwidth; (3) is an O(nm) operation but should be faster in practice. This type of algorithm really has different uses from Karatsuba.


I am having problems getting your idea to work with some sample data. I picked 1212 in decimal and 34345 in decimal. Can you show me what you are talking about using those examples?
Posted on 2004-08-13 17:40:17 by mark_larson
N = 1212 = 10010111100 (base 2)
M = 34345 = 1000011000101001 (base 2)

{N} = 2,3,4,5,7,10
{M} = 0,3,5,9,10,15

{P} = 2,3,4,9,11,13,16,17,19,20,21,22,25

P = 10011110110010101000011100 (base 2)

Note: the arrays are the base two exponents - similar to the algorithm at the top.
Posted on 2004-08-13 18:05:55 by bitRAKE

N = 1212 = 10010111100 (base 2)
M = 34345 = 1000011000101001 (base 2)

{N} = 2,3,4,5,7,10
{M} = 0,3,5,9,10,15

{P} = 2,3,4,9,11,13,16,17,19,20,21,22,25

P = 10011110110010101000011100 (base 2)

Note: the arrays are the base two exponents - similar to the algorithm at the top.



Ah now I gotcha!!! A picture is worth a thousand words ;) Just a couple off the top of my head alternatives to the sort in step 3. I am not sure how good/bad they are in comparison to doing a full sort. I haven't thought about it that much. Just tossing them out on the table.

1) Each row is guaranteed to be sorted. Do a binary search on each row for the current power of 2 you are collating.
2) Have pointers into each row that start at the beginning of the row. Check to see if the first value in the first row is less than your current value. If so increment to the next one. repeat until you get a value that is equal or greater. If equal then you have a match. if greater than stop the pointer, and leave it pointing there. Go to the next row and repeat. This will work because the rows are sorted.

you can turn both of the above options to work vertically instead of horizontally based on the size of M versus N.
Posted on 2004-08-13 19:09:52 by mark_larson
In the grid of (3) the number below or to the right is always greater. I was hoping there was some way to take advantage of this to reduce the number of comparisons needed. We know we are starting at the upper left and ending at the lower right, but how to determine the minimum path? There also should be some way to quickly exclude whole rows for sequential exponents as 7=8-1; 15=16-1; etc.
Posted on 2004-08-13 20:41:28 by bitRAKE

In the grid of (3) the number below or to the right is always greater. I was hoping there was some way to take advantage of this to reduce the number of comparisons needed. We know we are starting at the upper left and ending at the lower right, but how to determine the minimum path? There also should be some way to quickly exclude whole rows for sequential exponents as 7=8-1; 15=16-1; etc.


Yep that's what i was doing in those 2 suggestions was taking advantage that the one to the right is always greater, which means it's sorted on each row.

I think the second option is faster. The second option would more look like this. C pseudo-code




unsigned char *row_ptrs[6];
unsigned char *cur_row;


; ... in here make each array element of row_ptrs point to the start of a row in that 2x2 grid you showed me in x1.bmp.

int power_we_are_looking_for = 3;
for ( t = 0; t < MAX_ROW; t++)
{
while (1)
{
cur_ptr = (unsigned char *)(&row_ptrs[t]);
if (*cur_ptr < power_we_are_looking_for)
{
cur_ptr++;
continue;
}

if (*cur_ptr == power_we_are_looking_for)
{
collate(cur_ptr); //perform actions on the power we just found in the 2x2 array
break;
}
if (*cur_ptr > power_we_are_looking_for)
{
//went too far, so stop at the current row for the next power. Since we search the powers in order, this means we start
// FURTHER down the row next time!!!
break;
}


}
(unsigned char *)(&row_ptrs[t]) = cur_ptr; //update the rows each loop to reflect where we've already checked.
// Since the powers are in a row, subsequent calls to this routine will start checking in the middle of the row instead of at the start.
}
Posted on 2004-08-13 21:10:43 by mark_larson