hi

Mark Larson wrote in his 'Assembly Optimization Tips':

Multiply to divide.

If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result. This is faster because multiplication is faster than division. ( thanks to pdixon for the tip).

i don?t understand how it should work. counld someone show me some peace of code ? in C or ASM. but C would be better for me.

i?m just beginning to play with asm und with english. you will notice ;o)

with greetings from germany

Nachtschatten

Mark Larson wrote in his 'Assembly Optimization Tips':

Multiply to divide.

If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result. This is faster because multiplication is faster than division. ( thanks to pdixon for the tip).

i don?t understand how it should work. counld someone show me some peace of code ? in C or ASM. but C would be better for me.

i?m just beginning to play with asm und with english. you will notice ;o)

with greetings from germany

Nachtschatten

Try playing with this tool by The Svin.

hi roticv

Try playing with this tool by The Svin.

i know that tool.is very good. but i?m asm beginner and therfore it?s very hard for me to read and anderstand the code.

back to marks divide tip

i understand his sentence as follow:

i have a variable named 'x'. its a full 32 bit number ie. 3123456789

now, if i want to divide x by 10, i have to " multiply and take the top 32-bit half as the result"

ok, some pseudocode:

int32 x = 3123456789;

int32 div = 10;

int64 res = 0;

res = (int64)x * 10;

res shr 32;

now res should contain the result of "x / div".

i think, i missunderstand the meaning of his sentence.

do you know now my problem ?

thinks for helping

Nachtschatten

Try playing with this tool by The Svin.

i know that tool.is very good. but i?m asm beginner and therfore it?s very hard for me to read and anderstand the code.

back to marks divide tip

i understand his sentence as follow:

i have a variable named 'x'. its a full 32 bit number ie. 3123456789

now, if i want to divide x by 10, i have to " multiply and take the top 32-bit half as the result"

ok, some pseudocode:

int32 x = 3123456789;

int32 div = 10;

int64 res = 0;

res = (int64)x * 10;

res shr 32;

now res should contain the result of "x / div".

i think, i missunderstand the meaning of his sentence.

do you know now my problem ?

thinks for helping

Nachtschatten

He did not say that. He said

"If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result. This is faster because multiplication is faster than division. ( thanks to pdixon for the tip)."

For instance if you want to divide a number by 10, it would be faster to multiply by 0.1 than divide by 10, assuming that you are using fpu.

Using the tool by The Svin, you can easily determine the constant that you should multiply by to get the number you want to find.

"If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result. This is faster because multiplication is faster than division. ( thanks to pdixon for the tip)."

For instance if you want to divide a number by 10, it would be faster to multiply by 0.1 than divide by 10, assuming that you are using fpu.

Using the tool by The Svin, you can easily determine the constant that you should multiply by to get the number you want to find.

wow, this tool rocks! now i don't have to calculate it 'manually' :D thanx roticv, thanx The Svin :D

You are welcome. This tool used to be found on this board, until the hacking incidents happened.

there are a lot of instructions excluding mul to be done, is it all really faster than one div ??

If you ask me, yes. a mul + a shift should be faster than a divison. You can try it out ;)

If you ask me, yes. a mul + a shift should be faster than a divison. You can try it out ;)

Is there any algo for this, because I'm not good at reading someone else asm code.Is there any point of implementing something similar in c++ app ??

I really didn't see how Mark's tip from pdixon would work as stated. It was also confusing to me. I had found a very nice tutorial about??reciprocal multiplication written by Douglas Jones that clarified a lot for me. It also discusses problems with rounding that need to be considered at times. Just wanted to pass the link along.

-Phil

-Phil

I think you've taken the suggestion out of context. It doesn't apply to all cases, just to the specific case being discussed.

If I remember correctly, this came from an attempt to optimise a random number routine to give a random integer between 0 and SomeLimit.

e.g. rnd(10) should return an integer from 0 to 9

There were various algoritms discussed for producing random numbers of various precision typically 32 bits.

The next step in converting the the 32 bit random number into an integer from 0 to 9 was invariably to divide the 32 bit value by 10 and keep the remainder. This is what the DIV instruction does, it divides the 2 numbers giving the quotient in EAX (which is ignored) and the remainder in EDX which is the required result, an integer from 0 to 9.

The suggestion was to replace the DIV by 10 with a MUL by 10 which is much faster. You'll find that the high word of the result which appears in EDX is also a random integer from 0 to 9.

This only works if the initial random bits fill the register (or at least are uniformaly distributed within the high bits of that register).

pdixon

If I remember correctly, this came from an attempt to optimise a random number routine to give a random integer between 0 and SomeLimit.

e.g. rnd(10) should return an integer from 0 to 9

There were various algoritms discussed for producing random numbers of various precision typically 32 bits.

The next step in converting the the 32 bit random number into an integer from 0 to 9 was invariably to divide the 32 bit value by 10 and keep the remainder. This is what the DIV instruction does, it divides the 2 numbers giving the quotient in EAX (which is ignored) and the remainder in EDX which is the required result, an integer from 0 to 9.

The suggestion was to replace the DIV by 10 with a MUL by 10 which is much faster. You'll find that the high word of the result which appears in EDX is also a random integer from 0 to 9.

This only works if the initial random bits fill the register (or at least are uniformaly distributed within the high bits of that register).

pdixon

For power of 2 constants you can just AND Variable, Constant -1? ; 1111 & 7 = R111

The FPU has an opcode for remainder but it's slower than hell.

The magic number thing is great. But it seems to get shaky as the numbers get very large 4bil-ish.

Because of the FPU's precision I wonder if it'll be able to use the same calculation to get a magic number for a 64bit integer divide.

Processors need a fast MODULUS opcode!

The FPU has an opcode for remainder but it's slower than hell.

The magic number thing is great. But it seems to get shaky as the numbers get very large 4bil-ish.

Because of the FPU's precision I wonder if it'll be able to use the same calculation to get a magic number for a 64bit integer divide.

Processors need a fast MODULUS opcode!

Is there any point of implementing something similar in c++ app ??

Some C/C++ compilers do this automatically when optimization is turned on (e.g. Visual C++ and GCC).

If you want to divide by the odd number, and you know the number you want to divide is divisible without remainder, you can do this by multiplying by "magic" number and take the LOW half of the result, without need to do any shifts etc.

This approach is based on the theory of 2-adic numbers, if you need some elementary information on 2-adic number, you can find it in the Programming Tutorial I've once started.

For example, the 2-adic reciprocal of 3 is 110101010101... (infinite chain), if we multiply any integer N divisible by 3 by such infinite chain, we will get N/3. On computer we are able to operate only on finite chains, but note, that even when multiplyng the infinite ones, the n first bits always depends only on the first n bits of both numbers we multiply. So if we take the first 32 bits of 2-adic 1/3 and multiply the 32-bit integer by it, the low 32 bits of the result will be the correct first 32 bits of the result, N/3 in this case.

The first 32 bits of infinite 1/3 expansion are 0AAAAAAABh (and the whole expansion goes as if we were adding infinite amount of A digits to the left), therefore this code:

will calculate the N/3 for us in EAX.

When you want to divide by even number, you can factorize it to some power of two end off number and make one multiplication live above and then shift by the amount of bits equal to the power of two from the factorization. Always operating on the lower half of the result! This means you can use this method also with the IMUL instruction variants, which give you only the lower half of the result, like:

However this method has the substantial flow, that the number we want to divide has to de divisible by the value, whose reciprocal we use - otherwise we get the low bits of some infinite expansion of rational number.

The method discussed above doesn't have such disadvantage. But let's see how the two method are related to each other. If you use Svin's tool to get the magic number for dividing by 3, you will see that it's actually the same as the low 32 bits of p-adic 1/3 we used above. But this time you are told to get the HIGH 32 bits of result and shift them right by one bit.

Indeed, if you check it, when you multiply by 0AAAAAAABh the 32-bit N divisible by 3, you will get N/3 in low 32 bits, and 2N/3 in the high 32 bits. To see how the 2-adic arithmetics can explain this, let's analyze in terms of p-adic rational numbers, what you exactly get, when you do such multiplication. The 0AAAAAAAB is incomplete 1/3, what we cut off is the infinite amount of A digits starting from ninth one (33rd bit), therefore:

N*0AAAAAAABh = N*(1/3 - ...AAA00000000h)

note now that ...AAA00000000h = ...AAAAh * 100000000h = ...AAAAh * (2^32)

but ...AAAAh = ...AAABh - 1 = 1/3 - 1 = -2/3, so:

N*0AAAAAAABh = N * (1/3 - (-2/3)*(2^32)) = N*3 + (2N/3)*(2^32)

so we get exactly what we expected - after multiplyng by 0AAAAAAABh we get N/3 in low 32 bits and 2N/3 in the high bits.

But how does it come, that even when N is not divisible by 3, we get the correct rounded value of N/3 starting from the 34th bit?

If N=M+r, where r<3, then we can split the N*0AAAAAAABh to the sum of two products - for M we know we get 2M/3 in upper bits, so it suffices to show that r*0AAAAAAABh has zeros starting from the 34th bit - what is actually evident, since r is smaller than 2^2 and the 0AAAAAAABh is smaller than 2^32, so their product is smaller than 2^34. Well, there might be some overflow when M is very large, near to 2^32, but that's another story - I'm providing only very simplified analysis on single example, but I believe the analogous reasoning works for the general case.

Also, for the small values of M, the "middle" bits, just below the result in high bits, would even allow to recover the remainder from division - though I doubt it could be done efficiently. But if I design any nice algorithm for this, I'll let you know.

I hope I have not bored everyone with those theoretical divagations...

This approach is based on the theory of 2-adic numbers, if you need some elementary information on 2-adic number, you can find it in the Programming Tutorial I've once started.

For example, the 2-adic reciprocal of 3 is 110101010101... (infinite chain), if we multiply any integer N divisible by 3 by such infinite chain, we will get N/3. On computer we are able to operate only on finite chains, but note, that even when multiplyng the infinite ones, the n first bits always depends only on the first n bits of both numbers we multiply. So if we take the first 32 bits of 2-adic 1/3 and multiply the 32-bit integer by it, the low 32 bits of the result will be the correct first 32 bits of the result, N/3 in this case.

The first 32 bits of infinite 1/3 expansion are 0AAAAAAABh (and the whole expansion goes as if we were adding infinite amount of A digits to the left), therefore this code:

`mov eax,N`

mov ebx,0AAAAAAABh

mul ebx

will calculate the N/3 for us in EAX.

When you want to divide by even number, you can factorize it to some power of two end off number and make one multiplication live above and then shift by the amount of bits equal to the power of two from the factorization. Always operating on the lower half of the result! This means you can use this method also with the IMUL instruction variants, which give you only the lower half of the result, like:

`imul eax,0AAAAAAABh ; divide by 3`

However this method has the substantial flow, that the number we want to divide has to de divisible by the value, whose reciprocal we use - otherwise we get the low bits of some infinite expansion of rational number.

The method discussed above doesn't have such disadvantage. But let's see how the two method are related to each other. If you use Svin's tool to get the magic number for dividing by 3, you will see that it's actually the same as the low 32 bits of p-adic 1/3 we used above. But this time you are told to get the HIGH 32 bits of result and shift them right by one bit.

Indeed, if you check it, when you multiply by 0AAAAAAABh the 32-bit N divisible by 3, you will get N/3 in low 32 bits, and 2N/3 in the high 32 bits. To see how the 2-adic arithmetics can explain this, let's analyze in terms of p-adic rational numbers, what you exactly get, when you do such multiplication. The 0AAAAAAAB is incomplete 1/3, what we cut off is the infinite amount of A digits starting from ninth one (33rd bit), therefore:

N*0AAAAAAABh = N*(1/3 - ...AAA00000000h)

note now that ...AAA00000000h = ...AAAAh * 100000000h = ...AAAAh * (2^32)

but ...AAAAh = ...AAABh - 1 = 1/3 - 1 = -2/3, so:

N*0AAAAAAABh = N * (1/3 - (-2/3)*(2^32)) = N*3 + (2N/3)*(2^32)

so we get exactly what we expected - after multiplyng by 0AAAAAAABh we get N/3 in low 32 bits and 2N/3 in the high bits.

But how does it come, that even when N is not divisible by 3, we get the correct rounded value of N/3 starting from the 34th bit?

If N=M+r, where r<3, then we can split the N*0AAAAAAABh to the sum of two products - for M we know we get 2M/3 in upper bits, so it suffices to show that r*0AAAAAAABh has zeros starting from the 34th bit - what is actually evident, since r is smaller than 2^2 and the 0AAAAAAABh is smaller than 2^32, so their product is smaller than 2^34. Well, there might be some overflow when M is very large, near to 2^32, but that's another story - I'm providing only very simplified analysis on single example, but I believe the analogous reasoning works for the general case.

Also, for the small values of M, the "middle" bits, just below the result in high bits, would even allow to recover the remainder from division - though I doubt it could be done efficiently. But if I design any nice algorithm for this, I'll let you know.

I hope I have not bored everyone with those theoretical divagations...

i know that tool.is very good. but i?m asm beginner and therfore it?s very hard for me to read and anderstand the code.

forget "Magic Divider" it sucks, you'd better use this binary technic (here, we use the size of the register as a "pivot" (sorry i don't know the word in english...) :

real magic number = ( MAX register value / YOUR VALUE ) + 1

here, a small example with a 32 bits register :

2 000 000 / 127 773 = 15,652759190126239502868368121591 (respectively in eax and edx)

( 4 294 967 295 / 127 773 ) + 1 = 3 3615 (our magic value)

2 000 000 * 3 3615 = 67 230 000 000 (but this number exceed the eax registrer, so the corresponding value is send in edx)

and here is the magic => 67 230 000 000 / 4 294 967 296 = 15,6532041728496551513671875

(there is a lacks of precision here, but we don't care because we are only interested by the integer part in edx => 15)

advantages of this technic :

- easy to remember

- there's no need to pratice a shr on the result, etc... the result is directly in edx.

- it works with all registers size byte/word/dword/qword, etc..

- no limit for the size of the numbers (but of course, they can't exceed the size of the used registers)

to finish, if you want to obtain the fractionnal part, you just have to multiply your result by the original divider

15 * 127 773 = 1 916 595

and substract this value to the original value

2 000 000 - 1 916 595 = 83 405 (83 405 / 127 773 = 0,65275919012623950286836812159063)

(hey, we have a very good precision here...)

there are a lot of instructions excluding mul to be done, is it all really faster than one div ??

on a pentium 3, a div/idiv instruction takes 39 clock cycles and a mul/imul instruction just take 4 cycles...

PLEASE FORGIVE ME, FOR MY POOR POOR POOR ENGLISH

Noobies MUST remember one thing. This technique of multiplying instead of dividing is fine to improve execution speed

If the

Raymond

**IF AND ONLY IF**the*divider*is a constant which is used more than once and its reciprocal can be stored in memory to be used whenever required.If the

*divider*is a variable which can change throughout an application, using the "multiplication technique" would slow down execution speed because a division needs to be performed anyway to obtain the "magic" number in addition to a multiplication to obtain the quotient!!!Raymond

NightWare: yeah, that's a simple and neat trick, however it still may give wrong results sometimes. Example: 411287/17137 = 23,99994..., so the rounded down result of division should be 23 in this case, but the trick gives 411287*(4294967295/17137+1)/4294967296 = 24,0000...

Raymond: I have designed an algorithm to calculate 2-dic reciprocals without any division, though it might need a large amount of multiplications sometimes, so it won't be much more efficient. But just as a kind of curiosity - it is possible to do it without division.

Raymond: I have designed an algorithm to calculate 2-dic reciprocals without any division, though it might need a large amount of multiplications sometimes, so it won't be much more efficient. But just as a kind of curiosity - it is possible to do it without division.

however it still may give wrong results sometimes.

hmm... i didn't know that... it's a small error, but it's an error... now i have to verify/change some of my codes... but thanks for the info.

it should be easy to solve the problem... here, an attempt to solve the problem (not tested) :

; MagicNumber = ( MAX register value / OriginalValue ) + 1

; the integer part :

mov edi,X ;backup of the value to divide

mov eax,edi

mov edx,MagicNumber ;) use "mul MagicNumber" if you don't use an immediate value

mul edx ;)

; fraction part :

mov esi,edx ;backup of our integer result

mov eax,edx

mov edx,OriginalValue ;) use "mul OriginalValue" if you don't use an immediate value

mul edx ;)

sub edi,eax

; error solving :

jns NoError

dec esi ;our integer - 1

add edi,OriginalValue ;a small negative value + OriginalValue

NoError:

; here we have the integer in esi, and the fraction in edi

it should work, it still faster than a division (18/24 cycles insteed of 39 on a P3), but it's not as fast as the previous trick, hmm... magic divider give the good result... lol, maybe it's not so useless, after all...

If the

*divider*is a variable which can change throughout an application, using the "multiplication technique" would slow down execution speed because a division needs to be performed anyway to obtain the "magic" number in addition to a multiplication to obtain the quotient!!!

well, in fact it's not totally true... you're right on the principle, but since this otptimization only concern integer values, it's possible to use this trick with a variable by using a table, in this case the optimization still faster than a classic division (of course, here we're limited to small numbers, or a small range of numbers, otherwise it cost a lot of memory)

in some case, it could even be usefull if your cpu don't support sse instructions (i know it must be rare now...)

I'm using that program, trying to do a simple conversion from fahrenheit to celsius.

But my results are totally wrong...

; 9

MagicNumber = 3817748708

mov eax,X

mov edx, MagicNumber

mul edx

SHR edx, 3

MagicNumber = 3817748708

mov eax,X

mov edx, MagicNumber

mul edx

SHR edx, 3

` invoke GetDlgItemInt, hWnd, IDC_FIREN, NULL, FALSE`

; formula: (5*(x - 32))/9

sub eax, 32

lea eax,

mov edx, 3817748708

mul edx

shr edx, 3

invoke SetDlgItemInt, hWnd, IDC_CELSIUS, edx, TRUE

But my results are totally wrong...

It doesn't work if EAX has a negative number in it.

You can check the math with your Windows Calculator

EAX = 212

-32 = 180

*5 = 900

900 * 3817748708 = 3435973837200

In Hex = 00000320 00000190h

EAX = 190h

EDX = 320h

320h / 8 (shr 3) = 64h

64h = 100 decimal 212F = 100C

So the problem you're having is either in the API calls or your value in eax after 32 is subtracted from it is LESS than 0.

If the value in EAX is negative the MUL by magic number will fail

You can check the math with your Windows Calculator

EAX = 212

-32 = 180

*5 = 900

900 * 3817748708 = 3435973837200

In Hex = 00000320 00000190h

EAX = 190h

EDX = 320h

320h / 8 (shr 3) = 64h

64h = 100 decimal 212F = 100C

So the problem you're having is either in the API calls or your value in eax after 32 is subtracted from it is LESS than 0.

If the value in EAX is negative the MUL by magic number will fail