hi,

i often see people who do maths with "SHR, SHL, AND" and some other stuff instead of really multiplying or doing division. could anybody please explain me the system behind that?
the only thing i know (hopefully i remember that well) is:
division by 2 is the same as SHR value,2, right?? :grin:

hehe.. thanks
Posted on 2002-07-26 09:07:09 by NOP-erator
To explain it in more simple terms I'll use decimal....

When we shift left, we are basically adding a 0 to the front of our number.
This is effectivly multiplying by 10.

When we shift right, we remove the first "character".
This is basically dividing by 10, and removing the remainder.

In binary the same is true, except we are not multiplying by 10 we are multiplying by 2.

Shifting left one place is like multiplying by 2^1.
Shifting left two places is like multiplying by 2^2
Shifting left three places is like multiplying by 2^3
And so on.

The reason its fast is because it is a simple binary operation (no complex pipelineing).

So in fact a fast divide by two is to shift right by 1 place.

Mirno
Posted on 2002-07-26 09:27:17 by Mirno
Yeah what Mirno said,
however I would shr or shl by 33 and not 1.
Posted on 2002-07-26 14:57:32 by Graebel
Eh why? What's the advantage?
Posted on 2002-07-26 18:24:02 by iblis
Apparently this is a well known issue regarding SHR / SAR / SHL.
On 486's, there is an anomaly in the instructions.

(SHR / SAR / SHL) reg, 1 = latency 3
(SHR / SAR / SHL) reg, imm8 = latency 2

Based on the fact that all x86 processors since the 186 mask
shift counts modulo 32, this is an easy solution since imm8
can hold shift counts of 0 through 255. Basically we can code a
shift count of 33 to get an effective shift of 1 improving speed
ever so much. Every bit counts!
Posted on 2002-07-26 19:57:45 by Graebel
On the AMD Athlon, the only instruction this effects is RCL/RCR with the single shift versions being faster - latency of 1 (register) or 4 (memory) verses 5 or 6:
		; latency

RCL reg, 1 ; 1 direct path
RCL mem, 1 ; 4 direct path

RCL reg, imm8/CL ; 5 direct path
RCL mem, imm8/CL ; 6 vector path
Posted on 2002-07-26 22:17:23 by bitRAKE
That's... weird. Duly noted, thanks guys. :)

Although I will never understand why shifting a x_bit register by x+ bits doesn't zero the whole register. :boggle:
Posted on 2002-07-26 22:42:42 by iblis
The count is always masked: AND count, 1Fh (so says the Intel manual).
Posted on 2002-07-26 22:57:36 by bitRAKE
so is there a SAR/SHR/SHL version of

mov eax,2
mov ecx,12
mul ecx

???
can this be done faster? cause i need this to get values from a structure:

mov myvar.x,5

for example. but it is not possible that i use

mov myvar.x,5

it just works with 8 but not with 12... :confused:

bye
Posted on 2002-07-27 04:21:11 by NOP-erator
mov     eax, 3      ;EAX == 3

shl eax, 3 ;EAX == EAX * 2 ^ 3
3 * 2 ^ 3 <==> 2 * 12 <==> 24 :)
Posted on 2002-07-27 04:32:53 by stryker
and how to use that in a loop? if eax is incremented from zero to .. erm.. 50 for example? and eax should always be multiplied by 12? is that possible?
Posted on 2002-07-27 04:36:19 by NOP-erator
For example, when we say to multiply a number by 12. First, find the largest number you can use with shl. Since the largest possible number is 3. What is 2 ^ 3? It's 8. What's the difference of 12 and 8? It's 4. This 4 is the number of times you are going to add the original value to the "shifted value".
    mov     eax, 4

mov ecx, eax
shl eax, 3
add eax, ecx
add eax, ecx
add eax, ecx
add eax, ecx
Now assuming we want to multiply 4 by 12, all we have to do is find the largest value we can shift, and that is three and since the difference of 12 and 2 ^ 3 is 4, this is how many times we will add inside the loop ...

Sorry for not explaining this a little bit clearer. I'll post a code later, if you still want it. Right now, I have to get some sleep, it's 2:45 A.M. :)

P.S. Just remember multiply is just add, and division is just subtraction. ;)

Very Shagadelic. - Austin Powers :grin:
Posted on 2002-07-27 04:50:07 by stryker
hehe, it's about 12 P.M. here.. :grin:
yes, i would really appreceate some code! but: is the piece of code you just posted faster than multiplying?
Posted on 2002-07-27 05:00:48 by NOP-erator
What about the "and-math", have seen it to, please an explanation.

and:

P.S. Just remember multiply is just add, and division is just subtraction

?? or have I missed something ??
x+x+x = 3x <- ok
x-x-x = -2x != x/3 <- ...
x-x-x = -2x != 3/x <- ...
Posted on 2002-07-27 09:49:49 by scientica
yeah, in my first post here i asked for "AND"-maths, too.. thought that it wasn't true, cuz there was no answer on it..

i agree with scientica (with the division= substraction stuff :grin: )
Posted on 2002-07-27 09:52:34 by NOP-erator
Hi,

First Off, AND is a logical instruction. It's not arithmetic, like MUL. That's why the guys on the previous page didn't explained AND because it has nothing to do with math, maybe just a little(maybe in some special cases, rare ones). Of course you can change bits using AND but in practicality, AND is rarely used during mathematical solutions.

SHL ... is a shift instruction. It's not also arithmetic but since there is a binary pattern on the powers of 2 when using this instruction, that's why It was used.

This is how an AND works.

TRUE AND FALSE == FALSE
TRUE AND TRUE == TRUE
FALSE AND FALSE == FALSE

If I have a binary sequence like this.

0111 (7 in decimal) and I will AND 0111 with
0010 (2 in decimal) I will get
____
0010 (2 in decimal)

You see there is no mathematical use in most cases.

As for the question, "which is faster SHL-ADD or MUL"? I can't really say which one is faster but in some cases SHL-ADD is faster, sometimes MUL is faster.
[size=9].686

.MODEL FLAT, STDCALL
OPTION CASEMAP:NONE
INCLUDE \masm32\INCLUDE\windows.inc
INCLUDE \masm32\INCLUDE\kernel32.inc
INCLUDE \masm32\INCLUDE\user32.inc
INCLUDELIB \masm32\lib\user32.lib
INCLUDELIB \masm32\lib\kernel32.lib
INCLUDE \masm32\INCLUDE\masm32.inc
INCLUDELIB \masm32\lib\masm32.lib

.DATA?

g_ddTime1 DD ?
g_ddTime2 DD ?
g_dbBuffer DB 16 DUP(?)

.CODE

START:

;===========================================================================
;SHL-ADD (Original)
;===========================================================================

invoke QueryPerformanceCounter, OFFSET g_ddTime1

xor ebx, ebx

@@:

mov eax, ebx
mov ecx, ebx
shl eax, 3
add eax, ecx
add eax, ecx
add eax, ecx
add eax, ecx
inc ebx
cmp ebx, 50
nop
nop
jbe @B

invoke QueryPerformanceCounter, OFFSET g_ddTime2

mov eax, g_ddTime2
sub eax, g_ddTime1
invoke dwtoa, eax, OFFSET g_dbBuffer
invoke MessageBox, 0, OFFSET g_dbBuffer, 0, 0

;===========================================================================
;MUL
;===========================================================================

invoke QueryPerformanceCounter, OFFSET g_ddTime1

xor ebx, ebx
mov ecx, 12

@@:

mov eax, ebx
mul ecx
inc ebx
cmp ebx, 50
nop
nop
jbe @B

invoke QueryPerformanceCounter, OFFSET g_ddTime2

mov eax, g_ddTime2
sub eax, g_ddTime1
invoke dwtoa, eax, OFFSET g_dbBuffer
invoke MessageBox, 0, OFFSET g_dbBuffer, 0, 0


invoke ExitProcess,NULL

END START[/size]
As you can see, I used QueryPerformanceCounter, I'm not sure if the outputs are correct but I'll try maverick's profiler later. :)
Posted on 2002-07-27 10:23:20 by stryker
My approach is a little different than Stryker's.

To mul a number by twelve, first you've gotta think about the bit representation of 12. This is 1100b. I.E., 12=8+4. So why is this helpful?
Well, for all x

12*x = 8*x + 4*x

so if you wanna multiply eax by 12 you're gonna need to add 8*x and 4*x. We do this:


mov edx,eax ;use edx as a temp value
shl eax,3 ;mul eax by 8
shl edx,2 ;mul edx (the original x) by 4
add eax,edx ;add 8*x and 4*x to get 12*x in eax


I dunno about division being subtraction though... and multiplication being addition is a bit of a stretch too.

BTW, there's also multiplication using the LEA instruction

LEA does some shifting and addition at the same time. However, you can't do more than shl r,3

ex.


LEA eax,[8*eax] == shl eax,3
LEA eax,[4*eax+eax] == shl eax,2 / add eax all at once i.e., mul eax,5


so to mul by twelve we use 12=3*4


shl eax,2
lea eax,[2*eax+eax]


i.e., first we mul the value by 4 by shifting by 2. Then we use lea to mul the resultant by 3

Simple, eh?

:)

And to answer your question about speed... shifting is almost always faster. So do it whenever you can.

As for "and" math... I'm not sure what you mean. There are some tricks using "and" and "test" and "xor".
Ex. cmp eax,0 can be replaced with test eax,eax
mov eax,0 can be replaced with xor eax,eax
etc

--Chorus
Posted on 2002-07-27 10:30:07 by chorus
scientica,

As for DIV being SUB.

let's use a test case of 10/3. The answer to 10/3 is 3.3333.... (Non-Terminating)

Now look at the relationship between DIV and SUB

10 - 3 == 7
7 - 3 == 4
4 - 3 == 1

If we subtract 4 from 1 we will get a negative, so we stop. Now count how many times we subtracted from 10 by 3. That is 3 times, right? this is our quotient and the remaining value (1) will be our remainder.

Which is true if I used it this way.
mov eax, 10

mov ecx, 3
xor edx, edx
div ecx
You will get 3 in EAX and 1 in EDX.

But this isn't the right answer. The remainder is suppose to be a non-terminating 3. How do we fix this?

(Sorry for the wrong math terms Used - I seem to forgot most of the terms, I hope you'll get the idea)

Now let us look at the remainder first If you multiply that by 10 you'll get 10. Now do the same thing we did above and I guarantee you you'll get 3.3333... (Non-Terminating in the long run).

P.S. Don't mind the NOPs used on my code above, that was just a teaser for NOP-erator. :grin:

Very Shagadelic!!! ;)
Posted on 2002-07-27 10:44:22 by stryker
OffTopic:

Stryker, did you see GoldMember last night?? It was fricken hilarious :D

--Chorus
Posted on 2002-07-27 10:47:59 by chorus
I planned to watch but I was very busy with other projects. I'll try to watch this weekend(Sunday). :grin:

Yeah! Baby Yeah! :grin:
Posted on 2002-07-27 10:49:43 by stryker