Here is the code:

xor edx,edx
mov eax,estore
mov ecx, Width
div ecx    ;<==== this is bugger - 40 clock cycles with 32-bit numbers

mov MX,edx  ;MOD is stored here
mov MY,eax  ;RESULT is stored here

here is the FPU version, do I have it coded right?

xor edx,edx
mov eax,estore

fild estore
fidiv Width  ; = estore/Width
fistp MY      ; store value

mul MY  ;eax should still contain estore value
sub estore,eax  ;subtract to get modulus
mov MX,eax      ;move
Posted on 2005-07-11 05:55:40 by drarem
It would seem to be correct, I'd be curious why you wish to do things this way?
Posted on 2005-07-11 08:32:22 by Eóin
Is there a better way to get modulus without another division - can the FPU return the mod of a number too?  I browsed through some of the FPU documentation on the website and didn't find a fimod, or find where the remainder is stored in an integer division.
Posted on 2005-07-11 09:16:01 by drarem
If you are using FPU, you would not be interested in mod because it just does not make sense.
Posted on 2005-07-11 09:24:57 by roticv
The FPU does NOT do "integer" division. It does floating point division. Therefore, there is no "remainder" per se. The integer portion would be the result and the fractional portion multiplied by the original number would be the remainder.

A division with the FPU takes as much time as the integer division with the CPU. However, truncating the FPU result, subtracting that truncated result from the actual result to get the fractional portion and then multiplying that by the original number to get the "remainder" as an integer would take a lot longer.

BTW, your posted FPU version was faulty. Just try it with small numbers.

If, estore/Width = MY + remainder
Then, estore*MY = ????

Furthermore, the fistp instruction would give you a "rounded" result unless you change the rounding bits of the Control Word of the FPU to give you a truncated integer. (The default is for rounding.)

Posted on 2005-07-11 09:50:42 by Raymond
Use the FPREM instruction. That calculates the remainder.
Posted on 2005-07-11 10:56:14 by Sephiroth3
Integer divide with result and mod works fastest with the DIV opcode.
Trying to make an SSE solution for INT divide uses too many CVTxx2xx opcodes which makes it slower.
Using SSE instead of the FPU opcodes might improve speed.
To get the remainder and result with FPU you need to use the div and the prem opcodes.
The fprem opcode does it's own divide and does the calculation for remainder by truncating and multiplying and then subtracting (aka slow).
Posted on 2005-07-11 12:37:57 by r22
The original post seemed to be only concerned with getting the modulus of a 32-bit number by a 32-bit modulus because of the xor edx,edx as the preliminary instruction before the division. Under such conditions, using the CPU integer operations is the fastest way.

However, getting the modulo of a 64-bit number in the EDX:EAX pair would cause the program to crash due to an overflow if the modulus is smaller than or equal to the content of the EDX register. Under those conditions, using the FPU would be the only choice to obtain the modulo with the fprem instruction. The quotient could then be obtained by subtracting the modulo from the original 64-bit number before using the fdiv instruction (to avoid changing the Control Word for truncating before storing the result).

Posted on 2005-07-11 21:59:44 by Raymond
Thanks for the replies, yes it should be MY * Width - my ooops.

I'm just looking for the fastest way possible - it is for bitmaps - some of the GDI api's require x/y and not a pointer to a video address - and on initialization I scan a bitmap to create an irregular region - 800x600 takes about 50 seconds. I was wondering if I could use WORDS instead, it would save 20 cycles.

What if I were doing a full-screen blur or other video processing - it would take as long or longer.

I thought about tabulating the addresses into an indexed array, the size would be in the Mb to hold the table, 800*600*SIZEOF DWORD*2

I could throw a splash screen up and use a progress bar, or load the region data in as a resource - which I know little about at this point.
Posted on 2005-07-12 05:25:46 by drarem
Hmmmm... sounds like you're using a bad region creating algorithm. I posted a very fast one a while back.
Posted on 2005-07-12 05:29:22 by f0dder
I have to agree with f0dder. It is pointless to optimise for, say a O(n^2) algorithm, when there exist a O(n lg n) algorithm (Disclaimer: I'm not sure about the bounds for region creating). 
Posted on 2005-07-12 07:07:48 by roticv
Nice region creating algorithm, with discussion and all:
Posted on 2005-07-12 07:13:13 by f0dder
thanks for the examples, it does run fast - when I plugged my 800x600 bitmap in there and did some source updates, it worked perfectly (the c++ version).

It looks like the masm source is a disassembly of your binary? I shall study both sources, thank you.
Posted on 2005-07-12 19:02:37 by drarem

It looks like the masm source is a disassembly of your binary? I shall study both sources, thank you.

Yeah, it is - I couldn't be bothered writing it in assembly when the C++ works so well, even on slow machines like a PII-350. Added the disassembly because some people don't like dealing with .obj's and don't have a C compiler :)
Posted on 2005-07-12 19:07:21 by f0dder