Here is the code:

here is the FPU version, do I have it coded right?

xor edx,edx

mov eax,estore

mov ecx, Width

div ecx ;<==== this is bugger - 40 clock cycles with 32-bit numbers

mov MX,edx ;MOD is stored here

mov MY,eax ;RESULT is stored here

ret

here is the FPU version, do I have it coded right?

xor edx,edx

mov eax,estore

fild estore

fidiv Width ; = estore/Width

fistp MY ; store value

mul MY ;eax should still contain estore value

sub estore,eax ;subtract to get modulus

mov MX,eax ;move

ret

It would seem to be correct, I'd be curious why you wish to do things this way?

Is there a better way to get modulus without another division - can the FPU return the mod of a number too? I browsed through some of the FPU documentation on the website and didn't find a fimod, or find where the remainder is stored in an integer division.

If you are using FPU, you would not be interested in mod because it just does not make sense.

The FPU does NOT do "integer" division. It does floating point division. Therefore, there is no "remainder" per se. The integer portion would be the result and the fractional portion multiplied by the original number would be the remainder.

A division with the FPU takes as much time as the integer division with the CPU. However, truncating the FPU result, subtracting that truncated result from the actual result to get the fractional portion and then multiplying that by the original number to get the "remainder" as an integer would take a lot longer.

BTW, your posted FPU version was faulty. Just try it with small numbers.

If, estore/Width = MY + remainder

Then, estore*MY = ????

Furthermore, the

Raymond

A division with the FPU takes as much time as the integer division with the CPU. However, truncating the FPU result, subtracting that truncated result from the actual result to get the fractional portion and then multiplying that by the original number to get the "remainder" as an integer would take a lot longer.

BTW, your posted FPU version was faulty. Just try it with small numbers.

If, estore/Width = MY + remainder

Then, estore*MY = ????

Furthermore, the

**fistp**instruction would give you a "rounded" result unless you change the rounding bits of the Control Word of the FPU to give you a truncated integer. (The default is for rounding.)Raymond

Use the FPREM instruction. That calculates the remainder.

Integer divide with result and mod works fastest with the DIV opcode.

Trying to make an SSE solution for INT divide uses too many CVTxx2xx opcodes which makes it slower.

BUT

Using SSE instead of the FPU opcodes might improve speed.

To get the remainder and result with FPU you need to use the div and the prem opcodes.

The fprem opcode does it's own divide and does the calculation for remainder by truncating and multiplying and then subtracting (aka slow).

Trying to make an SSE solution for INT divide uses too many CVTxx2xx opcodes which makes it slower.

BUT

Using SSE instead of the FPU opcodes might improve speed.

To get the remainder and result with FPU you need to use the div and the prem opcodes.

The fprem opcode does it's own divide and does the calculation for remainder by truncating and multiplying and then subtracting (aka slow).

The original post seemed to be only concerned with getting the modulus of a 32-bit number by a 32-bit modulus because of the

However, getting the modulo of a 64-bit number in the EDX:EAX pair would cause the program to crash due to an overflow if the modulus is smaller than or equal to the content of the EDX register. Under those conditions, using the FPU would be the only choice to obtain the modulo with the

Raymond

**xor edx,edx**as the preliminary instruction before the division. Under such conditions, using the CPU integer operations is the fastest way.However, getting the modulo of a 64-bit number in the EDX:EAX pair would cause the program to crash due to an overflow if the modulus is smaller than or equal to the content of the EDX register. Under those conditions, using the FPU would be the only choice to obtain the modulo with the

**fprem**instruction. The quotient could then be obtained by subtracting the modulo from the original 64-bit number before using the**fdiv**instruction (to avoid changing the Control Word for truncating before storing the result).Raymond

Thanks for the replies, yes it should be MY * Width - my ooops.

I'm just looking for the fastest way possible - it is for bitmaps - some of the GDI api's require x/y and not a pointer to a video address - and on initialization I scan a bitmap to create an irregular region - 800x600 takes about 50 seconds. I was wondering if I could use WORDS instead, it would save 20 cycles.

What if I were doing a full-screen blur or other video processing - it would take as long or longer.

I thought about tabulating the addresses into an indexed array, the size would be in the Mb to hold the table, 800*600*SIZEOF DWORD*2

I could throw a splash screen up and use a progress bar, or load the region data in as a resource - which I know little about at this point.

I'm just looking for the fastest way possible - it is for bitmaps - some of the GDI api's require x/y and not a pointer to a video address - and on initialization I scan a bitmap to create an irregular region - 800x600 takes about 50 seconds. I was wondering if I could use WORDS instead, it would save 20 cycles.

What if I were doing a full-screen blur or other video processing - it would take as long or longer.

I thought about tabulating the addresses into an indexed array, the size would be in the Mb to hold the table, 800*600*SIZEOF DWORD*2

I could throw a splash screen up and use a progress bar, or load the region data in as a resource - which I know little about at this point.

Hmmmm... sounds like you're using a bad region creating algorithm. I posted a very fast one a while back.

I have to agree with f0dder. It is pointless to optimise for, say a O(n^2) algorithm, when there exist a O(n lg n) algorithm (Disclaimer: I'm not sure about the bounds for region creating).

Nice region creating algorithm, with discussion and all: http://www.asmcommunity.net/board/index.php?topic=17519.15

thanks for the examples, it does run fast - when I plugged my 800x600 bitmap in there and did some source updates, it worked perfectly (the c++ version).

It looks like the masm source is a disassembly of your binary? I shall study both sources, thank you.

It looks like the masm source is a disassembly of your binary? I shall study both sources, thank you.

It looks like the masm source is a disassembly of your binary? I shall study both sources, thank you.

Yeah, it is - I couldn't be bothered writing it in assembly when the C++ works so well, even on slow machines like a PII-350. Added the disassembly because some people don't like dealing with .obj's and don't have a C compiler :)