Greetings.

I'm developing a project in my spare time, and I'm somewhat stuck.好bsp; I hope someone can offer some suggestions - I've benchmarked my program, and found that 85% of the time is spent in this one algorithm -- and unfortunately, I must use this algorithm for compatibility.

The function is:

void hash( unsigned int in[2], unsigned int * out, int table[12])

{

unsigned int vv = table[10];

vv += in[0];

vv *= table[0]

vv = (vv >> 16) | (vv << 16);

vv *= table[1]

vv = (vv >> 16) | (vv << 16);

vv *= table[2]

vv = (vv >> 16) | (vv << 16);

vv *= table[3]

vv = (vv >> 16) | (vv << 16);

vv *= table[4];

out[1] = vv + table[11]

vv += in[1];

vv *= table[5]

vv = (vv >> 16) | (vv << 16);

vv *= table[6]

vv = (vv >> 16) | (vv << 16);

vv *= table[7]

vv = (vv >> 16) | (vv << 16);

vv *= table[8]

vv = (vv >> 16) | (vv << 16);

vv *= table[9];

out[1] += vv

out[0] = vv;

}

This seems to be because of the serialization ..

Ie, the compiler is generating:

好bsp; 好bsp; 好bsp; 好bsp;imul好bsp; r, mem

好bsp; 好bsp; 好bsp; 好bsp;rol好bsp; r, 16d

好bsp; 好bsp; 好bsp; 好bsp;imul好bsp; r, mem

好bsp; 好bsp; 好bsp; 好bsp;rol r, 16d

I speculate that MMX or SSE* could possibly speed this up, since ROL used to be something to avoid.好bsp; Maybe there are some algorithmic improvements that could make this go much faster?

I'm developing a project in my spare time, and I'm somewhat stuck.好bsp; I hope someone can offer some suggestions - I've benchmarked my program, and found that 85% of the time is spent in this one algorithm -- and unfortunately, I must use this algorithm for compatibility.

The function is:

void hash( unsigned int in[2], unsigned int * out, int table[12])

{

unsigned int vv = table[10];

vv += in[0];

vv *= table[0]

vv = (vv >> 16) | (vv << 16);

vv *= table[1]

vv = (vv >> 16) | (vv << 16);

vv *= table[2]

vv = (vv >> 16) | (vv << 16);

vv *= table[3]

vv = (vv >> 16) | (vv << 16);

vv *= table[4];

out[1] = vv + table[11]

vv += in[1];

vv *= table[5]

vv = (vv >> 16) | (vv << 16);

vv *= table[6]

vv = (vv >> 16) | (vv << 16);

vv *= table[7]

vv = (vv >> 16) | (vv << 16);

vv *= table[8]

vv = (vv >> 16) | (vv << 16);

vv *= table[9];

out[1] += vv

out[0] = vv;

}

This seems to be because of the serialization ..

Ie, the compiler is generating:

好bsp; 好bsp; 好bsp; 好bsp;imul好bsp; r, mem

好bsp; 好bsp; 好bsp; 好bsp;rol好bsp; r, 16d

好bsp; 好bsp; 好bsp; 好bsp;imul好bsp; r, mem

好bsp; 好bsp; 好bsp; 好bsp;rol r, 16d

I speculate that MMX or SSE* could possibly speed this up, since ROL used to be something to avoid.好bsp; Maybe there are some algorithmic improvements that could make this go much faster?

What kind of values the table array contains? If those are constant you could avoid memory access bottleneck by hardcoding them.

The table values are only constant for a single call to this routine, so code generation techniques would be necessary to inline the values.

I hadn't considered memory bandwidth as the limitting factor, since I believed the table would be small enough to remain in a close cache

I hadn't considered memory bandwidth as the limitting factor, since I believed the table would be small enough to remain in a close cache

imho, you can only save a couple of cycles around

out[1] = vv + table[11]

"imul r,mem | rol r,16" - this is the best you could do. Stalls are minimum to none on modern cpus.

The problem is more about your algorithm. Doing all this hashing every 2 dwords, and then modifying tables... are you defacto doing this on large arrays, after splitting them into 8-byte segments???

out[1] = vv + table[11]

"imul r,mem | rol r,16" - this is the best you could do. Stalls are minimum to none on modern cpus.

The problem is more about your algorithm. Doing all this hashing every 2 dwords, and then modifying tables... are you defacto doing this on large arrays, after splitting them into 8-byte segments???

It's actually a bit worse, since each invokation of the function (the in[2]) depends on the previous 'out' values.

Would it be possible to use a MMX/SSE implementation to calculate two hashes in parallel?

With the cummulative properties multiplication, are there any other opportunities for parallelization?

Would it be possible to use a MMX/SSE implementation to calculate two hashes in parallel?

With the cummulative properties multiplication, are there any other opportunities for parallelization?

Would it be possible to use a MMX/SSE implementation to calculate two hashes in parallel??

I don't think there is way to optimize this with MMX/SSE, since every next iteration of multiplication depends on the previous one.

Imo best bet would be to rewrite this in assembler in hope to get some tiny performance boost.

Compilers usually don't handle registers distribution well (although this may be not true with latest versions of gcc and cl). Try using EAX as a working register for imul/rol sequences. It may speed things up.

Also as previously mentioned 'out[1] = vv + table[11];\out[1] += vv;' could be optimized by initializing some register to table[11] adding to it twice and writing back to mem. Compiler probably handles this in the worst way possible.

Or, if other parts of your code permits this, try returning 64bit value instead of passing an output parameter each time.

the only optimization i see here is inlining, to skip the push/call/ret.

__inline void hash(....

__inline void hash(....

Even VS2005 will inline the function and use register returns, so I doubt a straightforward port to ASM would work.

From a naive reading of some optimization guides, it appears that I could do

pmaddwd (8 clocks)

punpck??/pshufw (2 clocks)

padd (2 clocks)

pand (2 clocks)

in less time than

imul r, mem (16 clocks)

rol r, 16 ( 4 clocks)

with the bonus that with MMX, all the operands would be in registers. I dont know if I can assume 90% of all processors have PSHUFW at this point?

It appears that given 2 32-bit integers A & B, broken into 16-bits, Ahi and Alo, Bhi and Blo

pmaddwd <Ahi><Alo><0><Ahi>, <Blo><Bhi><0><Blo>

followed by a rearrangement and an add has the same numeric result.

From a naive reading of some optimization guides, it appears that I could do

pmaddwd (8 clocks)

punpck??/pshufw (2 clocks)

padd (2 clocks)

pand (2 clocks)

in less time than

imul r, mem (16 clocks)

rol r, 16 ( 4 clocks)

with the bonus that with MMX, all the operands would be in registers. I dont know if I can assume 90% of all processors have PSHUFW at this point?

It appears that given 2 32-bit integers A & B, broken into 16-bits, Ahi and Alo, Bhi and Blo

pmaddwd <Ahi><Alo><0><Ahi>, <Blo><Bhi><0><Blo>

followed by a rearrangement and an add has the same numeric result.