My first solution was quite similar to this, it also used a rotational approach.
And yes, it was slowww.


I think the rcr/rcl are the main culprit.
Posted on 2009-11-24 11:21:16 by Scali
In this version it's not very fast, not even if you consider that it actually does two numbers instead of one. Even if you take half the measurement, it's still not with the fastest routines.

True. MMX is not supposed to be used in 'single-iteration' algorithms but in stream processing. The most costly thing, I guess, is the actual extraction of the return value. "movd" is kinda slow, IIRC. Please confirm, if possible.
Posted on 2009-11-24 13:10:10 by ti_mo_n
I think both the movd and the emms required will cause some overhead.
Regardless, I've tried this alternative MMX algo:
multiplier3 dq  0080000100800001h

movd mm0, dword ptr
pxor mm1, mm1
pand mm0,
punpcklbw mm0, mm1
pmaddwd mm0,
movq mm1, mm0
psrlq mm0, 18
paddd mm0, mm1
movd eax, mm0


I get:
test14-Scali MMX              ...000655 ticks (effective 19.650 clk/iteration)

So still not great, but it's got potential :)
Posted on 2009-11-24 16:02:55 by Scali
Here's an even better one... Figured out a way to get around the limitations of pmaddubsw:
multiplier1	dq	8001800180018001h
multiplier2 dq 4000000140000001h

movd mm0, dword ptr
movq mm1,
pand mm0,
pmaddubsw mm1, mm0
pmaddwd mm1,
movd eax, mm1


test14-Scali MMX              ...000608 ticks (effective 18.240 clk/iteration)

The high part of mm1 is ignored in the output in this case, but theoretically a second number could be in there, if you load a qword rather than a dword at the start of the routine.
I think if you start unrolling this MMX routine, it could become a real winner, as it doesn't need a lot of registers to process the numbers. You can unroll it 4 times pretty easily.
Posted on 2009-11-24 16:56:02 by Scali
I just answered my own problem...
emms causes the overhead. Why do we need emms? Because MMX shares the FPU stack.
What does SSE2 do? It allows us to use the dedicated SSE registers with regular MMX operations.

So I made this:
	movd		xmm0, dword ptr 
movdqa xmm2, xmmword ptr
movdqa xmm1, xmmword ptr
movdqa xmm3, xmmword ptr
pand xmm0, xmm2
pmaddubsw xmm1, xmm0
pmaddwd xmm1, xmm3
movd eax, xmm1


No more need for emms, which cuts lots of overhead...
I also found a small mistake... I accidentally had two emms instructions, so it was slowed down even more. I've fixed that now, and MMX doesn't look *that* disastrous anymore:
test01-Scali1                 ...000297 ticks (effective 8.910 clk/iteration)
test02-Scali2                 ...000265 ticks (effective 7.950 clk/iteration)
test03-Scali3                 ...000281 ticks (effective 8.430 clk/iteration)
test04-sysfce2-1              ...000515 ticks (effective 15.450 clk/iteration)
test05-Ultrano                ...000359 ticks (effective 10.770 clk/iteration)
test06-ANSI C 1               ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2               ...000297 ticks (effective 8.910 clk/iteration)
test08-ANSI C 2 handoptimized ...000280 ticks (effective 8.400 clk/iteration)
test09-r22 LUT                ...000671 ticks (effective 20.130 clk/iteration)
test10-drizz                  ...000265 ticks (effective 7.950 clk/iteration)
test11-sysfce2-2              ...000296 ticks (effective 8.880 clk/iteration)
test12-sysfce2-3              ...000920 ticks (effective 27.600 clk/iteration)
test13-ti_mo_n                ...000484 ticks (effective 14.520 clk/iteration)
test14-Scali MMX              ...000390 ticks (effective 11.700 clk/iteration)
test15-Scali SSE2             ...000265 ticks (effective 7.950 clk/iteration)

It is already the fastest solution, and that is only counting one solution. In reality it is calculating 4 solutions in parallel (the registers are 128 bits wide now, instead of MMX' 64-bits, which gave us two solutions).
So technically we only spend 1.875 clks on a single solution.

I've also tried the MMX code without any emms at all. It was about as fast as the SSE2 then, but obviously there were exceptions in the float code later.

Edit: removed attachment, get the latest file elsewhere in the thread.
Posted on 2009-11-24 17:46:03 by Scali
Look Up Table
It wasn't the MOVZX reg,  that was slowing the look up table solution down,
it was the MOVZX reg, word ptr that caused the giant penalty.

I think the LUT can be a little more competitive now...

align 16
Int32ToInt28:
       mov     eax, dword
       movzx   ecx, ax
       shr     eax, 16
       movzx   ecx, word
       movzx   eax, word
       shl     eax, 14
       or      eax, ecx
       ret     4



I figured a Multiply + Correction solution would be the fastest.

I've noticed for SSE2 using xmmreg, mem opcode variants can sometimes improve execution speed in comparison to loading into another register and then using the xmmreg, xmmreg variant of the opcode. You might want to try this for the SSE2 version.
Posted on 2009-11-24 20:52:03 by r22
That is indeed MUCH better, r22:
test01-Scali1                 ...000296 ticks (effective 8.880 clk/iteration)
test02-Scali2                 ...000281 ticks (effective 8.430 clk/iteration)
test03-Scali3                 ...000280 ticks (effective 8.400 clk/iteration)
test04-sysfce2-1              ...000515 ticks (effective 15.450 clk/iteration)
test05-Ultrano                ...000344 ticks (effective 10.320 clk/iteration)
test06-ANSI C 1               ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2               ...000281 ticks (effective 8.430 clk/iteration)
test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)
test09-r22 LUT                ...000234 ticks (effective 7.020 clk/iteration)
test10-drizz                  ...000265 ticks (effective 7.950 clk/iteration)
test11-sysfce2-2              ...000281 ticks (effective 8.430 clk/iteration)
test12-sysfce2-3              ...000920 ticks (effective 27.600 clk/iteration)
test13-ti_mo_n                ...000483 ticks (effective 14.490 clk/iteration)
test14-Scali MMX              ...000406 ticks (effective 12.180 clk/iteration)
test15-Scali SSE2             ...000234 ticks (effective 7.020 clk/iteration)

Yours is now the fastest, together with the SSE2 code (and theoretically the MMX code if you divide the measurement by 2 since it calcs two numbers).

Edit: removed attachment, get the latest file elsewhere in the thread.
Posted on 2009-11-25 04:25:09 by Scali
Here are some results for the Pentium 4:
test01-Scali1                ...000422 ticks (effective 16.028 clk/iteration)
test02-Scali2                ...000391 ticks (effective 14.850 clk/iteration)
test03-Scali3                ...000610 ticks (effective 23.168 clk/iteration)
test04-sysfce2-1              ...000610 ticks (effective 23.168 clk/iteration)
test05-Ultrano                ...000469 ticks (effective 17.813 clk/iteration)
test06-ANSI C 1              ...000390 ticks (effective 14.812 clk/iteration)
test07-ANSI C 2              ...000406 ticks (effective 15.420 clk/iteration)
test08-ANSI C 2 handoptimized ...000390 ticks (effective 14.812 clk/iteration)
test09-r22 LUT                ...000375 ticks (effective 14.243 clk/iteration)
test10-drizz                  ...000421 ticks (effective 15.990 clk/iteration)
test11-sysfce2-2              ...000484 ticks (effective 18.382 clk/iteration)
test12-sysfce2-3              ...001297 ticks (effective 49.260 clk/iteration)
test13-ti_mo_n                ...000718 ticks (effective 27.270 clk/iteration)
test14-Scali MMX              ...000734 ticks (effective 27.877 clk/iteration)
test15-Scali SSE2            ...000484 ticks (effective 18.382 clk/iteration)

It would appear that I have cheated a bit. I used pmaddusbw, which isn't an MMX or SSE2 instruction, it is SSSE3, so only Core2 Duo and newer.
For these Pentium 4 results, I had to rewrite the algo a bit to avoid the pmaddusbw, making it a tad slower.
r22's LUT appears to be the best performer on Pentium 4 aswell... Gotta hand it to Intel, they sure know how to make caches. I didn't expect the CPUs to be this fast... The test cases are all random, so you need to rely on the whole table, which won't fit in L1 cache (especially not on Pentium 4, this model only has 16KB L1 cache). Apparently the L2 prefetching is so good that you barely notice.
Posted on 2009-11-25 05:29:13 by Scali
Brute-force gives well thought-out and optimized solutions a kick in the shin :D
If you wanted to get crazy with wasting memory you could have 2 LUTs and the second would be 65536 Dwords ?????000h so you could avoid the SHL reg, 14 for the high half.

That stack alignment penalty is really large, everyone should take note of that.
Word sized reads from stack memory need to be aligned on 4 bytes or you'll take a heck of a hit in performance.

BAD
movzx eax, word
movzx ecx, word

GOOD
mov eax, dword
movzx ecx, ax
shr eax, 16
Posted on 2009-11-25 07:42:39 by r22
I don't think the mega-LUT really works... The L2 cache is good, but not THAT good:
test01-Scali1                ...000296 ticks (effective 8.880 clk/iteration)
test02-Scali2                ...000265 ticks (effective 7.950 clk/iteration)
test03-Scali3                ...000280 ticks (effective 8.400 clk/iteration)
test04-sysfce2-1              ...000515 ticks (effective 15.450 clk/iteration)
test05-Ultrano                ...000359 ticks (effective 10.770 clk/iteration)
test06-ANSI C 1              ...000327 ticks (effective 9.810 clk/iteration)
test07-ANSI C 2              ...000297 ticks (effective 8.910 clk/iteration)
test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)
test09-r22 LUT                ...000234 ticks (effective 7.020 clk/iteration)
test10-drizz                  ...000281 ticks (effective 8.430 clk/iteration)
test11-sysfce2-2              ...000296 ticks (effective 8.880 clk/iteration)
test12-sysfce2-3              ...000905 ticks (effective 27.150 clk/iteration)
test13-ti_mo_n                ...000484 ticks (effective 14.520 clk/iteration)
test14-Scali MMX              ...000437 ticks (effective 13.110 clk/iteration)
test15-Scali SSE2            ...000265 ticks (effective 7.950 clk/iteration)
test16-Scali MMX+SSSE3        ...000437 ticks (effective 13.110 clk/iteration)
test17-Scali SSSE3            ...000266 ticks (effective 7.980 clk/iteration)
test18-r22 mega-LUT          ...000265 ticks (effective 7.950 clk/iteration)

Ofcourse another issue is the caching itself. In this case we just run tons of iterations of the same code, so the LUT remains in cache all the time. If you only need to call the function every now and then, and the cache contains other data in the meantime, the other options will probably be more attractive. They are better 'one shot' functions so to say, since you won't get a penalty for a cache-miss.
Posted on 2009-11-25 08:12:00 by Scali
Should be faster: :)

option prologue:none
option epilogue:none
align 16
C32to28a proc num:dword
pop  ecx
mov  edx, 7Fh
pop  eax
mov  ecx, 7F00h
and  edx, eax
and  ecx, eax
lea  edx,
mov  ecx, 7F0000h  
and  ecx, eax 
and  eax, 7F000000h
lea  edx,
lea  eax,
shr  eax, 3
jmp  dword ptr
C32to28a endp
;
align 16
C32to28b proc   num:dword
pop   ecx
pop   eax
pshufd   xmm1, oword ptr , 0E4h
movd   xmm0, eax
pand   xmm0, oword ptr
pmaddubsw xmm1, xmm0
pmaddwd  xmm1, oword ptr
movd   eax,  xmm1
jmp   ecx
C32to28b endp
option prologue:prologuedef
option epilogue:epiloguedef

Posted on 2009-11-25 12:45:37 by lingo12
The mega-LUT seems to work better when both tables are dwords. Still it's no faster than the single LUT, but it's tied now.
lingo12's solution is also equally fast...

Edit: sorry lingo12, I didn't see that you put two routines in your post. There was a scrollbar, but I didn't notice it because the first routine fit exactly into the code block.
I've added the SSSE3 routine aswell:
test17-Scali SSSE3            ...000234 ticks (effective 7.020 clk/iteration)
test18-r22 mega-LUT          ...000234 ticks (effective 7.020 clk/iteration)
test19-lingo12                ...000234 ticks (effective 7.020 clk/iteration)
test20-lingo12 SSSE3          ...000234 ticks (effective 7.020 clk/iteration)
test09-r22 LUT                ...000249 ticks (effective 7.470 clk/iteration)
test10-drizz                  ...000265 ticks (effective 7.950 clk/iteration)
test02-Scali2                ...000266 ticks (effective 7.980 clk/iteration)
test03-Scali3                ...000281 ticks (effective 8.430 clk/iteration)
test08-ANSI C 2 handoptimized ...000281 ticks (effective 8.430 clk/iteration)
test15-Scali SSE2            ...000281 ticks (effective 8.430 clk/iteration)
test01-Scali1                ...000296 ticks (effective 8.880 clk/iteration)
test11-sysfce2-2              ...000296 ticks (effective 8.880 clk/iteration)
test07-ANSI C 2              ...000297 ticks (effective 8.910 clk/iteration)
test06-ANSI C 1              ...000312 ticks (effective 9.360 clk/iteration)
test05-Ultrano                ...000359 ticks (effective 10.770 clk/iteration)
test16-Scali MMX+SSSE3        ...000405 ticks (effective 12.150 clk/iteration)
test14-Scali MMX              ...000437 ticks (effective 13.110 clk/iteration)
test13-ti_mo_n                ...000484 ticks (effective 14.520 clk/iteration)
test04-sysfce2-1              ...000514 ticks (effective 15.420 clk/iteration)
test12-sysfce2-3              ...000920 ticks (effective 27.600 clk/iteration)

Apparently it just doesn't get faster than 7.020 clks... Quite a few routines get that result now.
Posted on 2009-11-25 16:09:09 by Scali
Well, it's been quiet for a few days...
I'll upload the latest yodel package. If nobody has anything to add, I suppose we can just run the benchmarks on a few machines and call out winners in every category... or something :)

Here's the latest version on Pentium 4:
test09-r22 LUT                ...000359 ticks (effective 13.635 clk/iteration)
test18-r22 mega-LUT          ...000375 ticks (effective 14.243 clk/iteration)
test02-Scali2                ...000391 ticks (effective 14.850 clk/iteration)
test06-ANSI C 1              ...000391 ticks (effective 14.850 clk/iteration)
test07-ANSI C 2              ...000391 ticks (effective 14.850 clk/iteration)
test04-sysfce2-1              ...000406 ticks (effective 15.420 clk/iteration)
test08-ANSI C 2 handoptimized ...000406 ticks (effective 15.420 clk/iteration)
test03-Scali3                ...000421 ticks (effective 15.990 clk/iteration)
test19-lingo12                ...000421 ticks (effective 15.990 clk/iteration)
test10-drizz                  ...000437 ticks (effective 16.597 clk/iteration)
test01-Scali1                ...000438 ticks (effective -1.#IO clk/iteration)
test15-Scali SSE2            ...000468 ticks (effective 17.775 clk/iteration)
test11-sysfce2-2              ...000484 ticks (effective 18.382 clk/iteration)
test05-Ultrano                ...000516 ticks (effective 19.598 clk/iteration)
test13-ti_mo_n                ...000719 ticks (effective 27.308 clk/iteration)
test14-Scali MMX              ...000734 ticks (effective 27.877 clk/iteration)
test12-sysfce2-3              ...001297 ticks (effective 49.260 clk/iteration)
Attachments:
Posted on 2009-11-30 09:17:17 by Scali
And here are the results on my old Athlon XP1800+:
test14-Scali MMX              ...000640 ticks (effective 9.882 clk/iteration)
test18-r22 mega-LUT          ...000661 ticks (effective 10.206 clk/iteration)
test09-r22 LUT                ...000671 ticks (effective 10.360 clk/iteration)
test04-sysfce2-1              ...000711 ticks (effective 10.978 clk/iteration)
test02-Scali2                ...000771 ticks (effective 11.904 clk/iteration)
test07-ANSI C 2              ...000771 ticks (effective 11.904 clk/iteration)
test08-ANSI C 2 handoptimized ...000771 ticks (effective 11.904 clk/iteration)
test10-drizz                  ...000771 ticks (effective 11.904 clk/iteration)
test19-lingo12                ...000771 ticks (effective 11.904 clk/iteration)
test01-Scali1                ...000841 ticks (effective -1.#IO clk/iteration)
test03-Scali3                ...000841 ticks (effective 12.985 clk/iteration)
test13-ti_mo_n                ...000841 ticks (effective 12.985 clk/iteration)
test11-sysfce2-2              ...000842 ticks (effective 13.000 clk/iteration)
test06-ANSI C 1              ...000901 ticks (effective 13.911 clk/iteration)
test05-Ultrano                ...001042 ticks (effective 16.088 clk/iteration)
test12-sysfce2-3              ...001101 ticks (effective 16.999 clk/iteration)

I suppose the conclusion so far is something like this:
Core2 Duo:
- Plain:
1) lingo12
2) r22
3) drizz

- MMX/SSE:
1) lingo12/Scali SSSE3
2) Scali SSE2
3) Scali MMX+SSSE3

Pentium 4:
- Plain:
1) r22
2) Scali2/C compiler
3) sysfce2/C compiler

- MMX/SSE:
1) Scali SSE2
2) ti_mo_n
3) Scali MMX

Athlon XP:
- Plain:
1) r22
2) sysfce2-1
3) lingo12/Scali2/C compiler/drizz

- MMX/SSE:
1) Scali MMX
2) ti_mo_n

Well, the results are quite interesting at any rate... Some routines are quite consistent across these different architectures... Some are great on one architecture and horrible on another. You can also see quite clearly that the Athlon has an improved emms instruction. The MMX doesn't have the overhead that it has on Pentium 4 (and I believe you'll find much the same overhead with Pentium 2/3). They already had an 'femms' for that purpose on the K6, but on Athlon they made emms as fast as femms. The result: suddenly the MMX routine is at the top, rather than at the bottom of the list.
The C compiler didn't even do all that badly, it managed to get into the top 3.
The routines that perform best are the LUT and the 'packed add' routine.
Posted on 2009-11-30 13:05:45 by Scali
Oh, and just for kicks, I figured I'd write a C version of the LUT...
The compiler came up with the 'perfect' code:
test21-ANSI C LUT            ...000234 ticks (effective 7.020 clk/iteration)

_text:0002003C                 mov     ecx, 
_text:00020040                mov    eax, ecx
_text:00020042                shr    eax, 10h
_text:00020045                movzx  eax, ds:_LUT
_text:0002004D                movzx  ecx, cx
_text:00020050                movzx  edx, ds:_LUT
_text:00020058                shl    eax, 0Eh
_text:0002005B                or      eax, edx
Posted on 2009-11-30 15:36:44 by Scali
Here's what I get on my work Pentium 4 (2.8 GHz).

######## Yodel version 0.6, 2009/11/24, 11:00
## WARNING: unable to read coulomb.ini, using defaults
## Test parameters: 100000000 iterations of 2048 muls, total 100000000 muls
## Boosting priority: your computer might appear frozen - don't panic.
## Retrieving (NT) or calculating (9x) clockspeed...2793 MHz
/] running conformance tests
test01-Scali1                ...conforms
test02-Scali2                ...conforms
test03-Scali3                ...conforms
test04-sysfce2-1              ...conforms
test05-Ultrano                ...conforms
test06-ANSI C 1              ...conforms
test07-ANSI C 2              ...conforms
test08-ANSI C 2 handoptimized ...conforms
test09-r22 LUT                ...conforms
test10-drizz                  ...conforms
test11-sysfce2-2              ...conforms
test12-sysfce2-3              ...conforms
test13-ti_mo_n                ...conforms
test14-Scali MMX              ...conforms
test15-Scali SSE2            ...conforms
test16-Scali MMX+SSSE3        ...func uses unsupported instructions
test17-Scali SSSE3            ...func uses unsupported instructions
test18-r22 mega-LUT          ...conforms
test19-lingo12                ...conforms
test20-lingo12 SSSE3          ...func uses unsupported instructions
/] running performance tests
test01-Scali1                ...000547 ticks (effective -1.#IO clk/iteration)
test02-Scali2                ...000531 ticks (effective 14.831 clk/iteration)
test03-Scali3                ...000578 ticks (effective 16.144 clk/iteration)
test04-sysfce2-1              ...000562 ticks (effective 15.697 clk/iteration)
test05-Ultrano                ...000625 ticks (effective 17.456 clk/iteration)
test06-ANSI C 1              ...000546 ticks (effective 15.250 clk/iteration)
test07-ANSI C 2              ...000547 ticks (effective 15.278 clk/iteration)
test08-ANSI C 2 handoptimized ...000531 ticks (effective 14.831 clk/iteration)
test09-r22 LUT                ...000500 ticks (effective 13.965 clk/iteration)
test10-drizz                  ...000594 ticks (effective 16.590 clk/iteration)
test11-sysfce2-2              ...000656 ticks (effective 18.322 clk/iteration)
test12-sysfce2-3              ...001781 ticks (effective 49.743 clk/iteration)
test13-ti_mo_n                ...001078 ticks (effective 30.109 clk/iteration)
test14-Scali MMX              ...001000 ticks (effective 27.930 clk/iteration)
test15-Scali SSE2            ...000625 ticks (effective 17.456 clk/iteration)
test16-Scali MMX+SSSE3        ...doesn't conform(!)...skipping
test17-Scali SSSE3            ...doesn't conform(!)...skipping
test18-r22 mega-LUT          ...000515 ticks (effective 14.384 clk/iteration)
test19-lingo12                ...000578 ticks (effective 16.144 clk/iteration)
test20-lingo12 SSSE3          ...doesn't conform(!)...skipping


Just out of curiosity, how long does the shld version take?
I haven't been able to get it to compile (well, mostly link) as it seems I don't have the C libraries for it and I haven't had a chance to go through to rewrite those sections.
Posted on 2009-12-01 14:12:20 by sysfce2
Here's what I got on my home Pentium 4.

######## Yodel version 0.6, 2009/11/24, 11:00
## WARNING: unable to read coulomb.ini, using defaults
## Test parameters: 100000000 iterations of 2048 muls, total 100000000 muls
## Boosting priority: your computer might appear frozen - don't panic.
## Retrieving (NT) or calculating (9x) clockspeed...2400 MHz
/] running conformance tests
test01-Scali1                ...conforms
test02-Scali2                ...conforms
test03-Scali3                ...conforms
test04-sysfce2-1              ...conforms
test05-Ultrano                ...conforms
test06-ANSI C 1              ...conforms
test07-ANSI C 2              ...conforms
test08-ANSI C 2 handoptimized ...conforms
test09-r22 LUT                ...conforms
test10-drizz                  ...conforms
test11-sysfce2-2              ...conforms
test12-sysfce2-3              ...conforms
test13-ti_mo_n                ...conforms
test14-Scali MMX              ...conforms
test15-Scali SSE2            ...conforms
test16-Scali MMX+SSSE3        ...func uses unsupported instructions
test17-Scali SSSE3            ...func uses unsupported instructions
test18-r22 mega-LUT          ...conforms
test19-lingo12                ...conforms
test20-lingo12 SSSE3          ...func uses unsupported instructions
/] running performance tests
test01-Scali1                ...000766 ticks (effective -1.#IO clk/iteration)
test02-Scali2                ...000766 ticks (effective 18.384 clk/iteration)
test03-Scali3                ...000937 ticks (effective 22.488 clk/iteration)
test04-sysfce2-1              ...000765 ticks (effective 18.360 clk/iteration)
test05-Ultrano                ...000938 ticks (effective 22.512 clk/iteration)
test06-ANSI C 1              ...000766 ticks (effective 18.384 clk/iteration)
test07-ANSI C 2              ...000766 ticks (effective 18.384 clk/iteration)
test08-ANSI C 2 handoptimized ...000750 ticks (effective 18.000 clk/iteration)
test09-r22 LUT                ...000719 ticks (effective 17.256 clk/iteration)
test10-drizz                  ...000922 ticks (effective 22.128 clk/iteration)
test11-sysfce2-2              ...000953 ticks (effective 22.872 clk/iteration)
test12-sysfce2-3              ...001922 ticks (effective 46.128 clk/iteration)
test13-ti_mo_n                ...001454 ticks (effective 34.896 clk/iteration)
test14-Scali MMX              ...001281 ticks (effective 30.744 clk/iteration)
test15-Scali SSE2            ...000781 ticks (effective 18.744 clk/iteration)
test16-Scali MMX+SSSE3        ...doesn't conform(!)...skipping
test17-Scali SSSE3            ...doesn't conform(!)...skipping
test18-r22 mega-LUT          ...000687 ticks (effective 16.488 clk/iteration)
test19-lingo12                ...000750 ticks (effective 18.000 clk/iteration)
test20-lingo12 SSSE3          ...doesn't conform(!)...skipping


And on an AMD Athlon 2100+:

######## Yodel version 0.6, 2009/11/24, 11:00
## WARNING: unable to read coulomb.ini, using defaults
## Test parameters: 100000000 iterations of 2048 muls, total 100000000 muls
## Boosting priority: your computer might appear frozen - don't panic.
## Retrieving (NT) or calculating (9x) clockspeed...1734 MHz
/] running conformance tests
test01-Scali1                ...conforms
test02-Scali2                ...conforms
test03-Scali3                ...conforms
test04-sysfce2-1              ...conforms
test05-Ultrano                ...conforms
test06-ANSI C 1              ...conforms
test07-ANSI C 2              ...conforms
test08-ANSI C 2 handoptimized ...conforms
test09-r22 LUT                ...conforms
test10-drizz                  ...conforms
test11-sysfce2-2              ...conforms
test12-sysfce2-3              ...conforms
test13-ti_mo_n                ...conforms
test14-Scali MMX              ...conforms
test15-Scali SSE2            ...conforms
test16-Scali MMX+SSSE3        ...func uses unsupported instructions
test17-Scali SSSE3            ...func uses unsupported instructions
test18-r22 mega-LUT          ...conforms
test19-lingo12                ...conforms
test20-lingo12 SSSE3          ...func uses unsupported instructions
/] running performance tests
test01-Scali1                ...000761 ticks (effective -1.#IO clk/iteration)
test02-Scali2                ...000691 ticks (effective 11.982 clk/iteration)
test03-Scali3                ...000751 ticks (effective 13.022 clk/iteration)
test04-sysfce2-1              ...000641 ticks (effective 11.115 clk/iteration)
test05-Ultrano                ...000931 ticks (effective 16.144 clk/iteration)
test06-ANSI C 1              ...000811 ticks (effective 14.063 clk/iteration)
test07-ANSI C 2              ...000691 ticks (effective 11.982 clk/iteration)
test08-ANSI C 2 handoptimized ...000691 ticks (effective 11.982 clk/iteration)
test09-r22 LUT                ...000581 ticks (effective 10.075 clk/iteration)
test10-drizz                  ...000691 ticks (effective 11.982 clk/iteration)
test11-sysfce2-2              ...000751 ticks (effective 13.022 clk/iteration)
test12-sysfce2-3              ...000982 ticks (effective 17.028 clk/iteration)
test13-ti_mo_n                ...000751 ticks (effective 13.022 clk/iteration)
test14-Scali MMX              ...000581 ticks (effective 10.075 clk/iteration)
test15-Scali SSE2            ...000581 ticks (effective -1.#IO clk/iteration)
test16-Scali MMX+SSSE3        ...doesn't conform(!)...skipping
test17-Scali SSSE3            ...doesn't conform(!)...skipping
test18-r22 mega-LUT          ...000580 ticks (effective 10.057 clk/iteration)
test19-lingo12                ...000691 ticks (effective 11.982 clk/iteration)
test20-lingo12 SSSE3          ...doesn't conform(!)...skipping
Posted on 2009-12-01 21:53:07 by sysfce2
Here's another method:  I combined a little bit of the partial registers, addition, and drizz's thinking.

	add al,al
mov esi,00000000011111110000000000000000b
movzx edx,ax
and esi,eax
xor eax,esi
add eax,edx
lea eax,
and eax,01111111000000011111111111111000b
lea eax,
shr eax,3


I think it still needs a little tweaking, but at least it's working:  I've been trying to think of a better way to do the xor/add/lea in the middle - you guys have some ideas?  It works with some numbers without the xor, but once initial numbers with too many bits set are used it overflows.
Posted on 2009-12-01 22:04:28 by sysfce2
oooh, thats a really interesting approach, I like
Posted on 2009-12-02 01:32:46 by Homer

oooh, thats a really interesting approach, I like

This is the original version:  I was trying to figure out a way to eliminate a partial register.
	add al,al
mov edx,00000000011111111111111111111100b
add ax,ax
and edx,eax
and eax,01111111000000000000000000000000b
lea eax,
shr eax,3

I guess I've got to try to get some timing code myself so I can try these out without relying on you guys (BTW, thanks a lot Scali for that).
Posted on 2009-12-02 19:44:23 by sysfce2