I was wondering why that CWDE is there - extended the sign bit does not contribute to the solution - especially given that we've already masked out that bit specifically, and given that we then go on to overwrite it !!!

And I'd be willing to bet that opcode is precisely what slows down that solution, too.

Wonder what the speed is like with that opcode omitted.

And I'd be willing to bet that opcode is precisely what slows down that solution, too.

Wonder what the speed is like with that opcode omitted.

Scali, what is your c2d? Such different results from mine.

I.e ANSI C 1 is 6.56 cycles here, and ratios are all inconsistent with your bench-results.

I.e ANSI C 1 is 6.56 cycles here, and ratios are all inconsistent with your bench-results.

The CWDE is only there to clear the high order bit. I used it as the instruction is only one byte long and I wasn't sure if we were going for size or speed. It can easily be replaced with an and:

EDIT: I had assumed that the high order nibble had to be cleared (for future ORing, etc.) - if not important, it can be omitted. My crude benchmarking (i.e. repeated execution between two GetTickCounts) points toward about a 10% time penalty.

I find it hard to believe that code would be even competitive with all the dependancies it has - I figured it might be size competitive though as I tried to avoid dword ands.

Here's another bit manipulation version with some ands:

I think though, that shld might be kind of slow - it does appear to be faster than my first one, although slower than the rest of the competion ;-). - I can't really test it for speed as I don't have a good benchmarking routine. I might try porting Ultrano's to NASM though.

I'm also an asm newbie, so I'm learning a bunch just reading (and comparing to mine) all you guy's solutions - it's quite impressive the completely different/elegant solutions.

EDIT: I had assumed that the high order nibble had to be cleared (for future ORing, etc.) - if not important, it can be omitted. My crude benchmarking (i.e. repeated execution between two GetTickCounts) points toward about a 10% time penalty.

` movzx edx,ax`

shr eax,15

shl dx,1

shr ah,1

shr dh,1

shr ax,1

shr dx,1

shl eax,14

; and eax,0x0FFFFFFF

add eax,edx

retn

I find it hard to believe that code would be even competitive with all the dependancies it has - I figured it might be size competitive though as I tried to avoid dword ands.

Here's another bit manipulation version with some ands:

` shld edx,eax,31`

shld esi,eax,30

shld edi,eax,29

and eax,0x0000007F

and edx,0x00003F80

and esi,0x001FC000

and edi,0x0FE00000

or eax,edx

or esi,edi

or eax,esi

I think though, that shld might be kind of slow - it does appear to be faster than my first one, although slower than the rest of the competion ;-). - I can't really test it for speed as I don't have a good benchmarking routine. I might try porting Ultrano's to NASM though.

I'm also an asm newbie, so I'm learning a bunch just reading (and comparing to mine) all you guy's solutions - it's quite impressive the completely different/elegant solutions.

Scali, what is your c2d? Such different results from mine.

I.e ANSI C 1 is 6.56 cycles here, and ratios are all inconsistent with your bench-results.

I have a Core2Duo E6600 running at 3 GHz.

The testing framework tests the routines as function calls, so there's calling overhead aswell. That's probably why all results are about ~2 clks more than your results.

I'll attach the framework as it is now, still messy, but if you run src/yodel.exe, you get the exact same stuff I was testing with.

Edit: removed attachment, get the latest file elsewhere in the thread.

Results from my 1.5 GHz Core2 Duo laptop:

test01-Scali1 ...000609 ticks (effective 9.111 clk/iteration)

test02-Scali2 ...000562 ticks (effective 8.408 clk/iteration)

test03-Scali3 ...000702 ticks (effective 10.502 clk/iteration)

test04-sysfce2 ...001060 ticks (effective 15.858 clk/iteration)

test05-Ultrano ...000702 ticks (effective 10.502 clk/iteration)

test06-ANSI C 1 ...000608 ticks (effective 9.096 clk/iteration)

test07-ANSI C 2 ...000655 ticks (effective 9.799 clk/iteration)

And my 1.6 GHz Celeron (Northwood) laptop:

test01-Scali1 ...001141 ticks (effective 18.188 clk/iteration)

test02-Scali2 ...001141 ticks (effective 18.188 clk/iteration)

test03-Scali3 ...001432 ticks (effective 22.826 clk/iteration)

test04-sysfce2 ...001142 ticks (effective 18.203 clk/iteration)

test05-Ultrano ...001432 ticks (effective 22.826 clk/iteration)

test06-ANSI C 1 ...001142 ticks (effective 18.203 clk/iteration)

test07-ANSI C 2 ...001141 ticks (effective 18.188 clk/iteration)

test01-Scali1 ...000609 ticks (effective 9.111 clk/iteration)

test02-Scali2 ...000562 ticks (effective 8.408 clk/iteration)

test03-Scali3 ...000702 ticks (effective 10.502 clk/iteration)

test04-sysfce2 ...001060 ticks (effective 15.858 clk/iteration)

test05-Ultrano ...000702 ticks (effective 10.502 clk/iteration)

test06-ANSI C 1 ...000608 ticks (effective 9.096 clk/iteration)

test07-ANSI C 2 ...000655 ticks (effective 9.799 clk/iteration)

And my 1.6 GHz Celeron (Northwood) laptop:

test01-Scali1 ...001141 ticks (effective 18.188 clk/iteration)

test02-Scali2 ...001141 ticks (effective 18.188 clk/iteration)

test03-Scali3 ...001432 ticks (effective 22.826 clk/iteration)

test04-sysfce2 ...001142 ticks (effective 18.203 clk/iteration)

test05-Ultrano ...001432 ticks (effective 22.826 clk/iteration)

test06-ANSI C 1 ...001142 ticks (effective 18.203 clk/iteration)

test07-ANSI C 2 ...001141 ticks (effective 18.188 clk/iteration)

What strikes me most is that VS (and generally modern compilers) can very often produce code so good that it's actually hard to beat it. "Optimized by hand in assembly" no longer means what it used to mean.

This is only when the C++ coder knows how a cpu works, and how to optimize in asm :P .

What strikes me most is that VS (and generally modern compilers) can very often produce code so good that it's actually hard to beat it. "Optimized by hand in assembly" no longer means what it used to mean.

Well, that's partly because of CPU design itself.

Since most programmers don't know how to code in asm anymore, most code is written in compiled languages. Ever since the Pentium, Intel has been working on optimizing only the 'common subset' of instructions that compilers use, rather than the more esoteric instructions that are mainly for hand-optimized assembly.

As a result, basic operations like add/sub/or/xor and all that have become faster, while 'weird' instructions like inc/dec/movsd/aaa and all that have been replaced with 'software'... they are basically being 'emulated' in microcode by generating a sequence of basic operations.

As a result, the code that compilers generate is always reasonably optimal. And the 'weird' instructions that assembly programmers once used to cut corners and optimize their code, have now often become slower than doing it the 'naive' way.

The 'trick' that I pulled here... whether that's an assembly trick is debatable. As you can see, I could easily implement the same trick in C. It doesn't take advantage of things like carry flags and such, which would only work in assembly, and not in C. You need to know how basic adding and carrying works, but technically you wouldn't have to know how to write asm, it's more of a 'mathematical' thing, number theory or something like that.

Edit:

In fact, it was already like that about 6 years ago, when f0dder originally developed the yodel package for some other code. Here are the old threads if anyone is interested:

http://www.asmcommunity.net/board/index.php?topic=12764.0

http://www.asmcommunity.net/board/index.php?topic=12817.0

The short version: even back then, most assembly versions got beaten by the C compiler.

When I looked at the second algo that the compiler generated, I noticed that if you swapped the usage of eax, ecx and edx around a bit, you could remove one instruction...

So I got to this:

Which is indeed faster. Same speed as my first implementation.

However, since I had the VS2010 beta installed aswell, I figured I'd try that one.

It came up with this:

This code is just as fast as the fine-tuned version I came up with (the first ANSI C routine resulted in the exact same code as VS2008):

test01-Scali1 ...000296 ticks (effective 8.880 clk/iteration)

test02-Scali2 ...000281 ticks (effective 8.430 clk/iteration)

test03-Scali3 ...000343 ticks (effective 10.290 clk/iteration)

test04-sysfce2 ...000530 ticks (effective 15.900 clk/iteration)

test05-Ultrano ...000343 ticks (effective 10.290 clk/iteration)

test06-ANSI C 1 ...000296 ticks (effective 8.880 clk/iteration)

test07-ANSI C 2 ...000296 ticks (effective 8.880 clk/iteration)

test08-ANSI C 2 handoptimized ...000296 ticks (effective 8.880 clk/iteration)

So I got to this:

` mov eax, `

mov ecx, eax

and eax, 7F7F7F7Fh

and ecx, 7F7F7Fh

add eax, ecx

add ecx, ecx

and ecx, 0FEFEh

add eax, ecx

add ecx, ecx

and ecx, 1FCh

add eax, ecx

shr eax, 3

Which is indeed faster. Same speed as my first implementation.

However, since I had the VS2010 beta installed aswell, I figured I'd try that one.

It came up with this:

`_text:00000000 mov eax, `

_text:00000004 mov ecx, eax

_text:00000006 and eax, 7F7F7F7Fh

_text:0000000B and ecx, 7F7F7Fh

_text:00000011 lea edx,

_text:00000014 lea eax,

_text:00000017 and eax, 0FEFEh

_text:0000001C add edx, eax

_text:0000001E add eax, eax

_text:00000020 and eax, 1FCh

_text:00000025 add eax, edx

_text:00000027 shr eax, 3

This code is just as fast as the fine-tuned version I came up with (the first ANSI C routine resulted in the exact same code as VS2008):

test01-Scali1 ...000296 ticks (effective 8.880 clk/iteration)

test02-Scali2 ...000281 ticks (effective 8.430 clk/iteration)

test03-Scali3 ...000343 ticks (effective 10.290 clk/iteration)

test04-sysfce2 ...000530 ticks (effective 15.900 clk/iteration)

test05-Ultrano ...000343 ticks (effective 10.290 clk/iteration)

test06-ANSI C 1 ...000296 ticks (effective 8.880 clk/iteration)

test07-ANSI C 2 ...000296 ticks (effective 8.880 clk/iteration)

test08-ANSI C 2 handoptimized ...000296 ticks (effective 8.880 clk/iteration)

Has the Look Up Table method been tried?

The 131 KB look up table is definitely a space hog, but once the LUT is in the processor cache it would likely be the fastest method.

Generating the LUT would be some preprocessor/macro annoyance. But you wouldn't need to actually fill the LUT to preform the benchmark.

LUT:

REPEAT 65536

dw ((% & 0x7F00) >> 1) | (% & 0x7F)

END REPEAT

MOVZX ecx, word

MOVZX eax, word

MOV cx, word

MOV ax, word

SHL ecx, 14

OR eax, ecx

The 131 KB look up table is definitely a space hog, but once the LUT is in the processor cache it would likely be the fastest method.

Generating the LUT would be some preprocessor/macro annoyance. But you wouldn't need to actually fill the LUT to preform the benchmark.

LUT:

REPEAT 65536

dw ((% & 0x7F00) >> 1) | (% & 0x7F)

END REPEAT

I've converted your code to MASM and added it, here are the results:

test01-Scali1 ...000312 ticks (effective 9.360 clk/iteration)

test02-Scali2 ...000296 ticks (effective 8.880 clk/iteration)

test03-Scali3 ...000359 ticks (effective 10.770 clk/iteration)

test04-sysfce2 ...000530 ticks (effective 15.900 clk/iteration)

test05-Ultrano ...000343 ticks (effective 10.290 clk/iteration)

test06-ANSI C 1 ...000297 ticks (effective 8.910 clk/iteration)

test07-ANSI C 2 ...000297 ticks (effective 8.910 clk/iteration)

test08-ANSI C 2 handoptimized ...000281 ticks (effective 8.430 clk/iteration)

test09-r22 LUT ...000624 ticks (effective 18.720 clk/iteration)

It seems the cache just isn't fast enough. I've also tried with 5 times as many iterations, but still it was considerably slower.

I've attached the latest yodel, with the extra tests.

Edit: removed attachment, get the latest file elsewhere in the thread.

test01-Scali1 ...000312 ticks (effective 9.360 clk/iteration)

test02-Scali2 ...000296 ticks (effective 8.880 clk/iteration)

test03-Scali3 ...000359 ticks (effective 10.770 clk/iteration)

test04-sysfce2 ...000530 ticks (effective 15.900 clk/iteration)

test05-Ultrano ...000343 ticks (effective 10.290 clk/iteration)

test06-ANSI C 1 ...000297 ticks (effective 8.910 clk/iteration)

test07-ANSI C 2 ...000297 ticks (effective 8.910 clk/iteration)

test08-ANSI C 2 handoptimized ...000281 ticks (effective 8.430 clk/iteration)

test09-r22 LUT ...000624 ticks (effective 18.720 clk/iteration)

It seems the cache just isn't fast enough. I've also tried with 5 times as many iterations, but still it was considerably slower.

I've attached the latest yodel, with the extra tests.

Edit: removed attachment, get the latest file elsewhere in the thread.

Impressively slow :)

I guess the underlying algorithm is already too compact to benefit from the 'brute force' solution.

I was also thinking of a MUL by magic number approach, but I don't see it being an improvement.

I guess the underlying algorithm is already too compact to benefit from the 'brute force' solution.

I was also thinking of a MUL by magic number approach, but I don't see it being an improvement.

I think the main problem is that the LUT is too large. You need to go to L2 a lot of the time.

If you could stay within L1, you'd have a lookup delay of about 3-4 cycles, depending on the exact CPU.

But L2 is more like 15 cycles or more.

If you could stay within L1, you'd have a lookup delay of about 3-4 cycles, depending on the exact CPU.

But L2 is more like 15 cycles or more.

here's my try:

` mov edx,00000000000000000000000001111111b`

mov ecx,00000000000000000111111100000000b

mov ebx,00000000011111110000000000000000b

and edx,eax

and ecx,eax

and ebx,eax

and eax,01111111000000000000000000000000b

lea eax,

lea eax,

lea eax,

shr eax,3

Very good one, drizz:

test01-Scali1 ...000297 ticks (effective 8.910 clk/iteration)

test02-Scali2 ...000265 ticks (effective 7.950 clk/iteration)

test03-Scali3 ...000265 ticks (effective 7.950 clk/iteration)

test04-sysfce2 ...000515 ticks (effective 15.450 clk/iteration)

test05-Ultrano ...000359 ticks (effective 10.770 clk/iteration)

test06-ANSI C 1 ...000328 ticks (effective 9.840 clk/iteration)

test07-ANSI C 2 ...000296 ticks (effective 8.880 clk/iteration)

test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)

test09-r22 LUT ...000671 ticks (effective 20.130 clk/iteration)

test10-drizz ...000266 ticks (effective 7.980 clk/iteration)

I've modified my third routine to use some of your ideas.

I think yours is the fastest routine technically, since I had to add push/pop ebx to make it run from the yodel framework, and still it's as fast as the Scali2 routine. Without the push/pop it would be faster.

test01-Scali1 ...000297 ticks (effective 8.910 clk/iteration)

test02-Scali2 ...000265 ticks (effective 7.950 clk/iteration)

test03-Scali3 ...000265 ticks (effective 7.950 clk/iteration)

test04-sysfce2 ...000515 ticks (effective 15.450 clk/iteration)

test05-Ultrano ...000359 ticks (effective 10.770 clk/iteration)

test06-ANSI C 1 ...000328 ticks (effective 9.840 clk/iteration)

test07-ANSI C 2 ...000296 ticks (effective 8.880 clk/iteration)

test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)

test09-r22 LUT ...000671 ticks (effective 20.130 clk/iteration)

test10-drizz ...000266 ticks (effective 7.980 clk/iteration)

I've modified my third routine to use some of your ideas.

I think yours is the fastest routine technically, since I had to add push/pop ebx to make it run from the yodel framework, and still it's as fast as the Scali2 routine. Without the push/pop it would be faster.

Here's a slight variation on drizz's code:

I haven't yet got the Yodel routine to compile - I'm using Pelle's C. What compiler are you using so I can try to tweak the settings?

` lea edx,`

lea esi,

lea edi,

and edx,00000000111111100000000000000000b

and esi,00000000000000011111110000000000b

and edi,00000000000000000000001111111000b

and eax,01111111000000000000000000000000b

or edx,esi

or eax,edi

or eax,edx

shr eax,3

I haven't yet got the Yodel routine to compile - I'm using Pelle's C. What compiler are you using so I can try to tweak the settings?

And just for fun, here's another one using rotate instructions (really slow):

` mov edx,eax`

rcr eax,8

mov esi,0xFE00007F

rcl edx,9

and eax,esi

and edx,esi

rol eax,7

ror edx,11

or eax,edx

My first solution was quite similar to this, it also used a rotational approach.

And yes, it was slowww.

And yes, it was slowww.

I've added the two routines from sysfce2:

test01-Scali1 ...000297 ticks (effective 8.910 clk/iteration)

test02-Scali2 ...000280 ticks (effective 8.400 clk/iteration)

test03-Scali3 ...000281 ticks (effective 8.430 clk/iteration)

test04-sysfce2-1 ...000531 ticks (effective 15.930 clk/iteration)

test05-Ultrano ...000358 ticks (effective 10.740 clk/iteration)

test06-ANSI C 1 ...000312 ticks (effective 9.360 clk/iteration)

test07-ANSI C 2 ...000280 ticks (effective 8.400 clk/iteration)

test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)

test09-r22 LUT ...000671 ticks (effective 20.130 clk/iteration)

test10-drizz ...000265 ticks (effective 7.950 clk/iteration)

test11-sysfce2-2 ...000281 ticks (effective 8.430 clk/iteration)

test12-sysfce2-3 ...000921 ticks (effective 27.630 clk/iteration)

And I'll attach the updated yodel routine... I've also made the conformance test actually WORK this time, which meant I also fixed some minor bugs in some code, but no impact on performance.

By the way, the yodel framework uses Microsoft's cl.exe and ml.exe. You can use the mk.bat file. Should be reasonably easy to modify it for other compilers (there's also some stuff in there for Intel's icl.exe, and we used nasm instead of masm last time round. I just moved back to masm because I don't have nasm installed now).

Edit: removed attachment, get the latest file elsewhere in the thread.

test01-Scali1 ...000297 ticks (effective 8.910 clk/iteration)

test02-Scali2 ...000280 ticks (effective 8.400 clk/iteration)

test03-Scali3 ...000281 ticks (effective 8.430 clk/iteration)

test04-sysfce2-1 ...000531 ticks (effective 15.930 clk/iteration)

test05-Ultrano ...000358 ticks (effective 10.740 clk/iteration)

test06-ANSI C 1 ...000312 ticks (effective 9.360 clk/iteration)

test07-ANSI C 2 ...000280 ticks (effective 8.400 clk/iteration)

test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)

test09-r22 LUT ...000671 ticks (effective 20.130 clk/iteration)

test10-drizz ...000265 ticks (effective 7.950 clk/iteration)

test11-sysfce2-2 ...000281 ticks (effective 8.430 clk/iteration)

test12-sysfce2-3 ...000921 ticks (effective 27.630 clk/iteration)

And I'll attach the updated yodel routine... I've also made the conformance test actually WORK this time, which meant I also fixed some minor bugs in some code, but no impact on performance.

By the way, the yodel framework uses Microsoft's cl.exe and ml.exe. You can use the mk.bat file. Should be reasonably easy to modify it for other compilers (there's also some stuff in there for Intel's icl.exe, and we used nasm instead of masm last time round. I just moved back to masm because I don't have nasm installed now).

Edit: removed attachment, get the latest file elsewhere in the thread.

I've added ti_mo_n's mmx routine:

test01-Scali1 ...000327 ticks (effective 9.810 clk/iteration)

test02-Scali2 ...000281 ticks (effective 8.430 clk/iteration)

test03-Scali3 ...000297 ticks (effective 8.910 clk/iteration)

test04-sysfce2-1 ...000577 ticks (effective 17.310 clk/iteration)

test05-Ultrano ...000390 ticks (effective 11.700 clk/iteration)

test06-ANSI C 1 ...000359 ticks (effective 10.770 clk/iteration)

test07-ANSI C 2 ...000343 ticks (effective 10.290 clk/iteration)

test08-ANSI C 2 handoptimized ...000312 ticks (effective 9.360 clk/iteration)

test09-r22 LUT ...000749 ticks (effective 22.470 clk/iteration)

test10-drizz ...000296 ticks (effective 8.880 clk/iteration)

test11-sysfce2-2 ...000327 ticks (effective 9.810 clk/iteration)

test12-sysfce2-3 ...001030 ticks (effective 30.900 clk/iteration)

test13-ti_mo_n ...000796 ticks (effective 23.880 clk/iteration)

I had to modify it slightly to make it fit the format of the test-suite (input on stack, output in eax).

In this version it's not very fast, not even if you consider that it actually does two numbers instead of one. Even if you take half the measurement, it's still not with the fastest routines.

Edit: removed attachment, get the latest file elsewhere in the thread.

test01-Scali1 ...000327 ticks (effective 9.810 clk/iteration)

test02-Scali2 ...000281 ticks (effective 8.430 clk/iteration)

test03-Scali3 ...000297 ticks (effective 8.910 clk/iteration)

test04-sysfce2-1 ...000577 ticks (effective 17.310 clk/iteration)

test05-Ultrano ...000390 ticks (effective 11.700 clk/iteration)

test06-ANSI C 1 ...000359 ticks (effective 10.770 clk/iteration)

test07-ANSI C 2 ...000343 ticks (effective 10.290 clk/iteration)

test08-ANSI C 2 handoptimized ...000312 ticks (effective 9.360 clk/iteration)

test09-r22 LUT ...000749 ticks (effective 22.470 clk/iteration)

test10-drizz ...000296 ticks (effective 8.880 clk/iteration)

test11-sysfce2-2 ...000327 ticks (effective 9.810 clk/iteration)

test12-sysfce2-3 ...001030 ticks (effective 30.900 clk/iteration)

test13-ti_mo_n ...000796 ticks (effective 23.880 clk/iteration)

I had to modify it slightly to make it fit the format of the test-suite (input on stack, output in eax).

In this version it's not very fast, not even if you consider that it actually does two numbers instead of one. Even if you take half the measurement, it's still not with the fastest routines.

Edit: removed attachment, get the latest file elsewhere in the thread.