I was wondering why that CWDE is there - extended the sign bit does not contribute to the solution - especially given that we've already masked out that bit specifically, and given that we then go on to overwrite it !!!
And I'd be willing to bet that opcode is precisely what slows down that solution, too.
Wonder what the speed is like with that opcode omitted.

Posted on 2009-11-22 22:52:13 by Homer
Scali, what is your c2d? Such different results from mine.

I.e ANSI C 1  is 6.56 cycles here, and ratios are all inconsistent with your bench-results.
Posted on 2009-11-23 00:59:31 by Ultrano
The CWDE is only there to clear the high order bit.  I used it as the instruction is only one byte long and I wasn't sure if we were going for size or speed.  It can easily be replaced with an and:
EDIT:  I had assumed that the high order nibble had to be cleared (for future ORing, etc.) - if not important, it can be omitted.  My crude benchmarking (i.e. repeated execution between two GetTickCounts) points toward about a 10% time penalty.

	movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
shl eax,14
; and eax,0x0FFFFFFF
add eax,edx
retn


I find it hard to believe that code would be even competitive with all the dependancies it has - I figured it might be size competitive though as I tried to avoid dword ands.

Here's another bit manipulation version with some ands:
	shld edx,eax,31
shld esi,eax,30
shld edi,eax,29
and eax,0x0000007F
and edx,0x00003F80
and esi,0x001FC000
and edi,0x0FE00000
or eax,edx
or esi,edi
or eax,esi


I think though, that shld might be kind of slow - it does appear to be faster than my first one, although slower than the rest of the competion ;-). - I can't really test it for speed as I don't have a good benchmarking routine.  I might try porting Ultrano's to NASM though.
I'm also an asm newbie, so I'm learning a bunch just reading (and comparing to mine) all you guy's solutions - it's quite impressive the completely different/elegant solutions.
Posted on 2009-11-23 01:02:42 by sysfce2

Scali, what is your c2d? Such different results from mine.

I.e ANSI C 1  is 6.56 cycles here, and ratios are all inconsistent with your bench-results.


I have a Core2Duo E6600 running at 3 GHz.
The testing framework tests the routines as function calls, so there's calling overhead aswell. That's probably why all results are about ~2 clks more than your results.

I'll attach the framework as it is now, still messy, but if you run src/yodel.exe, you get the exact same stuff I was testing with.

Edit: removed attachment, get the latest file elsewhere in the thread.
Posted on 2009-11-23 05:23:42 by Scali
Results from my 1.5 GHz Core2 Duo laptop:
test01-Scali1                ...000609 ticks (effective 9.111 clk/iteration)
test02-Scali2                ...000562 ticks (effective 8.408 clk/iteration)
test03-Scali3                ...000702 ticks (effective 10.502 clk/iteration)
test04-sysfce2                ...001060 ticks (effective 15.858 clk/iteration)
test05-Ultrano                ...000702 ticks (effective 10.502 clk/iteration)
test06-ANSI C 1              ...000608 ticks (effective 9.096 clk/iteration)
test07-ANSI C 2              ...000655 ticks (effective 9.799 clk/iteration)

And my 1.6 GHz Celeron (Northwood) laptop:
test01-Scali1                ...001141 ticks (effective 18.188 clk/iteration)
test02-Scali2                ...001141 ticks (effective 18.188 clk/iteration)
test03-Scali3                ...001432 ticks (effective 22.826 clk/iteration)
test04-sysfce2                ...001142 ticks (effective 18.203 clk/iteration)
test05-Ultrano                ...001432 ticks (effective 22.826 clk/iteration)
test06-ANSI C 1              ...001142 ticks (effective 18.203 clk/iteration)
test07-ANSI C 2              ...001141 ticks (effective 18.188 clk/iteration)
Posted on 2009-11-23 08:45:35 by Scali
What strikes me most is that VS (and generally modern compilers) can very often produce code so good that it's actually hard to beat it. "Optimized by hand in assembly" no longer means what it used to mean.
Posted on 2009-11-23 10:25:01 by ti_mo_n
This is only when the C++ coder knows how a cpu works, and how to optimize in asm :P .
Posted on 2009-11-23 10:27:35 by Ultrano

What strikes me most is that VS (and generally modern compilers) can very often produce code so good that it's actually hard to beat it. "Optimized by hand in assembly" no longer means what it used to mean.


Well, that's partly because of CPU design itself.
Since most programmers don't know how to code in asm anymore, most code is written in compiled languages. Ever since the Pentium, Intel has been working on optimizing only the 'common subset' of instructions that compilers use, rather than the more esoteric instructions that are mainly for hand-optimized assembly.
As a result, basic operations like add/sub/or/xor and all that have become faster, while 'weird' instructions like inc/dec/movsd/aaa and all that have been replaced with 'software'... they are basically being 'emulated' in microcode by generating a sequence of basic operations.

As a result, the code that compilers generate is always reasonably optimal. And the 'weird' instructions that assembly programmers once used to cut corners and optimize their code, have now often become slower than doing it the 'naive' way.

The 'trick' that I pulled here... whether that's an assembly trick is debatable. As you can see, I could easily implement the same trick in C. It doesn't take advantage of things like carry flags and such, which would only work in assembly, and not in C. You need to know how basic adding and carrying works, but technically you wouldn't have to know how to write asm, it's more of a 'mathematical' thing, number theory or something like that.

Edit:
In fact, it was already like that about 6 years ago, when f0dder originally developed the yodel package for some other code. Here are the old threads if anyone is interested:
http://www.asmcommunity.net/board/index.php?topic=12764.0
http://www.asmcommunity.net/board/index.php?topic=12817.0
The short version: even back then, most assembly versions got beaten by the C compiler.
Posted on 2009-11-23 11:02:36 by Scali
When I looked at the second algo that the compiler generated, I noticed that if you swapped the usage of eax, ecx and edx around a bit, you could remove one instruction...
So I got to this:
	mov     eax, 
mov    ecx, eax
and    eax, 7F7F7F7Fh
and    ecx, 7F7F7Fh
add    eax, ecx
add    ecx, ecx
and    ecx, 0FEFEh
add    eax, ecx
add    ecx, ecx
and    ecx, 1FCh
add    eax, ecx
shr    eax, 3


Which is indeed faster. Same speed as my first implementation.

However, since I had the VS2010 beta installed aswell, I figured I'd try that one.
It came up with this:
_text:00000000                 mov     eax, 
_text:00000004                mov    ecx, eax
_text:00000006                and    eax, 7F7F7F7Fh
_text:0000000B                and    ecx, 7F7F7Fh
_text:00000011                lea    edx,
_text:00000014                lea    eax,
_text:00000017                and    eax, 0FEFEh
_text:0000001C                add    edx, eax
_text:0000001E                add    eax, eax
_text:00000020                and    eax, 1FCh
_text:00000025                add    eax, edx
_text:00000027                shr    eax, 3


This code is just as fast as the fine-tuned version I came up with (the first ANSI C routine resulted in the exact same code as VS2008):
test01-Scali1                ...000296 ticks (effective 8.880 clk/iteration)
test02-Scali2                ...000281 ticks (effective 8.430 clk/iteration)
test03-Scali3                ...000343 ticks (effective 10.290 clk/iteration)
test04-sysfce2                ...000530 ticks (effective 15.900 clk/iteration)
test05-Ultrano                ...000343 ticks (effective 10.290 clk/iteration)
test06-ANSI C 1              ...000296 ticks (effective 8.880 clk/iteration)
test07-ANSI C 2              ...000296 ticks (effective 8.880 clk/iteration)
test08-ANSI C 2 handoptimized ...000296 ticks (effective 8.880 clk/iteration)
Posted on 2009-11-23 13:23:11 by Scali
Has the Look Up Table method been tried?

MOVZX ecx, word
MOVZX eax, word
MOV cx, word
MOV ax, word
SHL ecx, 14
OR eax, ecx

The 131 KB look up table is definitely a space hog, but once the LUT is in the processor cache it would likely be the fastest method.

Generating the LUT would be some preprocessor/macro annoyance. But you wouldn't need to actually fill the LUT to preform the benchmark.
LUT:
REPEAT 65536
   dw ((% & 0x7F00) >> 1) | (% & 0x7F)
END REPEAT
Posted on 2009-11-23 14:18:33 by r22
I've converted your code to MASM and added it, here are the results:
test01-Scali1                 ...000312 ticks (effective 9.360 clk/iteration)
test02-Scali2                 ...000296 ticks (effective 8.880 clk/iteration)
test03-Scali3                 ...000359 ticks (effective 10.770 clk/iteration)
test04-sysfce2                ...000530 ticks (effective 15.900 clk/iteration)
test05-Ultrano                ...000343 ticks (effective 10.290 clk/iteration)
test06-ANSI C 1               ...000297 ticks (effective 8.910 clk/iteration)
test07-ANSI C 2               ...000297 ticks (effective 8.910 clk/iteration)
test08-ANSI C 2 handoptimized ...000281 ticks (effective 8.430 clk/iteration)
test09-r22 LUT                ...000624 ticks (effective 18.720 clk/iteration)

It seems the cache just isn't fast enough. I've also tried with 5 times as many iterations, but still it was considerably slower.

I've attached the latest yodel, with the extra tests.

Edit: removed attachment, get the latest file elsewhere in the thread.
Posted on 2009-11-23 14:48:10 by Scali
Impressively slow :)
I guess the underlying algorithm is already too compact to benefit from the 'brute force' solution.

I was also thinking of a MUL by magic number approach, but I don't see it being an improvement.

Posted on 2009-11-23 15:56:42 by r22
I think the main problem is that the LUT is too large. You need to go to L2 a lot of the time.
If you could stay within L1, you'd have a lookup delay of about 3-4 cycles, depending on the exact CPU.
But L2 is more like 15 cycles or more.
Posted on 2009-11-23 16:01:56 by Scali
here's my try:
	mov edx,00000000000000000000000001111111b
mov ecx,00000000000000000111111100000000b
mov ebx,00000000011111110000000000000000b
and edx,eax
and ecx,eax
and ebx,eax
and eax,01111111000000000000000000000000b
lea eax,
lea eax,
lea eax,
shr eax,3
Posted on 2009-11-23 16:36:52 by drizz
Very good one, drizz:
test01-Scali1                ...000297 ticks (effective 8.910 clk/iteration)
test02-Scali2                ...000265 ticks (effective 7.950 clk/iteration)
test03-Scali3                ...000265 ticks (effective 7.950 clk/iteration)
test04-sysfce2                ...000515 ticks (effective 15.450 clk/iteration)
test05-Ultrano                ...000359 ticks (effective 10.770 clk/iteration)
test06-ANSI C 1              ...000328 ticks (effective 9.840 clk/iteration)
test07-ANSI C 2              ...000296 ticks (effective 8.880 clk/iteration)
test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)
test09-r22 LUT                ...000671 ticks (effective 20.130 clk/iteration)
test10-drizz                  ...000266 ticks (effective 7.980 clk/iteration)

I've modified my third routine to use some of your ideas.
I think yours is the fastest routine technically, since I had to add push/pop ebx to make it run from the yodel framework, and still it's as fast as the Scali2 routine. Without the push/pop it would be faster.
Posted on 2009-11-23 17:16:02 by Scali
Here's a slight variation on drizz's code:

	lea edx,
lea esi,
lea edi,
and edx,00000000111111100000000000000000b
and esi,00000000000000011111110000000000b
and edi,00000000000000000000001111111000b
and eax,01111111000000000000000000000000b
or edx,esi
or eax,edi
or eax,edx
shr eax,3


I haven't yet got the Yodel routine to compile - I'm using Pelle's C.  What compiler are you using so I can try to tweak the settings?
Posted on 2009-11-24 00:02:49 by sysfce2
And just for fun, here's another one using rotate instructions (really slow):

	mov edx,eax
rcr eax,8
mov esi,0xFE00007F
rcl edx,9
and eax,esi
and edx,esi
rol eax,7
ror edx,11
or eax,edx
Posted on 2009-11-24 00:06:31 by sysfce2
My first solution was quite similar to this, it also used a rotational approach.
And yes, it was slowww.



Posted on 2009-11-24 00:30:15 by Homer
I've added the two routines from sysfce2:
test01-Scali1                 ...000297 ticks (effective 8.910 clk/iteration)
test02-Scali2                 ...000280 ticks (effective 8.400 clk/iteration)
test03-Scali3                 ...000281 ticks (effective 8.430 clk/iteration)
test04-sysfce2-1              ...000531 ticks (effective 15.930 clk/iteration)
test05-Ultrano                ...000358 ticks (effective 10.740 clk/iteration)
test06-ANSI C 1               ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2               ...000280 ticks (effective 8.400 clk/iteration)
test08-ANSI C 2 handoptimized ...000265 ticks (effective 7.950 clk/iteration)
test09-r22 LUT                ...000671 ticks (effective 20.130 clk/iteration)
test10-drizz                  ...000265 ticks (effective 7.950 clk/iteration)
test11-sysfce2-2              ...000281 ticks (effective 8.430 clk/iteration)
test12-sysfce2-3              ...000921 ticks (effective 27.630 clk/iteration)

And I'll attach the updated yodel routine... I've also made the conformance test actually WORK this time, which meant I also fixed some minor bugs in some code, but no impact on performance.

By the way, the yodel framework uses Microsoft's cl.exe and ml.exe. You can use the mk.bat file. Should be reasonably easy to modify it for other compilers (there's also some stuff in there for Intel's icl.exe, and we used nasm instead of masm last time round. I just moved back to masm because I don't have nasm installed now).

Edit: removed attachment, get the latest file elsewhere in the thread.
Posted on 2009-11-24 04:05:35 by Scali
I've added ti_mo_n's mmx routine:
test01-Scali1                 ...000327 ticks (effective 9.810 clk/iteration)
test02-Scali2                 ...000281 ticks (effective 8.430 clk/iteration)
test03-Scali3                 ...000297 ticks (effective 8.910 clk/iteration)
test04-sysfce2-1              ...000577 ticks (effective 17.310 clk/iteration)
test05-Ultrano                ...000390 ticks (effective 11.700 clk/iteration)
test06-ANSI C 1               ...000359 ticks (effective 10.770 clk/iteration)
test07-ANSI C 2               ...000343 ticks (effective 10.290 clk/iteration)
test08-ANSI C 2 handoptimized ...000312 ticks (effective 9.360 clk/iteration)
test09-r22 LUT                ...000749 ticks (effective 22.470 clk/iteration)
test10-drizz                  ...000296 ticks (effective 8.880 clk/iteration)
test11-sysfce2-2              ...000327 ticks (effective 9.810 clk/iteration)
test12-sysfce2-3              ...001030 ticks (effective 30.900 clk/iteration)
test13-ti_mo_n                ...000796 ticks (effective 23.880 clk/iteration)

I had to modify it slightly to make it fit the format of the test-suite (input on stack, output in eax).
In this version it's not very fast, not even if you consider that it actually does two numbers instead of one. Even if you take half the measurement, it's still not with the fastest routines.

Edit: removed attachment, get the latest file elsewhere in the thread.
Posted on 2009-11-24 06:21:47 by Scali