Heh, fun to see my Yodel stuff being used again, last time I touched it was... April 2003, I think :)
Posted on 2009-12-03 03:40:49 by f0dder
Stunned at this reaction to such a simple problem!
Apparently, we're not being challenged enough.
Posted on 2009-12-03 04:47:40 by Homer
Well, I haven't done any actual asm in ages, so it was good fun.
The thing with this problem is that it *appears* simple, but current CPUs really don't have any adequate instructions for handling the problem. Even with SSSE3 you don't get *exactly* what you want.  You can shift the individual bytes by multiplying, but that only works for left shifts... so you need to correct the final result with a right shift (unless ofcourse you design the rest of your logic to also work with values that are shifted up by 3 bits, so it won't matter).
Also you still need two steps to get from bytes to dwords. You need to do an intermediate word step first.
Posted on 2009-12-03 05:53:10 by Scali
Well, my brother had problems with his computer, and decided to replace his CPU, motherboard and memory.
His old stuff wasn't really that old, it was a Core i7 860 with 8 GB of memory.
I got it cheap, and it was a nice upgrade from my trusty old Core2 Duo.
So far I haven't had any problems with it... The memory was a tad finnicky (it's high-performance memory from Corsair, with heatsinks on and all that, requires overvolting to run at the advertised speeds, which are also outside the spec of the Core i7's memory controller), but I think I have found stable settings.

Anyway, I wanted to see if the turbo was working correctly, and find out just how fast it is... so I ran this yodel stuff on it:
test21-ANSI C LUT             ...000202 ticks (effective 5.672 clk/iteration)
test09-r22 LUT                ...000203 ticks (effective 5.700 clk/iteration)
test18-r22 mega-LUT          ...000203 ticks (effective 5.700 clk/iteration)
test19-lingo12                ...000203 ticks (effective 5.700 clk/iteration)
test20-lingo12 SSSE3          ...000203 ticks (effective 5.700 clk/iteration)
test17-Scali SSSE3            ...000219 ticks (effective 6.150 clk/iteration)
test02-Scali2                ...000234 ticks (effective 6.571 clk/iteration)
test07-ANSI C 2              ...000234 ticks (effective 6.571 clk/iteration)
test08-ANSI C 2 handoptimized ...000234 ticks (effective 6.571 clk/iteration)
test15-Scali SSE2            ...000234 ticks (effective 6.571 clk/iteration)
test10-drizz                  ...000249 ticks (effective 6.992 clk/iteration)
test03-Scali3                ...000250 ticks (effective 7.020 clk/iteration)
test11-sysfce2-2              ...000265 ticks (effective 7.441 clk/iteration)
test06-ANSI C 1              ...000266 ticks (effective 7.469 clk/iteration)
test01-Scali1                ...000281 ticks (effective 7.890 clk/iteration)
test05-Ultrano                ...000312 ticks (effective 8.761 clk/iteration)
test16-Scali MMX+SSSE3        ...000343 ticks (effective 9.631 clk/iteration)
test14-Scali MMX              ...000374 ticks (effective 10.502 clk/iteration)
test13-ti_mo_n                ...000436 ticks (effective 12.243 clk/iteration)
test04-sysfce2-1              ...000437 ticks (effective 12.271 clk/iteration)
test12-sysfce2-3              ...000889 ticks (effective 24.963 clk/iteration)


One problem though: the turbo means it is cheating!
Yodel requests the clockspeed of the CPU, which is reported as 2.8 GHz. This clockspeed is fixed though... Intel has made RDTSC work on a fixed frequency, so neither downclocking (power saving modes) nor overclocking (turbo modes) will be reflected in RDTSC.
I have however checked with CPU-Z on the side, since that reports the actual clockspeed, and then you see the turbo kicking in, going up to 3.4 GHz.
So that means yodel's cycle count estimates are off... and it's going to be quite difficult to get accurate readings when the clockspeed is variable. I suppose the best way is to disable both speedstep and turbo, so the CPU is fixed at its stock speed of 2.8 GHz.
Nevertheless, the absolute times in ticks seem pretty nice. 202 ticks for the fastest run, where my Core2 Duo at 3 GHz did 234 ticks.
It also seems that the LUT-approach took a bit more distance from the others, so I guess caching is more efficient on this CPU.

Edit: and here are some results of a friend's AMD quadcore:
test19-lingo12                ...000327 ticks (effective 7.871 clk/iteration)
test04-sysfce2-1              ...000359 ticks (effective 8.641 clk/iteration)
test14-Scali MMX              ...000344 ticks (effective 8.280 clk/iteration)
test10-drizz                  ...000359 ticks (effective 8.641 clk/iteration)
test01-Scali1                ...000374 ticks (effective -1.#IO clk/iteration)
test07-ANSI C 2              ...000374 ticks (effective 9.002 clk/iteration)
test15-Scali SSE2            ...000375 ticks (effective 9.026 clk/iteration)
test02-Scali2                ...000390 ticks (effective 9.387 clk/iteration)
test03-Scali3                ...000405 ticks (effective 9.748 clk/iteration)
test06-ANSI C 1              ...000405 ticks (effective 9.748 clk/iteration)
test08-ANSI C 2 handoptimized ...000406 ticks (effective 9.772 clk/iteration)
test05-Ultrano                ...000437 ticks (effective 10.519 clk/iteration)
test13-ti_mo_n                ...000437 ticks (effective 10.519 clk/iteration)
test09-r22 LUT                ...000531 ticks (effective 12.781 clk/iteration)
test11-sysfce2-2              ...000546 ticks (effective 13.142 clk/iteration)
test18-r22 mega-LUT          ...000546 ticks (effective 13.142 clk/iteration)
test12-sysfce2-3              ...000592 ticks (effective 14.249 clk/iteration)
Posted on 2011-08-14 15:27:31 by Scali