You are given a 32 bit value where the Most Significant Bit in each byte (Bit 7) must be ignored, thus encoding a 28 bit value. Write an optimal function to decode the value (strip the ignored bits).
Well, I guess I can start this off with the obvious approach (assuming the value to convert is in eax):
I wouldn't say it's optimized though.
I wouldn't say it's optimized though.
movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
cwde
shl eax,14
add eax,edx
First attempt:
Bench output (in cycles) :
time1 = 12.67311
time2 = 5.459498
----------------------------------------
main proc
local time1,time2
local var1
mov var1,455445454
START_TEST time1
mov eax,var1
movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
cwde
shl eax,14
add eax,edx
END_TEST
START_TEST time2
mov eax,var1
mov edx,eax
mov esi,eax
mov edi,eax
and eax,7fh
and edx,7f00h
and esi,7f0000h
and edi,7f000000h
shr edx,1
shr esi,2
shr edi,3
or eax,edx
or esi,edi
or eax,esi
END_TEST
PrintLine
ret
main endp
Bench output (in cycles) :
time1 = 12.67311
time2 = 5.459498
----------------------------------------
Ultrano's method using MMX:
Takes about 10% more time to execute (on Core 2 Duo E6750), but processes 2 numbers simultaneously in 1 iteration (can be easily extended to process 4 numbers). Result in mm0.
align 4
param2 dq 1b258bce1b258bceh ;+0 <-- decimal: 455445454; 455445454;
mask1 dq 0000007f0000007fh ;+8
mask2 dq 00007f0000007f00h ;+16
mask3 dq 007f0000007f0000h ;+24
mask4 dq 7f0000007f000000h ;+32
align 4
mov eax, offset param2
movq mm0,
movq mm1, mm0
movq mm2, mm0
movq mm3, mm0
pand mm0,
pand mm1,
pand mm2,
pand mm3,
psrld mm1, 1
psrld mm2, 2
por mm0, mm1
psrld mm3, 3
por mm0, mm2
por mm0, mm3
Takes about 10% more time to execute (on Core 2 Duo E6750), but processes 2 numbers simultaneously in 1 iteration (can be easily extended to process 4 numbers). Result in mm0.
Maybe something like this:
I suppose it might be possible to make it more efficient by using more registers, and preparing some of the sums in parallel. I'll leave that for someone else, I just wanted to present this algorithm.
Edit: Just realized that I assume that the MSB is 0 in the highest byte. If not, add an extra and at the bottom :) Or rather, an and 7F7F7F7F at the top, even better.
Edit: Okay, I lied... I couldn't leave it for someone else, too much fun... here's one with more efficient use of registers:
Or what about this freaky one:
mov eax, num
mov edx, eax
and edx, 0x7FFFFF
add eax, edx
mov edx, eax
and edx, (0x7FFF SHL 1)
add eax, edx
mov edx, eax
and edx, (0x7F SHL 2)
add eax, edx
shr eax, 3
I suppose it might be possible to make it more efficient by using more registers, and preparing some of the sums in parallel. I'll leave that for someone else, I just wanted to present this algorithm.
Edit: Just realized that I assume that the MSB is 0 in the highest byte. If not, add an extra and at the bottom :) Or rather, an and 7F7F7F7F at the top, even better.
Edit: Okay, I lied... I couldn't leave it for someone else, too much fun... here's one with more efficient use of registers:
mov eax, num
mov edx, num
and eax, 0x7F7F7F7F
and edx, 0x7F7F7F
add eax, edx
add edx, edx
and edx, (0x7F7F SHL 1)
add eax, edx
add edx, edx
and edx, (0x7F SHL 2)
add eax, edx
shr eax, 3
Or what about this freaky one:
mov eax, num
mov ecx, num
mov edx, num
mov ebx, num
and eax, 0x7F7F7F7F
and edx, 0x7F7F7F
and ecx, 0x7F7F
and ebx, 0x7F
add eax, edx
add ecx, ecx
add ebx, ebx
add eax, ecx
add ebx, ebx
add eax, ebx
shr eax, 3
Timing results are as follows:
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result
BTW, what are the prizes? ^^
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result
BTW, what are the prizes? ^^
Timing results are as follows:
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result
BTW, what are the prizes? ^^
If you want to compare speed, I think the architecture is important aswell.
If I were to hazard a guess, I'd say Ultrano's code mostly runs well on CPUs with fast shifting and 3 or 4 ALUs (eg K7, K8, K10, Core2/Nehalem), while my code probably works better on CPUs with shift bottlenecks and less superscalar abilities (eg Pentium, P6, P4). I could have told you beforehand that Ultrano's code would be faster on Core2.
Homer never specified what 'optimal' meant... Are we talking speed, number of instructions, number of bytes...? And what target architecture?
Easy, I was just comparing speeds on my machine ^^ No offense intended. Every attempt is a good one - not only lets you pick the one that best suts your needs, but also it's fun to actually take the "challenge". And, actually, if you have time then please time all posted functions. It will let everyone interested see how they perform on different machines.
If someone provides a nice testing framework (f0dder?), we could all perform the same tests on our machines, with universal results.
I have a Pentium 4 myself, and an Athlon XP.
I'd like to test my theory of my code being reasonably favourable to Pentium 4.
I have a Pentium 4 myself, and an Athlon XP.
I'd like to test my theory of my code being reasonably favourable to Pentium 4.
Some interesting solutions :thumbsup:
By optimal, I meant fastest execution (not necessarily shortest code).
Just to mention that this algorithm is used a lot in compressed audio streams to mark framesizes.
They tend to use '255' to denote a SYNC marker, and this encoding ensures that all bytes of the FrameSize tag are less than '128' (80h) which means that an application can handle a 'broken/damaged' mpeg file by scanning for the next available SYNC marker ('resynchronize') with less chance of finding false positives.
While my code takes 5.5 cycles, on the same PC Scali's 3rd version takes 4.7 cycles. I really like the idea of mask-adding to shift left in groups :D.
c2d E8500 @3.8GHz.
c2d E8500 @3.8GHz.
These are my benching macros:
Of course, testing code with lots of cache-misses this way will lead to possibly bad decisions on which version to take. There I plug-in other RDTSC-based macros/calls around code in a complete big app, and measure min/max/avg/avg2/last/count .
;=====[[ Benchmarking macros START_TEST/END_TEST >>===\
;--[ configuration ]---[
TEST_ITERS equ 10000000
TEST_SUBCYCLE = 1 ; 0 or 1, whether to show FP result
;----------------------/
TEST_ID = 0 ; internal thingie
START_TEST macro where:REQ
local where2
@START_TEST_VAR textequ <where>
mov where,-1
invoke GetModuleHandle,0
invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadAffinityMask,eax,1
invoke SwitchToThread
TEST_ID = TEST_ID + 1
rdtsc
push eax
push edx
mov ecx,TEST_ITERS
align 16
@CatStr(<testlabl>,%TEST_ID,<:>)
endm
END_TEST macro
dec ecx
jnz @CatStr(<testlabl>,%TEST_ID)
nop
nop
nop
nop
rdtsc
sub eax,
sbb edx,
if TEST_SUBCYCLE EQ 0
add esp,8
mov ecx,TEST_ITERS
div ecx
.if eax<@START_TEST_VAR
mov @START_TEST_VAR,eax
.endif
invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS
%@CatStr(<print >,@START_TEST_VAR)
else
mov ,edx
mov ,eax
fild qword ptr
mov dword ptr,TEST_ITERS
fidiv dword ptr
fstp @START_TEST_VAR
add esp,8
invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS
%@CatStr(<PrintFloat >,@START_TEST_VAR)
endif
endm
;=======/
Of course, testing code with lots of cache-misses this way will lead to possibly bad decisions on which version to take. There I plug-in other RDTSC-based macros/calls around code in a complete big app, and measure min/max/avg/avg2/last/count .
Thanks for all the replies - when I have to invent problems to solve, the world is in a good state !!
I'll try to make some effort to post more of these simple problems, and I hope to next time hear from some more beginners as well :) (not that I mind the regulars showing everyone how its done, but its much more rewarding for me to know that someone pushed their envelope a little bit, or that someone gained something from the exercise).
I'll try to make some effort to post more of these simple problems, and I hope to next time hear from some more beginners as well :) (not that I mind the regulars showing everyone how its done, but its much more rewarding for me to know that someone pushed their envelope a little bit, or that someone gained something from the exercise).
A bit fixup and completion. In my previous post I meant Scali's 2nd version not 3rd. Too many post-edits there :) .
I added "add var2,eax" to verify calculations don't get skipped and skewed by the out-of-order exec.
c2d E8500 @3.8GHz, DDR3@1.6GHz (clk7) :
time1 = 2.861777 ; <null,warmup>
time1 = 13.15349 ; sysfce2
time2 = 6.553432 ; ultrano
time3 = 6.410781 ; scali1
time4 = 6.218256 ; scali2
time5 = 7.705081 ; scali3
----------------------------------------
I added "add var2,eax" to verify calculations don't get skipped and skewed by the out-of-order exec.
main proc
local time1,time2,time3,time4,time5
local num,var2
local tmp
mov num,455445454
START_TEST time1
mov eax,num
mov var2,eax
END_TEST
START_TEST time1
mov eax,num
movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
cwde
shl eax,14
add eax,edx
add var2,eax
END_TEST
START_TEST time2
mov eax,num
mov edx,eax
mov esi,eax
mov edi,eax
and eax,7fh
and edx,7f00h
and esi,7f0000h
and edi,7f000000h
shr edx,1
shr esi,2
shr edi,3
or eax,edx
or esi,edi
or eax,esi
add var2,eax
END_TEST
START_TEST time3
mov eax, num
mov edx, eax
and edx, 7FFFFFh
add eax, edx
mov edx, eax
and edx, (7FFFh SHL 1)
add eax, edx
mov edx, eax
and edx, (7Fh SHL 2)
add eax, edx
shr eax, 3
add var2,eax
END_TEST
START_TEST time4
mov eax, num
mov edx, num
and eax, 7F7F7F7Fh
and edx, 7F7F7Fh
add eax, edx
add edx, edx
and edx, (7F7Fh SHL 1)
add eax, edx
add edx, edx
and edx, (7Fh SHL 2)
add eax, edx
shr eax, 3
add var2,eax
END_TEST
START_TEST time5
mov eax, num
mov esi, num
mov edx, num
mov ebx, num
and eax, 7F7F7F7Fh
and edx, 7F7F7Fh
and esi, 7F7Fh
and ebx, 7Fh
add eax, edx
add esi, esi
add ebx, ebx
add eax, esi
add ebx, ebx
add eax, ebx
shr eax, 3
add var2,eax
END_TEST
PrintLine
ret
main endp
c2d E8500 @3.8GHz, DDR3@1.6GHz (clk7) :
time1 = 2.861777 ; <null,warmup>
time1 = 13.15349 ; sysfce2
time2 = 6.553432 ; ultrano
time3 = 6.410781 ; scali1
time4 = 6.218256 ; scali2
time5 = 7.705081 ; scali3
----------------------------------------
By optimal, I meant fastest execution (not necessarily shortest code).
But on which architecture?
These are my benching macros:
I tried to build a test app from your macros, but it seems I'm missing the PrintLine and PrintFloat stuff?
It's modified VKDebug stuff. I use MASM, and some libs that come with the masm32 package. The two macros are:
PrintFloat macro Var:REQ
local szDN,float1
.data
szDN byte 24 dup(0)
float1 real8 ?
.code
pushfd
pushad
push Var
fld real4 ptr
pop eax
fstp float1
invoke FloatToStr,float1,addr szDN ; <------ need to implement this
invoke strlen,addr szDN
add eax,@SizeStr(&Input)+4
invoke GlobalAlloc, GMEM_FIXED+GMEM_ZEROINIT, eax ; yuck
mov ebx,eax
FillMem ebx,&Var
mov dword ptr , 203D20h
invoke lstrcat,eax,addr szDN
invoke DebugPrint,ebx
invoke GlobalFree,ebx
popad
popfd
endm
PrintLine macro
prints "--------------"
endm
With f0dder's old 'yodel' framework slightly reworked to suit this test, I got this from a P4:
test01-Scali1 ...000547 ticks (effective 20.775 clk/iteration)
test02-Scali2 ...000515 ticks (effective 19.560 clk/iteration)
test03-Scali3 ...000625 ticks (effective 23.737 clk/iteration)
test04-sysfce2 ...000578 ticks (effective 21.952 clk/iteration)
test05-Ultrano ...000610 ticks (effective 23.168 clk/iteration)
And this from a Core2 Duo:
test01-Scali1 ...000375 ticks (effective 11.250 clk/iteration)
test02-Scali2 ...000344 ticks (effective 10.320 clk/iteration)
test03-Scali3 ...000406 ticks (effective 12.180 clk/iteration)
test04-sysfce2 ...000608 ticks (effective 18.240 clk/iteration)
test05-Ultrano ...000452 ticks (effective 13.560 clk/iteration)
It's not entirely fair towards my third routine and Ultrano's routine, since they both use non-volatile registers, which have to be pushed and popped.
test01-Scali1 ...000547 ticks (effective 20.775 clk/iteration)
test02-Scali2 ...000515 ticks (effective 19.560 clk/iteration)
test03-Scali3 ...000625 ticks (effective 23.737 clk/iteration)
test04-sysfce2 ...000578 ticks (effective 21.952 clk/iteration)
test05-Ultrano ...000610 ticks (effective 23.168 clk/iteration)
And this from a Core2 Duo:
test01-Scali1 ...000375 ticks (effective 11.250 clk/iteration)
test02-Scali2 ...000344 ticks (effective 10.320 clk/iteration)
test03-Scali3 ...000406 ticks (effective 12.180 clk/iteration)
test04-sysfce2 ...000608 ticks (effective 18.240 clk/iteration)
test05-Ultrano ...000452 ticks (effective 13.560 clk/iteration)
It's not entirely fair towards my third routine and Ultrano's routine, since they both use non-volatile registers, which have to be pushed and popped.
I've added the following two C versions, compiled with VS2008 with /Ox flag:
test01-Scali1 ...000374 ticks (effective 11.220 clk/iteration)
test02-Scali2 ...000359 ticks (effective 10.770 clk/iteration)
test03-Scali3 ...000421 ticks (effective 12.630 clk/iteration)
test04-sysfce2 ...000624 ticks (effective 18.720 clk/iteration)
test05-Ultrano ...000468 ticks (effective 14.040 clk/iteration)
test06-ANSI C 1 ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2 ...000343 ticks (effective 10.290 clk/iteration)
First version appears to be this:
Second one is this:
extern "C" unsigned int __stdcall time6(unsigned int num)
{
return (num & 0x7F) | ((num & 0x7F00) >> 1) | ((num & 0x7F0000) >> 2) | ((num & 0x7F000000) >> 3);
}
extern "C" unsigned int __stdcall time7(unsigned int num)
{
unsigned int result = num & 0x7F7F7F7F;
unsigned int tmp = num & 0x7F7F7F;
result += tmp;
tmp += tmp;
tmp &= (0x7F7F << 1);
result += tmp;
tmp += tmp;
tmp &= (0x7F << 2);
result += tmp;
return result >> 3;
}
test01-Scali1 ...000374 ticks (effective 11.220 clk/iteration)
test02-Scali2 ...000359 ticks (effective 10.770 clk/iteration)
test03-Scali3 ...000421 ticks (effective 12.630 clk/iteration)
test04-sysfce2 ...000624 ticks (effective 18.720 clk/iteration)
test05-Ultrano ...000468 ticks (effective 14.040 clk/iteration)
test06-ANSI C 1 ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2 ...000343 ticks (effective 10.290 clk/iteration)
First version appears to be this:
_text:00000000 mov ecx,
_text:00000004 mov eax, ecx
_text:00000006 shr eax, 1
_text:00000008 and eax, 3F800000h
_text:0000000D mov edx, ecx
_text:0000000F and edx, 7F0000h
_text:00000015 or eax, edx
_text:00000017 mov edx, ecx
_text:00000019 shr eax, 1
_text:0000001B and edx, 7F00h
_text:00000021 or eax, edx
_text:00000023 shr eax, 1
_text:00000025 and ecx, 7Fh
_text:00000028 or eax, ecx
Second one is this:
_text:00000000 mov eax,
_text:00000004 mov ecx, eax
_text:00000006 and eax, 7F7F7F7Fh
_text:0000000B and ecx, 7F7F7Fh
_text:00000011 add eax, ecx
_text:00000013 mov edx, eax
_text:00000015 lea eax,
_text:00000018 and eax, 0FEFEh
_text:0000001D add edx, eax
_text:0000001F add eax, eax
_text:00000021 and eax, 1FCh
_text:00000026 add eax, edx
_text:00000028 shr eax, 3
Looking at that code reminded me that I forgot to disable MASM's prologue/epilogue code, so I got some gratuitous esp/ebp stuff.
Removed that, the results now sorta look like this on Core2 Duo:
test01-Scali1 ...000343 ticks (effective 10.290 clk/iteration)
test02-Scali2 ...000297 ticks (effective 8.910 clk/iteration)
test03-Scali3 ...000374 ticks (effective 11.220 clk/iteration)
test04-sysfce2 ...000562 ticks (effective 16.860 clk/iteration)
test05-Ultrano ...000374 ticks (effective 11.220 clk/iteration)
test06-ANSI C 1 ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2 ...000343 ticks (effective 10.290 clk/iteration)
And like this on Pentium 4:
test01-Scali1 ...000453 ticks (effective 17.205 clk/iteration)
test02-Scali2 ...000438 ticks (effective 16.635 clk/iteration)
test03-Scali3 ...000500 ticks (effective 18.990 clk/iteration)
test04-sysfce2 ...000422 ticks (effective 16.028 clk/iteration)
test05-Ultrano ...000515 ticks (effective 19.560 clk/iteration)
test06-ANSI C 1 ...000438 ticks (effective 16.635 clk/iteration)
test07-ANSI C 2 ...000437 ticks (effective 16.597 clk/iteration)
I suppose sysfce2's code clearly demonstrates why architecture matters. His code is the fastest on Pentium 4, but the slowest on Core2 Duo.
I wonder what causes that exactly... perhaps cwde is slow on Core2... or it could be the partial registers that P4 handles better (partial register stalls have been a problem ever since the first Pentium Pro).
I seem to have misjudged Ultrano's code a bit. I would have expected it to be faster on Core2 Duo, but it seems like his code has the same problem as my third version: You can try to take advantage of 3 or 4 ALUs in parallel, but you are limited by the decoder, so it doesn't really work unless the reorder buffer is already full... that won't happen here because we don't have any weird branches or memory accesses.
So despite my efforts of trying to take advantage of the 4 ALUs in the Core2 Duo in my 3rd variation, the 2nd variation remains the fastest for Core2 Duo.
At least I was right about my algorithm favouring the Pentium 4 though. Pentium 4 has trouble handling shifts, that is probably what makes Ultrano's version slower on the P4. My code performs more 'consistently' between Core2 Duo and Pentium 4.
Removed that, the results now sorta look like this on Core2 Duo:
test01-Scali1 ...000343 ticks (effective 10.290 clk/iteration)
test02-Scali2 ...000297 ticks (effective 8.910 clk/iteration)
test03-Scali3 ...000374 ticks (effective 11.220 clk/iteration)
test04-sysfce2 ...000562 ticks (effective 16.860 clk/iteration)
test05-Ultrano ...000374 ticks (effective 11.220 clk/iteration)
test06-ANSI C 1 ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2 ...000343 ticks (effective 10.290 clk/iteration)
And like this on Pentium 4:
test01-Scali1 ...000453 ticks (effective 17.205 clk/iteration)
test02-Scali2 ...000438 ticks (effective 16.635 clk/iteration)
test03-Scali3 ...000500 ticks (effective 18.990 clk/iteration)
test04-sysfce2 ...000422 ticks (effective 16.028 clk/iteration)
test05-Ultrano ...000515 ticks (effective 19.560 clk/iteration)
test06-ANSI C 1 ...000438 ticks (effective 16.635 clk/iteration)
test07-ANSI C 2 ...000437 ticks (effective 16.597 clk/iteration)
I suppose sysfce2's code clearly demonstrates why architecture matters. His code is the fastest on Pentium 4, but the slowest on Core2 Duo.
I wonder what causes that exactly... perhaps cwde is slow on Core2... or it could be the partial registers that P4 handles better (partial register stalls have been a problem ever since the first Pentium Pro).
I seem to have misjudged Ultrano's code a bit. I would have expected it to be faster on Core2 Duo, but it seems like his code has the same problem as my third version: You can try to take advantage of 3 or 4 ALUs in parallel, but you are limited by the decoder, so it doesn't really work unless the reorder buffer is already full... that won't happen here because we don't have any weird branches or memory accesses.
So despite my efforts of trying to take advantage of the 4 ALUs in the Core2 Duo in my 3rd variation, the 2nd variation remains the fastest for Core2 Duo.
At least I was right about my algorithm favouring the Pentium 4 though. Pentium 4 has trouble handling shifts, that is probably what makes Ultrano's version slower on the P4. My code performs more 'consistently' between Core2 Duo and Pentium 4.