You are given a 32 bit value where the Most Significant Bit in each byte (Bit 7) must be ignored, thus encoding a 28 bit value. Write an optimal function to decode the value (strip the ignored bits).



Posted on 2009-11-20 19:19:19 by Homer
Well, I guess I can start this off with the obvious approach (assuming the value to convert is in eax):
I wouldn't say it's optimized though.
	movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
cwde
shl eax,14
add eax,edx
Posted on 2009-11-21 08:47:01 by sysfce2
First attempt:

main proc
local time1,time2
local var1



mov var1,455445454



START_TEST time1
mov eax,var1
movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
cwde
shl eax,14
add eax,edx
END_TEST


START_TEST time2
mov eax,var1
mov edx,eax
mov esi,eax
mov edi,eax
and eax,7fh
and edx,7f00h
and esi,7f0000h
and edi,7f000000h
shr edx,1
shr esi,2
shr edi,3
or eax,edx
or esi,edi
or eax,esi


END_TEST





PrintLine

ret
main endp


Bench output (in cycles) :

time1 = 12.67311
time2 = 5.459498
----------------------------------------
Posted on 2009-11-21 10:05:23 by Ultrano
Ultrano's method using MMX:

align 4
param2  dq  1b258bce1b258bceh ;+0  <-- decimal: 455445454; 455445454;
mask1   dq  0000007f0000007fh ;+8
mask2   dq  00007f0000007f00h ;+16
mask3   dq  007f0000007f0000h ;+24
mask4   dq  7f0000007f000000h ;+32


align 4
                   mov     eax, offset param2
                   movq    mm0,
                   movq    mm1, mm0
                   movq    mm2, mm0
                   movq    mm3, mm0
                   pand    mm0,
                   pand    mm1,
                   pand    mm2,
                   pand    mm3,
                   psrld   mm1, 1
                   psrld   mm2, 2
                   por     mm0, mm1
                   psrld   mm3, 3
                   por     mm0, mm2
                   por     mm0, mm3


Takes about 10% more time to execute (on Core 2 Duo E6750), but processes 2 numbers simultaneously in 1 iteration (can be easily extended to process 4 numbers). Result in mm0.
Posted on 2009-11-21 10:44:10 by ti_mo_n
Maybe something like this:
mov eax, num
mov edx, eax
and edx, 0x7FFFFF
add eax, edx
mov edx, eax
and edx, (0x7FFF SHL 1)
add eax, edx
mov edx, eax
and edx, (0x7F SHL 2)
add eax, edx
shr eax, 3


I suppose it might be possible to make it more efficient by using more registers, and preparing some of the sums in parallel. I'll leave that for someone else, I just wanted to present this algorithm.

Edit: Just realized that I assume that the MSB is 0 in the highest byte. If not, add an extra and at the bottom :) Or rather, an and 7F7F7F7F at the top, even better.

Edit: Okay, I lied... I couldn't leave it for someone else, too much fun... here's one with more efficient use of registers:
mov eax, num
mov edx, num

and eax, 0x7F7F7F7F
and edx, 0x7F7F7F

add eax, edx
add edx, edx

and edx, (0x7F7F SHL 1)

add eax, edx
add edx, edx

and edx, (0x7F SHL 2)

add eax, edx

shr eax, 3


Or what about this freaky one:
mov eax, num
mov ecx, num
mov edx, num
mov ebx, num

and eax, 0x7F7F7F7F
and edx, 0x7F7F7F
and ecx, 0x7F7F
and ebx, 0x7F

add eax, edx
add ecx, ecx
add ebx, ebx

add eax, ecx
add ebx, ebx

add eax, ebx

shr eax, 3
Posted on 2009-11-21 12:54:44 by Scali
Timing results are as follows:
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result


BTW, what are the prizes? ^^
Posted on 2009-11-21 15:46:47 by ti_mo_n

Timing results are as follows:
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result


BTW, what are the prizes? ^^


If you want to compare speed, I think the architecture is important aswell.
If I were to hazard a guess, I'd say Ultrano's code mostly runs well on CPUs with fast shifting and 3 or 4 ALUs (eg K7, K8, K10, Core2/Nehalem), while my code probably works better on CPUs with shift bottlenecks and less superscalar abilities (eg Pentium, P6, P4). I could have told you beforehand that Ultrano's code would be faster on Core2.

Homer never specified what 'optimal' meant... Are we talking speed, number of instructions, number of bytes...? And what target architecture?
Posted on 2009-11-21 16:38:45 by Scali
Easy, I was just comparing speeds on my machine ^^ No offense intended. Every attempt is a good one - not only lets you pick the one that best suts your needs, but also it's fun to actually take the "challenge". And, actually, if you have time then please time all posted functions. It will let everyone interested see how they perform on different machines.
Posted on 2009-11-21 16:57:17 by ti_mo_n
If someone provides a nice testing framework (f0dder?), we could all perform the same tests on our machines, with universal results.
I have a Pentium 4 myself, and an Athlon XP.
I'd like to test my theory of my code being reasonably favourable to Pentium 4.
Posted on 2009-11-21 17:24:10 by Scali

Some interesting solutions  :thumbsup:
By optimal, I meant fastest execution (not necessarily shortest code).

Just to mention that this algorithm is used a lot in compressed audio streams to mark framesizes.
They tend to use '255' to denote a SYNC marker, and this encoding ensures that all bytes of the FrameSize tag are less than '128' (80h) which means that an application can handle a 'broken/damaged' mpeg file by scanning for the next available SYNC marker ('resynchronize') with less chance of finding false positives.


Posted on 2009-11-21 19:57:41 by Homer
While my code takes  5.5 cycles, on the same PC Scali's 3rd version takes 4.7 cycles. I really like the idea of mask-adding to shift left in groups :D.
c2d E8500 @3.8GHz.
Posted on 2009-11-21 23:44:48 by Ultrano
These are my benching macros:

;=====[[ Benchmarking macros START_TEST/END_TEST >>===\
;--[ configuration ]---[
TEST_ITERS equ 10000000
TEST_SUBCYCLE = 1 ; 0 or 1, whether to show FP result
;----------------------/

TEST_ID = 0 ; internal thingie
START_TEST macro where:REQ
local where2
@START_TEST_VAR textequ <where>
mov where,-1


invoke GetModuleHandle,0
invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadAffinityMask,eax,1

invoke SwitchToThread

TEST_ID = TEST_ID + 1
rdtsc
push eax
push edx
mov ecx,TEST_ITERS
align 16
@CatStr(<testlabl>,%TEST_ID,<:>)
endm
END_TEST macro
dec ecx
jnz @CatStr(<testlabl>,%TEST_ID)
nop
nop
nop
nop

rdtsc
sub eax,
sbb edx,
 if TEST_SUBCYCLE EQ 0
add esp,8
mov ecx,TEST_ITERS
div ecx
.if eax<@START_TEST_VAR
mov @START_TEST_VAR,eax
.endif

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

%@CatStr(<print >,@START_TEST_VAR)
else
mov ,edx
mov ,eax
fild qword ptr
mov dword ptr,TEST_ITERS
fidiv dword ptr
fstp @START_TEST_VAR
add esp,8

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

%@CatStr(<PrintFloat >,@START_TEST_VAR)
endif
endm

;=======/


Of course, testing code with lots of cache-misses this way will lead to possibly bad decisions on which version to take. There I plug-in other RDTSC-based macros/calls around code in a complete big app, and measure min/max/avg/avg2/last/count .
Posted on 2009-11-21 23:48:21 by Ultrano
Thanks for all the replies - when I have to invent problems to solve, the world is in a good state !!
I'll try to make some effort to post more of these simple problems, and I hope to next time hear from some more beginners as well :) (not that I mind the regulars showing everyone how its done, but its much more rewarding for me to know that someone pushed their envelope a little bit, or that someone gained something from the exercise).



Posted on 2009-11-22 00:11:38 by Homer
A bit fixup and completion. In my previous post I meant Scali's 2nd version not 3rd. Too many post-edits there :) .
I added "add var2,eax" to verify calculations don't get skipped and skewed by the out-of-order exec.

main proc
local time1,time2,time3,time4,time5
local num,var2
local tmp

mov num,455445454

START_TEST time1
mov eax,num
mov var2,eax
END_TEST



START_TEST time1
mov eax,num
movzx edx,ax
shr eax,15
shl dx,1
shr ah,1
shr dh,1
shr ax,1
shr dx,1
cwde
shl eax,14
add eax,edx
add var2,eax
END_TEST



START_TEST time2
mov eax,num
mov edx,eax
mov esi,eax
mov edi,eax
and eax,7fh
and edx,7f00h
and esi,7f0000h
and edi,7f000000h
shr edx,1
shr esi,2
shr edi,3
or eax,edx
or esi,edi
or eax,esi

add var2,eax

END_TEST


START_TEST time3
mov eax, num
mov edx, eax
and edx, 7FFFFFh
add eax, edx
mov edx, eax
and edx, (7FFFh SHL 1)
add eax, edx
mov edx, eax
and edx, (7Fh SHL 2)
add eax, edx
shr eax, 3
add var2,eax
END_TEST


START_TEST time4
mov eax, num
mov edx, num

and eax, 7F7F7F7Fh
and edx, 7F7F7Fh

add eax, edx
add edx, edx

and edx, (7F7Fh SHL 1)

add eax, edx
add edx, edx

and edx, (7Fh SHL 2)

add eax, edx

shr eax, 3
add var2,eax
END_TEST



START_TEST time5
mov eax, num
mov esi, num
mov edx, num
mov ebx, num

and eax, 7F7F7F7Fh
and edx, 7F7F7Fh
and esi, 7F7Fh
and ebx, 7Fh

add eax, edx
add esi, esi
add ebx, ebx

add eax, esi
add ebx, ebx

add eax, ebx

shr eax, 3
add var2,eax

END_TEST





PrintLine

ret
main endp


c2d E8500 @3.8GHz, DDR3@1.6GHz (clk7) :

time1 = 2.861777 ; <null,warmup>
time1 = 13.15349 ; sysfce2
time2 = 6.553432 ; ultrano
time3 = 6.410781 ; scali1
time4 = 6.218256 ; scali2
time5 = 7.705081 ; scali3
----------------------------------------

Posted on 2009-11-22 01:03:03 by Ultrano

By optimal, I meant fastest execution (not necessarily shortest code).


But on which architecture?
Posted on 2009-11-22 03:35:09 by Scali

These are my benching macros:


I tried to build a test app from your macros, but it seems I'm missing the PrintLine and PrintFloat stuff?
Posted on 2009-11-22 03:48:40 by Scali
It's modified VKDebug stuff. I use MASM, and some libs that come with the masm32 package. The two macros are:

PrintFloat macro Var:REQ
local szDN,float1
.data
szDN byte 24 dup(0)
float1 real8 ?
.code
pushfd
pushad
push Var
fld real4 ptr
pop eax
fstp float1
invoke FloatToStr,float1,addr szDN ; <------ need to implement this
invoke strlen,addr szDN
add eax,@SizeStr(&Input)+4
invoke GlobalAlloc, GMEM_FIXED+GMEM_ZEROINIT, eax ; yuck
mov ebx,eax
FillMem ebx,&Var
mov dword ptr , 203D20h
invoke lstrcat,eax,addr szDN
invoke DebugPrint,ebx
invoke GlobalFree,ebx
popad
popfd
endm

PrintLine macro
prints "--------------"
endm
Posted on 2009-11-22 06:06:47 by Ultrano
With f0dder's old 'yodel' framework slightly reworked to suit this test, I got this from a P4:
test01-Scali1                ...000547 ticks (effective 20.775 clk/iteration)
test02-Scali2                ...000515 ticks (effective 19.560 clk/iteration)
test03-Scali3                ...000625 ticks (effective 23.737 clk/iteration)
test04-sysfce2                ...000578 ticks (effective 21.952 clk/iteration)
test05-Ultrano                ...000610 ticks (effective 23.168 clk/iteration)

And this from a Core2 Duo:
test01-Scali1                ...000375 ticks (effective 11.250 clk/iteration)
test02-Scali2                ...000344 ticks (effective 10.320 clk/iteration)
test03-Scali3                ...000406 ticks (effective 12.180 clk/iteration)
test04-sysfce2                ...000608 ticks (effective 18.240 clk/iteration)
test05-Ultrano                ...000452 ticks (effective 13.560 clk/iteration)

It's not entirely fair towards my third routine and Ultrano's routine, since they both use non-volatile registers, which have to be pushed and popped.
Posted on 2009-11-22 09:32:46 by Scali
I've added the following two C versions, compiled with VS2008 with /Ox flag:

extern "C" unsigned int __stdcall time6(unsigned int num)
{
return (num & 0x7F) | ((num & 0x7F00) >> 1) | ((num & 0x7F0000) >> 2) | ((num & 0x7F000000) >> 3);
}

extern "C" unsigned int __stdcall time7(unsigned int num)
{
unsigned int result = num & 0x7F7F7F7F;
unsigned int tmp    = num & 0x7F7F7F;

result += tmp;
tmp += tmp;
tmp &= (0x7F7F << 1);
result += tmp;
tmp += tmp;
tmp &= (0x7F << 2);
result += tmp;

return result >> 3;
}


test01-Scali1                ...000374 ticks (effective 11.220 clk/iteration)
test02-Scali2                ...000359 ticks (effective 10.770 clk/iteration)
test03-Scali3                ...000421 ticks (effective 12.630 clk/iteration)
test04-sysfce2                ...000624 ticks (effective 18.720 clk/iteration)
test05-Ultrano                ...000468 ticks (effective 14.040 clk/iteration)
test06-ANSI C 1              ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2              ...000343 ticks (effective 10.290 clk/iteration)

First version appears to be this:
_text:00000000                 mov     ecx, 
_text:00000004                mov    eax, ecx
_text:00000006                shr    eax, 1
_text:00000008                and    eax, 3F800000h
_text:0000000D                mov    edx, ecx
_text:0000000F                and    edx, 7F0000h
_text:00000015                or      eax, edx
_text:00000017                mov    edx, ecx
_text:00000019                shr    eax, 1
_text:0000001B                and    edx, 7F00h
_text:00000021                or      eax, edx
_text:00000023                shr    eax, 1
_text:00000025                and    ecx, 7Fh
_text:00000028                or      eax, ecx


Second one is this:
_text:00000000                 mov     eax, 
_text:00000004                mov    ecx, eax
_text:00000006                and    eax, 7F7F7F7Fh
_text:0000000B                and    ecx, 7F7F7Fh
_text:00000011                add    eax, ecx
_text:00000013                mov    edx, eax
_text:00000015                lea    eax,
_text:00000018                and    eax, 0FEFEh
_text:0000001D                add    edx, eax
_text:0000001F                add    eax, eax
_text:00000021                and    eax, 1FCh
_text:00000026                add    eax, edx
_text:00000028                shr    eax, 3
Posted on 2009-11-22 14:39:53 by Scali
Looking at that code reminded me that I forgot to disable MASM's prologue/epilogue code, so I got some gratuitous esp/ebp stuff.
Removed that, the results now sorta look like this on Core2 Duo:
test01-Scali1                ...000343 ticks (effective 10.290 clk/iteration)
test02-Scali2                ...000297 ticks (effective 8.910 clk/iteration)
test03-Scali3                ...000374 ticks (effective 11.220 clk/iteration)
test04-sysfce2                ...000562 ticks (effective 16.860 clk/iteration)
test05-Ultrano                ...000374 ticks (effective 11.220 clk/iteration)
test06-ANSI C 1              ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2              ...000343 ticks (effective 10.290 clk/iteration)

And like this on Pentium 4:
test01-Scali1                ...000453 ticks (effective 17.205 clk/iteration)
test02-Scali2                ...000438 ticks (effective 16.635 clk/iteration)
test03-Scali3                ...000500 ticks (effective 18.990 clk/iteration)
test04-sysfce2                ...000422 ticks (effective 16.028 clk/iteration)
test05-Ultrano                ...000515 ticks (effective 19.560 clk/iteration)
test06-ANSI C 1              ...000438 ticks (effective 16.635 clk/iteration)
test07-ANSI C 2              ...000437 ticks (effective 16.597 clk/iteration)

I suppose sysfce2's code clearly demonstrates why architecture matters. His code is the fastest on Pentium 4, but the slowest on Core2 Duo.
I wonder what causes that exactly... perhaps cwde is slow on Core2... or it could be the partial registers that P4 handles better (partial register stalls have been a problem ever since the first Pentium Pro).
I seem to have misjudged Ultrano's code a bit. I would have expected it to be faster on Core2 Duo, but it seems like his code has the same problem as my third version: You can try to take advantage of 3 or 4 ALUs in parallel, but you are limited by the decoder, so it doesn't really work unless the reorder buffer is already full... that won't happen here because we don't have any weird branches or memory accesses.
So despite my efforts of trying to take advantage of the 4 ALUs in the Core2 Duo in my 3rd variation, the 2nd variation remains the fastest for Core2 Duo.
At least I was right about my algorithm favouring the Pentium 4 though. Pentium 4 has trouble handling shifts, that is probably what makes Ultrano's version slower on the P4. My code performs more 'consistently' between Core2 Duo and Pentium 4.
Posted on 2009-11-22 14:52:38 by Scali