You are given a 32 bit value where the Most Significant Bit in each byte (Bit 7) must be ignored, thus encoding a 28 bit value. Write an optimal function to decode the value (strip the ignored bits).

Posted on 2009-11-20 19:19:19 by Homer
Well, I guess I can start this off with the obvious approach (assuming the value to convert is in eax):
I wouldn't say it's optimized though.
``	movzx edx,ax	shr eax,15	shl dx,1	shr ah,1	shr dh,1	shr ax,1	shr dx,1	cwde	shl eax,14	add eax,edx``
Posted on 2009-11-21 08:47:01 by sysfce2
First attempt:
``main proc	local time1,time2	local var1				mov var1,455445454				START_TEST time1	mov eax,var1	movzx edx,ax	shr eax,15	shl dx,1	shr ah,1	shr dh,1	shr ax,1	shr dx,1	cwde	shl eax,14	add eax,edx	END_TEST			START_TEST time2	mov eax,var1	mov edx,eax	mov esi,eax	mov edi,eax	and eax,7fh	and edx,7f00h	and esi,7f0000h	and edi,7f000000h	shr edx,1	shr esi,2	shr edi,3	or eax,edx	or esi,edi	or eax,esi			END_TEST						PrintLine		retmain endp``

Bench output (in cycles) :

time1 = 12.67311
time2 = 5.459498
----------------------------------------
Posted on 2009-11-21 10:05:23 by Ultrano
Ultrano's method using MMX:
``align 4param2  dq  1b258bce1b258bceh ;+0  <-- decimal: 455445454; 455445454;mask1   dq  0000007f0000007fh ;+8mask2   dq  00007f0000007f00h ;+16mask3   dq  007f0000007f0000h ;+24mask4   dq  7f0000007f000000h ;+32align 4                    mov     eax, offset param2                    movq    mm0,                     movq    mm1, mm0                    movq    mm2, mm0                    movq    mm3, mm0                    pand    mm0,                     pand    mm1,                     pand    mm2,                     pand    mm3,                     psrld   mm1, 1                    psrld   mm2, 2                    por     mm0, mm1                    psrld   mm3, 3                    por     mm0, mm2                    por     mm0, mm3``

Takes about 10% more time to execute (on Core 2 Duo E6750), but processes 2 numbers simultaneously in 1 iteration (can be easily extended to process 4 numbers). Result in mm0.
Posted on 2009-11-21 10:44:10 by ti_mo_n
Maybe something like this:
``mov eax, nummov edx, eaxand edx, 0x7FFFFFadd eax, edxmov edx, eaxand edx, (0x7FFF SHL 1)add eax, edxmov edx, eaxand edx, (0x7F SHL 2)add eax, edxshr eax, 3``

I suppose it might be possible to make it more efficient by using more registers, and preparing some of the sums in parallel. I'll leave that for someone else, I just wanted to present this algorithm.

Edit: Just realized that I assume that the MSB is 0 in the highest byte. If not, add an extra and at the bottom :) Or rather, an and 7F7F7F7F at the top, even better.

Edit: Okay, I lied... I couldn't leave it for someone else, too much fun... here's one with more efficient use of registers:
``mov eax, nummov edx, numand eax, 0x7F7F7F7Fand edx, 0x7F7F7Fadd eax, edxadd edx, edxand edx, (0x7F7F SHL 1)add eax, edxadd edx, edxand edx, (0x7F SHL 2)add eax, edxshr eax, 3``

``mov eax, nummov ecx, nummov edx, nummov ebx, numand eax, 0x7F7F7F7Fand edx, 0x7F7F7Fand ecx, 0x7F7Fand ebx, 0x7Fadd eax, edxadd ecx, ecxadd ebx, ebxadd eax, ecxadd ebx, ebxadd eax, ebxshr eax, 3``
Posted on 2009-11-21 12:54:44 by Scali
Timing results are as follows:
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result

BTW, what are the prizes? ^^
Posted on 2009-11-21 15:46:47 by ti_mo_n

Timing results are as follows:
Scali1: ~14% slower than Ultrano's (and incorrect result due to assumption that the MSB = 0)
Scali2: about 0.1% faster than Scali1, correct result
Scali3: about 5% faster than Scali2 (9% slower than Ultrano's), correct result

BTW, what are the prizes? ^^

If you want to compare speed, I think the architecture is important aswell.
If I were to hazard a guess, I'd say Ultrano's code mostly runs well on CPUs with fast shifting and 3 or 4 ALUs (eg K7, K8, K10, Core2/Nehalem), while my code probably works better on CPUs with shift bottlenecks and less superscalar abilities (eg Pentium, P6, P4). I could have told you beforehand that Ultrano's code would be faster on Core2.

Homer never specified what 'optimal' meant... Are we talking speed, number of instructions, number of bytes...? And what target architecture?
Posted on 2009-11-21 16:38:45 by Scali
Easy, I was just comparing speeds on my machine ^^ No offense intended. Every attempt is a good one - not only lets you pick the one that best suts your needs, but also it's fun to actually take the "challenge". And, actually, if you have time then please time all posted functions. It will let everyone interested see how they perform on different machines.
Posted on 2009-11-21 16:57:17 by ti_mo_n
If someone provides a nice testing framework (f0dder?), we could all perform the same tests on our machines, with universal results.
I have a Pentium 4 myself, and an Athlon XP.
I'd like to test my theory of my code being reasonably favourable to Pentium 4.
Posted on 2009-11-21 17:24:10 by Scali

Some interesting solutions  :thumbsup:
By optimal, I meant fastest execution (not necessarily shortest code).

Just to mention that this algorithm is used a lot in compressed audio streams to mark framesizes.
They tend to use '255' to denote a SYNC marker, and this encoding ensures that all bytes of the FrameSize tag are less than '128' (80h) which means that an application can handle a 'broken/damaged' mpeg file by scanning for the next available SYNC marker ('resynchronize') with less chance of finding false positives.

Posted on 2009-11-21 19:57:41 by Homer
While my code takes  5.5 cycles, on the same PC Scali's 3rd version takes 4.7 cycles. I really like the idea of mask-adding to shift left in groups :D.
c2d E8500 @3.8GHz.
Posted on 2009-11-21 23:44:48 by Ultrano
These are my benching macros:
``;=====[[ Benchmarking macros START_TEST/END_TEST >>===\;--[ configuration ]---[TEST_ITERS equ 10000000TEST_SUBCYCLE = 1 ; 0 or 1, whether to show FP result;----------------------/TEST_ID = 0 ; internal thingieSTART_TEST macro where:REQ	local where2	@START_TEST_VAR textequ <where>	mov where,-1			invoke GetModuleHandle,0	invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS	invoke GetCurrentThread	invoke SetThreadAffinityMask,eax,1		invoke SwitchToThread		TEST_ID = TEST_ID + 1	rdtsc	push eax	push edx	mov ecx,TEST_ITERS	align 16	@CatStr(<testlabl>,%TEST_ID,<:>)endmEND_TEST macro	dec ecx	jnz @CatStr(<testlabl>,%TEST_ID)	nop	nop	nop	nop		rdtsc	sub eax,	sbb edx,  if TEST_SUBCYCLE EQ 0	add esp,8	mov ecx,TEST_ITERS	div ecx	.if eax<@START_TEST_VAR		mov @START_TEST_VAR,eax	.endif		invoke GetModuleHandle,0	invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS		%@CatStr(<print >,@START_TEST_VAR) else	mov ,edx	mov ,eax	fild qword ptr	mov dword ptr,TEST_ITERS	fidiv dword ptr	fstp @START_TEST_VAR	add esp,8		invoke GetModuleHandle,0	invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS		%@CatStr(<PrintFloat >,@START_TEST_VAR)endifendm;=======/``

Of course, testing code with lots of cache-misses this way will lead to possibly bad decisions on which version to take. There I plug-in other RDTSC-based macros/calls around code in a complete big app, and measure min/max/avg/avg2/last/count .
Posted on 2009-11-21 23:48:21 by Ultrano
Thanks for all the replies - when I have to invent problems to solve, the world is in a good state !!
I'll try to make some effort to post more of these simple problems, and I hope to next time hear from some more beginners as well :) (not that I mind the regulars showing everyone how its done, but its much more rewarding for me to know that someone pushed their envelope a little bit, or that someone gained something from the exercise).

Posted on 2009-11-22 00:11:38 by Homer
A bit fixup and completion. In my previous post I meant Scali's 2nd version not 3rd. Too many post-edits there :) .
I added "add var2,eax" to verify calculations don't get skipped and skewed by the out-of-order exec.
``main proc	local time1,time2,time3,time4,time5	local num,var2	local tmp		mov num,455445454		START_TEST time1	mov eax,num	mov var2,eax	END_TEST				START_TEST time1	mov eax,num	movzx edx,ax	shr eax,15	shl dx,1	shr ah,1	shr dh,1	shr ax,1	shr dx,1	cwde	shl eax,14	add eax,edx	add var2,eax	END_TEST				START_TEST time2	mov eax,num	mov edx,eax	mov esi,eax	mov edi,eax	and eax,7fh	and edx,7f00h	and esi,7f0000h	and edi,7f000000h	shr edx,1	shr esi,2	shr edi,3	or eax,edx	or esi,edi	or eax,esi		add var2,eax		END_TEST			START_TEST time3	mov eax, num	mov edx, eax	and edx, 7FFFFFh	add eax, edx	mov edx, eax		and edx, (7FFFh SHL 1)	add eax, edx	mov edx, eax	and edx, (7Fh SHL 2)	add eax, edx	shr eax, 3	add var2,eax	END_TEST			START_TEST time4	mov eax, num	mov edx, num		and eax, 7F7F7F7Fh	and edx, 7F7F7Fh		add eax, edx	add edx, edx		and edx, (7F7Fh SHL 1)		add eax, edx	add edx, edx		and edx, (7Fh SHL 2)		add eax, edx		shr eax, 3	add var2,eax	END_TEST				START_TEST time5	mov eax, num	mov esi, num	mov edx, num	mov ebx, num	and eax, 7F7F7F7Fh	and edx, 7F7F7Fh	and esi, 7F7Fh	and ebx, 7Fh	add eax, edx	add esi, esi	add ebx, ebx	add eax, esi	add ebx, ebx	add eax, ebx	shr eax, 3	add var2,eax		END_TEST							PrintLine		retmain endp``

c2d E8500 @3.8GHz, DDR3@1.6GHz (clk7) :

time1 = 2.861777 ; <null,warmup>
time1 = 13.15349 ; sysfce2
time2 = 6.553432 ; ultrano
time3 = 6.410781 ; scali1
time4 = 6.218256 ; scali2
time5 = 7.705081 ; scali3
----------------------------------------

Posted on 2009-11-22 01:03:03 by Ultrano

By optimal, I meant fastest execution (not necessarily shortest code).

But on which architecture?
Posted on 2009-11-22 03:35:09 by Scali

These are my benching macros:

I tried to build a test app from your macros, but it seems I'm missing the PrintLine and PrintFloat stuff?
Posted on 2009-11-22 03:48:40 by Scali
It's modified VKDebug stuff. I use MASM, and some libs that come with the masm32 package. The two macros are:
``PrintFloat macro Var:REQ	local szDN,float1	.data	szDN byte 24 dup(0)	float1 real8 ?	.code	pushfd	pushad	push Var	fld real4 ptr	pop eax	fstp float1	invoke FloatToStr,float1,addr szDN ; <------ need to implement this	invoke strlen,addr szDN	add eax,@SizeStr(&Input)+4	invoke GlobalAlloc, GMEM_FIXED+GMEM_ZEROINIT, eax ; yuck	mov ebx,eax	FillMem ebx,&Var	mov dword ptr , 203D20h	invoke lstrcat,eax,addr szDN	invoke DebugPrint,ebx	invoke GlobalFree,ebx	popad		popfdendmPrintLine macroprints "--------------"endm	``
Posted on 2009-11-22 06:06:47 by Ultrano
With f0dder's old 'yodel' framework slightly reworked to suit this test, I got this from a P4:
test01-Scali1                ...000547 ticks (effective 20.775 clk/iteration)
test02-Scali2                ...000515 ticks (effective 19.560 clk/iteration)
test03-Scali3                ...000625 ticks (effective 23.737 clk/iteration)
test04-sysfce2                ...000578 ticks (effective 21.952 clk/iteration)
test05-Ultrano                ...000610 ticks (effective 23.168 clk/iteration)

And this from a Core2 Duo:
test01-Scali1                ...000375 ticks (effective 11.250 clk/iteration)
test02-Scali2                ...000344 ticks (effective 10.320 clk/iteration)
test03-Scali3                ...000406 ticks (effective 12.180 clk/iteration)
test04-sysfce2                ...000608 ticks (effective 18.240 clk/iteration)
test05-Ultrano                ...000452 ticks (effective 13.560 clk/iteration)

It's not entirely fair towards my third routine and Ultrano's routine, since they both use non-volatile registers, which have to be pushed and popped.
Posted on 2009-11-22 09:32:46 by Scali
I've added the following two C versions, compiled with VS2008 with /Ox flag:

``extern "C" unsigned int __stdcall time6(unsigned int num){	return (num & 0x7F) | ((num & 0x7F00) >> 1) | ((num & 0x7F0000) >> 2) | ((num & 0x7F000000) >> 3);}``

``extern "C" unsigned int __stdcall time7(unsigned int num){	unsigned int result = num & 0x7F7F7F7F;	unsigned int tmp    = num & 0x7F7F7F;		result += tmp;	tmp += tmp;	tmp &= (0x7F7F << 1);	result += tmp;	tmp += tmp;	tmp &= (0x7F << 2);	result += tmp;		return result >> 3;}``

test01-Scali1                ...000374 ticks (effective 11.220 clk/iteration)
test02-Scali2                ...000359 ticks (effective 10.770 clk/iteration)
test03-Scali3                ...000421 ticks (effective 12.630 clk/iteration)
test04-sysfce2                ...000624 ticks (effective 18.720 clk/iteration)
test05-Ultrano                ...000468 ticks (effective 14.040 clk/iteration)
test06-ANSI C 1              ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2              ...000343 ticks (effective 10.290 clk/iteration)

First version appears to be this:
``_text:00000000                 mov     ecx, _text:00000004                 mov     eax, ecx_text:00000006                 shr     eax, 1_text:00000008                 and     eax, 3F800000h_text:0000000D                 mov     edx, ecx_text:0000000F                 and     edx, 7F0000h_text:00000015                 or      eax, edx_text:00000017                 mov     edx, ecx_text:00000019                 shr     eax, 1_text:0000001B                 and     edx, 7F00h_text:00000021                 or      eax, edx_text:00000023                 shr     eax, 1_text:00000025                 and     ecx, 7Fh_text:00000028                 or      eax, ecx``

Second one is this:
``_text:00000000                 mov     eax, _text:00000004                 mov     ecx, eax_text:00000006                 and     eax, 7F7F7F7Fh_text:0000000B                 and     ecx, 7F7F7Fh_text:00000011                 add     eax, ecx_text:00000013                 mov     edx, eax_text:00000015                 lea     eax, _text:00000018                 and     eax, 0FEFEh_text:0000001D                 add     edx, eax_text:0000001F                 add     eax, eax_text:00000021                 and     eax, 1FCh_text:00000026                 add     eax, edx_text:00000028                 shr     eax, 3``
Posted on 2009-11-22 14:39:53 by Scali
Looking at that code reminded me that I forgot to disable MASM's prologue/epilogue code, so I got some gratuitous esp/ebp stuff.
Removed that, the results now sorta look like this on Core2 Duo:
test01-Scali1                ...000343 ticks (effective 10.290 clk/iteration)
test02-Scali2                ...000297 ticks (effective 8.910 clk/iteration)
test03-Scali3                ...000374 ticks (effective 11.220 clk/iteration)
test04-sysfce2                ...000562 ticks (effective 16.860 clk/iteration)
test05-Ultrano                ...000374 ticks (effective 11.220 clk/iteration)
test06-ANSI C 1              ...000312 ticks (effective 9.360 clk/iteration)
test07-ANSI C 2              ...000343 ticks (effective 10.290 clk/iteration)

And like this on Pentium 4:
test01-Scali1                ...000453 ticks (effective 17.205 clk/iteration)
test02-Scali2                ...000438 ticks (effective 16.635 clk/iteration)
test03-Scali3                ...000500 ticks (effective 18.990 clk/iteration)
test04-sysfce2                ...000422 ticks (effective 16.028 clk/iteration)
test05-Ultrano                ...000515 ticks (effective 19.560 clk/iteration)
test06-ANSI C 1              ...000438 ticks (effective 16.635 clk/iteration)
test07-ANSI C 2              ...000437 ticks (effective 16.597 clk/iteration)

I suppose sysfce2's code clearly demonstrates why architecture matters. His code is the fastest on Pentium 4, but the slowest on Core2 Duo.
I wonder what causes that exactly... perhaps cwde is slow on Core2... or it could be the partial registers that P4 handles better (partial register stalls have been a problem ever since the first Pentium Pro).
I seem to have misjudged Ultrano's code a bit. I would have expected it to be faster on Core2 Duo, but it seems like his code has the same problem as my third version: You can try to take advantage of 3 or 4 ALUs in parallel, but you are limited by the decoder, so it doesn't really work unless the reorder buffer is already full... that won't happen here because we don't have any weird branches or memory accesses.
So despite my efforts of trying to take advantage of the 4 ALUs in the Core2 Duo in my 3rd variation, the 2nd variation remains the fastest for Core2 Duo.
At least I was right about my algorithm favouring the Pentium 4 though. Pentium 4 has trouble handling shifts, that is probably what makes Ultrano's version slower on the P4. My code performs more 'consistently' between Core2 Duo and Pentium 4.
Posted on 2009-11-22 14:52:38 by Scali