Hello...
I decided to look more deeply at some sections of my code to try to optimize it...

I have coded three different versions of something doing the same thing :



mov esi, MemPointer
xor eax, eax
lodsb
shl eax, 8
mov edx, eax
xor eax, eax
lodsb
add eax, edx


16 bytes long
375 300 cycles



mov esi, MemPointer
xor edx, edx
mov dl, byte ptr [esi]
shl edx, 8
xor eax, eax
mov al, byte ptr [esi+1]
add eax, edx


17 bytes long
233 600 cycles



mov esi, MemPointer
movzx edx, byte ptr [esi]
shl edx, 8
movzx eax, byte ptr [esi+1]
add eax, edx


15 bytes long
200 300 cycles

Note : each portion of code has been run 50 000 times in a loop in order to have *accurate* timings... (are they ?)

Now, I would like to ask if the third code is really the fastest ?
It is the first time I clock code "seriously" and I would like to know if the results I obtained are reliable...

PS : If somebody has a faster version than the three presented here, I'm interested ! :tongue:

Thanks !
Posted on 2002-02-02 12:23:20 by JCP
Well if the aim is just to have the low word of eax contain the bytes in reverse order to the way they were stored then this code is very slightly faster, but then again if you need edx to contain that value then this is not a replacement. ;)

mov esi, MemPointer
xor eax,eax
mov ah,
mov al,

BTW, when I time your code I get 159,000 clocks if I use edi as a loop counter and 250,000 if I use a memory variable. And by faster I ment mine times in at 152,000 with edi, which isn't much faster at all.

Oh yeah, if I ignore the looping and just time the instruction ( which many have argued is inaccurate I they time a 3 clocks for mine to 4 for for yours. But as I've said none one seems to trust those timings. :confused:
Posted on 2002-02-02 13:28:57 by Eóin
Thanks for the reply E?in,

What I want to do is in fact
MemPointer[0] * 256 + MemPointer[1]

Here is how i clocked :



restart:
mov ecx, 50000
TESTCLOCKS_ON
@@:
mov esi, MemPointer
xor eax, eax
lodsb
shl eax, 8
mov edx, eax
xor eax, eax
lodsb
add eax, edx
dec ecx
jnz @B
TESTCLOCKS_OFF
SHOWCLOCKS_RESULT
jmp restart


The TESTCLOCKS* thingies are just macros playing with rdtsc...
Posted on 2002-02-02 15:09:51 by JCP
Readiosys,

Just from looking at the code, the third version using MOVSX should be faster as it is a well optimised instruction on later processors and the third algo has the shortest instruction path.

Just a suggestion, try XOR ESI, ESI before the line "mov esi, MemPointer" to see if the Intel internal optimisation for DWORD to BYTE makes it faster.

The SHL instruction will probably be a bottleneck but there may not be another ay to do it.

Regards,

hutch@movsd.com
Posted on 2002-02-02 15:17:40 by hutch--
mov eax,

bswap eax

mov edx,eax
shr eax,16
and edx,0FFFFh


This code does two at once. If you interleave two of these routines to do four words at once you can eliminate all forward dependancies and get some pretty good speed!
Posted on 2002-02-02 15:19:22 by bitRAKE
bitRAKE, why the


mov edx, eax
shr eax, 16
and edx, 0FFFFh



Why not:


mov eax, [esi]
bswap eax
rol eax, 16


Mirno
Posted on 2002-02-03 06:21:41 by Mirno
The problem seems extremely simple, logical and thus straightforward:

MOV AX,
XCHG AL,AH

Should be the fastest even on P6, i.e. no partial register stall.

Greets,
Maverick
Posted on 2002-02-03 07:52:28 by Maverick
Well, if you don't want to use a variable but a pointer then change it simply to:

MOV EDX,
MOV AX,
XCHG AH,AL

In general you should try to avoid extra memory references when they aren't really indispensible. Create a MACRO with the XCHG instruction and use it where it's necessary. For 32bit values instead use BSWAP.

There was an undocumented form of BSWAP which worked on 16bit values like this:

EAX = $12345678

becomes, IIRC:

EAX = 78345612

But it didn't work on my first Athlon.. so since then I just forgot how to do it and experimented elsewhere.


Greets,
Maverick
Posted on 2002-02-03 07:59:07 by Maverick
To avoid MOV AX, partial register stalls use MOVZX, etc..

Blah.. I better go back to coding my optimizer for my compiler ;)

Sorry for the 3 posts, but I was doing 10 other times at the same time.

Greets,
Maverick
Posted on 2002-02-03 08:02:15 by Maverick
A last alternative (you will take care to benchmark them all, and please report) is, once you have your 16bit value in AX, to simply do:

ROL AX,8

instead of:

XCHG AH,AL

Greets,
Maverick
Posted on 2002-02-03 08:16:10 by Maverick

There was an undocumented form of BSWAP which worked on 16bit values like this:

EAX = 345678

becomes, IIRC:

EAX = 78345612

Nah.. now I recall, it did -> $34127856, but, as I said, it wasn't undocumented by Intel and didn't work on other CPU's, so I just stopped using it. If I recall correctly you had to add the $66 prefix to get this form. It was long time ago anyway.

Greets,
Maverick
Posted on 2002-02-03 08:22:09 by Maverick
Mirno, EAX holds one word, and EDX holds another. It might not relate to the problem at all. :) If Readiosys needs to convert more than one word at a time, I have no idea. Sorry, I did not explain more.
mov eax,[esi]

mov ecx,[esi+4]

bswap eax
bswap ecx
mov edx,eax
mov ebx,ecx
shr eax,16
shr ecx,16
and edx,0FFFFh
and ebx,0FFFFh
Not counting the load/store, the operation takes 1 cycle per word!
Posted on 2002-02-03 10:34:51 by bitRAKE

What I want to do is in fact
MemPointer[0] * 256 + MemPointer[1]


mov ah,MemPointer[0]
mov al,MemPointer[1]

1 clock if the data in the cache
Posted on 2002-02-04 02:02:17 by The Svin
Thanks for all your advices... and please excuse my late reply, I have been way too busy this week to experiment them.

The problem is most of your solutions doesn't allow me to do the famous shl reg, 8 (as the two bytes I want are on the same register... the result will not be the same as I want) which is indispensable to my routine... so I don't think I can read two bytes at once to the same register in this case... :(

I tried to optimize it again : I came up with 166 000 cycles, but it won't work in all the cases. :(

In fact here is the clear "algorithm"

32 bit register = (MemPointer[0] * 256) + MemPointer[1]

I have a lame question :

mov esi, MemPointer
movzx eax, byte ptr
movzx edx, byte ptr

This works as wished...

This not :
movzx eax, byte ptr
movzx eax, byte ptr

Shouldn't be the result the same ? :eek:

Thanks.

Bye.
Posted on 2002-02-09 06:38:04 by JCP
I have a lame question :

mov esi, MemPointer
movzx eax, byte ptr
movzx edx, byte ptr

This works as wished...

This not :
movzx eax, byte ptr
movzx eax, byte ptr

Shouldn't be the result the same ?


Now I've got some sleep : It must be because MemPointer is defined as a LOCAL, isn't it ?
Posted on 2002-02-10 06:40:47 by JCP
shouldn't the second line be movzx edx, byte ptr ?

Thomas
Posted on 2002-02-10 08:34:39 by Thomas
You are right Thomas, It was a typo when I copied the code (too poor to do copy paste ? :rolleyes: ) but it doesn't change the problem (even if I place MemPointer in the .data? section...).
Posted on 2002-02-10 10:09:25 by JCP
If somebody cares, I got rid of the MemPointer var (hurray !)... so the use of is no longer a problem but it is still weird to me and if someone has an explaination, I would be glad to hear it. ;)

Thanks :)
Posted on 2002-02-11 01:29:15 by JCP
I see that you don't know assembler totally.

The second code is converted into:

movzx eax, byte ptr
movzx edx, byte ptr

(0XX is the offset to the LOCAL from ebp)

That means that you move from the actual pointer instead of from where the pointer points.
Posted on 2002-02-13 03:22:54 by gliptic
I never said nor pretended to know assembly totally... (few people here can say they know all about assembly).
and yes, I figured out what you said after debugging it more deeply (the similar syntax confused me, I suppose)...
Posted on 2002-02-14 04:55:42 by JCP