I'd like to ask several questions regarding assembly.  I am a C++ student using MSVC++ 2005's inline __asm {} to learn the basics of assembly.
The code I'm trying to optimize:
template<class T>
int CBinaryHeap<T>::moveUp(int index, int newKey)
{
  register int current2 = index>>1;
  while (index > 1) {//while there is a parent to compare
  if (BinaryHeap.key > newKey) {
    BinaryHeap = BinaryHeap;//move parent down
    index = current2;//on the next iteration, compare the parent
    current2 >>= 1;//divided by 2, points to the parent of the parent
  }
  else {
    break;//we can stop comparing, we've found the index to insert at
  }
  }
  return (index);
}//end moveUp


A Binary Heap is a sorted structure, basically a tree inside an array, and its indexs serve as pointers to tree members.  To insert an element onto the heap, it is placed at the end of the array.  The newly placed element is then iteratively compared with its parent node and filters up, the parent moving down each time.  To avoid temporary objects and extra assignments, the insertion and moving up have been seperated into two functions.  The new element is not actually placed on the binary heap until its index has been determined, and all its parents have moved down.  I have implemented the binary heap to have one extra element.  The root of the tree is at Heap[1], its two children are at 2 and 3, etc; this saves a bit of pointer calculation.  The datatype of data  and key is an int.

In the moveUp function, index is the heapSize, the index of the last element - or where it would be placed if following the standard algorithm.  Current2 is the parent, or index / 2.

Now, the moveDown function, which is used when removing an element from the heap, is more complex.  It needs to compare both children, and move down.  Because the children are at index*2 and index*2+1 I wanted to optimize this.  I found that I couldn't in C++, anything I did made it slower.  So I tried rewriting moveDown in assembly, but again things were slower, this time much worse!  So I tried rewriting moveUp in assembly because it was simpler.  This too is slower than if I compiled with MSVC++.

So I looked at how MSVC++ compiled the above, and I got this:

mov  eax, dword ptr
sar  eax, 1
mov  dword ptr, eax

while_current#1:
cmp  dword ptr, 1
jle  end#1;

mov  ecx, dword ptr;
mov  edx, dword ptr;
mov  eax, dword ptr;
mov  ecx, dword ptr;
cmp  ecx, dword ptr;
jle  end#2;

mov  edx, dword ptr;
mov  eax, dword ptr;
mov  ecx, dword ptr;
mov  edx, dword ptr;
mov  eax, dword ptr;
mov  ecx, dword ptr;
mov  ecx, dword ptr;
mov  esi, dword ptr;
mov  dword ptr, edx;
mov  dword ptr, eax;

mov  edx, dword ptr;
mov  dword ptr, edx;
mov  eax, dword ptr;
sar  eax, 1;
mov  dword ptr, eax;
jmp  while_current#2;

end#2:
jmp  end#1;

while_current#2:
jmp while_current#1;

end#1:
mov  eax, dword ptr

I have a few questions as to how / why MSVC++ compiled it this way, but first my rewritten assembly, copy and pasting some of this code:

__asm {
mov  ebx, index;
shr  ebx, 1;

while_current:
cmp  index, 1;
jle  end;

mov  ecx, dword ptr ;
mov  edx, dword ptr ;
mov  eax, ebx;
mov  ecx, dword ptr ;
cmp  ecx, dword ptr ;
jle  end;

mov  edx, dword ptr ;
mov  eax, dword ptr ;
mov  ecx, ebx;
mov  edx, dword ptr ;
mov  eax, dword ptr ;
mov  ecx, dword ptr ;
mov  ecx, dword ptr ;
mov  esi, dword ptr ;
mov  , edx;
mov  , eax;

mov  index, ebx;
shr  ebx, 1;
jmp  while_current;

end:
}
return index;

So a few things were changed that in my mind should yield extra performance
1 I removed the double jumps
2 I replaced current2 with ebx
3 I've cut down some moves when assigning index = current; and current >>= 1;
4 Other changes, but this is the current state of my code and only includes things that really seem like they would work.

However, after compiling my optimized code, it is slower by about 25 - 33%.
Now that you have all the information, here are my questions:

1: Are there any multiplications or additions in statements like this: mov edx, ? I see many moves, with pointer arrithmetic, but I'm not sure how much that move costs in ops. Or to put it another way, aren't there really 2 moves and an add in this: mov ecx, dword ptr ; mov edx, dword ptr ;?
2: Would it be any faster to store eax+ecx*8 in a free register, so that it can be reused on the next move?
3: Why the double jumps? It looks like it's to handle the break, but surely it could / should have been eliminated.
4: Could it possibly be that the code is slower because it doesn't know how to handle the assembly in templates?
5: Could it possibly be that the code is slower because with assembly there aren't any global optimizations being done? (this is where a profiler would help, but my trial edition doesn't seem to have this, I'm just going by time)
6: Do you see any way you can improve on the MSVC++ assembly here, and/or are any of my optimizations improving or hurting?
7: I thought [] means to load from the memory using the address stored inside the [], so why is it doing things like mov edx, dword ptr ?
8: Where online can I find resources for AMD and Intel that show cost in cycles for ops, for up to date processors?
9: Would compiling in something like MASM32 be better than inlining?
10: How can I measure execution time in clocks rather than time?

Thanks for your help!
Posted on 2006-05-04 00:05:26 by aalaardb

Hello,

I don't know the answer to all your questions but hopefully I can help.

In answer to question 7, in masm doing mov reg, is the same as mov reg, label

You can get developer manuals as PDFs for free here. They can also be obtained as printed versions for free (though not at the moment, but soon) from here.

Also, if you're a student, check out thespoke.net where you can register for a small fee (about 20 / 30) then you can download many Microsoft products for free, like Windows Server 2003, SQL Server, Visual Studio 2005 Professional (which has a profiler).

And some general comments: this function seems too basic to be worth optimising tbh. Also, using inline asm can stop the compiler from doing other optimisations like inlining. And using the "register" keyword is not much use (and even discouraged i think) these days.

hth
Posted on 2006-05-04 05:28:45 by stormix
I agree with Stormix, the function looks like it's too smal to be worth optimizing - you should probably focus your efforts elsewhere. And when you need the Assembly, it's often better to write an external .asm module instead of inline assembly (you need "large" blocks before it's worthwhile, and as Stormix pointed out, inline assembly can confuse some compilers).
Posted on 2006-05-04 05:34:00 by f0dder
The goal of optimizing this is not a matter of effort/reward, but one of learning.  Also, the available optimizations are better on the moveDOwn function, but I've posted moveUp here for simplicity.
Posted on 2006-05-04 16:00:24 by aalaardb