Hi,

I finally decided to learn some assembler, mainly to understand what my C/C++ compiler (gcc) is doing behind the scenes, and how I can improve the performance of my code.
One thing I've read again and again is that jumps are bad performance-wise and should be avoided. To test that, I implemented a (rather nonsensical) example in C in two ways and looked at the assembler output from gcc. Here's the C code of the first implementation:

int main()
{

int a = 0;
int i = 0;
int j = 0;
int k = 0;

for (i=0; i<100000; i++)
{
for (j=0; j<100000; j++)
{
k = 4;
if (a != 0)
k = 5;
}
}

return 0;
}


And here's the second implementation:
int main()
{

int a = 0;
int i = 0;
int j = 0;
int k = 0;

for (i=0; i<100000; i++)
{
for (j=0; j<100000; j++)
{
k = 5;
if (a == 0)
k = 4;

}
}

return 0;
}


Semantically, the two are the same. The only difference is that the condition in the if clause is false in the first case, and true in the second. The corresponding assembler code created by gcc is identical except for the if-clause, obviously. For the first implementation, it is
...
movl $4, -24(%ebp)
cmpl $0, -12(%ebp)
je L6
movl $5, -24(%ebp)
L6:
movl $0, -28(%ebp)
...


For the second implementation, the assembler code is
...
movl $5, -24(%ebp)
cmpl $0, -12(%ebp)
jne L6
movl $4, -24(%ebp)
L6:
movl $0, -28(%ebp)
...


This means that in the first implementation the jump is executed every single time, whereas it is never executed in the second implementation (at the cost of an extra mov instruction). However, the first implementation runs significantly FASTER than the second! This would imply that the extra mov instruction of the 2nd implementation is worse than the jump... Why is that?

I'm on Mac OS X, by the way.

Thanks,
Alex
Posted on 2011-07-24 10:18:53 by awe
Less instructions is faster...
Not sure what your question is?
You are not sure about how a jump is executed by the CPU?
As long as it can predict a jump properly, it only takes a single cycle (on some CPUs it is even 0 cycles, depends on how the CPU pipeline is designed). In both cases, the jump is easy to predict, since it is always the same (it doesn't matter whether the jump is taken or not).
So, the fact that it can skip the mov in the second case, means it is faster.

Problems with performance are more related to mispredictions.
Aside from that, a jump still takes 1 cycle, even if it is predicted properly... So if you can design your code in a way that it doesn't need the jump at all, you can save a cycle (techniques such as loop-unrolling for example, which save a number of jumps).
Posted on 2011-07-24 10:37:12 by Scali
Hi Scali,

yes, I was aware that less instructions usually means faster code. However, pretty much every assembler book I've seen preaches that jumps should be avoided as much as possible. That made me think that jumps are much more expensive than just one clock cycle.
One book on performance that I've read suggests that high-level code should be written in such a way that the more likely outcome of an if-clause should be coded first to avoid the jump in most cases. It went on to state that in some cases it might even still be faster to have two assignments if it means that the jump can be skipped. That's what I wanted to test with my second implementation.

Out of curiosity, roughly how many clock cycles does the mov instructions take in this case?

Thanks,a
Alex
Posted on 2011-07-24 18:45:13 by awe
What book is that, and how old is it? Sounds like it's a bit outdated. You have to realize that CPUs are a moving target. I often see assembly programmers still using instruction sequences that were optimal for 386 CPUs back in the late 80s, but are disastrous on today's machines. Each CPU has its own optimization rules, so be sure to learn the right rules for your CPU.

An if-clause is more than just a jump however. It's also a comparison. And again, the problem is predictability.

Your example just has perfect predictability, so you're not measuring the cost of a mispredicted jump. If the jump would get mispredicted half the time, you'd see something completely different. In your case, it's no different from an unconditional jump. The CPU just has to update the instruction pointer. That's not a very expensive operation.

A mov from memory takes two to three cycles on most modern x86 CPUs, assuming the data is in L1-cache (which it is in your case).

Hardly anyone writes about assembly optimizations anymore. The most up-to-date resources I know are the Intel Optimization Manuals, and Agner Fog's resources. (Yes, AMD has some optimization manuals as well, but they aren't very good, it's mostly just some basic examples, without much in-depth info about the architecture).
Posted on 2011-07-25 02:36:25 by Scali
With respect to conditional jumps, if you plan on hand-optimizing asm then you should be aware of branch-prediction predicates.
You can basically give a hint for the cpu as to which logical branch will be taken MOST of the time, and it will use that to optimize the cache.
Posted on 2011-07-25 08:52:55 by Homer
With respect to conditional jumps, if you plan on hand-optimizing asm then you should be aware of branch-prediction predicates.
You can basically give a hint for the cpu as to which logical branch will be taken MOST of the time, and it will use that to optimize the cache.
Do those branch hints actually make a difference for any existing CPUs? I like the idea of them, but were under the impression they didn't really mean much in the real world?
Posted on 2011-07-27 14:15:36 by f0dder

Do those branch hints actually make a difference for any existing CPUs? I like the idea of them, but were under the impression they didn't really mean much in the real world?


They were introduced on the Pentium 4, if I'm not mistaken?
No idea if Intel kept them in later CPUs, or if AMD adopted them.
If I recall correctly, the idea was that the 'cs' and 'ds' prefix byte in front of a conditional jump would force the prediction logic to predict the jump als 'always jump' or 'never jump'.
Then again, perhaps prediction has become so good that it doesn't really add much anymore.
I don't think I've ever seen a compiler emit these predicates in actual code.
Posted on 2011-07-27 15:52:51 by Scali
Sparc seems to rely heavily on branch prediction. I don't know whether branch prediction was a part of RISC design from the beginning or not, but I believe it has been around for awhile.

I don't have time right now but if I remember I will try to post a some assembly output from Solaris's compiler for Sparc and we should be able to see if it uses branch prediction or not.
Posted on 2011-07-28 04:51:05 by JoeCoder

However, pretty much every assembler book I've seen preaches that jumps should be avoided as much as possible. That made me think that jumps are much more expensive than just one clock cycle.


I don't know because I haven't seen the books you have seen but I could easily believe that kind of opinion is the result of parroting the same brainless mantra of not using GOTOs ever in code. In some HLL GOTO is appropriate, in others not. In assembly you will see how the machine works and you quickly realize jumps/branches etc. are one of the most commonly-used instructions. You have very little way of controlling program flow without explicit jumps, and on certain processors you have no way to control flow without jumps.

One book on performance that I've read suggests that high-level code should be written in such a way that the more likely outcome of an if-clause should be coded first to avoid the jump in most cases. It went on to state that in some cases it might even still be faster to have two assignments if it means that the jump can be skipped. That's what I wanted to test with my second implementation.


The whole point of a high level language and an optimizing compiler  for that language is to free the HLL coder from having to worry about issues like how to code your program according to what's best on the machine and allow you to focus on coding the correct solution in your problem domain. If an HLL can't provide abstraction, and the compiler for that HLL can't optimize it, you may as well be coding in assembly.

Posted on 2011-07-28 05:01:02 by JoeCoder

Sparc seems to rely heavily on branch prediction. I don't have time right now but if I remember I will try to post a some assembly output from Solaris's compiler for Sparc and we should be able to see if it uses branch prediction or not.


Well, I was talking about x86, where predication (*not* prediction) is a relatively new feature (predication being a static 'hint' or 'predicate' embedded in the code to tell the CPU what to expect for a given branch instruction).
Clearly, on CPUs where predication has been part of the instructionset for much longer, you will see compilers emitting code for them. On Itanium for example, it is quite crucial to use predication, since the CPU was not designed to do branch prediction at all. It strictly depends on the compiler to predict branches, or in the worst case, it will execute both sides of the branch in parallel (and thus at half speed), until the result of the branch is known and the incorrect side of the branch can be terminated.
Posted on 2011-07-28 05:02:01 by Scali

I don't know because I haven't seen the books you have seen but I could easily believe that kind of opinion is the result of parroting the same brainless mantra of not using GOTOs ever in code. In some HLL GOTO is appropriate, in others not. In assembly you will see how the machine works and you quickly realize jumps/branches etc. are one of the most commonly-used instructions. You have very little way of controlling program flow without explicit jumps, and on certain processors you have no way to control flow without jumps.


In fact, if you read Agner Fog's recent Sandy Bridge additions to his optimization manuals, he actually argues for MORE jumps...
That is, Sandy Bridge has a micro-op cache (much like the trace cache on the Pentium 4), which can cache pre-decoded x86 instructions for loops.
Agner Fog argues that since this cache is relatively small, you should not unroll loops. This way you can fit more into the cache, which will effectively be faster.
So the CPU will actually take more jumps this way.


The whole point of a high level language and an optimizing compiler  for that language is to free the HLL coder from having to worry about issues like how to code your program according to what's best on the machine and allow you to focus on coding the correct solution in your problem domain. If an HLL can't provide abstraction, and the compiler for that HLL can't optimize it, you may as well be coding in assembly.


Garbage in == garbage out.
The problem with programming languages is that they only describe the algorithms, but not the actual input and output. It does a static analysis of the code, but not a dynamic one.
So in most cases, a compiler cannot reasonably predict what the most likely path will be.
You'll need profiling of actual code at runtime to make certain optimizations to the code. And even then, it's mostly lies and statistics.
Take for example something like a movie encoder/decoder. The type of movie you will use to profile it, may greatly affect your profiling results. If you take an entirely black image, then it will be encoded massively differently from a movie with lots of high-frequency noise.
You need human intelligence and creativity for certain tasks.

Pet-peeve of mine, people who think the compiler will solve everything, and as such, they don't even bother to think about optimization issues in the first place (which means they'll never learn).
So I certainly don't agree that you might as well use assembly for things the HLL doesn't solve. There are many aspects that a HLL won't solve for you, most of them have to do with developing software in general. Writing the code is just the last stage of a long and complex process. Whether you use a HLL or assembly shouldn't matter all that much. Just the last few minor details that you can tweak in assembly.
Posted on 2011-07-28 05:15:04 by Scali
Solaris Studio compilers do indeed use branch prediction on Sparc. I'm going to post a few fragments of the assembly and disassembly generated by Solaris Studio F95 on a Towers of Hanoi program I found on the net.

First a piece of the assembly. You can see the branch prediction control if you look at the BLE instructions. PT means "predict taken". PN means "predict not taken"


.L900000140:
/* 000000   7 */ save %sp,-392,%sp

!    8       !      starting_post, goal_post)
!    9       ! 
!  10       !  integer, intent (in) ::  &
!  11       !  number_of_disks, starting_post, goal_post
!  12       !  integer :: free_post
!  13       !  ! all_posts is the sum of the post values 1+2+3
!  14       !  ! so that the free post can be determined
!  15       !  ! by subtracting the starting_post and the
!  16       !  ! goal_post from this sum.
!  17       !  integer, parameter :: all_posts = 6
!  18       !
!  19       !  if (number_of_disks > 0) then

/* 0x0004   19 */ ld [%i0],%l7
/* 0x0008     */ cmp %l7,0
/* 0x000c     */ ble,pn %icc,.L77000046

!  20       !      free_post =  &

/* 0x0010   20 */ or %g0,6,%l1
                     
! predecessor blocks: .L900000140

.L77000041:
/* 0x0014   20 */ ld [%i2],%l5
/* 0x0018     */ ld [%i1],%l6

!  21       !      all_posts - starting_post - goal_post
!  22       !      call hanoi (number_of_disks - 1,  &

/* 0x001c   22 */ add %l7,-1,%l2
/* 0x0020   0 */ sethi %hi(hanoi_module.hanoi.STR$4),%i4
/* 0x0024   20 */ sub %l1,%l5,%l0
/* 0x0028   0 */ add %i4,%lo(hanoi_module.hanoi.STR$4),%i4
/* 0x002c   20 */ sub %l0,%l6,%l4
/* 0x0030   19 */ cmp %l2,0
/* 0x0034     */ ble,pn %icc,.L77000057
/* 0x0038   20 */ st %l4,[%fp-16]

! Registers live out of .L77000041:
! o0 sp l1 l2 l4 l5 l6 l7 i0 i1 i2 i4 fp i7 gsr
!
                     
! predecessor blocks: .L77000041

.L77000050:
/* 0x003c   20 */ add %l4,%l6,%o5
/* 0x0040   22 */ add %l7,-2,%i5
/* 0x0044   20 */ sub %l1,%o5,%l3
/* 0x0048   22 */ sllx %i5,32,%o3
/* 0x004c     */ srl %l3,0,%o4
/* 0x0050     */ add %fp,-44,%o2
/* 0x0054     */ or %o3,%o4,%o0
/* 0x0058     */ stx %o0,[%fp-48]
/* 0x005c     */ or %g0,%i1,%o1
/* 0x0060     */ call hanoi_module.hanoi_ ! params =  %o0 %o1 %o2 ! Result =
/* 0x0064     */ add %fp,-48,%o0

/* 0x0214     */ call __f90_slw_ch ! params =  %o0 %o1 %o2 ! Result =
/* 0x0218     */ add %fp,-112,%o0
/* 0x021c     */ ld [%i2],%l5
/* 0x0220     */ add %fp,-112,%o0
/* 0x0224     */ call __f90_slw_i4 ! params =  %o0 %o1 ! Result =
/* 0x0228     */ or %g0,%l5,%o1
/* 0x022c     */ call __f90_eslw ! params =  %o0 ! Result =
/* 0x0230     */ add %fp,-112,%o0
/* 0x0234   19 */ cmp %i3,0
/* 0x0238     */ ble,pt %icc,.L77000046
/* 0x023c     */ nop


Now here's a piece of disassembly. I should have lined up both these pieces but I just wanted to show the different instructions generated for BLE,PT and BLE,PN.


hanoi_module.hanoi_()
    hanoi_module.hanoi_:      9d e3 be 78  save      %sp, -0x188, %sp
    hanoi_module.hanoi_+0x4:  ee 06 20 00  ld        [%i0], %l7
    hanoi_module.hanoi_+0x8:  80 a5 e0 00  cmp      %l7, 0x0
    hanoi_module.hanoi_+0xc:  04 40 00 ba  ble,pn    %icc, +0x2e8  <hanoi_module.hanoi_+0x2f4>
    hanoi_module.hanoi_+0x10:  a2 10 20 06  mov      0x6, %l1
    hanoi_module.hanoi_+0x14:  ea 06 a0 00  ld        [%i2], %l5
    hanoi_module.hanoi_+0x18:  ec 06 60 00  ld        [%i1], %l6
    hanoi_module.hanoi_+0x1c:  a4 05 ff ff  add      %l7, -0x1, %l2
    hanoi_module.hanoi_+0x20:  39 00 00 6d  sethi    %hi(0x1b400), %i4
    hanoi_module.hanoi_+0x24:  a0 24 40 15  sub      %l1, %l5, %l0
    hanoi_module.hanoi_+0x28:  b8 07 23 60  add      %i4, 0x360, %i4
    hanoi_module.hanoi_+0x2c:  a8 24 00 16  sub      %l0, %l6, %l4
    hanoi_module.hanoi_+0x30:  80 a4 a0 00  cmp      %l2, 0x0
    hanoi_module.hanoi_+0x34:  04 40 00 37  ble,pn    %icc, +0xdc  <hanoi_module.hanoi_+0x110>
    hanoi_module.hanoi_+0x38:  e8 27 bf f0  st        %l4, [%fp - 0x10]

    hanoi_module.hanoi_+0x218: 90 07 bf 90  add      %fp, -0x70, %o0
    hanoi_module.hanoi_+0x21c: ea 06 a0 00  ld        [%i2], %l5
    hanoi_module.hanoi_+0x220: 90 07 bf 90  add      %fp, -0x70, %o0
    hanoi_module.hanoi_+0x224: 40 01 1a a9  call      +0x46aa4      <__f90_slw_i4>
    hanoi_module.hanoi_+0x228: 92 10 00 15  mov      %l5, %o1
    hanoi_module.hanoi_+0x22c: 40 01 1a aa  call      +0x46aa8      <__f90_eslw>
    hanoi_module.hanoi_+0x230: 90 07 bf 90  add      %fp, -0x70, %o0
    hanoi_module.hanoi_+0x234: 80 a6 e0 00  cmp      %i3, 0x0
    hanoi_module.hanoi_+0x238: 04 48 00 2f  ble,pt    %icc, +0xbc  <hanoi_module.hanoi_+0x2f4>
    hanoi_module.hanoi_+0x23c: 01 00 00 00  nop
Posted on 2011-07-28 06:54:23 by JoeCoder
Oh, by the way... there's a more basic kind of 'predication' that x86 processors have been using since... the classic Pentium I believe.
Quite simply: If the target address for a conditional jump is lower than the current address, predict that the jump will be taken, else predict that it will not be taken.

The basic idea for that is simple: It is the structure for a loop (jump back N times, until the loop is finished, then fall through). This way a loop will have at most one bad prediction: at the last iteration.
So when you build conditional jumps, you can place the jump target either above or below your jump, depending on whether you think the most common case is to take the jump or not.
Posted on 2011-07-30 11:43:50 by Scali