Hi there,

I recently read a blog which suggested that the performance of the "lock cmpxchg" instruction is significantly worse on Sandy Bridge than on Nehalem processors. The author states that "with this sequencing benchmark, I discovered that Sandybridge has taken a major step backward in performance with regard to atomic instructions":

http://mechanical-sympathy.blogspot.com/2011/09/adventures-with-atomiclong.html

He's posted a sample C++ program and invited people to investigate the difference between Sandy Bridge and Nehalem for themselves. The post has received comments from someone called Doug Lea, who one might assume is the same Doug Lea who contributed the java.util.concurrent packages to the JVM. Esteemed company indeed.

Can this forum's readers see any flaws in the methodology or explain why the author may be seeing these results? Can anyone reproduce his findings on appropriate kit?

Regards

Michael
Posted on 2011-09-16 14:36:56 by michaelg
I suppose that is something that only Intel can really answer.
They have significantly improved performance of lock with Nehalem (it would be interesting to compare Nehalem and Sandy Bridge with other x86 architectures... Is Nehalem just exceptionally good, or is Sandy Bridge exceptionally bad?).
It could be that not all of Nehalem's improvements in this area were carried over to Sandy Bridge, for whatever reasons (they may be incompatible with other improvements).

On the other hand, Sandy Bridge has some new improvements for partials stalls. Instead of stalling, it inserts an extra micro-op in the pipeline to resynchronize partial registers.
cmpxchg8b suffers from partial updates to the flags register (xadd does not, nor should regular cmpxchg).
If they use cmpxchg8b, it might explain the difference with xadd. And it may also give some insight in why there's a difference with Nehalem: Sandy Bridge has to perform more operations.
Normally that should not be a problem... the latency of the extra flags recombining op should be hidden by the rest of the code. But if you write software that only hammers the CPU with cmpxchg all the time, things will look different. There is no other code to hide the latency of the extra micro-op (whereas Nehalem would never stall, since it never reads back the partial register. The recombine and hence the stall is not triggered until a read. Writes are 'free').

Anyway, just a theory. Without an actual assembly dump, I cannot see whether they use cmpxchg8b or not. If they use regular cmpxchg, I am not quite sure what is going on. Sure, cmpxchg needs to do a conditional move, where xadd is unconditional, so it will always have more latency. But since Nehalem seems to have reasonably close performance for the CAS and XADD variations of the code (they both seem to remain quite flat up to 4 threads, with CAS slightly degrading above that. SB can keep XADD almost flat at 3 threads and up, but CAS goes up fast), it is strange that SB's CAS runs away that far.

In short: it's not something I'd lose sleep over. It wouldn't be the first time that a new CPU is slower at certain operations than earlier variations. And as said above: I doubt it'd actually be much of a problem in practical situations. If you're just constantly hammering on atomics, your code needs to be rewritten anyway, because that's never going to be fast. Fastest way would be to just take a single thread then. Having 8 cores waiting on the same data is expensive. Using lock itself is expensive anyway. The quickest way to lock data is not to lock it.
Posted on 2011-09-16 15:57:47 by Scali
Hi Scali, thanks for the extensive reply - as always, much appreicated. I'll add a comment to the blog linking to this discussion.

The author posted a comment containing what looks like the output of objdump (the formatting was all over the place so I've done what I can to make it more readable):
http://mechanical-sympathy.blogspot.com/2011/09/adventures-with-atomiclong.html#c8126119963606304366

atomic_cas: file format elf64-x86-64
...
0000000000400c70 <_Z7run_casPv>:
400c70: 48 8b 15 89 07 20 00        mov    0x200789(%rip), %rdx      # 601400
400c77: 48 8d 4a 01                lea    0x1(%rdx), %rcx
400c7b: 48 89 d0                    mov    %rdx, %rax
400c7e: f0 48 0f b1 0d 79 07  lock cmpxchg %rcx, 0x200779(%rip)      # 601400
400c85: 20 00
400c87: 75 e7                      jne    400c70                    <_Z7run_casPv>
400c89: 48 81 fa ff 64 cd 1d        cmp    $0x1dcd64ff, %rdx
400c90: 76 de                      jbe    400c70                    <_Z7run_casPv>
400c92: f3 c3                  repz retq
400c94: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400c9b: 00 00 00 00 00


The performance of the CAS variant on Nehalem is _so_ good I do wonder whether he deployed the right one (you'll note his test code needs to be recompiled for each variant of run_xadd, run_add and run_cas)!

Cheers

Michael
Posted on 2011-09-16 16:23:15 by michaelg