Hi all,

According to the AMD optimization manual, it is advised to add one or two 66h prefixes to nop (90h) to create fast 2 or 3 byte nops. But the Intel manuals specify something else. So I was wondering if the AMD approach is a good compromise.

Intel sais that 90h is handled as a true nop, without dependencies on eax (even though it's really xchg eax, eax). If this is still true when adding 66h prefixes then I think these are optimal on Intel processors as well.

Any insights?

Posted on 2006-11-21 07:52:54 by C0D1F1ED
I can't see how the prefixes would hurt any cpu, really. You could always benchmark :) It might not be faster than a 100% nop slide on intel, but if AMD says it's better, well, better AMD performance won't hurt you.
Posted on 2006-11-21 16:43:16 by ector

I can't see how the prefixes would hurt any cpu, really. You could always benchmark :) It might not be faster than a 100% nop slide on intel, but if AMD says it's better, well, better AMD performance won't hurt you.

If it only gives ho-humm better performance, but introduces dependencies on intel... well, you do the math :)
Posted on 2006-11-21 17:38:17 by f0dder
Could anyone with an Intel processor do a quick test of whether these nops with prefixes are slower than just nops (same number of bytes)? Thanks a bunch.
Posted on 2006-11-22 03:43:29 by C0D1F1ED
I Prefixed the NOP with two RETs and the RETs were grouped with the NOP in Delphi and literally eliminated from the code. I then prefixed the NOP instruction with two same RETs in MASM and one of the RETs was eliminated but one was compiled.

The below code took 1 clock cycle to execute:

  DB      66h
  DB      66h
  DB      90h

obviously because two RETs were eliminated from the code. The below code took 4 clock cycles to execute while Delphi still eliminated the two RETs:

  OR      EAX , EAX
  DB      66h
  DB      66h
  DB      90h

Removing two RETs yielded 2 clock cycles less, meaning that this code executed in 2 clock cycles:

  OR      EAX , EAX
  DB      90h

The below code took 2 clock cycles to execute:

  DB      66h
  DB      66h
  DB      90h
  OR      EAX , EAX

The code is aligned on an even address. I have an Intel PIII with a 512 MB RAM.
Posted on 2006-11-22 10:10:42 by XCHG
Thanks XCHG!

I'm a but unsure what to conclude though. Am I right that a NOP with 66h prefixes execute no slower than just the same number of NOPs?
Posted on 2006-11-29 01:14:50 by C0D1F1ED
66 = RET?
what abour C3?
Posted on 2006-11-29 12:27:30 by vid
NOP can be created in two ways; 6690h and 90h. RET has the opcode 66 so if you put this in your code:

DB    66h
DB    90h

Your assembler might just convert it to a simple 90h as in a NOP because it would be a NOP however. Whether it is 6690h or a simple 90h. Clearly, if you want your code to be aligned on an even address, you are better off using the 2-byte long opcode of NOP. If you for example write this in your code:

DW    9066h

You will get a NOP just as you will get it when you write this:

DB    90h

Then the great thing about RET (66h) followed by a NOP (90h) is that they can group into one 2-byte long instruction equivalent of NOP.
Posted on 2006-11-29 14:35:34 by XCHG
hang on, 0x66 / 66h is a prefix isnt it, not a ret, ret is 0xC3 / 33h

0x66 / 66h = opcode size prefix..

even checked with hiew...

00001000: 6690                        nop
00001002: 90                          nop
00001003: 6690                        nop
00001005: 90                          nop
00001006: 6690                        nop
00001008: 66                          ???
00001009: 6690                        nop
0000100B: CF                          iretd
0000100C: 66CF                        iret
0000100E: C3                          retn
Posted on 2006-11-29 16:13:46 by evlncrn8

The things I said were just based on Delphi's debug window.
Posted on 2006-11-29 22:35:50 by XCHG
well then it seems delphis debugger is a bit crap :), 66 prefix is a prefix, not an opcode per se and its definately not a ret... seems either you're not reading the window right or delphis little debugger has a few bugs

the whole thing is similar to other opcodes, like rep nop for spinlocks and so on, sure they might be faster in some cases / processors or slower in others, really comes down to what you wish to use them for and ideally which processor you're using etc...
Posted on 2006-11-30 13:48:02 by evlncrn8
that doesn't mean 66h = ret  :P

that means there is prefix 66h before RET, which is considered invalid by disassembler.
Posted on 2006-11-30 15:15:41 by vid
66h has always been a prefix, just like 67h, 0F0h, 0F2h, 0F3h, 2Eh, 36h, 3Eh, 26h, 64h, 65h.  ;)
Posted on 2006-12-01 23:45:13 by roticv