Part1: Prefixes
(analizing one reverse string proc)
I hope this section will be practical and allow myself proceed with
simple and obvious statement.
Assemble language is mnemonics represantation of machine language, therefor
asm programmer MUST understand machine language it is NOT OPTIONAL.
To increase his power he MAY use macros, HLL statments, but he MUST be able truely
understand what exactly will be produced from those statements in machine level.
If he is not able to yet - he needs to study it, he is a begginer as asm programmer.
If he think it's not important - he'll never become one.
It's major difference between asm programmer and other kind of programmers, for
the rest we are the same. We think algorithm on machine language, they are think
algo on HLL algo languages and then allow compilers to translate it to machine language.
Each group has its own advantages and disadvantages and I'm not going to discuss it here,
cause its asm msgboard, and choice for particepents is obvious.
I spent almost a month to formulate words the above, 'cause it was something needed to be
said here. Lots of evidends were talking that some of us not only unaware of macine basics
but in addition not in the way to study it.
I'll discuss practical stuff of 32bit Intel asm programming, through analizing some procs and progs
written in Win32 asm, spotted to common weak spots of nowdays outgoing procs and demos.
Let's start then.
------------------------------------------------
For begginers:
What is prefix?
Prefixes are one or more bytes that precede an instruction and modify the operation of the instruction.
These prefixed opcodes cause penalties or pairing restrictions: lock, segment override,
address size, second opcode map (0F), and operand size. In particular, 16-bit instructions executing
in 32-bit mode require an operand-size prefix.
Let's illustarate it with code of Mirno's reverse string func:
OpCodes Instructions
668B07 ax, WORD PTR
668B11 dx, WORD PTR
66C1C808 ror ax, 08h
66C1CA08 ror dx, 08h
668901 mov WORD PTR , ax
668917 mov WORD PTR , dx
Note:
1. All opecodes first byte is 66
2. All instructions use 16 register.
As you probably gessed - 66 is operand-size prefix.
------------------------------------------------------------------------------------------
It makes each instruction wich uses 16bit reg 1 byte longer.
It's about size, so what about speed?
To understand it better let's put the whole block into V-TUNE(the best profiler available)
OpCodes Instructions Clocks Penalties and Warnings Pairing Issues
83EF02 " sub edi, 02h" 1
668B07 " Size mov ax, WORD PTR " 3 "Exp_AGI_U_Pen:1, Prefix_Pen:1" "Exp_AGI, Not_in_Fetch, Prefix"
668B11 " Size mov dx, WORD PTR " 4 Prefix_Pen:3 "Not_in_Fetch, Prefix"
66C1C808 " Size ror ax, 08h" 3 Prefix_Pen:2 "NP_Inst, Not_in_Fetch, Prefix"
66C1CA08 " Size ror dx, 08h" 4 Prefix_Pen:3 "NP_Inst, Prev_PV/NP"
668901 " Size mov WORD PTR , ax" 4 Prefix_Pen:3 Prev_PV/NP
668917 " Size mov WORD PTR , dx" 4 Prefix_Pen:3 "Not_in_Fetch, Prefix"
83C102 " add ecx, 02h" 1 - 1
3BCF " cmp ecx, edi" 1
Totals 9 instructions 22,22% pairing,24 total Cycles
28 bytes.
24 clocks!
Let's try to reorgonize code to do the same task, but without 16 bit registers.
I can offer one of possible ways:
Address Label OpCodes Instructions Clocks
1:156 L_156: 8B11 " mov edx, DWORD PTR " 1
1:158 8B47FE " mov eax, DWORD PTR " 1 - 1
1:15b 8821 " mov BYTE PTR , ah" 1
1:15d 8877FE " mov BYTE PTR , dh" 1 - 1
1:160 884101 " mov BYTE PTR , al" 1
1:163 8857FF " mov BYTE PTR , dl" 1 - 1
1:166 83C
Alex,
Compliments, this is good stuff. This type of analysis is what makes
assembler go faster and I guess thats why most people are here,
performance and size.
I agree with the view of not using 16 bit registers but I know that
there are times when it is needed. The general view is that you
should ALWAYS use a 32 bit register when you can, (counters etc ..)
as they are faster. The later series of Intel processors are native
32 bit devices and while they will run in 8 and 16 bit registers,
they pay a huge speed penalty doing so.
This alone is a good enough reason not to just port 16 bit code
but to rewrite it so that it takes advantage of the additional
performance.
Regards,
hutch@pbq.com.au
this section is a keeper. It's proving itself already :)
Steve,
8 bit registers are OK.
They don't need prefixes nor in 16 bit neither in 32 bit modes.
And they don't need one extra clock to decode wich could make
them NP.
So the problem is only 16 regs in 32bit address models.
The Svin.
xchg al,ah
xchg dl,dh
ror eax, 16
ror edx, 16
xchg al,ah
xchg dl,dh
Too bad that these instructions don't pair :( Is there a faster way to do a 32 bit version of the Mirno's routine above? I don't have VTune, is it really worth the price?
bitRAKE
This message was edited by bitRAKE, on 3/26/2001 5:37:55 PMWhat's the timing of BSWAP?
It was introduced in the 486.
I wrongly assumed it was bad. Michael Abrash's book as well as other sources on the web say it takes 1 cycle on the P/PII, but 4 on the AMD K6 (wonder about the Athlon?) I'd be an excellent choice here!
bitRAKE
*Athlon has a latency of 1
0x0F 0xC8+reg
I think
bswap esp
would be a very rare instruction :)
This message was edited by bitRAKE, on 3/26/2001 9:02:38 PMto bitRAKE:
Forget about xchg.
The Svin.