Now we are about to finish with prefixes.
The last two classes of prefixes are
"Segment override" and "LOCK" prefixes.
Term "segment override" might be somehow confusing to
begginers, but really perfectly fit to "default" approach in Intel opcode
system.
Time to open our testopcode app in OllyDbg.
Type MNEMONIC.
mov eax,
you can see
8B03 MOV EAX,
Type next MNEMONIC
mov eax,GS:
Now you can see:
8B03 MOV EAX,
65:8B03 MOV EAX,
65 is segment override prefix to change DEFAULT segment for
data from DS to GS. But only for this one operation.
Actually to use any data in opcode processor need to know both
segment and offset to address to the data.
But to specify directly those segments we would need additional
field in opcode. And it would lead to immenstly growing size compareing
to present opcode system.
Most common operations are operations with some data in memory.
So instead of direct specification in every opcode segment of data used,
a different approach was taken:
Operations were separated in groups by some definitions
and for each group was choosen DEFAULT register for data
those operations used:
CS: for EIP pointer (code)
ES: for chain operations that use 2 mem operands(movs,cmpsb etc.) for dest
and DS: for source
SS: for stack operation(push pop etc.)
DS: for the rest data operations (data handling other than in chain ops)
And logic for processor to understand what segment to use for particular
operationis is simple simple and direct:
if there is "segment override" prefix - use segment specifyed by the
prefix
else - use segment that is DEFAULT for that kind of operation.

TYPE OPCODES:
AC
3E AC
As you can see in debugger for both AC and 3E AC here is the same mnemonic:

AC LODS
3E:AC LODS

3E is prefix that directly specify segment DS,
but actually processor in this case would use DS even if it were not
specified, because it is DEFAULT segment for LODS operand address.

For you as programmer that means each time you use segment that is not
default for this kind of operation it costs you 1 byte for each instruction
and 1 additional clock for the instruction to execute (actually to "decode",
not "execute").
The same about all other prefixes that change DEFAULT something in operations
66 - default operand size (when you use 16 bit registers or 16 bit memory
operands in Win32 apps will cost you additional byte and additional clock
for each opcode that use them)
67 - default address mode (when you use 16 bit addressing)

Writing Win32Asm user mode apps you have nor much chance neither need
to work with segment registers, nevertheless low level programming is about
to make "black boxes" lit ones.
Let's have a slite info about specifics of segment register in Win32asm
user mode apps.
CS:
CS is the same for all user mode apps.
It is 1Bh in NT family OSs and 227h in 9x.
If someone of you remember discussing macro for absolute address jump,
I could give now the answer for very simple absolute jump format
but it would be different for NT and 9x.
OK let's have a look at long direct jump opcode.
EA - byte "code" telling processor that it is long direct jump opcode
when processor encounter that "code" it assumes that after that is
48 bit address, low 32 bits of address is offset and high 16 bits specify
segment register.
Since we know that for NT(2000,XP) segment selector for code is 1Bh
to make absolute direct to address 12345678h we could write:
db EA ;long jump
dd 12345678h;offset where to jump
dw 1Bh ;segment selector for code in NT
for 9x the same jump should look the same exept for code selector word.
db EA ;long jump
dd 12345678h;offset where to jump
dw 227h ;segment selector for code in 9x

So it would be fine if you are going to use the app only in NT or only in 9x
(actually there are lots of such apps - for example by M.Russinovich -
that are meant to be used only in one of Windows OS family)
We then write such macro
;NT=1
absjmp macro addr
ifdef NT
db 0EAh
dd addr
dw 01Bh
else
db 0EAh
dd addr
dw 227h
endif
endm
use:
absjump 401000h

And comment\uncomment line NT=1 depending on what system we are
going to use this long absolute jump.
bitRake can figure out some better way to handle conditions may be.

Try to write raw opcode to jump to some "nop" aimed in your debugger.
for example you want to jump to string
00401013 |. 90 NOP
in your debugger
type opcode (ctrl-e)
EA 13 10 40 00 and the rest two bytes will be B1 00 on NT or 27 02 on 9x
to be continued...
Posted on 2002-12-23 16:21:46 by The Svin
Important thing to remember is that DS SS CS segment are alias
segments in Win32 user mode app. That means that you can address them
using any of those segment registers.
If you tried "String to Dwords" util you know that one of variant to
place constant string where you need is using stack operation, when stack
pointer point directly to .data section and use push operations to
place constant string there.
The same use .code section for data
in such code as
.code
msg db 'Some text',0
start:
invoke MessageBox,0,offset sometext,....

or such code would do the same thing
.data
somedw dd 12343
......
.code
....
mov eax,somedw
mov eax,dword ptr offset somedw
mov eax,dword ptr DS:offset somedw
mov eax,dword ptr CS:offset somedw
mov eax,dword ptr SS:offset somedw
.......

you must remember that data on some section may be protected,
(it's done on page protection level to change access rights
use ERW keys with /section: option while linking)
nevertheless if you can address data as ptr somedata
and secced means that you can address it definining any of DS,CS,SS
register as selector.
We will discuss it in details when come to system aspects of programming.

Now what about the rest of the registers?
One that first will come to our attention is ES register,
that used in chain operations that still matters when you need compact code.
In DOS programm you would need make sure that ES=DS if you work in the same
segment for both destination and source data.
In Win32 user mode programming you don't need it, but it doesn't mean
that ES isn't used with chain operations anymore.
Before we check it out let's make some simple notes about chain operations.
ES is used in those chain operations where are two memory operands: source
and destination:
for example
movsd = move dword ptr DS:ESI to ES:EDI
As we can see there are two memory operands, so if we use "segment override"
prefix what segment selector would be changed? Selector for source or
destination?
Type OPCODE
A5 65 A5
A5 MOVS ,
65:A5 MOVS ,

So as you can see "segment override" prefix affect source.
Changing destionation selector is not allowed.
If you want to change selector for destination - change value of ES,
not register for selector.
The same is about all "two memory operands" chain instructions.
BTW if you started use or look at opcode you for sure noticed that all
chain operations are short 1 byte if you don't use words (two bytes with
words because of 66 prefix) use them when speed doesn't matter - it leads
to very compact opcode.

As was said before, ES is still used though you don't need especially
initialize it with DS value - system do it for you when loads your app.
Nevertheless you can easily spoiled it and bad result then comes immideatly.
Let's now check out how selector value would effect chain operation.
We can do it without writing app - we are low level coders are we not ;)
Remeber how we use local vars?
We sub esp for size of them. They say: "Stack is growing down" - that
means that we can use in stack address space from less then current esp
without risk to spoiled some data (return address, arguments etc.)
For example if you see in OllyDbg top of stack as

0012FFC4 77F1B9EA RETURN to KERNEL32.77F1B9EA
0012FFC8 0012E2D4
0012FFCC 77F92CD4 ntdll.77F92CD4
0012FFD0 7FFDF000
Then you can use addresses from 0012FFC0 and lower for yor local data
(remeber that size of given memory for stack (phisical memory mapped to
stack space) is 1 mb you can change it using stack key while linking
and don't use first kb ever - it's for errors catching).
If you feel uneasy yet about it - just type and execute some push\pop
operation looking at top of stack to get feeling how and where it grows.
Stack window is your right bottom corner in OllyDbg.

So let's use some of the space for our local data to check how movs works
with different selectors.
What are we going to do?
1. Fill stack space with short string
2. Copy the string to different near stack region step by step (one byte
each step) and in each step we define different segment register
- to get impression that selectors in DS,SS,CS pointing to alias segment
- to check if ES is still matter by spoiling it in last step

Type MNEMONIC
lea edi,
This will point edi to space for local data so we wouldn't spoil anything
that was placed in the stack as return address, arg etc.
We are going to use space addresses that are <=esp-4
Type next MNEMONIC
std
we set direction flag to 1 so that all chain operations use direction to less
addresses. (When you change DF to 1 in your window callback procedure -
remember to set it to 0 before your proc returns - your proc returns to system
code and the code assumes that DF=0)
Type next MNEMONICs
mov al,5
stos
dec al
Type OPCODE
75 FB
This code is analog of what would look in asm source like
lea edi,
std
mov al,5
@@:
stos
dec al
jne @B
As you can see there are no problems to type code right in debugger
to check some code, try ideas and find answer for simple questions
(for example how some opcode works).
The only problem for beginner might be short jump backword (75 FB in our case)
Format of the opcode(short relative conditinal jump) in bynary:
0111tttn:imm8
First 4 bits 0111 identifies that it is relative short conditinal jump.
In debugger it shown as 7 as first hex digit of the opcode.
Next 4 bits (lower 4 bits of the first byte) is bit field that specifies
condition it is called "tttn" and has the same format in all instructions
that check flags for condition (all jcc, cmovcc, setcc etc.)
second byte is signed value that is added to EIP after jcc instruction is
decoded and EIP points to the next to the jcc opcode.
Let's "decode" our 75 FB opcode.
7 - sign of short relative conditional jump
5 - tttn 0101 in banary 0100 - e or zf 0101 ne or not zf, as you can see
changing last bit we changing it to NOT(condition) that's why it is
called "tttn" .
FB - it's -5. -5 'cause we need to set EIP back 5 bytes:
1 byte AA STOS
2 bytes FEC8 DEC AL
2 bytes 75 FB JNZ SHORT
----------
5 bytes

If you want to know more about tttn - have a look at tttn.exe, and
not only at the app itself but mostly inside source.
Source is written to illustrate inner links iside tttn fields.

OK, back to our "register override" topic.
Switch to data window.
Go to esp-4 address (press ctrl-G and just type "esp-4" and press Enter)
Scroll one line up so you could see lower address 'cause we are going to
fill string in reverse direction.
Execute code step by step (use F8) and see how and were 01 02 03 04 05 bytes
are placed.
Now we have a string in memory and ready to check how it can be transffered
with different selector specified by "segment override" prefixes.
First we check if DS CS SS are alias semgent selectors.

Type MNEMONIC
lea esi,
This will tune esi(source register) to the place where edi was when
we started filling the string(remember we fill and move string in reverse
direction so put backword to start we add not substruct).
edi is already in the end of string-1 position so we can start copying.
Opcode for movsb = A4.
A4 = COPY FROM TO
Using "segment override" prefix before A4 we can change selector (segment) for
source (we can not do it for destination segment - it always pointed by
value in ES)
Prefixes to specify CS-2E,SS-36,ES-26
fist check DEFAULT
Type opcode:
A4
Run using F8, look at data window if it has been copied alright.
Now check if CS would pass for SOURCE segement selector.
Type opcode:
2E A4
Run by F8. Check result in data window.
Do the same with prefixes 36 and 26.
If you did everything correctly you can see that first 4 bytes of our
5 bytes string are successfuly copied. That shows that all SS DS CS ES
pointed to the same alias segment through different selectors.
Try any other segment override prefix - and you will get an error.
That last byte of our string we use to check if ES is still matter for
32 bit code.
Let's change ES.
Type OPCODE
66 6A 00
it's push WORD 0
Type MNEMONIC
pop ES
Type opcode (A4) for lodsb or mnemonic itself (for our exersizes the
more you type raw opcodes the better)
Now run all this 3 instruction using F8.
When it comes to lodsb you can not execute the opcode and status bar of
OllyDbg says that there are problem.
As we now can see ES is actually still used for 2 mem chain instruction,
though we don't need to set correct value in it - system do it for us.
Let's fix the problem that we created with ES by setting it equal to DS
Type in place of last movsb that failed to execute:
push DS
pop ES
Execute this two instructions.
type again A4.
Now everything should work OK.
In user mode maybe the only segment register that you would use
for something other than education perpose could be FS.
It is used for SEH. And now you remember that every time you
use FS to specify segment for operation with data it costs you 1 additinal byte
and 1 clock.
Nevertheless in system programming use of segement registers might have
some sence depending on what perpose of your driver is.

As to refference of values of segement override prefixes - you know you always
can lookup them in your debugger ;)
to be continued...
Posted on 2002-12-25 13:09:18 by The Svin
Next prefix LOCK.
There was a nasty story connected with the prefix with
Pentium and Pentium MMX :)
Known as F00F bug, when using it could freeze pocessor.
(for example F00FC7C8)
Those who is good in protec mode system understanding could read the detail explanation of nature of the bug in
Dr.Dobbs journal.
Here is not much special from me to say about using LOCK prefix. It's good explained in Intel refference. I would just
quote description to finish with "classic" prefixes.
Before it at the end of prefixes topic I want to say about
"inproper" use of some prefixes wich was discussed in the
"prefixes" part of the tuts, though in models upto PIII including inproper use of prefixes leads just to that processor ignores it, Intel claims that in new models it can have new special meaning and inproper use of them could lead to unpredicted behavior.
About new generation of prefixes it worth to mention
about 3E "hint" prefix that is used with JCC to help branch predition.

Now Intel about LOCK:

Causes the processor?s LOCK# signal to be asserted during execution of the accompanying
instruction (turns the instruction into an atomic instruction). In a multiprocessor environment,
the LOCK# signal insures that the processor has exclusive use of any shared memory while the
signal is asserted.
Note that in later IA-32 processors (including the Pentium 4, Intel Xeon, and P6 family proces-sors),
locking may occur without the LOCK# signal being asserted. See IA-32 Architecture
Compatibility below.
The LOCK prefix can be prepended only to the following instructions and only to those forms
of the instructions where the destination operand is a memory operand: ADD, ADC, AND,
BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR,
XADD, and XCHG. If the LOCK prefix is used with one of these instructions and the source
operand is a memory operand, an undefined opcode exception (#UD) may be generated. An
undefined opcode exception will also be generated if the LOCK prefix is used with any instruc-tion
not in the above list. The XCHG instruction always asserts the LOCK# signal regardless of
the presence or absence of the LOCK prefix.
The LOCK prefix is typically used with the BTS instruction to perform a read-modify-write
operation on a memory location in shared memory environment.
The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory
locking is observed for arbitrarily misaligned fields.
IA-32 Architecture Compatibility
Beginning with the P6 family processors, when the LOCK prefix is prefixed to an instruction
and the memory area being accessed is cached internally in the processor, the LOCK# signal is
generally not asserted. Instead, only the processor?s cache is locked. Here, the processor?s cache
coherency mechanism insures that the operation is carried out atomically with regards to
memory. See ?Effects of a Locked Operation on Internal Processor Caches? in Chapter 7 of IA-32
Intel Architecture Software Developer?s Manual, Volume 3, the for more information on
locking of caches.
Posted on 2003-02-28 18:00:30 by The Svin
Svin, I do not understand why lock is needed. Could you help by explaining it to me?
Posted on 2003-03-14 08:20:38 by roticv
It Used to avoid two processors from updating the same data location.
Posted on 2003-03-14 10:15:33 by wizzra

Svin, I do not understand why lock is needed. Could you help by explaining it to me?

Do you mean in other words then Intel did? :)
It's needed only in multiprocessor systems.
And only when memory operand during one opcode
execution is:
- read from memory
- changed during ALU operation
- written back to memory
And only for particular commands.
You see here is gap between 1st and 3rd stages. In wich other
processor can take data that is being processed and will be overwritten at the end.
LOCK makes signal(#LOCK) wich trough BUSs arbitr blocking access to any other processor(s)
to shared memory untill the result of command is written back.
Needs mostly for synhronize system work, shared resources etc.
Do you need more detailed explonations?
Posted on 2003-03-16 05:27:56 by The Svin
This should do I suppose. Thanks anyway.
Posted on 2003-03-16 06:56:16 by roticv