Hi all,

I was wondering what the sfence, lfence and mfence instructions are really used for. According to the documentation it ensures that non-temporary memory operations (i.e. movntq and movntps) have finished before continuing.

So am I correct that sfence should be used for a memory store function (e.g. filling with zeros), and mfence for a memory copying function? Do I have to add them to the top as well as the bottom of the function? I have used non-temporary memory operations for quite some time without fence instructions. Was I just lucky this never resulted into disaster?

I also wonder whether these instructions have any extra purpose with multi-core CPUs. For example, say thread A has written data to a buffer, and it sets a bit to indicate that it has finished. Thread B spin loops waiting for the bit so it can start further processing the data once the bit is turned on. Does the CPU guarantee that all writes to the buffer have finished before the bit is turned on and thread B starts, or do I have to insert a fence instruction anywhere?

Any other memory synchronisation gotcha's I should be aware of?

Thanks a lot,

c0d1f1ed
Posted on 2007-11-29 06:27:00 by C0D1F1ED
Re: "non-temporary"

You mean non-temporal. The idea being that some data may be written to memory but not needed any time soon, so it can be beneficial to avoid interacting with the normal memory cache heirarchy.

Re: mfence, lfence and sfence

The main purpose of these is synchronization of shared memory accesses within multiprocessor systems due to the fact that modern CPU's can execute instructions out of order. It can become important in shared memory situations that certain reads or writes happen in a well defined order.

For a detailed example where these instruction would be needed, google for Peterson's Algorithm.
Posted on 2007-11-29 07:22:10 by Rockoon
Every read (lfence) or write (sfence) operation is guaranteed to be (physically) finished before executing next instruction after the s/l/fence instruction.

In other words:

mov eax,  ; load from some address
lfence
...         ; here, the load is guaranteed to be already, physically performed before any further loads. (in other words, the RAM/cache (in this case) is supposed to have already processed the load request)


Modern CPUs don't read/write immediately when the corresponding read/write instructions are being fetched (and sometimes they don't load/store even after they're executed). This may cause problems with some devices. For example: imagine that you want to communicate with a device. This device requires that you write "0" to address "4" and then you can write a dword to this device, mapped to address 8.

Let's analyze the following code:
mov eax, 4 ; load the ponter. not really necessary but it's somewhat more clear this way
mov ebx, 8 ; load the second pointer. as above: not necessary

mov , 0 ; write "0" at address '4'
mov , 7 ; write "7" at address '8'


In this code, the the second write operation could be issued to the device BEFORE the first write operation. And that's what the fences are for:

mov eax, 4 ; load the ponter. not really necessary but it's somewhat more clear this way
mov ebx, 8 ; load the second pointer. as above: not necessary

mov , 0 ; write "0" at address '4'
sfence ; fence all stored
mov , 7 ; here, the previous store is guaranteed to have already been executed (more precisely: it's guaranteed that this write will FOLLOW the previous one)



(PS: to be precise, communication with devices is usually performed via memories with so-called 'strong ordering' and then fences are not really required ^^' But in multi-CPU systems, fences help make synchronizations easier, shorter and faster)
Posted on 2007-11-29 07:32:04 by ti_mo_n
You mean non-temporal. The idea being that some data may be written to memory but not needed any time soon, so it can be beneficial to avoid interacting with the normal memory cache heirarchy.

Thanks for noticing the typo. I know what these instructions are supposed to do, but is it necessary to use the fence operations before and/or after them? How does not storing the data in cache create a need for a fence instruction? Other memory operations can be out-of-order as well.
The main purpose of these is synchronization of shared memory accesses within multiprocessor systems due to the fact that modern CPU's can execute instructions out of order. It can become important in shared memory situations that certain reads or writes happen in a well defined order.

Then do Win32 synchronication methods (e.g. WaitForSingleObject) already use these? I can't image we should be forced to write assembly in C++ when going multi-threaded on a multi-core CPU.
For a detailed example where these instruction would be needed, google for Peterson's Algorithm.

I've used Peterson's algorithm before without using fence instructions...

Thanks for the information, but I'm still confused exactly when these instructions are needed.
Posted on 2007-11-29 13:48:28 by C0D1F1ED
Every read (lfence) or write (sfence) operation is guaranteed to be (physically) finished before executing next instruction after the s/l/fence instruction.

I assume it doesn't flush the entire pipeline, just doesn't start any new memory operation before the queued memory operations have finished?

I'm not working with other (synchronous) devices. I'm just worried about memory operation order within one thread when using non-temporal operations, and across threads with a multi-core CPU.

Thanks a lot!
Posted on 2007-11-29 14:01:33 by C0D1F1ED

Thanks for noticing the typo. I know what these instructions are supposed to do, but is it necessary to use the fence operations before and/or after them?


No.


How does not storing the data in cache create a need for a fence instruction?


It doesnt.


I've used Peterson's algorithm before without using fence instructions...


Issues like this even caught Microsoft with their pants down when true multi-core systems were introduced. There wasnt an issue with single-core multi-threading because thread switches made the issue irrelevant: The pipeline of thread A was fully retired long before thread B got its chance.


In a single core environment (as well as multi-core environments without out-of-order execution) only 3 cases can exist after a switch from thread A to thread B..

Thread B can witness:

o Operations A1 and A2 were both executed.
o Only operation A1 was executed.
o Neither A1 nor A2 were executed.

In some multi-core out-of-order execution environments, a new case may be observed by thread B:

o Only operation A2 was executed.

It is this case where the fence instructions can play a role, by preventing it from happening.

Posted on 2007-11-30 02:01:41 by Rockoon