CRITICAL SECTIONS seem to be the go -
Unlike events, mutexes, and semaphores, which are also used for multithreaded synchronization, critical sections don't always perform an expensive control transfer to kernel mode. As you'll see later, acquiring an unheld critical section requires, in effect, just a few memory modifications and is very quick. Only if you try to acquire an already-held critical section does it jump into kernel mode.

This is from an MSDN Magazine article http://msdn.microsoft.com/msdnmag/issues/03/12/CriticalSections/default.aspx
Posted on 2006-10-13 08:35:10 by sinsi
Hi
Implementing an own synchronisation scheme seems to be cool, but there are other considerations to be taken. If you implement something like pure spinlocks you will waste CPU resources and the total effect will be worse than a sequential implementation. To avoid this problem there are APIs that can be used to switch to another thread and save the rest of a time slice, but if you compare the execution of other threads you will see that the OS is not able to smoothly schedule these threads.

The best way to synchronise the code is using OS objects like Critical sections and co.

Biterider
Posted on 2006-10-13 08:45:12 by Biterider
i was thinking about something like having custom locking, and when we don't get a lock, we exit our timeslice by some OS call and retry to get lock in loop.
Posted on 2006-10-13 08:49:04 by vid
Hi
You can use the SwitchToThread API, but again, you will see that the scheduler will not work as it should.

Biterider
Posted on 2006-10-13 08:51:23 by Biterider
vid: look at critical sections... no reason to reimplement the wheel :)
Posted on 2006-10-13 09:07:51 by f0dder
Unfortunately, SwitchToThread seems to ignore multi-processors as far as the calling thread goes -
Note  The yield of execution is limited to the processor of the calling thread. The operating system will not switch execution to another processor, even if that processor is idle or is running a thread of lower priority.
Posted on 2006-10-13 09:11:10 by sinsi
f0dder: true, in case when wheel is implemented with roughly same usages on all OSes you want to code for, and when wheel is implemented with comparable or better speed on all OSes...
Posted on 2006-10-13 09:57:41 by vid

Sorry, but this thread is not about XCHG at all (unless I'm really missing something). It's about splitting a task over two threads with minimal overhead.

So far, spin looping seems like the best solution even though it burns clock cycles while there is no work. Lowering the thread priority and using PAUSE allows other processes to get some exeuction time and prevents the CPU from heating up unnecessarily.

and when you exit you get a big penalty for branch misprediction
Posted on 2006-10-14 03:04:23 by daydreamer
and when you exit you get a big penalty for branch misprediction

Big is relative. Calling kernel functions takes several thousands of clock cycles, while mispredicted branches take 10-20 clock cycles. Besides, it's totally unavoidable as far as I know. The pipelines have to be filled with new instructions either way.
Posted on 2006-10-14 06:50:50 by C0D1F1ED
how often is recommended to synchronize?
may sound stupid, but I am writing a multithreaded testapp and has no specific algo that could rule when data need to be transferred between threads and how much is needed to split between threads
so I want to minimize synch as it can messup cache or reading 128bytes too often will take a hit on performance, on the other hand I want to not minimize when synch happens because it wont represent a realworld application with too seldom synch
I just use a macro that can be used to fill in your own testcode that will be run in both threads and timing will be added to testrun with one thread or two threads
Posted on 2006-10-14 11:40:43 by daydreamer
You might want to look at using IO Completion Ports.
They are designed to be used for high-performance asynchronous file or network IO, but let's be clear, they are a mechanism for queuing 'jobs' for a pool of worker threads, and for dealing with notification of 'job completions', and they can be used for practically anything.
The idea with these is that you don't write a thread loop as such, you write some handlers/callbacks for various operations, and let the operating system deal with thread management.
Not exactly a low-level solution, but certainly worth looking into.
Posted on 2006-10-14 12:33:44 by Homer

You might want to look at using IO Completion Ports.
They are designed to be used for high-performance asynchronous file or network IO, but let's be clear, they are a mechanism for queuing 'jobs' for a pool of worker threads, and for dealing with notification of 'job completions', and they can be used for practically anything.
The idea with these is that you don't write a thread loop as such, you write some handlers/callbacks for various operations, and let the operating system deal with thread management.
Not exactly a low-level solution, but certainly worth looking into.

Thanks for the tip! It sounds like it could be better suited for my specific situation.
Posted on 2006-10-14 13:24:42 by C0D1F1ED
Anyone ever used fibers? Are they just threads that you need to schedule yourself, or are they truely more primitive than threads and faster?
Posted on 2006-10-22 10:18:54 by C0D1F1ED

Anyone ever used fibers? Are they just threads that you need to schedule yourself, or are they truely more primitive than threads and faster?

That is my understanding of fibers - but I've never played with it.
Posted on 2006-10-22 10:26:19 by f0dder
Fibers run on one thread. Their switching is basically:

mov esi,pCurrentFiber
mov edi,pFiberToSwitchTo

xchg .Fiber.esp_ , .Fiber.esp_ ,
xchg .Fiber.ebp_ , .Fiber.ebp_ ,

ret ; goes back to the pFiberSwitchTo's calling function. (different esp,...)

(metaphorically "speaking").
Thus, no way to use the second cpu with fibers.



For 2 concurrent threads to run simultaneously, I recommend using custom FIFO chains of commands, balanced between the threads. I've seen it run almost perfectly with up to 30-50ms difference of execution time. Or at least it appeared so.
The FIFO chain uses a CriticalSection for access (write,pool,read).
Without a dualcore at home, and for other reasons, I haven't toyed with such FIFO chains more than for using them in MIDI-commands handling and similar.
Posted on 2006-10-22 16:38:14 by Ultrano
30-50ms ? sounds much when my messageboxes popup and show max 10ms between threads and I dont know if its the delay in windows messagehandling between 2 sendmessage/message comes thru messageque/wndproc and windows internal messagehandling
Posted on 2006-10-29 04:42:04 by daydreamer
I'v gotta agree on three of those points. The CPU should do scheduling in hardware, multicore processors are on the way and synchronization takes more of my time even when it doesn't take more computing time.
Posted on 2006-11-22 09:04:06 by Jeronimo0d0a
Problem is that "scheduling" is more than just selecting a thread to run. There's different algorithms for different situations, some OS'es do more on a thread switch than others, and there are more situations than just the hardware clock interrupt when a re-schedule is needed...
Posted on 2006-11-22 09:08:29 by f0dder