Hi all,

I was wondering if anyone here knows if there are any low-level optimizations for multithreading. I've done some experiments with my dual-core Athlon, and the results are rather dissapointing. Using WIN32's SetEvent and WaitForSingleObject, flipping between two theads takes thousands of clock cycles. Using spin loops is a lot faster, but burns clock cycles even when there is no work.

So, are there any faster ways to do synchronization between concurrent threads?

Thanks,

c0d1f1ed
Posted on 2006-10-12 12:45:19 by C0D1F1ED
You would use InterlockedExchange and InterlockedXXX API's
or
just directly use "lock" prefix
Posted on 2006-10-12 14:46:50 by Dite
read up on AMD docs
use PAUSE instruction in your spinloops will work on P4 and newer, not sure when AMD support it
ensures it doesnt check variable more often than memoryread happens and works as NOP on all systems that doesnt support them
can also use OS timing services to check variable more seldom
the OS itself can issue a HALT instruction to make the thread not consumed(well you asked for low-level)?
when the OS makes the determination the synchronization should be satisfied the thread is wokenup
ownership of shared data is for each cacheline and write to that cacheline, the other cpu invalidates that and must re-read that from memory and cachesystem, write data as seldomas possible write data to the other cpu
this also means data should be aligned to fit on a cacheline for each cpu and no other data on that
so you should extend it with dummy variables
also means your spinloopvariable should be alone on a 128byte cacheline, to not cause trouble with false sharing which leads to lots of unesserary load on systembus
Posted on 2006-10-12 15:59:56 by daydreamer
You might want to check out these two links:
http://www.mikusite.de/pages/x86.htm
http://board.flatassembler.net/topic.php?t=5122

Of course this is a VERY specific thing so it won't work all the time, but it does show how nicely fractal rendering scales with SMP - and how much Core2Duo kicks ass :)
Posted on 2006-10-12 16:13:23 by f0dder

You would use InterlockedExchange and InterlockedXXX API's
or
just directly use "lock" prefix

Thanks, but that doesn't halt or resume theads...

My problem is this: The bottleneck of my application takes roughly 60% of the total processing time. So with dual-core I can split the job in two and do 30% on the other core (making the application run 43% faster), theoretically. But when using SetEvent/WaitForSingleObject to start the second thread and wait for it to finish the job (to synchronize with the primary thread), there's a lot of overhead caused by those functions. This is likely because they invoke the kernel. Using spin loops largely removes the overhead and makes me reach the theoretical performance, but the CPU usage is constantly at 100% while it's only really working 71.5%.

This not only causes the CPU to heat up more than it should, it also takes away clock cycles from other applications. While the first problem can apparently be solved by using PAUSE instructions (thanks daydreamer), the second problem can only really be solved by putting the thread in a waiting state at the kernel level. Spin loops are especially nasty when multiple applications use them and hog the entire system.

So I was wondering if there's any alternative to SetEvent/WaitForSingleObject that is quicker to respond (and keeps the overhead in the idle time). What I'm looking at now is asynchronous procedure calls, and altering the thread's priority to leave 'idle' time to other processes. Anyone got some experience with these?
Posted on 2006-10-12 16:39:17 by C0D1F1ED

You might want to check out these two links:
http://www.mikusite.de/pages/x86.htm
http://board.flatassembler.net/topic.php?t=5122

Of course this is a VERY specific thing so it won't work all the time, but it does show how nicely fractal rendering scales with SMP - and how much Core2Duo kicks a** :)

Downloading the code gives me a 404, but I assume Mandelbrod rendering is a highly parallelizable task...

My application really only has one central bottleneck that is suited for parallelization. So outside of that bottleneck it would be fair to leave the second core to other processes.
Posted on 2006-10-12 16:45:35 by C0D1F1ED
Too bad it gives 404 (hunt the fasm board for attachments?). But yes, it is *highly* parallelizable - you could do it per-pixel if you want to. No real locking needs to be done, "LOCK XADD" for the "next scanline to render" is all it takes.

Anyway, look up CRITICAL SECTIONS - it tries spinlooping for a short time, then uses an event. It's one of the better generic all-round synchronization primitives in win32.
Posted on 2006-10-12 17:00:22 by f0dder
Anyway, look up CRITICAL SECTIONS - it tries spinlooping for a short time, then uses an event. It's one of the better generic all-round synchronization primitives in win32.

Ah, sounds like exactly what I need! I was already wondering whether there wasn't any way to 'schedule' a sleep operation. This could spin loop till the next time the kernel's scheduler gets run, and only then stops really executing. Hopefully I can do that with critical sections. Thanks!
Posted on 2006-10-12 17:20:37 by C0D1F1ED
The idea of critsects is of course that it's cheap to spinlock for a short amount of time (far cheaper than r3->r0->r3 and a wait), but that spinlocks are expensive in the long run (hogs up CPU, generates heat, ...).

They're not the end-all-be-all, but it's definitely worth trying out.

I wish all stuff was as easy to parallelize as the mandelbrot, then we'd have really outstanding performance with linear speed increase :)
Posted on 2006-10-12 17:48:02 by f0dder
To the ineffable all,

    What is dual core anyway?  Is is two CPUs in one chip, or something more than that?  A sentence or two explainng it would be nice to read.  Ratch
Posted on 2006-10-12 23:50:18 by Ratch

What is dual core anyway?  Is is two CPUs in one chip, or something more than that?  A sentence or two explainng it would be nice to read.  Ratch

It's both :D

Intels first dualcore chips basically bolted two full Pentium4 on a single chip, but (iirc) they still used the normal bus to communicate.

Then came dualcore AMD64, which uses an internal high-speed bus between the cores for communication.

There's more to it than that, like how is the cache done (how much is per-core, how much is shared, how is it shared, etc).
Posted on 2006-10-13 03:36:14 by f0dder
Dual-core is very much like a dual-processor system. The processors are just in one package. So they have to call it dual-core because there's physically only one processor. It behaves completely like a dual-processor system, except that communication between the cores is a bit faster because they are so close together.
Posted on 2006-10-13 03:39:57 by C0D1F1ED


What is dual core anyway?  Is is two CPUs in one chip, or something more than that?  A sentence or two explainng it would be nice to read.  Ratch

It's both :D

Intels first dualcore chips basically bolted two full Pentium4 on a single chip, but (iirc) they still used the normal bus to communicate.

Then came dualcore AMD64, which uses an internal high-speed bus between the cores for communication.

There's more to it than that, like how is the cache done (how much is per-core, how much is shared, how is it shared, etc).


I am working on a dualthread testapp right now, was thinking experiment with different datastructures etc affect things
Posted on 2006-10-13 03:45:03 by daydreamer
I wish all stuff was as easy to parallelize as the mandelbrot, then we'd have really outstanding performance with linear speed increase :)

In my experience, finding things that can run in parallel is not that hard. The real problem is fast synchronization. You can often simply select a bunch of calculations that are independent, try to run them on a second thread, and then find that synchronization is taking longer than the calculations themselves. If you're 'lucky', things are faster, but still far from the theoretical performance.

This is getting a very fundamental problem, because in the not so distant future we'll have quad-core, octa-core, etc. So I believe it's the responsability of chip makers and O.S. writers to come up with much faster ways to do synchronization. The PAUSE instruction is definitely a step in the right direction. Ideally, the CPU should do thread scheduling in hardware, and synchronization calls should be single instructions...
Posted on 2006-10-13 03:55:25 by C0D1F1ED
by the way - about locking.
if it really needed to do locking with (expensive) kernel objects etc?

we asm coders can do something like this:

locked db 0
...
mov al, 1
xchg , al
cmp al, 0
jne .cant_lock
...
mov , 0


is there any problem with that? (except that threads do not get lock in order they tried to take lock, but i don't see that as some serious problem)
Posted on 2006-10-13 04:09:44 by vid
In case there were multiple processors in system you should do "lock xchg".
Posted on 2006-10-13 05:59:05 by Tomasz Grysztar
According to the Intel docs, XCHG using a memory operand automatically asserts a processer's LOCK signal.
Posted on 2006-10-13 06:43:31 by sinsi
Sorry, but this thread is not about XCHG at all (unless I'm really missing something). It's about splitting a task over two threads with minimal overhead.

So far, spin looping seems like the best solution even though it burns clock cycles while there is no work. Lowering the thread priority and using PAUSE allows other processes to get some exeuction time and prevents the CPU from heating up unnecessarily.
Posted on 2006-10-13 06:53:58 by C0D1F1ED
sorry if i off-topic-ed, i was thinking that to split algo to 2 threads, you may also need to synchronize them, and locking is synchronizing mechanism, and locking is done via XCHG ;)
Posted on 2006-10-13 08:11:31 by vid
vid: that's just one way of doing things - spinlock.
Posted on 2006-10-13 08:14:41 by f0dder