I was wondering how much (re-)scheduling of instructions is worth these days. I did a few rough experiments with automatic scheduling and got only 5% speedup in general. So I was wondering if this is due to the out-of-order execution of modern processors or does my algorithm just suck? I'm mainly trying to reduce dependencies by scheduling high-latency instructions as soon as possible and do some independent instructions in between before scheduling the instructions that use the result.
Doing some tests myself on very short code loops showed that I could gain up to 50% performance by just swapping the instruction order. But on longer code sequences this wasn't the case. Either that's because the processor has more freedom for out-of-order execution, or I missed the optimal order...
Is anybody scheduling manually or do you just focus on using the least number of instructions and going for low latency? If you schedule manually, what are the things you focus on? What's your method, your algorithm? Or do you just keep trying and timing until you've reached the optimum, even for longer code?
Doing some tests myself on very short code loops showed that I could gain up to 50% performance by just swapping the instruction order. But on longer code sequences this wasn't the case. Either that's because the processor has more freedom for out-of-order execution, or I missed the optimal order...
Is anybody scheduling manually or do you just focus on using the least number of instructions and going for low latency? If you schedule manually, what are the things you focus on? What's your method, your algorithm? Or do you just keep trying and timing until you've reached the optimum, even for longer code?
Ok so nobody schedules any more these days...
I think there might be a bunch of people who don't see posts in 'The Heap' after Hiro added the toggle option whether to include forums under 'Other' in 'view new posts' - and lots of people don't view the board during weekends. Don't jump to conclusions ^_^
I would say that reducing dependencies and scheduling instructions are two different things.
An ooo CPU can schedule instructions pretty efficiently at runtime, so the exact instruction order generally doesn't matter.
However, you should try to use a minimum of micro-ops (not necessarily instructions), and especially a minimum of dependent micro-ops. And know the limitations of the CPU(s) you're targeting...
For example, if I'm not mistaken, the Athlon cannot reschedule loads and stores. And there's a maximum to the number of dependent instructions (dependency chains) that a CPU can handle efficiently. Sometimes it's better to write a value to a local var, and continue from the local var later with the calculation, than to keep it stored in the register, because as you 'pin down' the register, you limit the CPU in the options it has for rescheduling code.
So well, in short, sometimes scheduling matters, often it doesn't, as long as you keep the dependencies under control.
Also, note that some CPUs handle some instructions in multiple dependent micro-ops... For example, I discovered that lea reg, takes 1 clk longer on Athlon than lea reg, or lea reg, . It seems to break it up into a shift and an add internally, and executes them one after another. If you break such instructions up yourself, you can sometimes get it faster.
An ooo CPU can schedule instructions pretty efficiently at runtime, so the exact instruction order generally doesn't matter.
However, you should try to use a minimum of micro-ops (not necessarily instructions), and especially a minimum of dependent micro-ops. And know the limitations of the CPU(s) you're targeting...
For example, if I'm not mistaken, the Athlon cannot reschedule loads and stores. And there's a maximum to the number of dependent instructions (dependency chains) that a CPU can handle efficiently. Sometimes it's better to write a value to a local var, and continue from the local var later with the calculation, than to keep it stored in the register, because as you 'pin down' the register, you limit the CPU in the options it has for rescheduling code.
So well, in short, sometimes scheduling matters, often it doesn't, as long as you keep the dependencies under control.
Also, note that some CPUs handle some instructions in multiple dependent micro-ops... For example, I discovered that lea reg, takes 1 clk longer on Athlon than lea reg, or lea reg, . It seems to break it up into a shift and an add internally, and executes them one after another. If you break such instructions up yourself, you can sometimes get it faster.
The only thing I'd add to Bruce's good advice, all OOO CPU's have an effective range or granularity to the reordering - it is not as general as the term OOO makes it seem. The OOO is just to reduce the granularity that compilers must concern themselves with during optimization (ASM programmers should know how to order instructions).
I hardly do it, I only reorder only to software pipeline FPU/3DNow! etc and integer calculations as much as possible. You'll be surprised how quickly I got my Pentium 133Mhz running my FPU mixer just by interleaving FPU and integer instructions. It seems to be quite effective on most processors. Even on non- OoO CPU's
I think there might be a bunch of people who don't see posts in 'The Heap' after Hiro added the toggle option whether to include forums under 'Other' in 'view new posts' - and lots of people don't view the board during weekends. Don't jump to conclusions ^_^
I know. :grin: Just wanted to bump it a little and get a few more reactions. I think it worked. ;)
Originally posted by Bruce I would say that reducing dependencies and scheduling instructions are two different things.
An ooo CPU can schedule instructions pretty efficiently at runtime, so the exact instruction order generally doesn't matter.
An ooo CPU can schedule instructions pretty efficiently at runtime, so the exact instruction order generally doesn't matter.
That's actually what my question is all about. Is it possible to do better than the CPU. As far as I know, the processor selects -the first- instruction that has all operands available and it's execution unit is available. But, I think there might be a lot of situations where it's not the first instruction that is the best candidate for scheduling. A dumb example is a nop or any other banal instruction followed by a high-latency instruction. It's clear that if both instructions are available for scheduling, that it's best to execute the high-latency instruction first, right?
Now, I believe that I've asked this question here before, but how can I know the latency, throughput and the execution units for every instruction, independent of the architecture? With the latter I mean I'm not in the mood to write a table with 5000 instruction variants for every chip. :( So I was looking for a way to construct such a table automatically using a 'profiling session'.
However, you should try to use a minimum of micro-ops (not necessarily instructions), and especially a minimum of dependent micro-ops. And know the limitations of the CPU(s) you're targeting...
For example, if I'm not mistaken, the Athlon cannot reschedule loads and stores. And there's a maximum to the number of dependent instructions (dependency chains) that a CPU can handle efficiently. Sometimes it's better to write a value to a local var, and continue from the local var later with the calculation, than to keep it stored in the register, because as you 'pin down' the register, you limit the CPU in the options it has for rescheduling code.
For example, if I'm not mistaken, the Athlon cannot reschedule loads and stores. And there's a maximum to the number of dependent instructions (dependency chains) that a CPU can handle efficiently. Sometimes it's better to write a value to a local var, and continue from the local var later with the calculation, than to keep it stored in the register, because as you 'pin down' the register, you limit the CPU in the options it has for rescheduling code.
I'm mainly targetting optimal performance for Pentium III and Pentium 4. If it can be optimized for Althon's too then that would be great but it's not a primary goal.
There's little I can do about the register allocation I'm afraid. It's done automatically and registers are freed as soon as possible to avoid false dependencies. On the other hand... I always select a free register with the lowest index. Maybe it's better to select the register that hasn't been used the longest? Thanks for the idea!
So well, in short, sometimes scheduling matters, often it doesn't, as long as you keep the dependencies under control.
Also, note that some CPUs handle some instructions in multiple dependent micro-ops... For example, I discovered that lea reg, takes 1 clk longer on Athlon than lea reg, or lea reg, . It seems to break it up into a shift and an add internally, and executes them one after another. If you break such instructions up yourself, you can sometimes get it faster.
Also, note that some CPUs handle some instructions in multiple dependent micro-ops... For example, I discovered that lea reg, takes 1 clk longer on Athlon than lea reg, or lea reg, . It seems to break it up into a shift and an add internally, and executes them one after another. If you break such instructions up yourself, you can sometimes get it faster.
Yeah that would be the kind of information that would be handy to have too. It would be a horrible job to analyse it manually for every instruction though... Any ideas?
Thanks!
The only thing I'd add to Bruce's good advice, all OOO CPU's have an effective range or granularity to the reordering - it is not as general as the term OOO makes it seem. The OOO is just to reduce the granularity that compilers must concern themselves with during optimization.
Thanks for the advice. The code I work with is mostly 100 instructions or longer, with very little branching. I've read somewhere that the reorder-buffer of a Pentium III is only 40 micro-instructions long so I thought it might be useful to shedule behind that limit as well. Because sometimes I have groups of a few dozen instructions that form a dependency chain, and then another such group that is totally independent.
ASM programmers should know how to order instructions
Sorry, but I hardly know how to do it manually (let alone do it automatically). I have a source file with 2000 lines of code that is all equally as much a bottleneck and I wouldn't know where to start optimizing it. I have the necessary framework for rescheduling but not the actual optimal algorithms.
I hardly do it, I only reorder only to software pipeline FPU/3DNow! etc and integer calculations as much as possible. You'll be surprised how quickly I got my Pentium 133Mhz running my FPU mixer just by interleaving FPU and integer instructions. It seems to be quite effective on most processors. Even on non- OoO CPU's
I'm afraid the easy days of the Pentium I are over. I don't know if this is true but I've heard that mixing integer and floating-point operations on modern processors has several penalties?
Is it possible to do better than the CPU.
In short, no... The CPU always reorders anyway. And it has more information than a human does (for example, you cannot predict when a task-switch occurs, which resets the ooo-buffer and cache, and what effect this has).
Assuming the code is not complete trash to begin with (valid assumption with compiler-generated code, and most hand-written code, based on the Pentium/P6 optimization rules, even if written rather naively), then I think the CPU will generally manage just fine.
You should focus on minimizing micro-ops, dependencies and decoding-problems on Pentium Pro and higher (any ooo-CPU in other words), I'd say. The actual order of execution is determined by the CPU anyway. Just don't assume that the ooo-window is infinitely large :)
But in general, your inner-loops will fit in the ooo-buffer entirely... I'll have to look up how large the ooo-buffer is on different CPUs, but I vaguely recall a figure of 40, for Pentium Pro.
Now, I believe that I've asked this question here before, but how can I know the latency, throughput and the execution units for every instruction, independent of the architecture? With the latter I mean I'm not in the mood to write a table with 5000 instruction variants for every chip.
There's no way to measure these things properly on a CPU, as far as I know... In fact, even the figures listed by the manufactures aren't always exact... There are usually some conditions for reaching these numbers, and sometimes they're just worst-case numbers.
In fact, AMD doesn't list any latency-figures at all, as far as I know, only throughput. But if you write Pentium Pro-code for Athlon, it generally works very well. The design is quite similar.
Personally I don't worry about these things too much... I just ignore instructions that are known to be 'bad' on some CPUs (such as inc/dec) and let the ooo handle the exact order.
There's little I can do about the register allocation I'm afraid. It's done automatically and registers are freed as soon as possible to avoid false dependencies. On the other hand... I always select a free register with the lowest index. Maybe it's better to select the register that hasn't been used the longest? Thanks for the idea!
That doesn't matter. The registers the programmer sees (eax, ebx etc) are just renamed anyway. What I meant was, don't tie up registers with values... If you need a value a few clks later, try saving it to a local variable, rather than keeping it in the register. You'll see that this is often faster, because you make the instructions independent to the CPU. The dependency is now moved to the cache, so to say. In other words... Try to make the time between reading a value into a register, and writing it back to memory as short as possible. L1 cache is really fast anyway. Sometimes it's even faster to do a single loop in 2 passes... First pass does 'half the loop', then writes to a temporary buffer, which is forced into L1 cache...
The second loop reads back the L1 cache, and does the remaining operations, then stores to the final location.
Since the total dependency is shorter, the CPU will now reorder the code more efficiently.
Yeah that would be the kind of information that would be handy to have too. It would be a horrible job to analyse it manually for every instruction though... Any ideas?
Not really, I just found this out by rewriting the same piece of code in a few ways, and trying to find out why one version was faster than the other.
I'm afraid the easy days of the Pentium I are over. I don't know if this is true but I've heard that mixing integer and floating-point operations on modern processors has several penalties
I don't know of any. What I do know is that on PIII and above, you should try to avoid the x87 altogether, since it's just there for legacy support. Virtually all common x87 code can be rewritten in SSE/SSE2, and is considerably faster, even when not taking advantage of the parallelism. This is what Intel wants you to do.
I have the necessary framework for rescheduling but not the actual optimal algorithms.
If it was as simple as just implementing an algorithm, we wouldn't be here. I use assembly for one reason only: compilers cannot generate optimum code.
The behaviour of modern x86 CPUs is too complex to just put into an algorithm. Else we'd have those algorithms in compilers already. The only remedy is to use simpler CPUs. This is what Intel is doing with the Itanium2. It does no rescheduling by itself, everything is now the responsibility of the compiler (including branch-behaviour. The only fuzzy part remaining is the cache/memory subsystem). The CPU is very simple and the rules of execution are very clear.
This is a very important requirement for automated code scheduling. x86 just doesn't work, it's too old. It was designed before compilers were commonly used, and therefore it isn't designed in a way that will make it cooperate with a compiler efficiently. Which is why we are still coding asm.
Originally posted by Bruce
In short, no... The CPU always reorders anyway.
In short, no... The CPU always reorders anyway.
But not optimally. I do am getting slight performance increases from my simple scheduler. The problem is that it has to be consistently better than not scheduling, because sometimes it makes things worse. :mad: I'm just wondering if it's achievable to have an algorithm that always imporoves performance (given as input the straigtforward generated code).
And it has more information than a human does (for example, you cannot predict when a task-switch occurs, which resets the ooo-buffer and cache, and what effect this has).
What does a context switch have to do with it? Surely it can run several iterations of my loop (about 100 clock cycles) without being interrupted by anything?
But in general, your inner-loops will fit in the ooo-buffer entirely... I'll have to look up how large the ooo-buffer is on different CPUs, but I vaguely recall a figure of 40, for Pentium Pro.
It doesn't fit in the reordering window. I'm compiling processing pipelines for graphics.
Personally I don't worry about these things too much... I just ignore instructions that are known to be 'bad' on some CPUs (such as inc/dec) and let the ooo handle the exact order.
Oh, I didn't know that. Why exactly is inc bad? I'm using it several times...
Sometimes it's even faster to do a single loop in 2 passes... First pass does 'half the loop', then writes to a temporary buffer, which is forced into L1 cache...
The second loop reads back the L1 cache, and does the remaining operations, then stores to the final location.
Since the total dependency is shorter, the CPU will now reorder the code more efficiently.
The second loop reads back the L1 cache, and does the remaining operations, then stores to the final location.
Since the total dependency is shorter, the CPU will now reorder the code more efficiently.
That's not the case for my situation. The inner loops take approximately 'only' 100 clock cycles. Splitting the loop up and writing all general-purpose, MMX and SSE registers to memory is definitely going to increase that.
I don't know of any. What I do know is that on PIII and above, you should try to avoid the x87 altogether, since it's just there for legacy support. Virtually all common x87 code can be rewritten in SSE/SSE2, and is considerably faster, even when not taking advantage of the parallelism. This is what Intel wants you to do.
That's what I already do. Vertex processing makes excellent use of the parallelism offered by SSE, and pixels fit perfectly in MMX registers. I have an SSE emulator that used pure x87 code but it's terribly slow in comparison because of all the emms instructions required.
If it was as simple as just implementing an algorithm, we wouldn't be here. I use assembly for one reason only: compilers cannot generate optimum code.
And I use a compiler because I can't write dozens of optimized processing pipelines in a nanosecond. :wink:
See, it would be easy if I had just one loop of 20 instructions that I had to schedule optimally. After a lot of experimentation I would close in on the optimal performance anyway. But it's different now because I have quite long blocks of code that are put together to form a processing pipeline. So I can't know in advance in what order the instructions will appear so I can't do any scheduling for the blocks. It simply has to be done afterwards, preferably using the same algorithms that people use for manual scheduling.
Hasn't anybody every done anything similar with scheduling? If I recall correctly the gcc compiler has a very rudimentary scheduler but it doesn't support MMX/SSE. I'm actually focussing on those instruction sets, I mainly only use general-purpose registers for addressing...
Interesting thread. :alright:
Some random bits (not necessarily addressing the problem of rescheduling)
* C0D1F1ED, if you made your code 5% faster than the compiled output from high-quality optimizing compiler, then, I would say, you rock. :) (I'm referring to Intel's own compiler or comparable. Don't stick Borland C or gcc into the list. An intermediate-level ASM coder can beat them easily.) I suppose you reduced the size by hand-coding, and I think you did a good job.
* Whether SSE is faster/better than FPU depends on what you do. If your main coding does not care about the precision of the result (accumulated rounding error), then SSE is good. On the other hand, if your code needs accuracy, then SSE is much worse than FPU.
I like to give this example when people say they are doing their numerical analysis using SSE: complex absolute value (or, Euclidean norm if complex numbers are not introduced to them yet). Implementing it using FPU is shorter, faster, and protected from overflow (well, up to double precision input value - and let's not talk about 80 bit precision, for SSE does not provide it). Implementing it using SSE is either longer and slower if it should provide accuracy, or inaccurate if it is fast.
* inc/dec issue: Pentium Pro (and later CPUs) has issues with inc/dec only when the flag is read right after it. (Pentium 4 optimization manual recommends add/sub because of it.) This is what Agner called "partial flag stall". If you don't read the flag or you update the flag right after inc/dec, then there is no penalty. But, I do use add/sub when adding 2 more bytes makes the next loop naturally aligned at 16 byte boundary, even if I don't read the flag.
* Personally, I schedule my code to utilize all decoders (P6 specific 4-1-1 rule). Although there is no need for this on Pentium 4 (or, should I say, Pentium 4 is incapable of doing it?) I don't see any harm (unless dependency chain can be broken by breaking 4-1-1 rule). I confess I don't have firm understanding of execution parellelism, and I just hope CPU does it for me. :)
Some random bits (not necessarily addressing the problem of rescheduling)
* C0D1F1ED, if you made your code 5% faster than the compiled output from high-quality optimizing compiler, then, I would say, you rock. :) (I'm referring to Intel's own compiler or comparable. Don't stick Borland C or gcc into the list. An intermediate-level ASM coder can beat them easily.) I suppose you reduced the size by hand-coding, and I think you did a good job.
* Whether SSE is faster/better than FPU depends on what you do. If your main coding does not care about the precision of the result (accumulated rounding error), then SSE is good. On the other hand, if your code needs accuracy, then SSE is much worse than FPU.
I like to give this example when people say they are doing their numerical analysis using SSE: complex absolute value (or, Euclidean norm if complex numbers are not introduced to them yet). Implementing it using FPU is shorter, faster, and protected from overflow (well, up to double precision input value - and let's not talk about 80 bit precision, for SSE does not provide it). Implementing it using SSE is either longer and slower if it should provide accuracy, or inaccurate if it is fast.
* inc/dec issue: Pentium Pro (and later CPUs) has issues with inc/dec only when the flag is read right after it. (Pentium 4 optimization manual recommends add/sub because of it.) This is what Agner called "partial flag stall". If you don't read the flag or you update the flag right after inc/dec, then there is no penalty. But, I do use add/sub when adding 2 more bytes makes the next loop naturally aligned at 16 byte boundary, even if I don't read the flag.
* Personally, I schedule my code to utilize all decoders (P6 specific 4-1-1 rule). Although there is no need for this on Pentium 4 (or, should I say, Pentium 4 is incapable of doing it?) I don't see any harm (unless dependency chain can be broken by breaking 4-1-1 rule). I confess I don't have firm understanding of execution parellelism, and I just hope CPU does it for me. :)
But not optimally. I do am getting slight performance increases from my simple scheduler. The problem is that it has to be consistently better than not scheduling, because sometimes it makes things worse. I'm just wondering if it's achievable to have an algorithm that always imporoves performance (given as input the straigtforward generated code).
You are missing the point here. The CPU reordering is optimal. It batches up micro-ops, and dispatches them as soon as the respective execution unit becomes available.
What you are saying is "When I feed it garbage, it does worse than when I feed it code that is reasonably okay".
What I am saying is "The CPU reorders anyway, as long as the instructions are 'sorta' in the right order, it will run properly. Putting them in the ACTUAL right order doesn't matter, because the CPU will find that order anyway".
What does a context switch have to do with it? Surely it can run several iterations of my loop (about 100 clock cycles) without being interrupted by anything?
It was just an example of where the CPU knows better than you.
It doesn't fit in the reordering window. I'm compiling processing pipelines for graphics.
Don't you have hardware for that these days? :)
Anyway, the window is 'streaming'. As long as you make your code faster to decode than to execute, it will virtually fit in the reordering window. Just make sure there are no long dependency chains, as I already said. That will severely trouble the ooo-window in moving forward.
That's not the case for my situation. The inner loops take approximately 'only' 100 clock cycles. Splitting the loop up and writing all general-purpose, MMX and SSE registers to memory is definitely going to increase that.
You'd be surprised. And you don't write to memory, you write to L1 cache, which has a 1 or 2 clk latency on modern CPUs. It's about as good as using registers, without the dependency problems.
I have an SSE emulator that used pure x87 code but it's terribly slow in comparison because of all the emms instructions required.
Perhaps you should read the manual. emms is only required for MMX. It resets the FPU stack state. Since SSE has its own registers, emms has no effect, and is not required.
Besides, if you use SSE anyway, you never actually need the x87, so you never actually need to reset the FPU state. So your MMX code could also do without emms in that case.
See, it would be easy if I had just one loop of 20 instructions that I had to schedule optimally. After a lot of experimentation I would close in on the optimal performance anyway. But it's different now because I have quite long blocks of code that are put together to form a processing pipeline. So I can't know in advance in what order the instructions will appear so I can't do any scheduling for the blocks. It simply has to be done afterwards, preferably using the same algorithms that people use for manual scheduling.
The best you can do, is probably to detect the different dependency chains in the code, interleave them, and perhaps cut them up in multiple parts, separated by store/load to local variable, when they get too long.
Again, if it were so simple to generate optimum code, we wouldn't be needing asm, because compilers would give us the optimal result anyway. x86 doesn't work.
* Whether SSE is faster/better than FPU depends on what you do. If your main coding does not care about the precision of the result (accumulated rounding error), then SSE is good. On the other hand, if your code needs accuracy, then SSE is much worse than FPU.
I don't agree completely here... As far as I know, SSE also follows the IEEE standards, and it has rounding control in the control word. So the results when dealing with floats in SSE should be the same as the results with the fpu. Same goes for SSE2 and doubles.
Sure, SSE can't do 80 bit precision, then again, apart from the x86, no CPU can, as far as I know, and there are plenty of non-x86 systems being used for numerical analysis and the like, so apparently it's quite possible to make do with 'just' 64 bits.
Although there is no need for this on Pentium 4 (or, should I say, Pentium 4 is incapable of doing it?)
Well, P4 doesn't work like any previous CPU. It doesn't actually cache x86 code. It decodes into micro-ops before it stores into L1 cache... So the entire problem of decoding in a loop is gone anyway. The execution backend never actually sees x86-code at all.
Originally posted by Starless
Interesting thread. :alright:
Interesting thread. :alright:
Thanks!
* C0D1F1ED, if you made your code 5% faster than the compiled output from high-quality optimizing compiler, then, I would say, you rock. :) (I'm referring to Intel's own compiler or comparable. Don't stick Borland C or gcc into the list. An intermediate-level ASM coder can beat them easily.) I suppose you reduced the size by hand-coding, and I think you did a good job.
It's not my intention to let it optimize any input. It's a lot more compilated to link and create executable than to keep everything run-time.
* inc/dec issue: Pentium Pro (and later CPUs) has issues with inc/dec only when the flag is read right after it. (Pentium 4 optimization manual recommends add/sub because of it.) This is what Agner called "partial flag stall". If you don't read the flag or you update the flag right after inc/dec, then there is no penalty. But, I do use add/sub when adding 2 more bytes makes the next loop naturally aligned at 16 byte boundary, even if I don't read the flag.
Thanks for the explanation! I vaguely remeber Agner's paper. I'll look it up again...
Personally, I schedule my code to utilize all decoders (P6 specific 4-1-1 rule). Although there is no need for this on Pentium 4 (or, should I say, Pentium 4 is incapable of doing it?) I don't see any harm (unless dependency chain can be broken by breaking 4-1-1 rule). I confess I don't have firm understanding of execution parellelism, and I just hope CPU does it for me. :)
Ah yes, another optimization rule from Agner. Maybe he has all the answers to my questions? ;)
MMX instructions are mostly only 1 micro-instruction, while most SSE instructions are longer, so it might be a good idea to keep strict 4-1-1 scheduling. That would automatically put some time between the higher latency SSE instructions as well! Thanks for the advise!
Originally posted by Bruce
You are missing the point here. The CPU reordering is optimal. It batches up micro-ops, and dispatches them as soon as the respective execution unit becomes available.
What you are saying is "When I feed it garbage, it does worse than when I feed it code that is reasonably okay".
What I am saying is "The CPU reorders anyway, as long as the instructions are 'sorta' in the right order, it will run properly. Putting them in the ACTUAL right order doesn't matter, because the CPU will find that order anyway".
You are missing the point here. The CPU reordering is optimal. It batches up micro-ops, and dispatches them as soon as the respective execution unit becomes available.
What you are saying is "When I feed it garbage, it does worse than when I feed it code that is reasonably okay".
What I am saying is "The CPU reorders anyway, as long as the instructions are 'sorta' in the right order, it will run properly. Putting them in the ACTUAL right order doesn't matter, because the CPU will find that order anyway".
I think I'm still missing your point. Either that or you're missing mine... I don't think the CPU can find the best order itself. If it has multiple instructions that are ready for execution, it will most probably just pick the first one. I don't think it needs a formal proof that this greedy algorithm is, although quite good in practice, still sub-optimal. Most probably it's just done that way because it's easiest to implement in hardware. However, what this thread is all about is to spend a little extra time in the code generation phase to present the code to the CPU in an order that it will likely take a more efficient route. So the question is, what is this more optimal instruction order?
Starless already mentioned a very nice optimization for Pentium 3's. It's not related to the out-of-order execution but nonetheless it's a scheduling optimization the CPU can't make.
It was just an example of where the CPU knows better than you.
I understand there are several things the CPU knows better, but I just take these things for granted. What I'm trying to do is optimize the things the CPU -doesn't- know, like the decoder scheduling, or reducing dependencies beyond the 40 micro-instruction boundary. See my point? There are several things we can do better than the CPU because we have more information about, and we can spend more time on it.
Don't you have hardware for that these days? :)
If you have the money and can stand the noise it generates, yes. But for example my modern Centrino laptop has an integrated graphics card which doesn't even run Unreal Tournament properly (but it runs fine at 1024x768 in software mode). So that's the market I'm targetting. And many professionals among which Tim Sweeney agree with me that eventually graphics processing might become the task of the CPU again, or at least the GPU becomes as flexible and as programmable as the CPU. And I would like to be on the front row when that happens.
Anyway, the window is 'streaming'. As long as you make your code faster to decode than to execute, it will virtually fit in the reordering window. Just make sure there are no long dependency chains, as I already said. That will severely trouble the ooo-window in moving forward.
Unfortunately these long dependency chains do exist in my code. So probably it's best to first schedule for dependencies and then do a 'second pass' for decoder scheduling?
But how do I reduce dependencies? Obviously, I have to schedule dependent instructions later, but not that late that all dependent instructions get moved to the end of the code block so they form a new, possibly worse dependency chain. I guess I'll at least need latency information for every instruction first.
You'd be surprised. And you don't write to memory, you write to L1 cache, which has a 1 or 2 clk latency on modern CPUs. It's about as good as using registers, without the dependency problems.
Well it's still a lot of 'redundant' move operations. And I don't think writing all MMX and SSE registers to L1 cache is that fast.
Perhaps you should read the manual. emms is only required for MMX. It resets the FPU stack state. Since SSE has its own registers, emms has no effect, and is not required.
Besides, if you use SSE anyway, you never actually need the x87, so you never actually need to reset the FPU state. So your MMX code could also do without emms in that case.
Besides, if you use SSE anyway, you never actually need the x87, so you never actually need to reset the FPU state. So your MMX code could also do without emms in that case.
No it wouldn't, maybe you should read my code. :tongue: I have an SSE -emulator-, meaning that if I try to generate an SSE instruction at run-time on a Pentium II, it will get replaced by the equivalent FPU code, inserting emms instructions when required. And I've really put a lot of work in reducing the emms instructions to a minimum but there's hardly a way around it unless I would write a Pentium II version of every processing pipeline which is really mad. It's not meant to be really efficient, but more like a "look what you're missing without SSE" mode. :grin:
The best you can do, is probably to detect the different dependency chains in the code, interleave them, and perhaps cut them up in multiple parts, separated by store/load to local variable, when they get too long.
Again, if it were so simple to generate optimum code, we wouldn't be needing asm, because compilers would give us the optimal result anyway. x86 doesn't work.
Again, if it were so simple to generate optimum code, we wouldn't be needing asm, because compilers would give us the optimal result anyway. x86 doesn't work.
You mean generating a dependency 'graph' and interleaving the branches that have a common starting and ending node? Sound nice, but it doesn't take account of instruction latency, although that can be arranged... Thanks, just had some great ideas!
I do believe that it's much different from a regular compiler. See, I'm working with 100 instructions in average, and I'm willing to spend about several milliseconds optimizing it to the absolute maximum. It doesn't really matter if the algorithm scales badly. A real compiler can't afford to do that, as it has to be able to compile hundreds of files with each several hundreds lines of code in just a few minutes. So, even in a 'release' build, people have to be content with the results from heuristics that give relatively good results in the least time. For most applications it also doesn't matter if it's 20% slower than the optimal, because most people just blame the hardware. In my situation I'm stressing the CPU a lot, and to be an acceptable alternative for hardware rendering every single percent counts. Besides, if compilers already do a bit of scheduling, I want to do at least as good and not just leave things the way they are.
Last but not least, I'm an energetic student and I see this as an exciting challenge. And saying "it can't be done" has never prevented me from doing the impossible. Actually all my projects are achievements that people said wouldn't work. So please show me some optimism and show me the best that can be done...
don't think the CPU can find the best order itself. If it has multiple instructions that are ready for execution, it will most probably just pick the first one. I don't think it needs a formal proof that this greedy algorithm is, although quite good in practice, still sub-optimal.
Perhaps it does? :)
I don't think the order would matter, it has to execute the instructions anyway... And only the dependent instructions matter (either dependent on results, or dependent on the same execution resources).
I think the only difference is that you can shift some dependent chains back and forth, but the entire piece of code will take just as long... But in general, you will already have put the code that should retire first, before the other code, so I don't think this would ever be an issue anyway.
What I'm trying to do is optimize the things the CPU -doesn't- know, like the decoder scheduling, or reducing dependencies beyond the 40 micro-instruction boundary. See my point? There are several things we can do better than the CPU because we have more information about, and we can spend more time on it.
Yes, but I already gave some advice on that, didn't I?
And many professionals among which Tim Sweeney agree with me that eventually graphics processing might become the task of the CPU again, or at least the GPU becomes as flexible and as programmable as the CPU. And I would like to be on the front row when that happens.
Well, read about DirectX Next (see http://www.beyond3d.com)... Doesn't look like the CPU will ever take over graphics again :)
And the GPU is already very programmable... besides, I think you'll have trouble beating even the slowest of accelerators with a CPU :)
You'll either have to sacrifice quality (resolution, filtering, shading etc) or speed.
And I don't understand your "front row" statement.
Unfortunately these long dependency chains do exist in my code. So probably it's best to first schedule for dependencies and then do a 'second pass' for decoder scheduling?
As I said, cut the chains up by using local variables. And yes, you must first attack the dependencies. I'd say the decoder scheduling is 'optional' anyway. On P4 it doesn't matter altogether, and on other CPUs the decoder can most probably stay ahead of the execution backend anyway, so you will probably gain a lot less here than by just reducing dependencies and making the ooo work at maximum efficiency.
Well it's still a lot of 'redundant' move operations. And I don't think writing all MMX and SSE registers to L1 cache is that fast.
Try it, you'll be surprised... Besides, you can use the variable as a memory operand on the next instruction, so you don't need a mov there. This is actually preferred in most cases anyway... Instructions with load+execute are better than separate mov and execute pairs. As long as the load is far away enough from the store, the variable can be prefetched from L1 cache, and the instruction has the same throughput as if it used a register.
I have an SSE -emulator-, meaning that if I try to generate an SSE instruction at run-time on a Pentium II, it will get replaced by the equivalent FPU code, inserting emms instructions when required.
Oh I see... I thought you meant you emulated x87 with SSE.
Excuse me for saying this... but this doesn't seem like a very smart thing to do. SSE-code is written differently from x87 code... Just like hardware-accelerated 3d is different from software. There are different things that can be done quickly in both cases, so you should focus on the strong points of either method.
It would perhaps make more sense to build a virtual language aboce SSE/x87... Something where you indicate parallel operations, but don't tie it directly to specific instructions... And then you just use 'optimal' rules to generate either SSE or x87 code...
Or well, the way I write 3d stuff, I only use floats in the T&L pipe anyway, not in the trifiller... so virtually all float-related code is matrix/vector, and you can just create macros for those... at a high level.
You mean generating a dependency 'graph' and interleaving the branches that have a common starting and ending node? Sound nice, but it doesn't take account of instruction latency, although that can be arranged...
Yes, something like that... basically you create linked lists of dependent instructions... then these linked lists themselves can freely be moved forward or back in the code, and can be interleaved with other lists, to make maximum use of the execution resources. This is how I write my asm code manually anyway... I'll also pay some attention to the execution units being used then... You could probably make some simple rules for that aswell.
Note also that latency is usually not important, since the common ALU, SSE, MMX instructions all have the same latency.
Oh, and something I just thought of... Something that may be overlooked often...
Sometimes it's as fast, or faster, to re-calc a value rather than to store it...
That is, if you need a+b multiple times, and you already have a in a register, you can just do add a, from memory. You have the value in 1 clk, and you didn't need to pin it down in a register, or save it to a local var, to be used again later. This is sometimes a great help in reducing dependencies, or keeping the register-count low.
Last but not least, I'm an energetic student and I see this as an exciting challenge. And saying "it can't be done" has never prevented me from doing the impossible. Actually all my projects are achievements that people said wouldn't work. So please show me some optimism and show me the best that can be done...
Well, no offence, but you are trying to tackle the very problem that the CPU manufacturers have been dealing with ever since the first compilers... Their solution was not to make better compilers, but to make better CPUs, which would allow better compilers.
The problem is, x86 predates this... And even the MMX/SSE/SSE2 instructionsets aren't very modern... For example, they still use a 2-operand model, while 3 operands would be many times more powerful.
(OT. If anyone wants to reply to this, it would be better to create a new thread.)
I guess that your computation deals mostly with limited range of numbers. Any numerical analysis textbook teaches how to avoid overflow in case of complex absolute value. FPU does not need that trick, but SSE does. Try coding it in SSE and see what I mean. (And, for the record, Sparc needs that trick because it supports only DP on its FPU.)
Or, someone can tell me an SSE libm implementation which passes the IEEE test and faster than FPU libm. (Or, not much slower so that it could be faster when vectorized.) Then, I'll take back what I said and I myself will use SSE for numerical analysis. Intel MKL is fast with vectors, but error bound is not so tight compared to FPU. (My experience of MKL is limited to old free versions 5.x. If someone has newer MKL and sees something different from what I said, please tell me.)
Originally posted by Bruce-li
I don't agree completely here... As far as I know, SSE also follows the IEEE standards, and it has rounding control in the control word. So the results when dealing with floats in SSE should be the same as the results with the fpu. Same goes for SSE2 and doubles.
Sure, SSE can't do 80 bit precision, then again, apart from the x86, no CPU can, as far as I know, and there are plenty of non-x86 systems being used for numerical analysis and the like, so apparently it's quite possible to make do with 'just' 64 bits.
I don't agree completely here... As far as I know, SSE also follows the IEEE standards, and it has rounding control in the control word. So the results when dealing with floats in SSE should be the same as the results with the fpu. Same goes for SSE2 and doubles.
Sure, SSE can't do 80 bit precision, then again, apart from the x86, no CPU can, as far as I know, and there are plenty of non-x86 systems being used for numerical analysis and the like, so apparently it's quite possible to make do with 'just' 64 bits.
I guess that your computation deals mostly with limited range of numbers. Any numerical analysis textbook teaches how to avoid overflow in case of complex absolute value. FPU does not need that trick, but SSE does. Try coding it in SSE and see what I mean. (And, for the record, Sparc needs that trick because it supports only DP on its FPU.)
Or, someone can tell me an SSE libm implementation which passes the IEEE test and faster than FPU libm. (Or, not much slower so that it could be faster when vectorized.) Then, I'll take back what I said and I myself will use SSE for numerical analysis. Intel MKL is fast with vectors, but error bound is not so tight compared to FPU. (My experience of MKL is limited to old free versions 5.x. If someone has newer MKL and sees something different from what I said, please tell me.)
Well, I haven't done any actual testing, but since I've seen SSE/SSE2 code being used even in the MS Visual Studio libs, and since afaik SSE/SSE2 follow IEEE specs for 32/64 bit floats, I thought they would be just as accurate as the FPU within the same precision.
Comparing the 80 bit mode of the FPU is not fair ofcourse.
Comparing the 80 bit mode of the FPU is not fair ofcourse.
Perhaps it does? :)
You've got me there... I really thought it would be trivial to find an example that proves out-of-order scheduling makes a difference but I couldn't find any. :sweat: So, now it's up to you to prove that greedy out-of-order scheduling is optimal. :grin:
Anyway, thanks for opening my eyes that I'm probably totally on the wrong tracks. The results I'm getting from my experiments (both positive and negative) are probably due to to other scheduling limitations and penalties of the CPU, like the 40 micro-instruction limit and 4-1-1 decoding. Probably the Pentium 4 doesn't even suffer from the latter...
Well, this is quite sad. I had hoped for a 15% performance increase or so 'for free' after I implemented the optimal algorithm. But now I realize I'm probably close to the optimum already. That leaves some mixed feelings though. :rolleyes: