Vague point, but nevertheless -- does it worth anything to write code optimized for specific CPU make/model and choose between them based on available hardware? As we don't have uninterruptive (from CPU's point of view) way to switch contexts, modern OSes ruin the entire concept of instruction-level optimization. Reducing working set have much more impact on performance (unless you program is heavy on calculations).
Your opinions?
Your opinions?
Well, there's quite a few sides to this story... Most of which have little to do with the OS, some not even with the CPU itself.
Firstly, 'unless your program is heavy on calculations'... well isn't that pretty much a given before you start any kind of optimization at all? I mean, why bother optimizing something like Notepad? You could try to squeeze every last cycle out of it, but there's little point as all operations are so trivial that the user is the limiting factor in its performance, not the CPU.
So yes, unless something actually takes enough time that the user will notice, it's not worth optimizing.
Secondly, I think reducing the working set is always a good optimization, regardless of whether you are using a multitasking OS or not. Even with just a single task, you will get better performance from your cache and memory, because you minimize the bandwidth and maximize coherency hence efficiency.
Thirdly, I would argue that context switches are not a big deal. As long as the user doesn't run multiple heavy processing programs together, the extra overhead is not that large. I remember back when I first moved from 32-bit DOS to Win9x on 486 and Pentium, that even then the actual impact of the multitasking OS wasn't that big a deal. The total effect on performance was in the < 5% range usually. Barely noticeable. And that was on those old CPUs with < 100 MHz clockspeeds and just a single core, where 256 KB of cache was considered a lot.
These days with multiple cores and multiple MBs of cache on the CPU, the overhead of some background tasks in the OS is really no big deal at all... They generally only affect a single core at a time, and only a small part of the cache.
And even if the user does decide to run multiple programs at a time, I think optimizations still matter, because it still gets the job done as fast as possible, under the circumstances.
Lastly, I think that literally 'CPU-specific' or 'architecture-specific' optimizations aren't that important anymore. Back in the day, with 486 vs Pentium for example, the Pentium was a completely different CPU, and if you didn't pay special attention to the Pentium's architecture, you'd only get about 50-60% of its maximum performance. Likewise, there was quite a big difference between Pentium 4 and Athlon, so paying special attention to each architecture could really pay off. These days, Core2, Core i5/i7 and Athlon/Phenom are all pretty similar CPUs... Code that is optimal for one of them will also run very well on the others.
However, taking advantage of new instructionset extensions, that can really pay off in some cases. A good example was the new SSE4 in the Core2 Penryn series. It could drastically improve video encoding performance for example. So it's a very good optimization to write some SSE4 code in your encoder, and choose to run it when SSE4 is detected, and run a standard SSE2/SSE3 codepath for the others.
Firstly, 'unless your program is heavy on calculations'... well isn't that pretty much a given before you start any kind of optimization at all? I mean, why bother optimizing something like Notepad? You could try to squeeze every last cycle out of it, but there's little point as all operations are so trivial that the user is the limiting factor in its performance, not the CPU.
So yes, unless something actually takes enough time that the user will notice, it's not worth optimizing.
Secondly, I think reducing the working set is always a good optimization, regardless of whether you are using a multitasking OS or not. Even with just a single task, you will get better performance from your cache and memory, because you minimize the bandwidth and maximize coherency hence efficiency.
Thirdly, I would argue that context switches are not a big deal. As long as the user doesn't run multiple heavy processing programs together, the extra overhead is not that large. I remember back when I first moved from 32-bit DOS to Win9x on 486 and Pentium, that even then the actual impact of the multitasking OS wasn't that big a deal. The total effect on performance was in the < 5% range usually. Barely noticeable. And that was on those old CPUs with < 100 MHz clockspeeds and just a single core, where 256 KB of cache was considered a lot.
These days with multiple cores and multiple MBs of cache on the CPU, the overhead of some background tasks in the OS is really no big deal at all... They generally only affect a single core at a time, and only a small part of the cache.
And even if the user does decide to run multiple programs at a time, I think optimizations still matter, because it still gets the job done as fast as possible, under the circumstances.
Lastly, I think that literally 'CPU-specific' or 'architecture-specific' optimizations aren't that important anymore. Back in the day, with 486 vs Pentium for example, the Pentium was a completely different CPU, and if you didn't pay special attention to the Pentium's architecture, you'd only get about 50-60% of its maximum performance. Likewise, there was quite a big difference between Pentium 4 and Athlon, so paying special attention to each architecture could really pay off. These days, Core2, Core i5/i7 and Athlon/Phenom are all pretty similar CPUs... Code that is optimal for one of them will also run very well on the others.
However, taking advantage of new instructionset extensions, that can really pay off in some cases. A good example was the new SSE4 in the Core2 Penryn series. It could drastically improve video encoding performance for example. So it's a very good optimization to write some SSE4 code in your encoder, and choose to run it when SSE4 is detected, and run a standard SSE2/SSE3 codepath for the others.
Beyond making choices such as SSE vs no SSE, instruction level optimization outside of critical code/loops, in assembly language, to such an extent is essentially a waste of time. If you are working with ultra low power (i.e. embedded real-time) devices, however, then the effort might be worth it.
I would recommend looking into things like LLVM and see how bytecode/JIT can benefit you.
In general, you can still do some pretty decent optimizations at the algorithm/implementation level in assembly language. Despite all of the fancy pipeline/cache optimizations and things such as out-of-order execution, there is still a direct correlation between the "speed" of your processor and how much code you throw at it in regard to execution time. Just remember that all the optimization in the world cannot help slow I/O bound operations, so make the appropriate trade-off between optimizing and spending your time on more important things.
As for context switches, they by their very nature have to be "interruptive". There are some techniques to mitigate the overhead, such as using the SYSCALL instruction for Ring 3 -> Ring 0 switches instead of an INT. To be honest, however, most of your overhead is going to be from paging and the resultant cache flushing, among other non-code related things.
I would recommend looking into things like LLVM and see how bytecode/JIT can benefit you.
In general, you can still do some pretty decent optimizations at the algorithm/implementation level in assembly language. Despite all of the fancy pipeline/cache optimizations and things such as out-of-order execution, there is still a direct correlation between the "speed" of your processor and how much code you throw at it in regard to execution time. Just remember that all the optimization in the world cannot help slow I/O bound operations, so make the appropriate trade-off between optimizing and spending your time on more important things.
As for context switches, they by their very nature have to be "interruptive". There are some techniques to mitigate the overhead, such as using the SYSCALL instruction for Ring 3 -> Ring 0 switches instead of an INT. To be honest, however, most of your overhead is going to be from paging and the resultant cache flushing, among other non-code related things.
My rules of thumb for optimization are (in order in which they should be executed):
1. Optimize only those functions which take noticable amount of time. I.e.: if you can actually SEE that your program is 'working' (high CPU usage for at least a few seconds).
2. Optimize the abstract algorithm first, not the code. Most of the time there is a better way to achieve the goal. Try googling first, don't waste your time on reinventing the wheel.
3. Optimize the data first, not the code. Most of the time the CPU is waiting for external devices to supply data. Proper precaching/structuring of the data can give high speed boosts.
4. Optimize 'by hand' in assembly using MMX/SSE. Usually it means writing a DLL with only few most time-consuming function(s).
5. Create different code paths and optimize them for a specific CPU. Same as pt.4 only this time you have one DLL for each supported CPU.
So, yeah - optimizations for a specific CPU can give you some speed boost but ONLY after you have completed steps 1-4. But by then, your program should be already fast enough.
1. Optimize only those functions which take noticable amount of time. I.e.: if you can actually SEE that your program is 'working' (high CPU usage for at least a few seconds).
2. Optimize the abstract algorithm first, not the code. Most of the time there is a better way to achieve the goal. Try googling first, don't waste your time on reinventing the wheel.
3. Optimize the data first, not the code. Most of the time the CPU is waiting for external devices to supply data. Proper precaching/structuring of the data can give high speed boosts.
4. Optimize 'by hand' in assembly using MMX/SSE. Usually it means writing a DLL with only few most time-consuming function(s).
5. Create different code paths and optimize them for a specific CPU. Same as pt.4 only this time you have one DLL for each supported CPU.
So, yeah - optimizations for a specific CPU can give you some speed boost but ONLY after you have completed steps 1-4. But by then, your program should be already fast enough.
On a related note, I have developed an open source library for exactly this purpose:
http://www.asmcommunity.net/board/index.php?topic=29464.0
I think the most useful feature is that it can tell you what the cache sizes are (for assembly programmers it shouldn't have been that difficult to detect something basic like SSE).
You can tweak your code at runtime to make the most of the CPU architecture it's running on.
So yes, I do think there's a point in optimizing for specific CPUs, up to a certain point.
http://www.asmcommunity.net/board/index.php?topic=29464.0
I think the most useful feature is that it can tell you what the cache sizes are (for assembly programmers it shouldn't have been that difficult to detect something basic like SSE).
You can tweak your code at runtime to make the most of the CPU architecture it's running on.
So yes, I do think there's a point in optimizing for specific CPUs, up to a certain point.