Hi all,

I wonder whether there are any (free) x86 optimizers. What I mean is a library that can take a function (pointer) as argument and returns an optimized version of it.

I foud this: x86 Optimizer, but they don't provide a download nor do they mention any prizes so I doubt it's going to be cheap.

Any suggestions? Thanks!
Posted on 2007-06-06 17:10:59 by C0D1F1ED
Hi!

They appear to know what they're doing, but their benchmarks are kinda misleading.  The compiler did a pretty crummy job on some of them.  Then again, compilers have a really hard time vectorizing stuff, even if you code it specifically to try to make it easy for them to see how to vectorize it.  Some of the stuff, though, should've been easy for a compiler to figure out, like moving back the updating of "loop-counter"-esque registers.  Most of the cases look like they could be sped up a lot if you only need single-precision floating-point, and again by forcing 16-byte or 64-byte alignments on the arrays to get the cache performance for a 2x unroll (or 4x if you've got the 16 SSE registers on 64-bit processors).

I haven't heard of any program like this one before, so I'm not much help in that regard, sorry.
Posted on 2007-06-06 21:10:58 by hackulous
Simply contact them :)
There was a topic earlier about x86 optimizers - which fix-up the output of obsolete compilers from something like
mov eax,val1
mov val1,eax
mov eax,val1
mov val1,eax
>_<
Posted on 2007-06-06 21:20:47 by Ultrano

mov eax,val1
mov val1,eax
mov eax,val1
mov val1,eax


That's AWESOME! :lol: That's about 20 clocks to nothing, meaning its infinitely faster! lol
Posted on 2007-06-06 21:36:26 by hackulous
Hmm, running an optimizer on binary code? A compiler has so much more info than binary analysis ever will, so this ounds pretty silly to me; if your compiler isn't doing a good enough job, you switch compiler or handcode assembly instead.

I did use "sallys peephole optimizer" in the past when I wrote pascal code, but there weren't many other alternatives back then :)
Posted on 2007-06-07 04:19:48 by f0dder

Hmm, running an optimizer on binary code? A compiler has so much more info than binary analysis ever will...

The idea is adaptive optimization. When a function is executed, say, 10000 times, it should be worth it to disassemble it, apply some agressive optimizations on it, and recompile. This is comparable to what JIT-compilers do, except that they keep the intermediate code around at all times. For native code that's a waste, because the binary code is already a compact representation of the (actual) instructions. Disassembling should be really fast.

It's like taking two steps forward, one step back, and two steps forward again. It should be better than taking one step forward twice (JIT) or two steps forward once (native). ;)
Posted on 2007-06-07 05:53:39 by C0D1F1ED
Imho working from an IM representation is still better, since you have more data about the code. With binary optimization, all you can really do is some peephole optimizations, even if heuristics are applied - it's much more difficult to, say, inline functions or reduce expressions.

Current VMs and JITters still aren't optimal, but I do think it's what the future has in store for us, and I do think it's the right thing to do, once the technology is mature enough. Deliver apps in IM form, let it go through a "optimize for size, include profiling hooks" translate-to-native step during installation.

During use, code is profiled, and a re-translation (from the IM representation!) occurs - translate most for size, and the important hot-spots for speed, possibly remove profiling hooks to reduce overhead.

If a new and improved JIT compiler is released, the binary cache can be cleared and the cycle repeated, for even better results.

Yes, afaik this is how dotNET already works, possibly some java implementations too, but they're just not good enough yet. There's massive memory bloat and, at least in the case of java, some not-super-optimal class library implementations as well.

I don't believe native code would (or should!) ever be completely abolished, though. For really intensive stuff making use of SSE/whatever, handwritten assembly is still quite useful. Yes, compilers do have some intrinsic support, but it's still not good enough, and one of the big ideas of IM code is to be largely platform independant, something you lose with instruction set intrinsics.
Posted on 2007-06-07 06:16:28 by f0dder
personally i don't believe their software really can do what they state. They would need some extra-great analyzer to understand all those mess compilers can create.

remember that they say they operate only on optimized code!
Posted on 2007-06-07 07:34:32 by vid
I'm not really talking about JIT-compilation (just mentioned it in the context of adaptive optimization).

I'm approaching things from the other end. Compile to already fairly optimized native code. Let the programmer decide which functions are candidates for reoptimization and should get 'profiling hooks'. In particular I'm talking about adaptive optimization of dynamically generated specialized functions. 8)

Anyway, all details aside, I really am looking for a binary code reoptimizer. :) I can write one myself but if such (free) project already exist then obviously I'd like to try that first. And yes the optimizations would be mainly machine-specific peephole optimizations, the rest has all been done in intermediade code...
Posted on 2007-06-07 07:58:36 by C0D1F1ED
Wouldn't it be smarter to enhance your code generation instead? Or are you (still) generating from something too low-level to get more efficient output code? (one of the first "code generators" I saw merely stitched predefined blocks together, at least your have things like register allocation iirc :) ).
Posted on 2007-06-07 08:02:34 by f0dder
Code generation has and still is being improved on. However, that's part of the problem. Because of the optimizations, compilation times are now several times higher. Since it's intended for real-time applications, this can be a problem. New functions generally also have to be generated in 'bursts'. I either don't have to generate any new functions, or I have to generate dozens all in a couple milliseconds.

So the idea is to use adaptive optimization. First generate functions that have only the most crucial optimizations (aiming for say 75% of optimal performance). Then, during a 'quiet' period, perform more advanced/expensive optimizations to get that extra bit of performance. Furthermore, optimization effort would be focussed on the functions that get used most frequently (or take most cycles).

One approach is to keep all intermediate code to perform the more advanced optimizations on. However, that's tens of megabytes of intermediate code, without even taking any other data structures for analysis into account. Furthermore, intermediate code is not machine-specific so there's typically not much left to optimize for. The most opportunity is at the x86 level.

So the approach I'm currently researching is to perform the peephole optimizations directly on x86 binary code. I could even split it into separate passes to spread out the optimization workload even more. It really doesn't matter that taking one step back to stake two steps forward is a detour, it happens when there's some extra time available. Also, it's only done for the functions that really need it, and I'm saving megabytes of memory for more useful purposes.
Posted on 2007-06-07 09:40:34 by C0D1F1ED

one of the first "code generators" I saw merely stitched predefined blocks together

You mean something like this:
int stitchFoobar(void *buffer)
{
    int byteLength;

    __asm
    {
        jmp endStitch
    beginStitch:
        foo eax, ebx
        bar ebx, ecx
    endStitch:
        mov edi, buffer
        mov esi, beginStitch
        mov ecx, endStitch
        sub ecx, beginStitch
        mov byteLength, ecx
        rep movsb
    }

    return byteLength;
}

It's a damn shame Visual C++ doesn't support inline assembly any more for x86-64. :sad:
Posted on 2007-06-07 09:45:13 by C0D1F1ED

personally i don't believe their software really can do what they state. They would need some extra-great analyzer to understand all those mess compilers can create.

Not really. Compilers just always miss some oportunity for optimization. I once read a paper where a superoptimizer was used to find thousands of useful peephole optimizations. This may sound complicated, but it really isn't. And as soon as one of the patterns is matched, the code is more optimized.
remember that they say they operate only on optimized code!

All this means is that common high-level optimizations are assumed to already have been made by the compiler. Their reoptimizer only focusses on the low-level optimization opportunities that are left.

So when you compare the results of their optimizer applied to unoptimized code, versus compiler-optimized code, the latter should win. Their optimizer can't perform the high-level optimizations that make the biggest difference. They can, however, take the output of the optimizing compiler and optimize it even more.

Anyway, this is what they claim, but I don't see much reason to have any doubt about it.
Posted on 2007-06-07 14:35:48 by C0D1F1ED
I found this: DIABLO.

It's an open-source linker that disassembles everything and reoptimizes things. If I can figure out how to feed it one binary function and retrieve an optimized version, that would be exactly what I'm looking for. :D
Posted on 2007-06-08 02:31:36 by C0D1F1ED
i wasn't talking about optimization part. I was talking about analysis of binary code. It's not possible to analyze properly anything, there is always possibility of something that analyzer won't understand or will misinterpret.

What does optimizer do then?

And even analyzing result of common compilers is extremely hard, just look at Ilfak Guilfanov's (author of IDA) blog.
Posted on 2007-06-08 05:48:31 by vid

I found this: DIABLO.

It's an open-source linker that disassembles everything and reoptimizes things. If I can figure out how to feed it one binary function and retrieve and optimized version, that would be exactly what I'm looking for. :D


Let me know if you find such a program that disassembles code and replaces sub-optimal algorithms with better ones :)
Posted on 2007-06-08 05:57:09 by SpooK
vid: the DIABLO thign codified mentions works on object module input though, you have a lot of extra information available at that time, which helps analysis a lot. I still think optimization is better performed on IM code, but hey :)
Posted on 2007-06-08 09:22:50 by f0dder
f0dder: yeah, of course i agree. But i think they would have problem with GPL if they build it into gcc.
Posted on 2007-06-08 16:32:52 by vid
Hm, why problems with the GPL? Unless the DIABLO license is incompatible with GPL...
Posted on 2007-06-09 12:15:16 by f0dder
i was talking about first one :]
and i bet they don't want to make it's source available
Posted on 2007-06-09 18:48:19 by vid