Is there such an application that will allow you to highlight certain lines in an Assembly program for optimizing? eg. in a 20 line code segment, identify which instructions must be after which ones. It would then create all 200 etc permutations of the lines which give the exact same result but run at different speeds. Then you can assemble all 200 asm files and test for speed.

This would effetively serve to hand optimize routines without hand optimizing them per se by checking instruction latency, ports, etc.

Thanks.
Posted on 2005-12-25 12:44:39 by V Coder
VTUNE?
Posted on 2005-12-25 16:22:13 by The Svin
Yeah - Vtune is cool. It can tell you how to write the code so the app will perform best on all intels. Alternatively you can dig Agner Fogg's tutorials on optimization and do the job yourself (more fun IMHO).
Posted on 2005-12-26 20:37:18 by ti_mo_n
Thanks.

I tried vtune briefly. Short trial period iirc - and I was studying early in the trial period and had little time to experiment with it. Aldo, I don't yet want to spend that kind of money on a non income-generating project. I'll look at it again.

I've used the Agner Fog optimization guidelines and the Intel/AMD optimization manuals. I'm actually optimizing code for multiple platforms. I'm just wondering if I can squeeze another 1 or 2 or 5 or 10% out of the routines.

Yet I was thinking more of a brute force optimizer. No software hints, no Agner Fog: Load the asm program into a this app, and it produces 200+ asm variants of the code you have entered. Same instructions in different order. You run ml etc on all 200 and you test the speed yourself.

Is there an assembly program to do that?

Posted on 2005-12-27 20:20:37 by V Coder
It looks like writing something between a logic analyzer and an assembler. The assembler part shouldn't be very difficult, but the first part, where you must analyze the logic of a proc and then generate n routines that maintain that logic, is quite a challenge. Try reading any C(++) optimiizng compiler sources (like MS visual c++) to get the idea how it works. Unfortunately I don't know any open-source C(++) optimizing compiler (I just use MS's VC++).
Posted on 2005-12-27 22:13:17 by ti_mo_n
Well actually, I was thinking just an programmable text editor.

Lets say I have a program:
A
B
C**
D**
E**
F**
G**
H

Instruction E must follow C, F must follow D, and G must follow E, and I want to optimize that 5 instruction sequence, leaving A, B, H in place.

I can have {D, F, C, E, G}; {C, D, F, E, G}; {C, E, D, F, G}; {C, E, G, D, F}, plus several other arrangements. I want the program to produce the how ever many options in separate asm files. Then we'll need a batch file to run the assembler.

The program I'm describing need not understand anything about assembly. It must just obey instructions.
Posted on 2005-12-27 22:55:40 by V Coder
This way you'll gain no more than 1%. Simply rearranging the instructions gives (almost) no gain because nowadays CPUs execute the instructions out of order.
Posted on 2005-12-28 07:12:19 by ti_mo_n

  I have found in my experience my P4 can give quite a decent speed boost by re-arranging the instructions ( much more than 1%).  The P4 by no means does a fantastic job of instruction re-ordering, otherwise I wouldn't have seen decent speedups from swapping instructions around.  I had planned on writing a program exactly like you are proposing, but I ran out of time.  I wrote something similar that finds the optimal prefetch distance for the "prefetch" instruction.  It would try all combinations of the location of the instruction in a loop, plus all different offsets into the buffer you are trying to prefetch.  I wrote a program that would automatically compile all those different combinations, and run them, saving the output of the speed of the routine in cycles into one big huge text file.  I sorted the text file to find the fastest one.
Posted on 2005-12-29 10:27:55 by mark_larson
I was hoping to get that 1% out of an Athlon routine, plus that x% from the Pentium III and Pentium 4 routines. I've recently optimzed stuff for a Pentium III, Athlon and P4. My program chooses the appropriate processor at run-time.

The Athlon appears to reorder aggressively, since running the base algorithm took 144 seconds but manual optimization eventually got it down to 140 seconds. I was wondering if I could get that down further. I tried loop unrolling, but I would need to optimize the unrolled loop manually. The Pentium III and Pentium 4 do not appear to reorder as well, and manual optimization got good results. I have no doubt that my manual optimization is not yet optimal, and thus I was looking for an easier way to do this.

Mark, do you expect to write such a program? And how soon? Thanks.

I would be interested in that prefetch optimization routine as well. Thanks.
Posted on 2006-01-07 18:31:18 by V Coder
New AMD chips are better than new Intel chips, that's just how it is for now.

r22's QUICK REFERENCE OPTIMIZATION POST

Rroutines,
-If the routine can be made parallel use SIMD instructions
-Use CMOV's and other branch avoiding techniques especially inside loops
-If you're moving a lot of data (ie > 4096 bytes) prefetchnta
-Post your routine for other people with experience to check out, if a decent optimization is there usually someone will tell you about it.

Memory
-Take/Give as much data as possible on reads and writes
-Unaligned data movs use 32byte chunks
-Aligned data movs use SIMD chunks as big as you can ~64bytes
-Aligned and Unaligned, align the READ and keep the WRITE unaligned
-Prefetching big READs/WRITEs, use a loop with 2 prefetch instructions inside,
prefetch ~4096bytes with the loop then use another loop for your READs

mov ecx,32
Prefnontemp:
prefetchnta byte
prefetchnta byte
add edi,128
dec ecx
jnz Prefnontemp
sub edi,128*32

Math
-Use LEA instead of Mul and Add combos
-Integer: Div or Mul by power of 2 use a shift
-Integer: Div x/y if y is known use magic number multiply
-Integer: Div x/y if y is NOT known USE FPU for 32 or 64bit ints

div latency 39/71 for 32bit and 64bit
idiv latency 42/74 for 32bit and 64bit
fild 2x latency 12 (6+6)
fdiv latency 20/24 for 32bit and 64bit reals
fistp latency 4
FPU int divide 36/40 for 32bit and 64bit integers
-SIMD: Don't use DIV, use RCP and MUL combo
-SIMD: Use LUT approximations or taylor expansion approximations for trig functions

Unrolling
-Unroll after you've optimized yoru code
-Unroll 2x first and reoptimize the structure of the code
-Try to seperate consecutive WRITEs to the same register or memory location

BAD
inc ecx
mov ebx,eax
add ebx,10
BETTER
mov ebx,eax
inc ecx
add ebx,10
-Try to keep 32bit and 16bit instructions apart from eachother
-Unrolling too much can sometimes make the code run slower, so benchmark

That's a good start
Posted on 2006-01-07 23:10:26 by r22
New AMD chips are better than new Intel chips, that's just how it is for now.

Not necessarily. Depends on algorithm and its implementation. In regard to GPR integer and MMX speed, my own tests and optimization of compute-bound code has found that at the same clock speed the Pentium 4 is outclassed by the AMD Athlon 64, AMD Athlon, Pentium M, Pentium III, AMD k6-2 and Pentium MMX. However, the Pentium 4 has an advantage, or is that a saving grace, it does NOT run at the same clock speed as these processors - it runs faster. The Pentium 4 also processes 64 bit numbers with paddq whereas the Athllon, Pentium III (and lower processors) would need to use paddd and propagate carries from dword to dword. This reduces the length of the code from 63 instructions (Athlon) or 56 instructions (Pentiium III) to 46 instructions. The Athlon code does not use paddd to and MMX to propagate the carry. It uses integer operations and therefore has more instructions than the Pentium III code.

The Pentium III/M execute MMX instructions with 1 cycle latency, but cannot often execute more than one at a time. The Pentium 4 executes one MMX at a time, with a latency of 2 cycles. The Athlon can execute up to two MMX instructions at a time, with a latency of 2 cycles. Athlon XP/64 and Pentium III/M each have 3 integer execution units. However the Athlons more likely execute 3 integer instructions per cycle than the Pentium III/M, and the Athlons also apparently reorder instructions agressively. Furthermore the Athlons execute 32bit adc in 1 cycle whereas the Pentium III/M take 2 cycles. This is why the Athlon is stronger with my Athlon (integer laden) code than my Pentium III (MMX heavy) code, and the Athlon 64 stronger with the longer Athlon code than the shorter Pentium 4 (MMX & SSE2) code.

Interestingly, the Athlon XP runs the Pentium III (MMX heavy) code slower than the Pentium III, and about 20% slower than it does the Athlon code, and the Pentium III runs the Athlon (integer) code slower than the Athlon, and about 20% slower than it does the Pentium III code. All in all, the Athlon executes its own 63 instructions in 10% less time than the Pentium III executes its 56 instructions.

The Pentium M is almost identical to the Pentium III, except that it can also run Pentium 4 code (SSE2 instructions). My Pentium 4 code uses 46 instructions including paddq. At the same clock speed, the Pentium M executes the 46 instructions in 83% of the time the Pentium III takes to execute its 56 instructions (almost linear ratio - it does not execute more instructions simultaneously). This is 94% of the time the Athlon takes to execute 63 instructions, ie, the Athlon strictly also executes instructions faster than the Pentium M. However, since the Pentium M uses the shorter instruction sequence whereas the Athlon uses a longer instruction sequence, the Pentium M is slightly faster than the Athlon at the same clock speed.

The Athlon 64 inherits the Athlon's weakness in MMX code and iirc runs the shorter 'SSE2' code slower than it runs the Athlon code. Indeed, if the Athlon 64 were to halve its MMX instruction latency from 2 cycles to 1, it would probably be the processor that cannot be beat at anything.

Time to complete a benchmark:
Pentium 4 HT 2400MHz - 261 seconds
Pentium III 1066MHz - 312 seconds. A 2000MHz Pentium 4 competes with this.
Athlon XP 1666MHz - 179 seconds. You would need a 3500MHz Pentium 4 Desktop to compete with this desktop.
Pentium M 1600MHz - 169 seconds. You would need a 3700MHz Pentium 4 Desktop to compete with this notebook.

The Pentium 4 beat my Pentium III notebook but nothing else. It would beat a 1GHz but not a 1.1GHz Pentium M notebook.

Dealing with ultimate processors:
2200MHz Pentium M matches 5100MHz Pentium 4.
2200MHz Athlon XP matches 4600MHz Pentium 4
2800MHz Athlon 64 (running Athlon XP code not Pentium 4 SSE2 code) matches 5900MHz Pentium 4.

That is, all things being equal.

IIRC Prescott eliminated some of the advantages of the double pumped ALU of the Pentium 4, negatively affecting some integer but not so much MMX code. The latency, not the throughput was increased from 0.5 cycles to 1 cycle, that is the result of those instructions would be available 0.5 cycle later. This would not matter so much in an optimized MMX routine which uses sub or add at the end of the loop to advance the pointers. This may negatively affect code which uses tightly arranged integer instructions. Again, the Athlon handles tightly arranged integer instructions very well.

What about Core Duo/Solo?
The SSE/SSE2 units of the Pentium M are apparently optimized. I'm not sure if it now would run more MMX instructions in parallel, in which case it would strengthen its position against the Athlon XP/64. I'm eagerly awaiting a look at the new instruction latencies.

As it stands I plan to get an Athlon X2 probably next quarter. I'm guess I should wait for the M2 since the 939 is at the end of its line. Or I might get a 939 X2 when the prices come down as 939 is phased out.
Posted on 2006-01-10 19:36:29 by V Coder
One question:

Pentium 4 HT 2400MHz - 261 seconds


Have you split the job for 2 threads of execution? Or is this only 1 unit's time?
Posted on 2006-01-11 00:02:32 by ti_mo_n
With HT and multi-core/smp it's also worth knowing whether a single-thread operation is affinity-limited to one processor, or allowed to jump around... jumping around has some negative consequences (check the "dualcore musings" at http://f0dder.reteam.org/ )
Posted on 2006-01-11 05:49:19 by f0dder
Yeah. In intel's manuals it's stated clearly that it's much wiser to bind a thread to a specific execution unit.

So to finish my previous post: you should split the job between all available execution units and then present the slowest time. for example: if a CPU has 2 units, then you
a) split the job in half,
b) ,
c) start the jobs.

If the jobs execute in 210ms and 230ms respectively, then you can say that the whole work took 230ms, NOT 440ms.

V Coder, please clarify how you did the test on HT CPUs.
Posted on 2006-01-11 12:07:00 by ti_mo_n
All machines ran a single-threaded program performing the same calculations, etc. No threading magic done.
Posted on 2006-01-11 22:45:52 by V Coder
So the HT CPUs' overall performance is about twice the mentioned :)
Posted on 2006-01-11 23:09:05 by ti_mo_n

So the HT CPUs' overall performance is about twice the mentioned :)
Nope. AFAIK HT never doubles performance. Gains are usually about 10-20%

In this case HT gave 40% performance increase when two copies of that program ran simultaneously on the HT, they both operated iirc at about (a bit less than) 70% of the speed of a single copy running. Just goes to show how much of the potential bandwidth of the Pentium 4 is normally wasted.
Posted on 2006-01-13 15:08:04 by V Coder