Back in the days of the Pentium 1, it was easy to pair instructions and know for sure that you had optimal code. But now that all processors have out-of-order execution, how can you do the scheduling?

I want to know this because I'm writing an automatic scheduler for my code generator. I've been able to set up a dependency graph on which I can do the scheduling operations, but I don't know which operations result in faster code. Measuring the speed isn't practical because it takes lots of time and isn't accurate.

I'm just looking for some heuristics or rules that generally work well for an out-of-order architecture. Thanks!
Posted on 2003-05-19 04:27:15 by C0D1F1ED
Anyone know how to optimize?
Posted on 2003-05-20 02:18:11 by C0D1F1ED
Bump! Nobody can help me?
Posted on 2003-05-20 07:43:04 by C0D1F1ED
Study the pipeline and target a processor.

Reduce latency (dependancy, memory, instruction).

It is not a simple thing, but I'd like to talk about it. :)
Posted on 2003-05-20 07:56:32 by bitRAKE
I'd like it to be nearly processor independent (Athlon XP, Pentium III, Pentium 4) by providing some parameters about the processor and maybe latency/troughput per instruction. The code is run-time generated so it can be compiled specifically for the processor it is running on.

Problem is, how do I start scheduling? These out-of-order processors don't keep things, euh... in order :grin: So I don't know at what moment a certain instruction can start and when it terminates and when the execution unit is available again and how many instructions fit in the pipeline, etc. That's all details that very much depend on the situation and probably many things are kept secret by the processor manufacturer (like micro-instructions). Now, suppose I knew all these little details and scheduled the code perfectly. Then it's still possible that a cache miss or a task switch or whatever influences the order of instructions I had wished.

So it quickly became clear to me that -the- ideal scheduling like with the Pentium 1 is impossible. So I wondered what heuristics give me a good scheduling quickly without knowing much about the pipeline structure or the micro-instructions. Also a problem is the register scheduling. Should I do it before, after, or even while scheduling?

Your help would be much appreciated.
Posted on 2003-05-20 09:13:47 by C0D1F1ED
First the code must be broke into pieces - you cannot schedule the whole program at once. Where will we create the breaks? Jcc is a good choice, another would be memory accesses - both.

Now we have little bits of code to schedule. We must track the dependancies between these little bits (register,flags,constant memory). I'd start with the instructions with high latency - trying to hide non-dependant instruction in their wake.

Also, many algorithms with SMD are memory bound -- knowing the latency for each of the caches and main memory will be a big factor in speeding up code. From this you will know how many instructions to cram between memory accesses, and how to patition data to be worked in groups - while it is in the cache.
Posted on 2003-05-20 16:54:30 by bitRAKE
bitRAKE,
Continue please...
It sounds very interesting...he he...
Regards,
Lingo
Posted on 2003-05-20 17:10:35 by lingo12
I already have the code broken up and a method to detect dependencies (assuming no memory aliasing). Selecting instructions with high latency first seems like a good rule of thumb!

I'm not sure if memory throughput is going to be a bottneck though, especially with the newest Pentium 4. Anyway, scheduling memory accesses early is probably a good idea...

Thanks!
Posted on 2003-05-20 18:08:51 by C0D1F1ED

I'm not sure if memory throughput is going to be a bottneck though, especially with the newest Pentium 4. Anyway, scheduling memory accesses early is probably a good idea...
Surely, you jest! 92 cycles between memory accesses is not a good thing if you ask me. Run a test (or use this): dump the caches and read as fast as possible from contigous memory larger than cache size - tell me the latency between instructions. :) This will be your primary concern in the long run.

The P4 can execute a load and store each cycle, so they should be right next to each other (same cycle). :) Same with the Athlon, but an added dynamic is that the Athlon can also do two loads per cycle. I've gained a couple cycles a loop doing this - from some actual code:


mov eax, [esi - 1*4]
mov ebx, [esi - 2*4]
mov ecx, [esi - 3*4]
mov edx, [esi - 4*4]

_0:

...

mov eax, [esi - 1*4]
mov ebx, [esi - 2*4]

movntq [ebp - 2*8], mm0

mov ecx, [esi - 3*4]
mov edx, [esi - 4*4]

movntq [ebp - 1*8], mm1

jne _0
...the moves at the end of the loop are 'free' when paired with the stores. :) I read extra data on the last itteration - important to note just to ensure data is there to read.

lingo12, please join there is more to say...

There are some nice articles here:
http://chip-architect.com/mid_home.html
Posted on 2003-05-20 20:15:37 by bitRAKE

Study the pipeline and target a processor.

Reduce latency (dependancy, memory, instruction).

It is not a simple thing, but I'd like to talk about it. :)


One suggestion I would have is, to create several versions of the same code, optimized to target *different* processors, then at install-time/setup-time see which version works fastest for the computer.

Prolly lotsa work though.
Posted on 2003-05-21 02:45:23 by AmkG
Thanks a lot bitRAKE!

I'll focus some more at optimizing the memory accesses. I don't need much throughput to RAM, but I could definitely use some cache optimizations. Thanks for those links too.
Posted on 2003-05-21 07:59:42 by C0D1F1ED
C0D1F1ED, there are free books provided by Intel, which clearly descirbe process of optimization for Intel Pentium 4. You can order them here for free (no delivery charges -- totally free)
Posted on 2003-05-21 08:52:17 by masnick[CCCP]
I need a source-code optimizer, since I want to generate ASM code with a C program, and I need a tool to reorder all the lines (in order to compute the optimal pairing).

I found:
http://www.imada.sdu.dk/~jews/optimizer/
but didn't test it yet.
(Several peephole optimizers exist on the Amiga)

Are you aware of such a project ?

JC
Posted on 2003-06-03 04:42:49 by MCoder