Hi all,

I have several situations where I can choose between these two forms of memory access:

mov eax,
mov eax,

The first instruction is three bytes long, while the second is seven bytes long. But on the other hand, the first one uses up a precious general purpose register and needs to read esi first, while the second instruction can execute immediately.

So, which one is better? The static address is used several times, so loading it into esi can be neglected.

Thanks,

c0d1f1ed
Posted on 2006-07-05 14:20:30 by C0D1F1ED
- Using the registers is almost always faster and it's always shorter
- Shorter instructions are almost always faster (than their longer equivalents).

The CPU does everything within its ultra-small memory area called 'registers'. Caclulating the effective address from the argument is no different. It's faster to add 1 register to another, than copy the argument to a temporary register, then add it to another.

At least theoretically that is, because nowadays CPU's pre-decode instructions, so simply changing from 1 option to another won't make any noticable difference (in most cases). Everything depends on the whole code, not on its micro-parts. First optimize the algorithm (mathematically), then try to write it using the most suitable data structures, then optimize all functions, and then and only then try to 'cut' some instructions. But first 2 optimizations give you ~80% of all you can get, in most cases. It's best seen in Intel's JPEG library: Plain IDCT algorithm executes in 30s on my CPU (1024x768x24bit image). Aggressively Optimized plain algorithm executes in 3s. AA&N algorithm (mathematically changed variant, which produces same results with CPU-friendly operations) executes in ~100-140ms. The one implemented by Intel in their library (strongly MMX- and SSE-optimized) executes in ~70-80ms. As you can see, you get 0.00(3)* of the original performance from optimizing the algorithm itself, and only ~0.635 from optimizing the code.



So the lesson is: don't bother about instructions, unless your algorithm is perfect and you use ideally-suited data structures and functions.




*1.0 means 'no change', 0.5 means '2 times faster', 0.1 means '10 times faster', 0.00(3) means '300 times faster'.
Posted on 2006-07-06 02:08:27 by ti_mo_n
Haha I thought that's what computer scientist have been preaching? Using a superior algorithm is definitely the right way to go.

Personally I think it is better to make use of register, but it is best you test it out with a profiler yourself.
Posted on 2006-07-07 08:31:09 by roticv
It's faster to add 1 register to another, than copy the argument to a temporary register, then add it to another.

Doesn't the immediate address enter the execution pipeline directly? x86 micro-instructions are very long so I'd expect this to be included.
At least theoretically that is, because nowadays CPU's pre-decode instructions...

Only the Pentium 4 uses a trace cache. The soon to be released Intel Core 2 Duo processor won't even use it any more.
So the lesson is: don't bother about instructions, unless your algorithm is perfect and you use ideally-suited data structures and functions.

My algorithm is perfect. 8) It has a lot of 'global' data though (tables with constants and such), so I have the option to encode the addresses into the memory operations directly or work indirectly with a register.

So I want to know if anyone has any specific experience with different addressing modes. It's quite likely that it doesn't have much of an influence, but I want to ask anyway...
Posted on 2006-07-08 08:46:10 by C0D1F1ED
I agree on "everything depends on the whole code...
so if your algo could be speed up in other parts by use esi or not?
I was thinking a third option if your code looks about this
you have 7 adresses of adress+ecx*4, the other is 1 adress2+ecx*4
rework it to use only ecx and instead and initialize ecx before loop just once 4*ecx+adress
in loop exchange add ecx,1 to add ecx,4
cmp ecx, check if equal endadress

I myself like to use mov ,eax / add ecx,4 over
mov ,eax / add ecx,1
its easier to customize to any stepping and easier to type, if you target PIV's, you anyway use add instead of inc

in theory, wouldnt a access from indirect register few times, over dependency on calculate adress+ecx*4 or esi+ecx*4 be faster?
especially esi+ecx*4, as cpu cant count on esi being constant


Posted on 2006-07-08 09:33:05 by daydreamer
Another voice sings the praise of the "whole code" concept. Another thing to think about is the order and alignment of data in the DATA section of your program. The caching algorithm, especially on p6 architecture is optimized for DWORD aligned data and cache reads on contiguous data (size is processor specific). So if you have read from an address close to the one containing your memory offset the read ahead algorithm functions more effectively, speeding things up. As for register or memory, the internal cache of the processor is just as fast as a register and with read ahead and well thought out code structure the memory based address will be ready before it is needed so the savings by using a register are negligible.

Donkey
Posted on 2006-07-08 10:40:27 by donkey
(...)

You got me wrong. I'm not talking about caching already-decoded instructions, which doesn't speed up very much (that's why it's been deprecated). I'm talking about pre-decoding process. CPU decodes instructions into micro-operations and executes them. Nowadays CPUs decode ahead in parallel with execution. Trace cache wasn't very good, because chaotic code had to be decoded and stored inside the cache every time it was executed. Normal L1 cache is better, because it stores chaotic code itself, ready to be decoded and executed. Immediate address must be decoded and stored inside a temporary register. Then it can be added to anything else. It's slower than adding 2 registers. The main idea behind registers is to add the CPU a super-fast memory area. That's why we have registers in the first place. We could as well operate on immediates and memory operands, like in JAVA, and other HLL*.

My algorithm is perfect. 8) It has a lot of 'global' data though (tables with constants and such), so I have the option to encode the addresses into the memory operations directly or work indirectly with a register.

So you should bother about data structures :)

So I want to know if anyone has any specific experience with different addressing modes. It's quite likely that it doesn't have much of an influence, but I want to ask anyway...

If you want a simple and plain answer, then: use the registers. They are better.

The whole rant above was to explain you that nowadays CPUs are VERY complicated and it's hard to predict whether 'this' or 'that' will be better. There is really no simple answer whether you should use 'this' or 'that'. It depends on the algorithm, data organization, and even the place inside the code. But yes - try to use registers as often as you can. Try to desing your algorithms and data structures the way they use the registers as often as they can. The registers reside inside the most inner part of the CPU. That's why there is so few of them. Intel or AMD could add 512 128-bit registers (or even more) to their CPUs if they wanted to, but then such CPUs would cost...Šnbsp; ugh O_OŠnbsp; Šnbsp;TOO MUCH and they really wouldn't be worth the price, because we would require incredibly intelligent compilers to utilize those CPUs' full power (most of the world codes in HLLs).




*Explaination about this "java and other hlls": I know that HLLs get compiled and assembled and then they DO USE registers. What i mean is the languages >themselves<. They don't use registers: only immediates and memory operands.
Posted on 2006-07-09 02:15:45 by ti_mo_n
You just made me think for a minute... Why aren't compilers more intelligent than they are? I know it's possible to create a much more intelligent compiler than we have now, especially with all the advances in AI that we have... Maybe someone (I'm not directly suggesting anyone here) should go out and use AI to make a compuler that can actually optimize stuff the way an assembly coder iptimizes, and maybe even better (figure out how to make an algorithm work better, etc.)
Posted on 2006-07-10 05:12:41 by Bobbias
Instead of just guessing or threading on experience, I made a little benchmark on

; code 1, takes <result1> cycles
mov esi,offset offs1
mov edx,1000
@@:
mov eax,edx
and eax,3
dec edx
mov ,eax
jnz @B

against

mov edx,1000
@@:
mov eax,edx
and eax,3
dec edx
mov offs1,eax
jnz @B

Ran it at realtime-priority-class dozens of times, and the results are always:

result1 = 3045
result2 = 3044
result1 = 3045
result2 = 3045
result1 = 3045
result2 = 3045
result1 = 3046
result2 = 3046
result1 = 3045
result2 = 3046
result1 = 3046
result2 = 3046

In other words, at least on my cpu, this code performs like this.
Sempron 2200+. I think it's best to actually benchmark your code when feeling such doubts, and select the faster implementation (you might need to test it on different cpus).

Bobbias: I think GCC4.1 boasts with it, - they're doing it by adding an "abstract assembler" iirc between the *c++.exe and *as.exe. But anyway, I hope the need for asm is always present.
Attachments:
Posted on 2006-07-14 17:14:32 by Ultrano

You just made me think for a minute... Why aren't compilers more intelligent than they are? I know it's possible to create a much more intelligent compiler than we have now, especially with all the advances in AI that we have... Maybe someone (I'm not directly suggesting anyone here) should go out and use AI to make a compuler that can actually optimize stuff the way an assembly coder iptimizes, and maybe even better (figure out how to make an algorithm work better, etc.)


It is well within our technological state to create a compiler that generates perfectly optimized code. However, the algorithms to do so are NP-Complete and generating the perfectly optimized program would take a *very* long time (i.e., longer than the age of the universe).  Therefore, such compilers won't be very practical to use :-)
Cheers,
Randy Hyde
Posted on 2006-07-24 09:00:53 by rhyde
I'd say use the register.
In simple code it doesn't make much difference but with more complex code the CPU is parallel decoding typically 16 byte lines of data and the higher the code density in those 16 bytes, the more scope it has to do functions in parallel.
If you have 7 byte instruction then you'll only have room for 2 or 3 instructions, with 3-byte instruction you'll have room for 5 or 6, that's nearly 3 times the code density and 3 times the opportuinity for the CPU to find savings.

Paul.
Posted on 2006-07-24 11:48:04 by pdixon