The VC++ 6 compiler generates the following code for this line:
for(int a = 0; a < 0x1000000; ++a) b = a;

XOR EAX, EAX
MOV ECX, VARIABLE
AGAIN:
MOV DWORD PTR ,EAX
INC EAX
ADD ECX,4
CMP EAX,1000000h
JL SHORT AGAIN

And I thought it could be improved:
xor eax, eax
mov ecx, VARIABLE
again:
mov , eax
inc eax
cmp eax, 1000000h
jne short again

By counting cycles, my method seems faster. When comes to real testing, VC generated code is faster(a bit). Could anyone explain it for me, thanks!
Posted on 2003-02-12 04:29:32 by C.Z.
I'm not sure about it.Maybe this instruction "mov , eax" is the reason of it. Actually ecx+eax*4 has more cpu circles than ADD ECX,4

my English is poor:)
Posted on 2003-02-12 05:19:25 by HookWorm

bomb01:
It may seem irrilevant for you, but it's not: what CPU are you testing this on?
There are dependencies and instruction decoding (and more) to take into account, depending on the specific CPU that is being tested on both programs.
Posted on 2003-02-12 06:06:51 by Maverick
In my opinion,

in the VC6++ code,



mov dword ptr [ecx], eax
inc eax



read and write after read, pair OK so each instruction can be in each pipe... so mov dword ptr , eax = 2 cycles and eax is incremented in the same time so the piece of code above is executed in 2 cycles...

your code


mov [ecx+eax*4], eax
inc eax


can not be paired


here is a smaller / faster solution (i think :tongue: ) :


mov eax, VARIABLE
mov ecx, 999999
again:
mov dword ptr [eax+ecx*4], ecx // 2 cycles
dec ecx U pipe
jns again V pipe


6 bytes for the loop
Posted on 2003-02-12 16:21:51 by DarkEmpire

bomb01: writing asm code without a deep knowledge of the hardware behind it is optimization suicide.. a big waste of time.. and food for those C++/Java zealots that only wait for excuses to say that compilers outperform human asm coders.
Posted on 2003-02-12 16:37:59 by Maverick
bomb01,
Can you post your test code, please?

Regards,
Lingo
Posted on 2003-02-12 17:48:20 by lingo12



mov eax, VARIABLE
mov ecx, 999999
again:
mov dword ptr [eax+ecx*4], ecx // 2 cycles
dec ecx U pipe
jns again V pipe



i think you need to change your second line... the original program executed 1000000h times (0 to 999999h)
yours executes the loop 999999 to 1 times... it should read: mov ecx,1000000h (and execute 1000000h to 1)

i could be wrong :)

but yours certainly looks faster than the compiler :alright:
Posted on 2003-02-12 18:01:58 by jademtech
try this one :grin: :stupid:
xor     eax, eax

__again:
mov DWORD PTR b[eax*4], eax
lea edx, DWORD PTR [eax+1]
lea ecx, DWORD PTR [eax+2]
mov DWORD PTR b[eax*4+4], edx
lea edx, DWORD PTR [eax+3]
mov DWORD PTR b[eax*4+8], ecx
mov DWORD PTR b[eax*4+12], edx
lea edx, DWORD PTR [eax+4]
mov DWORD PTR b[eax*4+16], edx
add eax, 5
cmp eax, 1000000
jl __again
you can always use sub and test if the sign flag is set (like DarkEmpire's code) plus reverse the ordering of the code. Or you can add more lines of code inside the loop above and increase the value of add eax, ???5???.... Or you can try to unroll the loop and start doing something like this
mov     DWORD PTR b[eax], 0

mov DWORD PTR b[eax+4], 1
mov DWORD PTR b[eax+8], 2
...
but you have to sacrifice code space if you want to unroll... a macro would be a good choice for this. :grin: :stupid:


Posted on 2003-02-12 18:09:22 by arkane
Bomb,


mov [ecx+eax*4], eax
inc eax
cmp eax, 1000000h

This looks like a dependency problem, pipeline delay on 1st line as EAX is in both the source and destination, further operation on EAX on the next line with INC and a compare to EAX on the third line.

Regards,

hutch@movsd.com
Posted on 2003-02-12 18:15:53 by hutch--
Here is a shorter version that is slower on my Internet AMD 550.


; ?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?

Bomb1 proc VARIABLE:DWORD

mov ecx, VARIABLE
mov eax, -1000000h

again:
mov [ecx+eax*4+1000000h*4], eax
inc eax
jnz again

ret

Bomb1 endp

; ?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?=?

Will see if I can do a faster one later but this code is still strangled by the dependency in EAX.

Regards,

hutch@movsd.com
Posted on 2003-02-12 18:43:45 by hutch--
20 arrays per loop :grin: :stupid:
pcmpeqd     mm1, mm1

psrld mm1, 31
movq mm0, mm1
paddd mm1, mm1
psllq mm0, 32

xor eax, eax
__again:
movq QWORD PTR b[eax*4], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+8], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+16], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+24], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+32], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+40], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+48], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+56], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+64], mm0
paddd mm0, mm1
movq QWORD PTR b[eax*4+72], mm0
paddd mm0, mm1
add eax, 20
cmp eax, 1000000
jl __again
try using movntq(inside the loop) if processor supports SSE. :grin:

Dunno, if that will be faster. :stupid:
Posted on 2003-02-12 19:24:43 by arkane
jademtech :
i use the conditionnal jump jns and not jnz, so that the loop is done even ecx = 0 ;)

arkane
Your solution is fast because we know the number of loop (1000000) but imagine that this number is a function parameter... would your solution not become too complex ? that 's true that i am not using MMX / SSE so that i don't really understand these instructions...

hutch--


again:
mov [ecx+eax*4+1000000h*4], eax
inc eax
jnz again


with that code we have : b[0] = -1000000
b[1] = -999999
don't we ?
Posted on 2003-02-13 03:22:42 by DarkEmpire
HookWorm: Thank you for the explanation, but I'm not sure if that's the point?

Maverick: Tested it on an AMD Thunderbird.
Yes you are absolutely right. I've yet to squeeze time to read Intel manuals and Agner Fog's tut. :)

DarkEmpire: Thank you for you very fast code. I'm quite a newbie, shame on my sloppy code. :D
About pairing I don't know, have yet to read read and read.
Your code outperforms the VC code, but I somehow don't think we should depend on the array being <= 2G in size.
Will it be equal in speed if:


mov eax, VARIABLE
mov ecx, NUMBER_OF_BYTES
again:
dec ecx
mov dword ptr [eax+ecx*4], ecx
jne again


jademtech: yes his code is 30% faster than VC code.

lingo12: It's attached below.

arkane: You told me an idea I haven't thought of before, thanks zealot!

hutch--: Thank you for your clear explanation, now I begin to understand what makes it slow.
Posted on 2003-02-13 05:51:04 by C.Z.
Bomb,

I played arund with the original VC code, your replacement, the one I posted and another variation on my PIV later and the one I posted is about 5% faster than the two you had. I did another expanded risc style that was about the same amount faster but the real problem is the speed of memory seems to be the restriction here, there are no unpredictable jump problems, the dependency problem cannot be beaten so each memory write leaved a big hole that something elose can be plugged into but nothing will beat the memory limit.

I got at best about 5 - 10 improvement on my PIV, on the AMD, it was slightly slower than yours but the VC version was slow on all tests.

I have found over time that complex addressing mode and instruction reduction seems to be the best average across different processors.

Regards,

hutch@movsd.com
Posted on 2003-02-13 07:05:29 by hutch--
Should he read the AMD manual or the Intel manual first? Would he run into any confusion later if he started one before the other? Does it really matter?
Posted on 2003-02-13 08:16:53 by drhowarddrfine
I am having the same problem that I asked earlier (remember Hutch?) but the difference is that I use a different register than those in the address reference.
I recommend that you plan the usage of the registers before you start writing a tight loop. Choose carefully you only have 6 to work with :)
Posted on 2003-02-13 10:46:06 by x86asm

Will it be equal in speed if:




mov eax, VARIABLE
mov ecx, NUMBER_OF_BYTES
again:
dec ecx
mov dword ptr [eax+ecx*4], ecx
jne again



Yes, you're rigth
Posted on 2003-02-13 11:13:58 by DarkEmpire
Originally posted by DarkEmpire

would your solution not become too complex ?
it will make it a bit complex especially if we are dealing with arrays whose length number is odd(5, 7...). I'll code something later... :)
Posted on 2003-02-13 14:38:10 by arkane
Some people don't care about optimization methodology and depend on speed measure only.
When they "optimize" the code they try to reduce the number of mops only. There are a lot of examples
in "Algorithms & Source Code" section, but I don't want to offend their authors just because I like their code
and respect their efforts in this area.
In general there is no Align 16 directive in their code, hence we can't analyze properly.
because the test code will start by accidental memory address rather then to put the ifetch boundaries where we want them.

"... you have to know where each ifetch block begins. This is quite a tedious job. First you need to make your code segment paragraph-aligned in order to know where the 16-byte boundaries are. Then you have to look at the output listing from your assembler to see how long each instruction is. (It is recommended that you study how instructions are coded so that you can predict the lengths of the instructions.) If you know where one ifetch block begins then you can find where the next ifetch block begins in the following way: Make the block 16 bytes long. If it ends at an instruction boundary then the next block will begin there. If it ends with an unfinished instruction then the next block will begin at the beginning of this instruction..." by A.Fog

Here is an example of optimization methodology:

"Here we have reduced the number of mops to 6 by using the same register as counter and index. The base pointers point to the end of the arrays so that the index can count up through negative values to zero.

Decoding: There are two decode groups in the loop so it will decode in 2 clocks.
Instruction fetch: A loop always takes at least one clock cycle more than the number of 16 byte blocks. Since there are only 11 bytes of code in the loop it is possible to have it all in one ifetch block. By aligning the loop entry by 16 we can make sure that we don't get more than one 16-byte block so that it is possible to fetch in 2 clocks.

Register read stalls: The ESI and EDI registers are read, but not modified inside the loop. They will therefore be counted as permanent register reads, but not in the same triplet. Register EAX, ECX, and flags are modified inside the loop and read before they are written back so they will cause no permanent register reads. The conclusion is that there are no register read stalls.

Execution:
port 0 or 1: 2 mops
port 1: 1 mop
port 2: 1 mop
port 3: 1 mop
port 4: 1 mop
Execution time: 1.5 clocks.

Retirement:
6 mops = 2 clocks.
Conclusion: this loop takes only 2 clock cycles per iteration.
If you use absolute addresses instead of ESI and EDI then the loop will take 3 clocks because it cannot be contained in a single 16-byte block. " by A.Fog

To answer question why this piece of code is faster than other piece of code we have to analyze
their assembly output listings. Here is an example: http://www.asmcommunity.net/board/showthread.php?threadid=8330&perpage=15&pagenumber=1


Regards,
Lingo
Posted on 2003-02-13 19:31:21 by lingo12
Lingo has a good point here, when I set up the test piece to clock the different algos I had, I put the same algo in a couple of different places to test if the random start address effected the algo speed and it did so I aligned all of them to 16 bytes and the variation stopped.

I will be interested to see if anyone can get any substantail speed increases of the basic functionality of the VC code that bomb posted, I could get about 10% on my PIV but this was different on the old AMD I use for Internet so hardware differences and memory limits are what I see as the effective limits on speed increases.

Regards,

hutch@movsd.com
Posted on 2003-02-13 19:59:51 by hutch--