I wrote some ASM which I thought was very efficient, but it turned out not to be. I execute it a billion times in a loop and subtract the time for the empty loop. I whittled it down to the following, which takes more than 28 clock cycles...

``````
movzx  eax, word ptr [edi]
imul   eax, 92
mov    [edi],ax
``````

IMUL seemed to be the culprit so I replaced it with equivalent code...

``````
movzx  eax, word ptr [edi]
lea    edx, [eax*4]
lea    eax, [eax*2 + eax]
shl    eax, 5
sub    eax, edx
mov    [edi],ax
``````

This code takes 23 clock cycles. When I comment out the SHL it only takes 11.5 clock cycles. Can someone help me understand what is happeneing here? Intel's docs for Pentium 4 timings say SHL has a latency of 4 and a throughput of 1. Am i doing something wrong?
Posted on 2003-04-23 08:37:11 by Ekted
You could try the floating point fmul but I doubt it would be any faster as I think it uses thwe same circuitry for multiplications.

The combination of LEA and shifts will probably be hard to improve on, multiplications are generally slow in comparison to other integer operators.

Regards,

hutch@movsd.com
Posted on 2003-04-23 09:27:06 by hutch--
Give this a try,
``````
push eax
lea ecx, [eax+eax*8]    ; x 9
lea edx, [ecx+ecx*8]    ; edx = / 81
lea ecx, [eax+eax*4]
lea eax, [ecx+ecx]
pop edx
``````

It seems to work OK but I have no idea if its faster. Input into EAX and output in EAX.

Regards,

hutch@movsd.com
Posted on 2003-04-23 09:50:36 by hutch--
Using your code I get 21 cycles now. But what I am asking is: how is it that SHL adds 12 cycles when docs say 1? How can I possibly make intelligent decisions about which ops to use when the docs don't match reality?
Posted on 2003-04-23 09:59:45 by Ekted
Ekted,

Timings are always tricky to get precisely. The LEA can have stalls with the multiple use of the same registers in each instruction. I have tried to alternate them but you will get a few effects like this unfortunately.

I am a bit toobrain dead to do much more as its 1 am but you may be able to fiddle the sequence to get the time down a bit.

Regards,

hutch@movsd.com
Posted on 2003-04-23 10:05:17 by hutch--
Maybe it is due to AGI stall as lea is prone to that.
Posted on 2003-04-23 10:10:10 by roticv
humm, do you need to work on a single short (16bit unsigned word), or an array of them?
For multiplying an array of 1024 shorts by 92, I tinkered a bit with the intel C++ compiler.
It produced the following output, which has neither been tested nor timed :)

``````
mov       eax, 2048                                     ;11.11

; LOE eax ebp esi edi
.B1.2:                          ; Preds .B1.1 .B1.2
movzx     edx, WORD PTR ?blah@@3PAGA[eax-2]             ;13.8
lea       ecx, DWORD PTR [edx+edx]                      ;13.8
sub       ecx, edx                                      ;13.8
sub       ecx, edx                                      ;13.8
movzx     edx, WORD PTR ?blah@@3PAGA[eax-4]             ;13.8
mov       WORD PTR ?blah@@3PAGA[eax-2], cx              ;13.8
lea       ecx, DWORD PTR [edx+edx]                      ;13.8
sub       ecx, edx                                      ;13.8
sub       ecx, edx                                      ;13.8
movzx     edx, WORD PTR ?blah@@3PAGA[eax-6]             ;13.8
mov       WORD PTR ?blah@@3PAGA[eax-4], cx              ;13.8
lea       ecx, DWORD PTR [edx+edx]                      ;13.8
sub       ecx, edx                                      ;13.8
sub       ecx, edx                                      ;13.8
movzx     edx, WORD PTR ?blah@@3PAGA[eax-8]             ;13.8
mov       WORD PTR ?blah@@3PAGA[eax-6], cx              ;13.8
lea       ecx, DWORD PTR [edx+edx]                      ;13.8
sub       ecx, edx                                      ;13.8
sub       ecx, edx                                      ;13.8
mov       WORD PTR ?blah@@3PAGA[eax-8], cx              ;13.8
jne       .B1.2         ; Prob 99%                      ;11.2
``````
Posted on 2003-04-23 10:10:40 by f0dder
Try this one, it has a lower instruction count.
``````
lea ecx, [eax*8]
lea edx, [ecx*4]
lea ecx, [edx+edx*2]
lea edx, [eax*4]
sub ecx, edx
``````

EAX in, ECX out.

Regards,

hutch@movsd.com
Posted on 2003-04-23 10:17:38 by hutch--
1000000 iterations of 2048 muls (2048000000 total) took 3515 ticks
That's on a 2.53GHZ Pentium4 - you do the maths and tell me the average cycles/mul.

MMX might be beneficial here...

assuming that we're working on an array of values.
Posted on 2003-04-23 10:23:52 by f0dder
That code is about 35 clock cycles. I made a version not using LEA at all:
``````
mov   ecx,  eax             ; ecx = n
add   ecx,  ecx             ; ecx = n * 2
add   ecx,  ecx             ; ecx = n * 4
add   ecx,  ecx             ; ecx = n * 8
add   ecx,  ecx             ; ecx = n * 16
add   ecx,  ecx             ; ecx = n * 32
sub   ecx,  eax             ; ecx = n * 31
mov   edx,  ecx             ; edx = n * 31
add   ecx,  ecx             ; ecx = n * 62
add   ecx,  edx             ; ecx = n * 93
sub   ecx,  eax             ; ecx = n * 92
``````

which takes only 11.5 clock cycles. Maybe the solution is just to use ADD,SUB,MOV for eveything and completely avoid LEA and SHL.
Posted on 2003-04-23 10:24:20 by Ekted
``````
lea ecx, [eax*8+eax]; n* 9
shl eax,7;n*128
nop
shl ecx,2;n*36
sub eax,ecx;n*128 - n*36
``````
Posted on 2003-04-23 10:38:01 by roticv
That's 28 clocks. What's the NOP for?
Posted on 2003-04-23 10:44:32 by Ekted
My logic is probably pretty flawed. That, or the compiler has optimized
away something that breaks the results (don't think so though, I put
the innerloop in a separate .cpp file, and I'm not doing cross-module
nor whole program optimization). If anything is flawed, most likely
it's my calculations ;). I VirtualAlloc a 4k array (2048 words = 4k),
virtualalloc so alignment definitely wont be an issue (it shouldn't anyway).
The memory is filled with zeroes - this way memory is pretouched and
should in cache. The contents of the memory shouldn't really matter,
I guess. Process + thread priority is raised to realtime/timecritical,
then a Sleep(0) is done, and the mainloop runs for 1000000 iterations.
The mainloop calls "time1", which multiplies all entries of the 2048-word
array with CONSTANT (defined as 92).

Results:
1000000 iterations of 2048 muls (2048000000 total) took 3500 ticks
(Possibly flawed) logic:
2.53 ghz ~= 2530000000 cycles/sec
that's 2530000 cycles/milisec
3500 tics is (3500*2530000) 8855000000 cycles

To get approx cycles/instruction, divide cycles used by muls completed.
That would be 8855000000/2048000000, so effectively ~4,3237 cycles/mul
Posted on 2003-04-23 10:44:40 by f0dder
``````
; 1000000 iterations of 2048 muls (2048000000 total) took 5235 ticks
_time2@4:
push	esi
push	edi

mov		esi, [esp+12]
mov		edi, 2048 - 1

.loop:
movzx	eax, word [esi + (edi*2)]
; his code
mov		ecx,  eax             ; ecx = n
add		ecx,  ecx             ; ecx = n * 2
add		ecx,  ecx             ; ecx = n * 4
add		ecx,  ecx             ; ecx = n * 8
add		ecx,  ecx             ; ecx = n * 16
add		ecx,  ecx             ; ecx = n * 32
sub		ecx,  eax             ; ecx = n * 31
mov		edx,  ecx             ; edx = n * 31
add		ecx,  ecx             ; ecx = n * 62
add		ecx,  edx             ; ecx = n * 93
sub		ecx,  eax             ; ecx = n * 92
mov		[esi + (edi*2)], cx

dec		edi
jnz		.loop

pop		edi
pop		esi
ret 4
``````
Posted on 2003-04-23 10:49:39 by f0dder
Ekted,

What processor are you running as it may effect what code works well.

try the second LEA version as it has a lower instruction count and shorter complex addressing which should bring its time down some.

Regards,

hutch@movsd.com
Posted on 2003-04-23 10:49:44 by hutch--
Yes. The timings I am posting include my load/store instructions:
``````
movzx eax,  word ptr [edi]
...
mov   [edi], ax
``````
Posted on 2003-04-23 10:50:27 by Ekted
I am using a P4 2.4GHz.
Posted on 2003-04-23 10:52:51 by Ekted
Maybe my logic is flawed. I was hoping that nop makes my shifts pairable. But it is indeed puzzling that the shifts are so slow. Are you using Athlon?
Posted on 2003-04-23 10:56:30 by roticv
O my P4 2.53ghz, the sample code snippet I posted (adjusted a bit from your code snippet, and obviously not unrolled) ran quite somewhat slower than the previously posted compiler output.

What I'm wondering is, if my "logic" stuff seems correct, or flawed?
Posted on 2003-04-23 10:56:56 by f0dder
``````
;1000000 iterations of 2048 muls (2048000000 total) took 4421 ticks
_time3@4:
push	esi
push	edi

mov		esi, [esp+12]
mov		edi, 2048 - 1

.loop:
movzx	eax, word [esi + (edi*2)]
; roticv code begin
lea		ecx, [eax*8+eax]	; n * 9
shl		eax, 7				; n * 128
shl		ecx, 2				; n * 36
sub		eax, ecx			; n * 128 - n*36
; roticv code end

mov		[esi + (edi*2)], ax
dec		edi
jnz		.loop

pop		edi
pop		esi
ret 4
``````

Of course not unrolled, so again comparison isn't entirely fair.
Posted on 2003-04-23 11:03:01 by f0dder