This is inspired by a comment from Scali in the "need help with stacks" thread. He observes - correctly, AFAIK - that code using "aam" and "aad" and the like is very slow. Okay, but "how slow?" (preliminary indications are "a lot"!). I've gotten some "funny" answers (IMO) trying to investigate similar things. For example, a "short" way to convert a nybble to a hex digit involves "das". I know it's "slow", but how bad? Putting it into my "timing harness", it jumps around all over the place! Initially, I was investigating push/pop vs mov on a K6-300. push/pop was faster (slightly, I don't recall the numbers). On a K6-2-500, it was reported that mov was twice as fast. My current P-IV (yeah, I know) shows mov faster, as expected, but it seems to matter how many times you do it! That isn't too unexpected, I guess... but "more is faster" sure is!

I'll attach the entire thing - Linux-specific, and uses "-f bin", no linker... if there's any interest, I can try to come up with a more "portable" version - but here's the meat of it:

    xor eax, eax
    push edx
    push eax

; code to time

;times REPT_COUNT mov eax, 0
;times REPT_COUNT xor eax, eax

;times REPT_COUNT dec eax
; times REPT_COUNT sub eax, byte 1

;push eax
;pop eax

;mov eax,
;mov , eax

;    das



    xor eax, eax
    pop ebx
    sub eax, ebx
    pop ecx
    sbb edx, ecx
    sub eax, THIS_EMPTY_LOOP
    sbb edx, 0

    mov edi, ascbuf
    call u64toda

; will need a suitable "printstring" and "exit" for 'doze

As the comments indicate, "THIS_EMPTY_LOOP" is defined to a number that mostly gives zero for the code with everything commented out, as shown. "REPT_COUNT" can be most conveniently defined on the command line...

; "tune" this so an empty loop gives zero, most often,
; on your machine.

; define this on the command line to Nasm,
; or uncomment it here, if you prefer...
; REPT_COUNT equ 1 ; 10000

Then we uncomment, say the push/pop pair, assemble it with "nasm -f bin -dREPT_COUNT=1 timeit.asm" - gives 8, on my P-IV. Re-assemble with "-dREPT_COUNT=2" - gives 4. Wha? "REPT_COUNT=3" gives 12.

Okay, comment the push/pop lines again and uncomment the mov pair. Assembled with "-dREPT_COUNT=1" gives 8 - same as push/pop. REPT_COUNT=2 gives 8 again. REPT_COUNT of three gives 4! This isn't cycles-per-iteration, this is total cycles for three repetitions of the code! At this rate, we'll be doing it in negative time! But no, REPT_COUNT of 4 gives 12...

At least the numbers usually repeat (barring the occasional outlier). Uncommenting "das" gives results all over the lot, no matter the REPT_COUNT. 104, 104, 104, 96, 96, 64, 104, 96, 96, 76 - for one instance of "das"! "aam" looks more consistent...

Can anyone explain these "strange" results? Am I doing something wrong?


Posted on 2010-04-28 10:34:19 by fbkotler
Better check your timing framework :) .

mov     ax, 10 shl 8
mov     al, 8
sub     al, 3
xchg    ah, al

C2D E8500

This takes 10 cycles  per loop. When I put it 10 times in the loop-body, it takes 103 cycles, so no surprises.
Measuring the aad/das alone results in 2-3 cycle latency.
If I procomment the "mov ax," making the pipeline stall or create input values that are possibly nasty for the instructions, the 103 cycles become 177 or 230.

A real benchmark remains: a real usage of the instructions, with real data to feed it.

Btw, aad/das,etc are not documented anymore in the latest Intel optimization manuals. (no info on latency/throughput)
Posted on 2010-04-28 12:09:49 by Ultrano

Better check your timing framework :) .

I guess! 10 cycles, you say? That seems unexpectedly fast, no? I get 88 cycles for that code, 76 cycles for a naive implementation of approximately the same calculation using "mul" and "div". Curiously, cranking "REPT_COUNT" up to 10, they nearly even out at 1028 vs 1020. (Suggesting that maybe "Baldr's code" isn't as slow as Scali - or I - would have expected.)

Thanks for the feedback, Ultrano! I'll keep lookin'...


Posted on 2010-04-28 13:41:25 by fbkotler

I guess! 10 cycles, you say? That seems unexpectedly fast, no?

Depends on how you look at it.
10 cycles for 7 instructions on a CPU that has a maximum throughput of 4 instructions per cycle? Doesn't sound very good.
I would say that on average you should be able to get 2 instructions per cycle in practice.
It is well possible that the optimal solution is longer in terms of instructioncount, but executes much more efficiently.
One of the fastest routines in the last small challenge was 14 instructions long, and executed in 7 cycles, and that included calling overhead.
Posted on 2010-04-28 14:35:08 by Scali
Okay. When you said "VERY SLOW" (which matches my expectations), I was expecting worse than that. I still don't know where 10 cycles comes from - f0dder's "yodel"? ...and I don't know why I'm getting 88 cycles...

This is why I generally like to shoot for "small" code - easier to keep score. :)


Posted on 2010-04-28 21:27:45 by fbkotler