hi,

i found this thing,

http://www.onversity.com/load/d-loop.pdf

its someone who claims having found a way to do quicker loops if they have many ending conditions.

he writes it in C and says he has beaten the standard C librairies that are (well thats what he says) optimized in asm. he wants ppl to spread his method, makes conferences (lectures) about it, has written a big pdf and a statistic model (!) on when its applicable, he shows some asm code the compiler generates, but i m not sure he understands them.

basically i think his trick is to have only one ending condition in the inner loop, for ex:

you want to count a char in a string, you have

-a while for the end char (null),
-in which you have an if(currenchar==searchedchar){count++}

so that makes 2 tests each time
(most of the times, one branch FW (not taken) and one BW(taken) i think)
and its baad for new processors, so he makes initialized structures (didnt really go into it) and makes an inner loop with just 1 test.

the idea is interesting, but i would say its useful only when the branch prediction mechanism is the bottleneck, but he seems to have results.

in asm you could well make test the two conditions sequentially and make just one branch for most cases, i think


l1:
sub currentchar, zerochar
save notZF into eax.b0 ;rcl or adc or something (slow!)
;or you could add the result of sub to eax
sub currentchar, searchedchar
save notZF into eax.b1

cmp eax,0
jz l1 ; most of the times, taken
{here you check if it was
-end of str: jmp end
-char found: charcount++; jmp l1}

so most of the times you ve just 1 codepath fork, (jz taken)

you get the idea, cmov and friends would ease and speed up things.

do you think he has made something valuable or ... (?)
what do you think of it?

the strlen version he shows in his pdf is worth giving a look :)
Posted on 2004-12-04 08:32:28 by HeLLoWorld
or in the special caseof the char_count func:

al : 0 (end)
ah : searched char
ecx: result(init 0)
esi: str

l1:
xor ebx,ebx
cmp byte ,ah
rcl ebx,1 ;might be a cmovz ebx,1
add ecx,ebx
cmp byte ,al ; guess its in the cache...
;maybe better to put it into a reg the 1st time...
jnz l1


whaddya think of it?

we could also have long code snippets with each instr prefixed by a condition on a flag(there could be many user-settable flags)...
so theres no branches anymore...(but hey then we ve got to decode all those instr without jumping so its not better :) )

isnt it the road taken by ia64?
Posted on 2004-12-04 08:42:42 by HeLLoWorld
It is a good article - stuff I've been typing around here for years. :) There is a trade-off between memory speed and processor branching costs in most loops with multiple branches. D-Loop optimization will fail when prep-time and memory costs are greater than branch savings.

Another way to state this: (almost) all code can be replaced with data.

Let me give an example, say we have a very complex operation that be have coded: it takes EAX and does some wonderful stuff with it then returns.
Magic PROC

; lots of branching and complex code
Magic ENDP
Now we don't need to use all that complex code. Instead we can replace it with some pre-processing that computes all the possible results of the complex PROC, and just move the result into EAX:
PrepMagic PROC

mov ecx, -1
@@:
; lots of branching and complex code

mov [PREP_DATA][ecx*4], eax
sub ecx, 1
jnc @B
retn
PrepMagic ENDP


Magic PROC
mov eax, [PREP_DATA][eax*4]
retn
Magic ENDP
Please, don't take this literally - I'm trying to demostrate two extremes so we may think about generating an optimal solution for our work somewhere in the middle. Obviously, we are not going to store/compute 2^32 values for the example, but usually we don't have to. In the best situations no pre-compution/memory (at runtime) is required at all.
Posted on 2004-12-04 10:58:15 by bitRAKE
Looks like snake-oil.

I tested it abit (function - time in seconds):
    count_char() - 0.92
    new_count_char() - 1.38
    new_count_char() (sec. 7.1) - 1.19
    std. lib. strstr() - 0.85
    new_strstr() - 1.44
    As for his new_strlen(), he breaks a pretty hard requirement. I didn't even test it.

    I used VC7.1 and the standard library supplied with it for the testing. The test ran 100 rounds of the function with a 1mb null-terminated random string. For new_strstr()/strstr(), a specific word was placed around 734kb and searched for. I use a pIII 733MHz. The tests were run several times.

    It's logical for it to be slow, as well... it accesses a lookup table in memory. While this technique sometimes fits well, it doesn't work in these cases.
    Using an internal loop for a common condition is a well-known optimization.

    His coding style shows that he is not well versed in the C programming language.
Posted on 2004-12-04 11:10:54 by death