There is nothing new here, but there is a bit of size reduction and order to extend the dword optimizations by Agner Fog and Jens Duttke to the following functions:
Posted on 2002-10-07 07:29:32 by pelaillo
Hola Pelaillo, good work :alright:
Just a note.. the backslash ( \ ) didn't appear in the part I'm reporting in red:


; ---------------------------------------------------------------
; | String Functions:
; | Based on strlen optimizations by Agner Fog and Jens Duttke
; | str_end: edi points to 0 terminator
; | str_len: eax -> len in bytes of edi (preserved)
; | str_add: edi <- edi + esi
; | str_copy: edi <- esi
; | str_dir_sep: edi ends with single ''
; ---------------------------------------------------------------
str_end:
mov ecx,
add edi,4
lea edx,
xor ecx,edx
and ecx,80808080h
[..]

So, for those who didn't understand what's the purpose of str_dir_sep: it adds a \ to the end of a string.
Posted on 2002-10-07 08:02:31 by Maverick
pelaillo, I have fixed the board feature Maverick informs us about.
Posted on 2002-10-07 20:21:22 by bitRAKE
Thanks Maverick and BitRAKE
Posted on 2002-10-08 08:49:05 by pelaillo
Two more functions to get rid of bytescan. These are smaller and quicker than those found in stdlib.
Posted on 2002-10-09 14:30:26 by pelaillo
Thanks to Maverick's PROFILER v2.0, I have tested str_ucase and there are very large savings of clock cycles with respect to the Masm32 lib Ucase (look attachment)

~2.1 times on a Pentium
~3.6 times on Athlon

And improves with large number of chars.

Reasons for improving ucase?
-Accented and other non-english chars
-For case-insensitive searchs on large data
-I am learning

There are still some :grin: limitations:
1. data must be dword-aligned
2. this chars (`{|}~) get converted respectively into (@[\]^) (not critical for case-insensitive searches)

Some ideas to avoid this second point without loosing most of performance?
Posted on 2003-01-16 08:20:02 by pelaillo
The codes look very interesting. But what does it do exactly? I must
admit that I have not yet read Agners Optimization guide. Could
anyone add descriptions to these string algos? I would really like
to understand what it actually does? or maybe there is a commented
source on the orginal strlen by Agner Fog? any suggestions appreciated. :alright:

I cant really tell you that its a job well done pelaillo. Since I dont understand
all of the code. However it does look and sound like damn good code. ( ;) )
Posted on 2003-01-16 08:46:00 by natas
natas:
Don't miss this excelent thread:
http://www.asmcommunity.net/board/index.php?topic=4058
Commenting this code confuss it instead of help. I suggest you to download the demo app and load it with OllyDbg.
The hint is the ASCII table itself.

Regards,
Posted on 2003-01-16 09:14:14 by pelaillo
pelaillo,
I'm wondering (If you know well this "excelent" thread:
http://www.asmcommunity.net/board/...=&threadid=4058)
why you play with slower code when you can use a faster one
Consider this:



xor edx,edx ; edx=0
C2_loop: ;
mov eax, [esi+edx] ; get a dword (buffer is aligned)
lea ecx, [eax-1010101h];sub 1 from each byte in eax
add edx, 4 ; ready for next dword
and ecx, 80808080h ; test sign
jz C2_loop ; if not loop again
;
test eax, 000000FFh ; is al zero?
jz C2_minus4 ;
test eax, 0000FF00h ; is ah zero?
jz C2_minus3 ;
test eax, 00FF0000h ; is zero?
jz C2_minus2 ;
test eax, 0FF000000h ; is zero?
jnz C2_loop ; if not zeroes loop again
lea eax, [edx-1] ; eax= length of string
ret ;
C2_minus2: ;
lea eax, [edx-2] ; eax= length of string
ret ;
C2_minus3: ;
lea eax, [edx-3] ; eax= length of string
ret ;
C2_minus4: ;
lea eax, [edx-4] ; eax= length of string
ret ;


Regards,
Lingo
Posted on 2003-01-16 15:47:11 by lingo12
Lingo12,
You are right, buliaNaza's strlen is faster (I need to check it in Athlon too).

But, do you have some hints on my str_ucase question?
Posted on 2003-01-17 04:04:20 by pelaillo
continuing with my usual silly interruptions on threads in this forum.

lingo12,
could you please explain how exactly does this code you posted works and how is it faster?
Posted on 2003-01-18 06:28:39 by clippy
That is because you do not scan byte by byte for the null-terminator. lingo12's method loads the data into dword and scan the dword for null-terminator, hence its speed. I presume it is better for longer strings, and that for shorter strings byte scanners are better

Correct me if i am wrong.:stupid:
Posted on 2003-01-18 09:29:15 by roticv
pelaillo,
Thanks for replay!


gladiator,

"..how exactly does this code you posted works.."
There is a similar code in A.Fog's book with explanation.
(Example 2.8: _strlen). Unfortunately, I can't explain better!

" and how is it faster?"
If you have read A.Fog's book I'll try to explain why, but if you are
a HLA "assembly" coder just skip my post and ask the right person
in the right thread.

So, here is our two main loops in memory:


Let's assume that:
- we have a PPro, PII or PIII CPU
- and we haven't any problems with Decoding, Instruction fetch, Register read stalls, Retirement, etc

[B]A.pelaillo's main loop in memory[/B]
00401150 8B 0F mov ecx, dword ptr [edi] ; [B]D0[/B] len=2 p2rEDIwECX 1 mop 1-1-0
00401152 83 C7 04 add edi, 4 ; D1 len=3, p01rwEDXwF 1 mop
00401155 8D 91 FF FE FE FE lea edx, [ecx-1010101h] ; [B]D0[/B] len=6 p0wEDXrECX 1 mop 1-0-0
0040115B 33 CA xor ecx, edx ; [B]D0[/B] len=2 p01rwECXrEDXwF 1 mop 1-0-0
0040115D 81 E1 80 80 80 80 and ecx, 80808080h ; [B]D0[/B] len=6 p01rwECXwF 1 mop 1-1-0
00401163 74 EB je 00401150 ; D1 len=2 p1rF 1 mop

We can expect pelaillo's main loop instructions to be executed in 4 clocks (we have four D0's)

[B]B.buliaNaza's main loop in memory[/B]
00401170 8B 04 32 mov eax, dword ptr [edx+esi]; [B]D0[/B] len=3 p2rESIrEDXwEAX 1 mop 1-0-0
00401173 8D 88 FF FE FE FE lea ecx, [eax-1010101h] ; [B]D0[/B] len=6 p0rEAXwECX 1 mop 1-1-0
00401179 83 C2 04 add edx, 4 ; D1 len=3, p01rwEDXwF 1 mop
0040117C 81 E1 80 80 80 80 and ecx, 80808080h ; [B]D0[/B] len=6 p01rwECXwF 1 mop 1-1-0
00401182 74 EC je 00401170 ; D1 len=2 p1rF 1 mop

We can expect buliaNaza's main loop instructions to be executed
in 3 clocks (we have 3 D0's), i.e. we have reducing the number of mops

Can we improve buliaNaza's main loop? It is a challenge, but let's try:

We have:
xor edx,edx ; edx=0
C2_loop: ;
mov eax, [esi+edx] ; get a dword (buffer is aligned)
lea ecx, [eax-1010101h]; sub 1 from each byte in eax
add edx, 4 ; ready for next dword
and ecx, 80808080h ; test sign
jz C2_loop ; if not loop again
;
We modify to:
mov eax, [esi] ; get a dword (buffer is aligned)
mov ebx, 80808080h ; we'll use register ebx rather then immediate 80808080h
xor edx, edx ; edx=0
C2_loop: ;
lea ecx, [eax-1010101h] ; sub 1 from each byte in eax
inc edx ; ready for next dword
and ecx, ebx ; test sign ; ebx= 80808080h
mov eax, [esi+edx*4] ; get next dword
jz C2_loop ; if not loop again

and here is
[B]C."Improved" buliaNaza's main loop in memory[/B]

00401197 8B 06 mov eax, dword ptr [esi];
00401199 BB 80 80 80 80 mov ebx, 80808080h ; we'll use register ebx (2 bytes) instead of
0040119E 33 D2 xor edx, edx ; immediate (6 bytes) to avoid decoding problems
; later, i.e. we manipulate instruction lengths(see A.Fog)
004011A0 8D88FFFEFEFE lea ecx, [eax-1010101h] ;[B] D0[/B] len=6 p0rEAXwECX 1 mop 1-1-0
004011A6 42 inc edx ; D1 len=1 p01rwEDXwF 1 mop
004011A7 23 CB and ecx, ebx ; [B]D0[/B] len=2 p01rwECXrEBXwF 1 mop 1-1-1
004011A9 8B 04 96 mov eax, dword ptr [esi+edx*4]; D1 len=3 p2rESIrEDXIwEAX 1 mop
004011AC 74 F2 je 004011A0 ; D2 len=2 p1rF 1 mop
004011AE ;

We can expect buliaNaza's main loop instructions to be executed in
2 rather then 3 clocks (we have two D0's only)


Notes:
- Our improvements change the logic of the buliaNaza's proc, i.e. we must modify the rest of code too!
- We can substitute and ecx, ebx ; len=2 p01rwECXrEBXwF 1 mop
with test ecx, ebx too ; len=2 p01rECXrEBXwF 1 mop




Regards,
Lingo
Posted on 2003-01-18 15:08:42 by lingo12
Lingo12:
You have improved buliaNaza !! great !!!:alright:
I will provide a version of all dword string functions based on this enhancement.

Gladiator:
These functions scans strings four by four characters to take advantage of the processor's 32 bits.
The strlen loops until one of the bytes on the dword is zero and then identifies wich of them is zero to calculate the lenght.
The ucase, lcase clears or sets bit 6 (ascii table) of each byte if the char is alphabetical symbol.

Best Regards,
Posted on 2003-01-20 03:19:46 by pelaillo
considering you have to deal with non-english languages, wouldn't the best approach be using xlat tables? (not necesarrily the xlat instruction, though).
Posted on 2003-01-20 04:08:42 by f0dder
f0dder:
Doing so we will loose all of performance...:(
Posted on 2003-01-20 04:43:53 by pelaillo
foreign language characters suck :/. gets even worse with queer charsets like hebrew, chinese etc which require unicode. Oh well, I've chosen to limit myself to "sane" languages ;)
Posted on 2003-01-20 09:22:40 by f0dder
pelaillo,

You may be onto a good idea here if you can get the algo to work on all characters, the u/lcase algos are ones I wrote about 3 - 4 years ago and they work OK as byte scanners go.

I would if it was a real performance matter write it as a table and then compare the table access speed to the byte scanner to see if it was any faster. The table is probably more flexible in that it can easily be made to handle different code pages.

Keep up the good work.

Regards,

hutch@movsd.com
Posted on 2003-01-21 04:37:24 by hutch--
Korean does not have the concept of upper case. I doubt Chinese, or Japanase have that concept either.

Hebrew and Arabic I'll guess don't. Cyrillc? We got quite few Russian members, they'd know.
Posted on 2003-01-21 05:08:33 by ThoughtCriminal
thanks roticv, lingo12 and pellalio for trying to help me.:alright:

lingo12:
I am pretty new to asm and i cant understand most of agner fog's manual. Although i have never used HLA but i do code in HLLs:)

What exactly do these things in your hex code mean????:confused:


D0 len=2 p2rEDIwECX 1 mop 1-1-0


Whats D0, p2rEDIwECX , 1 mop 1-1-0????:confused:

Parts of your code I couldnt understand( I havent tried reading any further than this)


mov eax, [esi+edx] ; get a dword (buffer is aligned)
lea ecx, [eax-1010101h]; sub 1 from each byte in eax
add edx, 4 ; ready for next dword
and ecx, 80808080h ; test sign
jz C2_loop ; if not loop again

Why do you subtract 1 from each byte in eax???
Also what 'sign' are you checking for and why??? I know i look extremly idiotic while saying this but what exactly are signs in chars? After all this is a character array you are scanning?(or not)?
Posted on 2003-01-22 04:31:48 by clippy