That means that th code you compared can not be compared.
One checks 256 possibility range the other 64 possibilities.
In PMMX btw your profile shows 5 ticks for bt memory version.
It can be compared because Jens' code and bitRAKE reply was about a 64 bit lookup table method. Even adding a CMP/JAE makes my method faster, so what the heck?

I hope you don't want to turn this thread into a compo about the fastest HEX validity check routine, I could join but I'd have a better use of my time than that.

I gave you a method to perform 64 (or more) bit two state lookups, you're free to use or not to use it. If you want to make extra validity checks, by any means you're free to add them.. it'll still be fast.

It was not me who wanted to make comparisons that specific way, I am here to contribute and help, not to make compos.
Posted on 2002-03-27 07:40:44 by Maverick
Also, if you want to be fair, you will notice that when I tested Jens's and bitRAKE's code to mine, I deliberately removed their extra instructions (like SUB '0') to make an *extremely* fair comparison. As well as I used the best case (perfectly cacheline aligned, 100% cached, etc..) for your BT ,ECX "method" (which rather than a method is, really, an Intel instruction). Also, it's not my "fault" if bitRAKE's method had "incidentally" out of bounds control, and I credited him for that.

So I don't understand what do you complain against my "fairness". Perhaps that I wasn't extremely fair (re-read all of the above), and you found a CPU where my Athlon cycles count was a bit worse? Again, if you re-read my posts you will find only extreme honesty in comparisons, and I clearly stated that the method was for P6 and not the old P5 (or 386, nor 8088 for the matter).

Next time that I read your:


Jens, I've analyzed your method.
Man... You are very talanted!
I'm glad I asked you for comments.


and bitRAKE's:

Very creative approach Jens Duttke! I like it, and will be using something similar in the future. Eliminates two-state small table lookups.


you teach me I better use my time for myself instead for you. Right?
Posted on 2002-03-27 07:58:29 by Maverick
First of all:
1. My credits was for his(Jens) comments not for code.
His code doesn't work, if you read my followed posts carefully.
2. Jens tried (though failed) to analyze whole 256 byte range.
Though his mask was just 64 bits he tried by prologue to using mask
and triky following code to handle all 256 possibilities.
Posted on 2002-03-27 08:12:24 by The Svin
1) I know that Jens code doesn't work - if you read my post carefully I (politely) stated that. I still think you appreciated the method.. that was what I meant, if you read between the lines.
2) I don't see what's special in 256 byte ranges. Do you limit yourself to lone ASCII (specific) applications? This method is "bitmap register lookup", it has nothing to do per se with ASCII or any specific application.
3) I see I wasted enough of my time on this whole thing.

Regards.
Maverick
Posted on 2002-03-27 08:16:51 by Maverick
I see I wasted enough of my time on this whole thing

You have a talant man to turn any sientific discussion into personal things :)
Meveric, when somebody tries to analyze(or critisize) my method (or thoughts) or beat my code I take it as sign of respect, 'cause it is ferm fact that he wasted his time (found worthy to waste it) to follow and study my things.
Why almost any negative comments about your things you take as a personal insults?

I'll leave you alone if you want, but I thought we here is not as nominant for a Nobel medal, so the only advantage of this to seek for new thought, methods, analyze them, critisize them etc. in a single goal - to make our pro level better.

Anyway if it's for good I'll leave the section for a while :)
It's been a long time since my last feedback Ewayne about bugs in his nice ResEdit :)
I 'm coming Ewayne!!! Ugly evel Svin is going to report a lot of bugs about your darm groupbox logic !!!! :)
He always thanks for critics (or maybe it's just template for
msg post? Steve always write Regards and Ewayne Thanks?!)
;)
be well...
Posted on 2002-03-27 08:34:58 by The Svin

Also, it's not my "fault" if bitRAKE's method had "incidentally" out of bounds control, and I credited him for that.
"By Design", silly - it wasn't an accident. :cool:

Also, two instructions are hard to get an accurate timing on
Athlon when three instruction can execute in parallel. :)
Posted on 2002-03-27 08:39:46 by bitRAKE
The Svin: constructive criticisms are always welcome. But it's not to waste my time that I write affirmations (like "ecx must be 0..63"), so I don't consider too much constructive a criticism that ignores what I explicitly wrote, or that makes seem less fair a comparison that I've made where extreme fairness was the very first goal.
Maybe you haven't read with enough attention the above, so I will repeat: constructive criticisms are always welcome.

bitRAKE: I've put "incidentally" into quotes not by accident as well. :rolleyes:
Posted on 2002-03-27 08:59:47 by Maverick
constructive criticisms are always welcome

I always say what I mean without hidden allusions:
You compared things wich can not be compared - no more no less.
To compare things you need to bring it to one ethalon.
If you want to compare bt to your code - you can do it
but bring table to 64 bits that's it.
Then compare.
That's all I said.
Not about what method is better or not, not about fairness or some personal shit.
Pure math logic - pointing out to incorrect compareson conditions.
Could you just concentrate on technical details and take all this personal reflection apart?
Posted on 2002-03-27 09:10:54 by The Svin
bitRAKE wrote:Also, two instructions are hard to get an accurate timing on Athlon when three instruction can execute in parallel.
Do you mean your method is faster, or what? I don't follow you..


The Svin wrote:I always say what I mean without hidden allusions:
You compared things wich can not be compared - no more no less.
To compare things you need to bring it to one ethalon.
If you want to compare bt
to your code - you can do it
but bring table to 64 bits that's it.Are you blind or what? That's what I did.. I compared all methods on 64 iterations, not 256. So what are you trying to say now?

You want to enforce me into making you buy that I would really use my method to do HEX validity check? Ok.. here you're served:

ISHEX? (lowercase and uppercase supported, covers whole ASCII characters map)
;input: AL = character to test


LEA ECX,[EAX-'0']
MOV EBX,MASK1
CMP CL,'g'-'0'
JAE .notvalid
MOV EAX,MASK2
CMP CL,32
CMOVC EAX,EBX
BT EAX,ECX
JNC .notvalid


256 iterations: 2.34375 cycles on average per iteration.

Happy now?


Pure math logic - pointing out to incorrect compareson conditions.
Where did I ever do that?? I compared them on 64 iterations.. (shall I quote it? "64 iterations of which, on my Athlon, take xxx cycles") now if you wanted to be blind, it's not my fault.
Posted on 2002-03-27 09:48:43 by Maverick
Are you blind or what? That's what I did.. I compared all methods on 64 iterations

Are you drunk or what?
64 bits is size not time.
Posted on 2002-03-27 09:51:49 by The Svin

Do you mean your method is faster, or what? I don't follow you..
No, I'm speaking of the method that buliaNaza has posted BT/JNC. It is the only two instruction method posted above - I could call you 'blind' here, but it would only serve to admit my extreme dis-like for your words to Svin - he can take care of himself. :)
Posted on 2002-03-27 09:56:23 by bitRAKE


Are you drunk or what?
64 bits is size not time.
Dammit, you must be stupid.. 64 iterations test for all 64 possible bits in the bitmap.

Hello? Is anybody home? :grin:

;)
Posted on 2002-03-27 10:02:40 by Maverick

No, I'm speaking of the method that buliaNaza has posted BT/JNC. It is the only two instruction method posted above - I could call you 'blind' here, but it would only serve to admit my extreme dis-like for your words to Svin - he can take care of himself. :)
Look bitRAKE, if you put a JNC after a BT then there's no many miracles the CPU can do to make it fast. JC will have to wait for BT to complete, simple as that.
Also, I wouldn't call an instruction that Intel has designed *precisely* for this purpose an "algorithm", or a "method". I didn't see buliaNaza claim anything like that, so I don't see why you should make it seem that way.

About dis-liking words, you should take your own responsabilities, as well as The Svin against me. There are many posts above to attest how things went, who was blind and who called drunk who.
Posted on 2002-03-27 10:06:36 by Maverick

Also, I wouldn't call an instruction that Intel has designed *precisely* for this purpose an "algorithm", or a "method". I didn't see buliaNaza claim anything like that, so I don't see why you should make it seem that way.
True, but what should I call it?
Are there instructions before the BT?

This discussion digresses quickly. I am to blame as any.
Posted on 2002-03-27 10:28:27 by bitRAKE
bitRAKE:

I see where all the problem arised: my fault is that I had too much
consideration for The Svin to believe that he doesn't understand such a
simple thing as:

"64 iterations of which on my Athlon runs at an average of 2.000 cycles per iteration."
in the very context of testing the average performance on all 64 possible checks.

I do not want to think he's stupid (although he calls me "drunk", this as
demonstration that is me who is being personal, not him), but I have it hard
to believe that he doesn't really understand a statement like the above which
was written several times (unless he didn't even bother to read it, hence my
"if one wants to be blind it's not my fault").

Not only he acts and writes as if I never wrote all those specifications, but
he even accuses me of doing the opposite of what I did. In fact I removed
instructions from yours and Jens's code to "downsize" them (so they run
faster) as much as possible to a simple, generic, 64 bit bitmap register
lookup method.. because that was I what explicitly comparing - and where it
wasn't fully possible, in your code's case, I credited the extra capabilities
(which were useless in that specific test I was interested in, anyway).

But then all of the above gets ignored and The Svins forces it to be not a
generic bitmap register lookup method.. but a specific ASCII-validity one.

He went off topic, and accused me of doing so.. while I always stated that
checking for out of bounds or subtracting '0' is a specific, and thus EXTRA,
application .. to which I wasn't interested.

Even so, 2.34375 cycles per check on average on my method is still quite good,
maybe I could even improve it but I don't want nor need to bother.

Also, bitRAKE, don't teach me education: it's you in some other thread that
went personal against me saying "That is based on your level of paranoia",
in response to a simple question of mine with a big disclaimer "I don't know
about the subject, I only heard rumors", so that the paranoia, if of anybody,
can't be mine anyway. Even if I was harsh against Microsoft, I wasn't against
anybody personally: but you were, calling me paranoid.

As you used the adjective "silly" on me regarding an expression I used inside
quotes, and not for no reason. You read what you want to read, but you omit
what it's comfortable to omit.

So, be clear, when you blame on my words read some posts above and see what
you write and what The Svin wrote before (no, I was not drunk.. I don't drink
alchoolics at all The Svin: think about yourself).

Anyway, I gave you the technical satisfaction you asked: "2.34375 cycles" is
my last word on this subject, and my last reply to YOUR offensive and personal
assumptions or deliberate omissions and misreadings.

It's sad anyway that there can't be friendship between me and The Svin: I've
always, although all of the past, tryed to be positive and polite towards him,
and to build instead of destroy.

Same thing for you bitRAKE, that I esteem as coder but something I don't really
like how you act.. as calling me paranoid when I stated they were rumors I've
heard. Or in other occasions.

I really had no time for that, but since The Svin and you seemed to be interested
in this kind of two-state lookups I really tryed to be friendly and help as much
as I could. Sure I've got my bad character at times, but I don't really think I'm
the one to blame for all of this situation. I was upset that my very explicit
specifications were ignored, as if I wrote for nothing, and opposite claims
were done where instead my goals were clearly others.
Posted on 2002-03-27 11:01:23 by Maverick
Are there instructions before the BT?
Why are you and The Svin so unfair? When I was forced to test the Intel method I did so..
I tested only the BT/JNC, I removed the bound checking instructions that were before.
It took 12.421875 cycles per iteration, on average (all 64 possibilities).
I don't want to offend anybody, so I cheated and inverted JNC to JC.. result: 8.3125, which is less offensive to say the same advice "stay away from this method, is not fast".
After all my lone goal was to help and give good advice, not to look like a smartass.

So you are calling me unfair???

I think that enough is enough.. I better code my own stuff and do my own business.

I have much work behind.
Posted on 2002-03-27 11:20:13 by Maverick
Maverick, I am not calling you anthing - I do not know you. We are just speaking here of some code and the use of that code. I am just curious of how you reach your conclusions. My tests show that BT mem,reg has a latency of 3 cycles, where as BT reg,reg has a latency of 1 cycle. Also, MOVcc performs very nicely on Athlons. Hence, your combination of instructions is very favorable within the range where it works. If more complete coverage of the input value is required, then the sequence of instructions I present perform well. I would like to thank, Jens for moving the discussion in this direction of register bitmaps, for it lead to the other developments.
Posted on 2002-03-27 11:59:48 by bitRAKE
I've re-read all of the posts (a thing I advice you to do as well), and I
discovered that Jens code was meant to check for >=64 bounds - I didn't notice
that before, I noticed it only for bitRAKE one (and credited for it).

But I don't see where is the point: I've always said that extra features like
out of bounds checkings and SUB '0' are anyway extra, so since I've always
claimed I wasn't interested in any extra feature, but to talk about the generic
method itself, anyway I don't see why I had to be forced to implement these
extra features as well.

If I got since the start that Jens's code attempted to do so, I would have
skipped directly my 2.000 cycles example and show since the start (also) the
2.34375 cycles version, which features extra things (like bound checkings)
that don't belong to the generic register bitmap method we were discussing.

I thought that me and Jens were both doing it the "generic way", and that
bitRAKE had extra features useful in certain specific applications. I thought
(I could be wrong) that he got the extra features for free, and that 2 vs 1 was
enough to not have to make a routine that checked for bound limits (then I did
it, as you know). I thought, if bitRAKE may get a generic routine (i.e. without
out of bound checks) faster than his one that features that extra thing, he
would surely post also that. Sure there are a lot of cases where the extra
out of bound checks aren't necessary.

I always acted trying to be as much honest as possible, if I knew since the
start that Jens code attempted out of bounds protection, I would have posted
*both* of my routines at once, instead of posting the other later (which also
does SUB '0' that I had removed from both Jens and bitRAKE routines in my speed
tests .. but I'm not acting like a kid now and bothering about it and being
picky .. I wish I could say the same about others in this thread as well).

Yet I thought it was pointless that The Svin attacked me, since my first
routine was so fast that even adding traditional bound checkings wasn't going
to change the scenario. Not that I wanted to claim better performances of my
routine than the real ones: the opposite, I removed as many instructions as
possible from the other 2 routines to test, so to make a fair compare (again,
I didn't know that Jens code attempted bound control, and I credited bitRAKE
because I knew his one did); my last routine does one more thing than Jens
and bitRAKE stripped down versions I used; I also said 8.3125 cycles for your
code, The Svin, (better, Intel code) while in reality it was 12.421875.

Because of all of the above I was pissed to be attacked on the fairness side..
I think you could be indulgent a fraction of what I've been. Let away the "just
want to be helpful" extra stuff, of what personally I'm sick now. Never again.

Anyway..
Posted on 2002-03-27 12:18:16 by Maverick
bitRAKE wrote:
Maverick, I am not calling you anthing - I do not know you. We are just speaking here of some code and the use of that code. I am just curious of how you reach your conclusions. My tests show that BT mem,reg has a latency of 3 cycles, where as BT reg,reg has a latency of 1 cycle.
You are entitled to all the doubts that you want, but I'm quite sure I can profile code in a proper way. Anyway, here is how to finally test all of the four ISHEX? methods (uncomment one of the 4 routines to test it):

Also, MOVcc performs very nicely on Athlons. Hence, your combination of instructions is very favorable within the range where it works. If more complete coverage of the input value is required, then the sequence of instructions I present perform well.
You contraddict yourself: in a previous post you said that my method scales well, 1 cycle per additional DWORD mask, but now you said the opposite. Ehm?




MASK1 equ 007E03FFh
MASK2 equ 007E0000h
; let's make sure the table is cached in the best possible way
ALIGN 64
hexmask: ;used only by routine 4
DD 0,1FF0000h,7Eh,7Eh,0,0,0,0

; ---

ALIGN 64
testme:

; ---

;routine 1 (Jens)
;sub cl,48
;xor edx,edx
;mov eax,1
;shld edx,eax, cl
;shl eax,cl
;and eax,MASK1
;and edx,MASK2
;or eax,edx
;jz .not_a_hex_digit

; ---

;routine 2 (bitRAKE)
;sub ecx,'0'
;mov edx,MASK1
;xor eax,eax
;bt edx,ecx
;mov edx,MASK2
;rcl eax,1
;bt edx,ecx
;rcl eax,1
;shr ecx,5
;bt eax,ecx
;jnc .not_a_hex_digit ;NOTE: jc = .is_a_hex_digit

; ---

;routine 3 (Maverick)
;LEA ECX,[ECX-'0']
;MOV EBX,MASK1
;CMP ECX,'g'-'0'
;JAE .not_a_hex_digit
;MOV EAX,MASK2
;CMP CL,32
;CMOVC EAX,EBX
;BT EAX,ECX
;JNC .not_a_hex_digit
; ---

;routine 4 (table and code provided by The Svin)
; .data
; hexmask dd 0,1FF0000h,7Eh,7Eh,0,0,0,0
; .code
; bt hexmask,edx
; jnc nothex
;BT U32 [hexmask],ECX
;JNC .not_a_hex_digit

; -----------------------------
; common code
RET
.not_a_hex_digit: RET

; -----------------------------------------------------------------------------

MAIN:

; perform all possible 256 cases and update a "total cpu cycles" counter.

XOR ECX,ECX
MOV U32 [total],ECX
.loop: PUSH ECX
PROFILE testme
POP ECX
MOV EAX,U32 [PROFILE.CYCLES]
ADD U32 [total],EAX
INC ECX
CMP ECX,256 ; IMHO it should be 64, but no bother
JNE .loop

Now display total/256 to get the average performance of the tested routine.


Results (my machine is a Athlon-XP 1800+, DDR 292MHz):

as a reference:
1 to 3 NOPs 1.00000000 cycles per call on average
4 NOP 2.00000000 cycles per call on average
etc..

routine 1 (Jens) 7.00000000 cycles per call on average
routine 2 (bitRAKE) 4.25390625 cycles per call on average
routine 3 (Maverick) 2.21484375 cycles per call on average
routine 4 (The Svin) 15.74609375 cycles per call on average

routine 4 assumes table is perfectly aligned and fully cached.. an unrealistic scenario.
I'm very disappointed by routine 4 though, I myself had foreseen it would have performed
decently++, given the ideal data cache scenario it was offered (intuitively I thought that
from 128 bit on, memory would be faster.. although I'd use for longer strings, due to
the fact that it's unrealistic to think that the table will be in the data cache all the
time, and loading memory can cost thousands cycles.. let away loading it from swap file.
Still carefully used, I strongly advocated in my previous post the use of BT ,reg).
I suspect that BT may stall the pipelines for some weird reason, since the execution
times are way too high. I will investigate more.

Regards,
Maverick

Posted on 2002-03-27 17:15:34 by Maverick
Maverick, You contraddict yourself: in a previous post you said that my method scales well, 1 cycle per additional DWORD mask, but now you said the opposite. Ehm?
Solve this problem for me: How can both of my comments be correct? :) I am not in the habit of contradicting myself. You make broad assumptions of Svin and myself. Not that I am not guilty of the same. I would hope that we can learn to communicate well, for there is much that we should learn from each other. :) There is nothing more important for me!

I was very surprised of the BT mem,reg execution as well. The manual states a latency of 8 cycles. I can execute nine instructions before the BT without them having any additional cost, but none after without it costing more cycles! BT mem,reg takes 5 cycles to complete. This is on a Athlon TB - I know you have an XP. Of the 5 cycles, the first three can pair with other instructions, but the latter two execute alone.
Posted on 2002-03-27 23:43:15 by bitRAKE