hi!


No misunderstanding. Read the quote above from the Intel Manual. Memory lookups are different than register lookups.


Ohh ... after reading it 5 times (damn complex written, if english isn't your native language :grin: ), I understand it.
And yes, i was wrong and your are right. :)

And again, I learned something new about the x86 ... ;)

But now the different between the size of your macro and this short code with memory access is even larger ... and it's nearly unbelievable that 140 bytes can be read/handled faster than 60 bytes, so i will test it with the code i used for the strlen-routines.

Cu, Jens
Posted on 2002-03-26 10:47:19 by Jens Duttke
Certainly, there is going to be a 'crossover' point for each processor - where it is better to use the table over using straight code. It will be interesting to see what this point is for each processor. This is the same reason 'compiled sprites' are faster than the sprites stored in data - Maverick knows what I'm talking about here - he mentioned he was using compiled sprites.

Jens, I am glad that you learned something - I too was caught by this instruction. In 16-bit code the count can only access 32k - where one would think it would be able to access 64k. I spent a couple hours debugging this one! :)
Posted on 2002-03-26 11:19:35 by bitRAKE
I'd like to extend what I wrote some posts above about advanced uses of the
carry flag. When you check for a bound, e.g. from 0 (included) to 32 (excluded),
you can do, as we saw in that post, a simple:


CMP ECX,32

now the carry flag will be set if ECX is within 0 (included) to 32 (excluded).
(note that the "from included to excluded" is what is typically used in bound
checkings, for example for arrays).

We can now exploit another trick:


SBB EAX,EAX

regardless of what was in EAX before, now EAX will be all zeros if the carry
flag was set by CMP (which is like to say "if we were inside the bounds", but
EAX will be all ones (i.e. FFFFFFFF hex) if the carry was clear (i.e. if we
were out of bounds).

Now we can use the EAX mask we obtained to do many interesting tricks, using
for example the AND, OR and XOR instructions.

To remain in topic with this thread, we may want to select one of two 32 bit
values depending on if ECX is within the bounds or not.


On P6 and higher:


MOV EBX,MASK1
MOV EAX,MASK2
CMP ECX,32
CMOVC EAX,EBX

Now EAX will contain MASK1 or MASK2, depending if ECX is within bounds or not.


For CPU's <= P5 we can do instead:


MOV EAX,MASK2
CMP ECX,32
SBB EBX,EBX
AND EBX,MASK1^MASK2 ; MASK1 xor MASK2
XOR EAX,EBX

How does it work? It's simple.. and maybe it's about time to remove the magic
also from the tricks that use another very useful instruction: XOR.

So, we already know that, after SBB, EBX will be either all 0's or all 1's,
depending if ECX is within the bounds we've specified or not.
Now, for example, do you recall the formula to implement linear interpolation?

It's just a:

result = (A*(1-alpha))+(B*alpha) ; where alpha is from 0.0 to 1.0

but the above can be rewritten this way, which is perfectly equivalent (but simply more optimized):

result = A+(alpha*(B-A))

i.e. we start from A and then account only for the difference between B and A,
"to fill the gap".

We can use the same technique to set a value, and eventually fix it later as
much as needed to transform it into another value. We could have written it
also this way:


MOV EAX,MASK2
CMP ECX,32
SBB EBX,EBX
AND EBX,MASK1-MASK2 ; difference between MASKs
ADD EAX,EBX

It's just that I prefer to use logical instructions instead of arithmetic
ones, when possible (there are good reasons to do so, but I do not want to
make this post too long).

---

So now we've a P6 (using CMOV) or P5 method to select the right mask.. and
we can finally test it:


BT EAX,ECX
JNC .notvalid

But BT may be a slow instruction in some CPU's .. so here's an alternative
method:


SHR EAX,CL
JC .notvalid

Just remember two things if you use the above:
1) ECX must not be higher than 31 .. because otherwise unfortunately Intel does
not *guarantee* a correct carry flag :( So place a AND ECX,31 before the SHR.
2) You must increment of 1 unit CL to obtain the same result obtained with BT.
You may account for this in the LEA instruction, or rotate the masks, etc..

---

To sum it all this is my code for "bitmap register lookup". I've to admit
that my own programming language's compiler/optimizer gave me some hints,
because I was too lazy to think about some of the above things myself. ;)


MOV EBX,MASK1
MOV EAX,MASK2
CMP ECX,32 ; ECX is a 0..63 index
CMOVC EAX,EBX
BT EAX,ECX
JNC .notvalid

Maybe it can be improved furtherly.. but I've no time right now. I will try
later perhaps, since I'd like to send this stuff to Thomas' snippets section;
together with the profile code (BTW, grv575, since I'd like to give you proper
credits for your MASM translation, please could you give me your full name,
or anyway the name you prefer to be credited with?); and maybe also send the
stuff about CMP and the carry flag. Later I may add other contributions,
since I'm trying to be less asshole and maybe share the little I know. Also,
I proven myself my ignorance on the x86 by the ROL wrong assumptions.. I've
really to clean my mind a bit about the 68000 and start doing some serious
x86 assembly instead.. like learning flags after all opcodes, etc.. lazyness
sucks. :)


ISHEX? (lowercase and uppercase supported)
;input: AL = character to test


LEA ECX,[EAX-'0']
MOV EBX,007E03FFh ; MASK1
MOV EAX,007E0000h ; MASK2
CMP CL,32
CMOVC EAX,EBX
BT EAX,ECX
JNC .not_a_hex_number


Finally, to know if a character is numeric, we can use the other technique
I exposed:

ISNUMERIC?
;input: CL = character to test


LEA EAX,[ECX-'0']
CMP AL,9
JNC .not_a_number


Really gotta go now.. have a nice day.
Posted on 2002-03-26 12:11:11 by Maverick
You know, I'm tired and stressed and I always leave errors behind.

My last code:


LEA EAX,[ECX-'0']
CMP AL,9
JNC .not_a_number


should have CMP AL,10 instead of CMP AL,9

Sorry for the error, I was in a hurry and overlooked some things.
But all the other code should be error free.

---

Now, in reply to bitRAKE:

Certainly, there is going to be a 'crossover' point for each processor - where it is better to use the table over using straight code. It will be interesting to see what this point is for each processor. This is the same reason 'compiled sprites' are faster than the sprites stored in data - Maverick knows what I'm talking about here - he mentioned he was using compiled sprites.
It's quicker to just profile the two routines under _real_ conditions, and then see which one performs better. In a loop, I believe yours will be much faster.. but since profiling is such a quick and reliable operation, why not just profile it where/when/how you will use it? A different scenario can make one better or worse, so it's always better to keep more routines.. someday you'll have a use for any of them.

On thing that we all (Athlon owners) must not forget when evaluating these techniques, though, is that the Athlon has got much more generous L1 I/D caches than Intel CPU's.
That's why about my fonts/sprites routine (as in most of my work) I prefer to simply have a routine that generates the code for the print_font / draw_sprite routine.
So when I'll get my hands on a Pentium IV I'll know if it's the case to differentiate between generating precompiled code, or maybe e.g. RLE sprites.
Also, having a code generator makes profiling possible *at run time* and self adaptation to the host CPU. This last one is a technique I use a lot.

The problem is always time.. so many things to do, so many ideas.. so little free time for them.
I wish my contracts were expired soon.. lotsa boring non-assembly-coding stuff there.
Posted on 2002-03-26 12:45:50 by Maverick
Maverick, you have explained my code above, nicely. :)
See post: http://www.asmcommunity.net/board/showthread.php?s=&postid=31018.msg30890

Still it is limited in that ECX must be in range [0-63] + '0' Else there is overlap of mask, for example if ECX = 128.
Posted on 2002-03-26 12:48:57 by bitRAKE
Maverick, you don't have to credit me it was just a straightforward translation. I'm just glad you took the time to post this code. Kinda surprised me that it gets exact cycle counts on my machine even in ring 3.

Btw, if anyone knows offhand how to get 64bit alignment for the code & data segments maybe you could post that. Don't know how important that was for the algo so I just used align 4 for the .data and align 16 for the .code. Does it make a difference when you switch processors?
Posted on 2002-03-26 14:13:00 by grv575

Btw, if anyone knows offhand how to get 64bit alignment for the code & data segments maybe you could post that. Don't know how important that was for the algo so I just used align 4 for the .data and align 16 for the .code. Does it make a difference when you switch processors?
Yes it makes a difference in many ways, check out this post:
http://www.asmcommunity.net/board/index.php?topic=3941.msg26814
Posted on 2002-03-26 14:35:20 by bitRAKE

Maverick, you have explained my code above, nicely. :)
See post: http://www.asmcommunity.net/board/showthread.php?s=&postid=31018.msg30890

Still it is limited in that ECX must be in range [0-63] + '0' Else there is overlap of mask, for example if ECX = 128.
If you refer to my code, yes, I already said that.. in the line:


CMP ECX,32 ; ECX is a 0..63 index

It's easy to extend my method to have more that two masks, anyway.. but from a certain moment on, I'd rather prefer a memory based one.
Posted on 2002-03-26 17:10:23 by Maverick

If you refer to my code, yes, I already said that.. in the line:


CMP ECX,32 ; ECX is a 0..63 index

It's easy to extend my method to have more that two masks, anyway.. but from a certain moment on, I'd rather prefer a memory based one.
Yes, you say that in the comment, but your code responds incorrectly to values out of range. The last version I present responds correctly to values out of range. I'm sorry, my rushed replied are not explicit enough.
Posted on 2002-03-26 17:35:53 by bitRAKE
Maverick, you don't have to credit me it was just a straightforward translation. I'm just glad you took the time to post this code. Kinda surprised me that it gets exact cycle counts on my machine even in ring 3.
I'd like to insist, not only because of the work on the MASM translation, but also because I saw other posts of you (just to name one, the Chi square deviation test for random generators, but there were others) which were very interesting and IMHO a very good thing for the board.
Btw, if anyone knows offhand how to get 64bit alignment for the code & data segments maybe you could post that. Don't know how important that was for the algo so I just used align 4 for the .data and align 16 for the .code. Does it make a difference when you switch processors?
It matters a lot. For example, if you keep some variables near your code (which is anyway bad, because so you waste inst & data cache by doubling the same stuff, thus exploiting the cache only half) you get a much higher risk: if you write to a variable which is in the same cache line of your code, then it will be "misidentified" as self modifying code, and you will incur in *major* penalties (I'm talking even about thousands cycles for each write here!). Having it at least in another cache line fixes some, although not all of the problems.
Also, on my Athlon if you align loops at multiples of 2 then they will be faster than if they're at odd addresses. In general an alignment of 16 is very important for instruction decoding, expecially on loops, which imposes even more severe "restrictions".
Sorry, I can't help about how to achieve alignment under MASM.. but I can about any memory allocator ( e.g. C/C++'s malloc() ) which by itself doesn't support aligned allocations ( laaaame :) ). Also, making things cache-line aligned makes you exploit the cache better.

A thing that I use much but I've never seen elsewhere (although it has been probably used by a lot of coders) is to specify not only the alignment, but also the offset.. for a higher control. I use this e.g. to get an inner loop already aligned (without having to pad with NOPs, equivalent instructions, or even worse an explicit JMP), but also because the limited associativity of caches sometimes imposes, to exploit the caches well, to have different lower bits values in the addresses of multiple data buffers you access.

PS: This is some old WatcomC/C++ code of mine to fix the limitations of malloc():

---
/****************************************************************************

;*
;* HEAPALLOCATE
;*
;* INPUT:
;* U32> Size of requested memory (in bytes)
;* U32> Memory alignment, e.g. 4 = LONGWORD aligned, 65536 = $xxxx0000
;* U32> Memory alignment offset, e.g. $8000 = from half of ^^^^^^^^^
;*
;* OUTPUT:
;* P32> ^ to the allocated memory
;*
;* NOTE: Memory alignment must be a power of 2 (example: 4,16,32,256..)
;* NOTE: The bank will always be at least 32bytes aligned automatically
;* NOTE: The constant MEMSAFE is used to allow out of limit R/W
;*
;* FAILED:
;* 0 = All right. The P32 returned is valid.
;* (error) 1 = Out of memory. The P32 returned is 0.
;*
*/

P32 _HEAPALLOCATE(U32 Size, U32 Alignment, U32 Offset) {
P32 truept, mempt;
// --
if (Size==0) return(0);
if (Alignment<32) Alignment=32;
truept=(P32)malloc(4+Alignment+Size+MEMSAFE);
if (truept==NULL) {
ERROR("Not enough memory.");
return(0);
} else {
mempt=(P32)(((U32)truept+4+Alignment)&(-Alignment))+(Offset&(Alignment-1));
*(P32U32)(mempt-4)=(U32)truept;
SUCCESS();
return(mempt);
}
}

P32 HEAPALLOCATE(U32 Size, U32 Alignment, U32 Offset) {
return(_HEAPALLOCATE(Size,Alignment,Offset));
}

P32 HEAPALLOCATE(U32 Size, U32 Alignment) {
return(_HEAPALLOCATE(Size,Alignment,0));
}

P32 HEAPALLOCATE(U32 Size) {
return(_HEAPALLOCATE(Size,0,0));
}

/*
;* End of HEAPALLOCATE routine
;****************************************************************************
;*
;* HEAPFREE
;*
;* INPUT:
;* P32> ^ to the allocated memory
;*
*/

void HEAPFREE(P32 Pointer) {
if (Pointer!=0) {
free((P32)*(P32U32)(Pointer-4));
}
}

/*
;* End of HEAPFREE routine
;****************************************************************************
;*
;* HEAPAVAILABLE
;*
;* OUTPUT:
;* U32> size of available block
;*
*/

U32 HEAPAVAILABLE() {
U32 tn, m, n;
P32 pt;
// --
tn=0; m=0; n=2<<28;
loop: tn++; pt=(P32)malloc(m+n); if(pt!=0) free(pt); else { n>>=1; goto loop; }
m=m+n; n>>=1; if (n>65536) goto loop;
// --
return(m+n);
}

/*
;* End of HEAPAVAILABLE routine
;***************************************************************************/
---

Under Windows VirtualAlloc() returns memory with an alignement of at least 4096.. which will be enough for the usual needs.

Posted on 2002-03-26 18:10:39 by Maverick

Yes, you say that in the comment, but your code responds incorrectly to values out of range. The last version I present responds correctly to values out of range.
Yes, your method never branches for out of bounds, but is twice as slow as mine.. so I don't see the benefits (even adding cmp/jae would not make my method twice as slow). Also, bound checking is a requirement that I don't have, since I wouldn't use that method to check if a char is HEX anyway (it would be very easy to add it, though, and probably not having to resort to a stupid cmp/jae, even).
My method was just in response to this last code I saw from Jens, sorry if it instead seemed something else:




xor edx, edx
mov eax, 1
shld edx, eax, cl
shl eax, cl
and eax, MASK1
and edx, MASK2
or eax, edx
jz @invalid

That needs a 0..63 index, and 64 iterations of which on my Athlon runs at an average of 11.84375 cycles per iteration. Also, it doesn't really work, since one mask will interfere with the other, giving wrong results.


Then (but I saw it after writing mine) you wrote this routine:


mov edx,MASK2 ; load flags
xor eax,eax ; initialize result store
bt edx,ecx ; test
mov edx,MASK1 ; load flags
rcl eax,1 ; store result
bt edx,ecx ; test
rcl eax,1 ; store result
shr ecx,5 ; which bit?
bt eax,ecx ; test proper bit of results
jc INVLD ;5 cycles + branch on Athlon

64 iterations of which on my Athlon run at an average of 4.000 cycles per iteration.
(I had to invert the masks on your routine and change JNC to JC in mine below just to make results consistent: it doesn't change the execution speeds).


I simply explained step by step the reasoning that instead made me write this very different code:


MOV EBX,MASK1
MOV EAX,MASK2
CMP ECX,32 ; ECX is a 0..63 index
CMOVC EAX,EBX
BT EAX,ECX
JC .notvalid

64 iterations of which on my Athlon runs at an average of 2.000 cycles per iteration.


(note that in all 3 cases cycles are referred to the 007E0000007E03FF test mask, but just because it was the first example available).


I should make you notice that to check if e.g. a char is alphanumeric using your method (with e.g. n masks to cover the whole 0..255 range, or up to where it is required) or Jen's new one based on memory is overkill. Traditional LEA/CMP/Jxx methods (checking for _invalid_ ranges) will be anyway faster, and will have the further advantage to exit maybe at the first *statistically* convenient condition.
My post was just about having 64 bit look up tables (because that was the topic, and I wrote my post after seeing Jens's code and The Svin's comments that followed).. and the applications were just examples to explain the technique better. As I said before I wouldn't test for HEX or alphanumeric using this technique, but 64 bit (or 96, etc..) bitmap lookup tables based on registers can be useful elsewhere. So let's not cling on an example, it was just a practical example to explain things better. I wish that our posts weren't read only by me, you and some others.. but that even beginners could fully learn not just to copy n'paste, but how they really work inside, and at an intuitive level. That's why of my dumb-style explanations of things like CMP/LEA, carry flags, SBB, XOR and so on. I may write e.g. an explanation about why the triple XOR trick (to swap variables) work, yet if I write it in a dumbo style and with dumbo examples is just because being understood also by beginners here is our highest goal.
Posted on 2002-03-26 19:12:02 by Maverick
Hi Maverick,
I like your code with BT JC but you have a problem with MASK1,MASK2 and registers..
Try to use memory:

MASKS dd 0,1,2,3,4,5,6,bla,bla...

and
BT , ecx
jnc @invalid
Posted on 2002-03-26 19:32:06 by buliaNaza


.data
hexmask dd 0,1FF0000h,7Eh,7Eh,0,0,0,0
.code
bt hexmask,edx
jnc nothex
Posted on 2002-03-26 19:37:43 by The Svin
Hi buliaNaza :)

Yes, when using tables in memory BT is a full featured instruction.

Or maybe I didn't get what you really meant?

I presented simply a 2 reg based 64 bit method, just because I saw Jens's one and The Svin's reactions to it, and thus I thought it was of big interest to many, possibly.

I'd sure use BT straight from (precached) memory in many situations, I never excluded that option ;) (nor I excluded bitRAKE's one, for the matter.. each solution has its applycations... one should always test all the possible ones on the _real_ problem, and profile to find the best one).
Posted on 2002-03-26 19:39:26 by Maverick



.data
hexmask dd 0,1FF0000h,7Eh,7Eh,0,0,0,0
.code
bt hexmask,edx
jnc nothex


I hate to say that.. but if my code took 2.000 cycles per iteration, the above takes (on pre-cached, perfectly aligned, etc.. data) 8.3125 cycles per iteration.

I'd use the memory method on considerably bigger than 64 bit tables. My method may be extended beyond 64 bit, of course, up to where it's useful.
Posted on 2002-03-26 19:43:13 by Maverick
Impressive solution, Maverick and as you said, each has its uses. Your method scales very well on the Athlon, taking a cycle for each dword.

Updated macro:
; bitmap register lookup

reglu MACRO INVLD,min,masks:VARARG
LOCAL y,msk

y = 0 - (min AND -32)
IF (min + y) NE 0
sub ecx, min + y
ENDIF

FOR msk, <&masks>
IF y LE 0
y=0-y
mov eax,msk
ELSE
mov edx,msk
cmp ecx,y
cmovc eax,edx
ENDIF
y = y + 32
ENDM
bt eax,ecx
jnc INVLD
ENDM

;Example:
reglu @Invalid, 30h, 007E03FFh,007E0000h
This only works for the range of masks provided - no testing is done for values greater or less, and you save two cycles over macro above. Also, 32-bit masks must be in reverse order of above macro. There is a more optimized way to handle the constants to try and ensure that the SUB/CMP are sign extended bytes and the mask moves to minimize instruction size, but it is harder and I'll hand code it. :tongue:
Posted on 2002-03-26 22:51:26 by bitRAKE
About my MASK1/MASK2 select on CMP code:
On P6 of course the CMOVC method is still the fastest.. but for P5 & Co. change my:

change:


MOV EAX,MASK2
CMP ECX,32
SBB EBX,EBX
AND EBX,MASK1^MASK2 ; MASK1 xor MASK2
XOR EAX,EBX

to:


CMP ECX,32
SBB EBX,EBX
MOV EAX,MASK2
AND EBX,MASK1^MASK2
XOR EAX,EBX

or


CMP ECX,32
SBB EBX,EBX
MOV EAX,MASK2
AND EBX,MASK1-MASK2
ADD EAX,EBX

(which is the same thing, but for some may look simpler).

The 2nd version will probably pair better on the Pentium, and it's really faster even on the Athlon.
Posted on 2002-03-27 06:33:49 by Maverick
I hate to say that.. but if my code took 2.000 cycles per iteration

It's OK, Maveric.
But I'm again complitly lost, wich your code do mean.
Could you give, full example piece of your code checking
if Hex.
I saw just part of it wich use 64 bit mask, it's not all range of
byte, where is the begining?
Posted on 2002-03-27 06:33:51 by The Svin
With "2nd version" I mean the 2nd and 3rd snippet, i.e. the newer version with instructions order rearranged.
The "1st version" is implyed to be my older one.

The Svin:
Damn me when I made an applicative example on that technique.. it was not really meant to be anything else than an example to see in use the technique.

What I really mean is that this is a 64 (or 96, 128, etc..) "bitmap register lookup", and nothing else. I'm not really discussing of possible applications here, if I did it was just to provide an example. I was only interested offering an equivalent of Jens' code that you and bitRAKE showed to find useful.. the method is a generic bitmap register lookup table, the applications are up to you.

About "ISHEX?" serious code.. well.. personally I wouldn't even check if a string is HEX, lowercase or uppercase.
I'd just make sure that the conversion routine already gets all upperacase chars, for example. But here we're really going off topic.

If I wrote and showed my simple routine is just because you and bitRAKE said you'd have other uses for that than a simple HEX validity check.

Mine was just meant as a generic contribution, the applications or special optimizations are up to you.. I just showed an algorithm and an implementation.. but unfortunately also a bad applicative example.

I believe there are a lot of various uses for this algorithm.
Posted on 2002-03-27 06:43:15 by Maverick
That means that th code you compared can not be compared.
One checks 256 possibility range the other 64 possibilities.
In PMMX btw your profile shows 5 ticks for bt memory version.
Posted on 2002-03-27 07:26:48 by The Svin