Did a search of the site and found like a million resources for string parsing!

The problem is I'd like to create a parser (much like one done with html or asm code sytax highlighting)...

Basically I want to search a line for "keywords" and if found then execute code based on that keyword... (This is really for my winsock program, however, i'm sure it will come in handy in other examples)

For example

KeyWords2: ADC ADD AND CALL CBW CLC CLD CLI CMC CMP
CMPS CMPSB CMPSW CWD DAS DEC DIV DAA ESC HLT IDIV IMUL IN INC INT INTO IRET JA JAE JB JBE JC JCXZ JE JG JGE JL JLE
JMP JNA JNAE JNB JNBE JNC JNE JNG JNGE JNL JNLE JNO JNP JNS
JNZ JO JP JPE JPO JS JZ LAHF LDS LEA LES LODS

(using asm as an example)

If the word was "LEZ" it would check the first char of all the keyword items and then hitting LEA would start to check the individual words finding that LEZ isn't a keyowrd would do nothing...

I ask this question because I'm sure someone has done this before and I'm trying to learn how this is done... I just want to apply it to individual lines...

Sliver


-----EDIT-----
I should mention I don't know how to create an array of keywords in masm either, but I'm search for help on that right now... but if you get it first ... please point me in the right direction
Posted on 2002-01-30 10:12:41 by Sliver
There are a number of ways to approach this problem:

You could use the immediate values test:
  cmp [esi],"L"

jne @Next
cmp [esi+1],"E"
jne @Next
cmp [esi+2],"A"
jne @Next
It's really bad that there is so much code, but macros could be wrote to simplify the process. The good part is that there isn't any data to access besides the source buffer. SpASM used this method for instruction matching the last time I looked.

Another method would be to have a table of some type. Could be strings in alphabetical order. Or like in FASM, sort the strings by string length. After a positive match the table usually has pointers to code to execute, and/or a value passed to the function that responds to that string.

This thread shows how to build an array of strings:
http://www.asmcommunity.net/board/index.php?topic=3290

Also look for the Jump Table Helper macro.

There are so many ways to tackle this problem - it's best just to get some experience with different methods, so that you know their strengths and weaknesses. I'd advise using macros and flexible coding techniques unless your absolutely sure your syntax isn't going to change. (It usually does ;))

Links:
http://www.cs.vu.nl/~dick/PTAPG.html
Posted on 2002-01-30 12:22:24 by bitRAKE
I think a hash table wold be helpfull on this.
Posted on 2002-01-30 14:52:51 by dxantos
Please forgive my naiveness... What's a hash table?

I really would like to have a set-up like:


char Keywords[][80] =
{
"ADD",
"ADC",
"AND",
"CALL",
"CBW",
"CLC"
}


Sliver
Posted on 2002-01-30 19:16:13 by Sliver
Hash table is a method of quickly looking up data based upon a key value.

Pro:
It's really fast on random accesses when tuned correctly.

Cons:
You have to know the general parameters of your data to tune it correctly.
Poor at sequential access.

General idea:
You use a function that turns the key value into an array index. But note the "pigeonhole priciple" which is... if you got n beds & n+1 people then someone is bunking with someone else... or if your array has n elements & you have n+1 keys then you've got a slight problem. Solution: use a VERY SHORT linked list (or other method) to allow the same array index to hold multiple key values.

I posted a VERY PRIMITIVE one a while back in asm that should get you started. It's not the best in the world but it's something.
Thread is here

Have a google for more info.

rafe
Posted on 2002-01-31 11:58:58 by rafe
Do you havethe example you used?

downloadables tend to be deleted from the serer after a few weeks

Sliver
Posted on 2002-01-31 12:47:49 by Sliver
Oops! I'll repost what I still have when I get home.

:stupid:
Posted on 2002-01-31 12:57:23 by rafe
A sorted array can be searched in log N time without a hash,
but it's not very programmer friendly. ;)
Posted on 2002-01-31 13:10:16 by bitRAKE
actually, i've found binary search to be easier to work with.

& given that O(log n) is only a bit bigger than O(n) in this case (log(n) ~ 8-10 or so). a properly coded binary search has a low mulplicative constant & both have similar setup times (additive constants)

sooo.... binary search may be a viable option in this case. I'd recommend presorting the data & then putting the literals presorted into your table if you go this route.

Unless you're thinking of storing all the variables too. then the log(n) gets into the ~15-16 range (look at the big includes). The variable names can get large to so keeping the array orthogonal would reqire an array of pointers. still doable (& may be preferable) but you're getting into the range where hashing can help or could be considered.

i've done this math before. ;)

rafe
Posted on 2002-01-31 13:35:43 by rafe
This is an intersting thread. (I dont see much talk of "order"s ie. O(n) outside of class :) ). Im surprised you remember to use such stuff (not to say i think its wrong, just i couldnt never find the interest to remember it all :) )

Anywho, my two cents was to use a double linked list. But reading Rafe's thoughts on a binary tree, the need for the "double" is not all that important since there is only 26 odd entries (A-Z).

Then it came to mind that perhaps you can take the first letter, subtract off 0x41 (hex), and use this as an index offset into a table of individual linked lists. Where the list will be all the different "commands" that start with that letter.

So this would be a pseudo-double-linked list, as there is not an initial list storing other list pointers, but rather a table of 26 pointers to the lists.

Hmmm... Sliver, If your still looking for help next week, bring this topic up again, and i'll wip together a double linked list and a binary tree (Alrready have a double linked list written). But its coded with our object model tho.. Objects are the best way to implement these things.

:alright:
NaN
Posted on 2002-01-31 16:47:19 by NaN
My current assembler finds the instruction in eight compares, with support for 208 instruction mnemonics. It's all pretty nasty code. There is separate handling for labels using hashing.
Posted on 2002-01-31 17:17:05 by bitRAKE
Then it came to mind that perhaps you can take the first letter, subtract off 0x41 (hex), and use this as an index offset into a table of individual linked lists. Where the list will be all the different "commands" that start with that letter
This is definitely a hashing algorithm, where the hash function returns as a result -- the first character less 0x41.

There are at least three ways to organize a hash table. The above is exactly what rafe initially described. Another way is to make the full table the same as the hash table -- this requires a way to handle collisions, the situation where an array entry is already occupied by a valid value. And there is a variation on the linked list version, where each node on the list contains not one, but several values held in a bucket. The last version can reduce the delays caused by cache misses. (Big if the cache is a disk cache.)
Posted on 2002-01-31 19:59:42 by tank
I'm trying to implement a keyword/parser...

How it works (based on icz's tut35)

have a keyword list file...

parse the file and store words into a structure (the word/the size of word/the color (not used)/and pointer to next structure)

The problem is... I can't seem to find the pointer to the first structure...

I can find the first structure (compared the first word in the list with with an exact match)...

I've gotten like 180 GPF errors in trying to solve this... So I'd appreciate a point in the right direction... I've included the source, but I'll quote some source here too



ParseBuffer proc uses edi esi hHeap:DWORD,pBuffer:DWORD, nSize:DWORD, ArrayOffset:DWORD,pArray:DWORD
LOCAL buffer[128]:BYTE
LOCAL InProgress:DWORD
mov InProgress,FALSE
lea esi,buffer
mov edi,pBuffer
invoke CharLower,edi
mov ecx,nSize
SearchLoop:
or ecx,ecx
jz Finished
cmp byte ptr [edi]," "
je EndOfWord
cmp byte ptr [edi],9 ; tab
je EndOfWord
mov InProgress,TRUE
mov al,byte ptr [edi]
mov byte ptr [esi],al
inc esi
SkipIt:
inc edi
dec ecx
jmp SearchLoop
EndOfWord:
cmp InProgress,TRUE
je WordFound
jmp SkipIt
WordFound:
mov byte ptr [esi],0
push ecx
;========================================================
; store the word in a WORDINFO structure
;========================================================
invoke HeapAlloc,hHeap,HEAP_ZERO_MEMORY,sizeof WORDINFO
mov RootNode, eax
push esi
mov esi,eax
assume esi:ptr WORDINFO
invoke lstrlen,addr buffer
mov [esi].WordLen,eax
push ArrayOffset
pop [esi].pColor
inc eax
invoke HeapAlloc,hHeap,HEAP_ZERO_MEMORY,eax
mov [esi].pszWord,eax
mov edx,eax
invoke lstrcpy,edx,addr buffer
mov eax,pArray
movzx edx,byte ptr [buffer]
shl edx,2 ; multiply by 4
add eax,edx
.if dword ptr [eax]==0
mov dword ptr [eax],esi
.else
push dword ptr [eax]
pop [esi].NextLink
mov dword ptr [eax],esi
.endif

invoke lstrcmpi,ADDR buffer ,ADDR test2
or eax,eax
.if (eax == 0)


invoke wsprintf,addr tempBuffer,offset _str,[esi].pszWord
invoke MessageBox,0,addr tempBuffer,addr tempBuffer,MB_OK
.endif

pop esi
pop ecx
lea esi,buffer
mov InProgress,FALSE
jmp SkipIt
Finished:
.if InProgress==TRUE
;========================================================
; store the word in a WORDINFO structure
;========================================================
invoke HeapAlloc,hHeap,HEAP_ZERO_MEMORY,sizeof WORDINFO
push esi
mov esi,eax
assume esi:ptr WORDINFO
invoke lstrlen,addr buffer
mov [esi].WordLen,eax
push ArrayOffset
pop [esi].pColor
inc eax
invoke HeapAlloc,hHeap,HEAP_ZERO_MEMORY,eax
mov [esi].pszWord,eax
mov edx,eax
invoke lstrcpy,edx,addr buffer
mov eax,pArray
movzx edx,byte ptr [buffer]
shl edx,2 ; multiply by 4
add eax,edx
.if dword ptr [eax]==0
mov dword ptr [eax],esi
.else
push dword ptr [eax]
pop [esi].NextLink
mov dword ptr [eax],esi
.endif
pop esi
.endif

mov edx, RootNode
assume eax:ptr WORDINFO
push (WORDINFO ptr [edx]).NextLink
pop eax
invoke MessageBox, 0, [eax].pszWord, [eax].pszWord, MB_OK

assume eax:nothing
ret
ParseBuffer endp

FillHiliteInfo proc uses edi
LOCAL buffer[1024]:BYTE
LOCAL pTemp:DWORD
LOCAL BlockSize:DWORD
;===================================================================
; Zero out the array
;===================================================================
invoke RtlZeroMemory,addr ASMSyntaxArray,sizeof ASMSyntaxArray
;===================================================================
; obtaining the path of this program instance
;===================================================================
invoke GetModuleFileName,hInstance,addr buffer,sizeof buffer
invoke lstrlen,addr buffer
mov ecx,eax
dec ecx
lea edi,buffer
add edi,ecx
std
mov al,"\"
repne scasb
cld
inc edi
mov byte ptr [edi],0
invoke lstrcat,addr buffer,addr WordFileName
;==================================================================
; Check whether the file exists
;==================================================================
invoke GetFileAttributes,addr buffer
.if eax!=-1
;===================================================================
; allocate a block of memory from the heap for the strings
;===================================================================
mov BlockSize,1024*10
invoke HeapAlloc,hMainHeap,0,BlockSize
mov pTemp,eax
@@:
invoke GetPrivateProfileString,addr ASMSection,addr C1Key,addr ZeroString,pTemp,BlockSize,addr buffer
.if eax!=0
inc eax
.if eax==BlockSize ; the buffer is too small
add BlockSize,1024*10
invoke HeapReAlloc,hMainHeap,0,pTemp,BlockSize
mov pTemp,eax
jmp @B
.endif
mov edx,offset ASMColorArray
invoke ParseBuffer,hMainHeap,pTemp,eax,edx,addr ASMSyntaxArray
.endif
invoke HeapFree,hMainHeap,0,pTemp
.endif
ret
FillHiliteInfo endp
Posted on 2002-01-31 20:27:27 by Sliver
Here's the hash table dll skeleton. It's not god's gift to hash tables, testing was light, but it should give you the general layout.

This implementation is for a static hash table. that is, one where you set up your data up front & then don't change it. which should suit your needs well enuf.

I've also worked on a dynamic hash table, one more like Perl's hashes, but it's not done (tested at all)... short attention span (-:
Posted on 2002-02-01 08:06:01 by rafe
Do you have an example to go with the dll application?

I'm still not sure how you can use a hash table for parsing information and such...

Thanks in advance
Sliver
Posted on 2002-02-01 19:45:22 by Sliver
The hashing is used to speed up the the search within an array of items - it doesn't do the parsing. You need to code the parsing algorithm manually (yuck!), or build the parser from a definition in another language with another tool (ie lex).
Posted on 2002-02-01 20:53:26 by bitRAKE
Then how does this relate to my initial question?
:(

Can someone answer my other question about my keyword parser then?

Hash tables seems a little too much to tackle if I haven't got the parser working yet...

Sliver
Posted on 2002-02-02 00:11:45 by Sliver