Now that I have your attention :P

Attached is the sourcecode for a new OA32 Object.
It's a BaseClass, you're meant to derive specializations from it, which I have, and can demonstrate, but I digress.. this class is for PARSING INPUT FILES.
It does not give a hoot about structure, its a glorified byte fetcher.
Support for massive files is built in, and there's two modes of operation.
You can choose from "PUSH" and "PULL" modes.

When you choose "push" (driving) mode, everything is automagical.
The main entry fuinction (ParseFile) won't return until the whole file is parsed.
All the input characters are presented to a dynamic 'onParseChar' method, which you would redefine in your derived classes.. I degined this thing with "push" mode in mind as default, and added "pull" mode as an afterthought.

In "pull" (driven) mode, ParseFile returns immediately, and you must make repeated calls to ParsenextChar until it returns NULL. In this case, 'onParseChar' is NOT automatically called for you.. its up to you what to do with each character you parse.

If you like this code, buy me a beer.
And if you have any bug reports, feature requests, or criticisms, let me know :)
Attachments:
Posted on 2007-09-25 06:00:12 by Homer
Thanks for the new object Homer, I'll definitely have a use for it.
Next time I'm in OZ, I'll buy you that beer. Last time I was there was in
1986, so you might have a little wait. :lol:

bFeedMeSeymour
  :lol: Little shop of Horrors?

Rags
Posted on 2007-09-25 08:30:46 by rags

no free sex ? i've been fooled!

Posted on 2007-09-25 10:02:04 by HeLLoWorld
Thanks to this post I now have a terrible rash!  :shock:
Posted on 2007-09-25 15:32:54 by madprgmr
no free sex ?


- What is love?
- Love was invented by russians to not pay money

a very old anecdote :)
Posted on 2007-09-26 03:54:44 by Shoo
Since there's at least a passing interest in this code, I will offer to post a derived class for parsing of plaintext which utilizes the other dynamic handlers - it collects LINEFEED-delimited "statements", supports the use of a "multiline delimiter" (like MASM), manages a database of unique 'Terms' (where a Term is a WhiteSpace delimited 'word' from a Statement), and where each Statement is represented by a collection of references to unique Terms, and strips extraneous whitespace, comments and empty lines from the input stream.

I am playing with this code at the moment in regards to my quirky cross-assembler project, but I have generalized everything into useful, reusable modules because like anyone, I get sick of writing the same code again and again ...
So far I've used the "StatementParser" class in a GUI demo which builds a 'Sentence Tree', whose structure is based on the Statements collected from a sourcecode input file.
I don't seriously intend to use this Tree in this form, but I thought it was interesting to see an entire sourcecode presented in a visual Tree form, while being 'compressed' in regards to duplicate Terms.

So I offer both - a Statement parser derived from the Parser baseclass, and an odd little demo that uses it.
Put up your hand if you like free stuff :)

And yes Rags, you are correct about the reference to LSOH.
Some of my peers complain about having trouble inventing meaningful names for things.
I don't see the problem, just get your sense of humour out and use it :P

Posted on 2007-09-26 08:46:01 by Homer
My Hands are up. :D
Posted on 2007-09-26 12:15:14 by rags
Attached are two files.
StatementParser is an OA32 Object which derives from Parser, as described previously.
Its main job is to collect Statements and tag them with the File and LineNumber from which they were sourced, and get rid of WhiteSpace and Comments, but it does a fair bit more.

NOTE !!  the Dynamic Methods in Parser, such as onParseChar, are 'directly' replaced with the new ones in this class, so that while Parser.ParseFile THINKS it is calling Parser.onParseChar, it is in fact now calling StatementParser.onParseChar instead :)
The Ancestor is driving its Descendant. Thats confusing at first, but you'll get over it ...

Each Statement is represented by a collection of Terms, and the Terms themselves are represented by Tokens, where a Token is a container struct, similar to a Statement.
StatementParser will attempt tag each Token (each Term) with a value which loosely represents its Type.
The Types handled natively by StatementParser are : Decimal, Hex, Binary, Floating Point, Operator, and "Other".
Operator includes anything that is made of one or more non alphanumeric characters, like "!="
Other implies that we could not (yet) determine the Type of Term (implying that we might be able to do so at a later stage).
If we positively identify a Term as any kind of Immediate Value, we tag the Token with that Value, and if we can't identify the Term as being a Value, we store a pointer to the Term string (within the collection of unique Terms).

The second attached file is called 'Helpers', and contains the functions required to identify the natively handled Term types and some other junk.

I won't post the demo app just yet as I want to clean it up.. you probably don't NEED to see these objects implemented in order to use them?? .. at the moment it implements a third tier Parser object called RuleLearner that knows how to read my Grammar Rules from my custom text file, and I don't want to post that object at this point in time, as I am far from happy with it.

In fact I'm not happy with this one either - if anyone wants to rework it I ask that you repost it here, I'll be working on the RuleLearner class, and a new ancestor for the Parser class called something like 'FormatRecognizer'.. its job is to discern between ascii and utf-8 at parse-time.
Currently, Parser only supports 8 bit ASCII.
I believe I can, with very few changes, support UTF-8 and ASCII in the same framework, noting that UTF-8 supports characters UP TO 32 bits wide, consuming all of the wide formats in its wake.
Having said that, handling input stream conversions from other wide formats to UTF-8 is a piece of cake, so all the formats can be supported by wrapping the front with yet another layer.

Anyway, I'm more interested in the top end of the machine at the moment, ie grammar analysis and response, I'll fiddle with the input stage more as I go, and post updates periodically.


I just realized that I didn't finish implementing all the Types, cuz I had yet to (re)write functions to convert eg StringToFloat, but easy enough to complete that
Posted on 2007-09-27 03:37:33 by Homer
I found a few little bugs in StatementParser, but nothing that will prevent it from operating, bugs related to the identification of term types.

Not happy with the scheme used for multi-line statements and the general assumption that a LineFeed marks the end of a Statement by default.
Might let the user set up optional delimiters which trigger calls to user callbacks or something.

Not happy with the naiive collecting of non-alphanumerics into Terms.
Theres some exceptions that we'd rather not have the parser join up theses characters.
Its ok to think of, say, "!=" or "::" that way, but stuff like Braces should always be single character, yes?
We don't wanna interpret "((" as a single Term, we wanna see it as two Braces.

Thoughts?
Posted on 2007-09-27 11:46:25 by Homer

I found a few little bugs in StatementParser, but nothing that will prevent it from operating, bugs related to the identification of term types.

Not happy with the scheme used for multi-line statements and the general assumption that a LineFeed marks the end of a Statement by default.
Might let the user set up optional delimiters which trigger calls to user callbacks or something.


In what way are you collecting tokens? How I did it, is to take lines of text ending on Linefeed, turn them in to tokens, and any of those lines that have "\" as the last token forces the scanner to read the next line and append to the token list.


Not happy with the naiive collecting of non-alphanumerics into Terms.
Theres some exceptions that we'd rather not have the parser join up theses characters.
Its ok to think of, say, "!=" or "::" that way, but stuff like Braces should always be single character, yes?
We don't wanna interpret "((" as a single Term, we wanna see it as two Braces.

Thoughts?


I'd rather see "((" as two Parenthesis :P

Depends on how you intend to use such tokens later on. For generic purposes, I would say to keep non-alphanumerics that have no distinction as separate entities.
Posted on 2007-09-27 12:47:07 by SpooK
Homer,

I've only really glanced over this thread a few times as, you know me I avoid OOP like the plague, but in the case of '((' and '!=' style parsing I believe we discussed this some time ago. To refresh you on my stance, and to catch the board up as well, I believe that it would be best not to catch such items as singular tokens as you are doing, rather have the user setup a sub parser for token handler '!' which sets flag for '=' as next token, this flag should be cleared by your primary parser if it reads a whitespace token before obtaining the '=' token. If the whitespace token is obtained a callback can be used to allow '!' (or whatever) to continue processing and return to the primary parser before handling the just read (or soon to be read) following token. This way you only have handlers for the ones you implement, the others get dealt with as normal singular tokens. I'm not sure how you could do this modularly, but you're a smart guy, I'm sure you could figure it out.

Also, a small optimization on that method (as just a quick thought I'm having atm) instead of using the normal procedure to change states, it might be worth just setting up a call table containing 'token', offset of single token handler, offset to special case token handler, and the flag. This could possibly be done as a class (I'm trying to think modularly here, work with me :p). The table/child class/whatever could be used to preform a quick and dirty fast scan upon obtaining one of the single character tokens. If the token is present in the table, save it's index somewhere, update it's flag,  then do the normal check for the next character. On space call the 'single token handler' from the table, if not you call the 'special case token handler'. In the instance of a child class (or child of Collections) the two handlers would be virtual methods that could be redefined by the user and a static method to set the token.. you'd also need a list of it's "special cases" in there somewhere, can't forget about that.

Sorry if all of that is kinda skatter brained.. I started out just refreshing on an old discussion and before I hit post I started actually thinking about it and just typed as I thought. Hope it still makes since after I post it. ;)

Regards,
Bryant Keller
Posted on 2007-09-27 16:54:13 by Synfire
You said sub parser, I was avoiding that.

Some other parsers are designed with subclasses for handling things as inane as one particular character, thats just dumb and wasteful and template-bloated.

I believe we can handle specific cases in a dynamic way, without resorting to the 'urge to over-oop' which is displayed by some highlevel programmers. There is a break-even point where the overheads of OOP outweigh its benefits, and that point can only be reached via clinically poor design premises and ignorant implementations. OOP is only bad to you if you abuse it, and over-engineering is the main problem there.

So, we want to be able to describe and handle arbitrary conditions at any 'layer' of the parsing machine? I suggest that the best solution would be to introduce a class whose purpose is to 'register' conditions and their handlers, via one or more 'contracts' (the obvious one being that the method param format must be vanilla, and flexible).

I know Biterider has played around with the notion of registered functional callbacks in the past, in this case we're also registering a Condition which should trigger them, that Condition being described as a Grammar Rule.

On another note:
I have decided that I absolutely do not need to detect LineFeeds.
The parser can produce , from the input ByteStream, a stream of Tokens.
We can deduce Statements from the Token Stream via grammar analysis, with no linefeeds, like some HLLs use, we can do that for ANY language.
We just need to watch the token stream for matching Grammar Rules.
If we don't find them, its a Syntax Error.
And if we do, we can create a 'rule-ified' version of the input by substituting matching sequences for rule references, and at some point, that 'rule-ified' version should perfectly match some complex Rule (complete statement parsed), or partially match one or more complex Rules (statement parse in progress), or otherwise again we have a Syntax Error.

Some rules will have Semantic Actions associated with them, and they are carried out linearly.

I took a closer look at YACC.
Man, it's a top-down recursive parser with infinite lookahead, just like I have been describing ("sentence tree").
Not only are my ideas solid, but everything I have coded lately has been done before.
Version 1 of XASM turns out to be a 'PackRat Interpreter', despite my best efforts to remain detached from any specific implementation pathway.

Well, I'm now more determined to do stuff differently just to be different :|
Posted on 2007-09-28 05:18:59 by Homer
You said sub parser, I was avoiding that.


I also said a small optimization on that would be to use whatever normal procedure you are using for tokenizing to preform the state changes and use a type of state call table. You are right, sub parsers are more the norm in the HL method of development and that's mostly because in HL design you want to try and seperate things so that it's "easier to find". But in this case, you will gain much better preformance by using a table of known first tokens; '!', '=', '<', '>', '+', '^', etc. an associated flag and the required callback. For example:

XASM_MSCT STRUCT
    lpIdentifier DD    ?  ; Pointer to a string for the token
    dwCondition  DD    ?  ; State change flag, gets updated when you
                          ; see the token (set to 1) and if it's followed by a space (set to 0)
    lpfnCallback DD    ?  ; The callback handler for the identifiers.
XASM_MSCT ENDS


That's an overly simplified idea of it, changes can (and most likely should be) made to that structure to better suite your design. Anyways, basic idea is still the same, the procedure that finds the '!' characters puts it into XASM_MSCT.lpIdentifier then sets XASM_MSCT.dwCondition to a true state. The next character is read from the token stream. If that character is a whitespace (or any other kind of delimiter that isn't accepted to follow the first token) the XASM_MSCT.dwCondition is set to false and XASM_MSCT.lpfnCallback is called to handle the, until now, unhandled token. The XASM_MSCT.lpfnCallback routine uses an argument of it's index in the table to be able to look up the XASM_MSCT.dwCondition value (or that value can be passed directly, personal preference really) so it preforms the proper actions. When the procedure pointed to by XASM_MSCT.lpfnCallback completes it returns to the procedure that read the last token so it can either continue processing that token or start over by reading a new character (which it will need to if XASM_MSCT.dwCondition was set to true). Best way to do that would be to set EAX to XASM_MSCT.dwCondition before clearing the value of XASM_MSCT.dwCondition at the end of the XASM_MSCT.lpfnCallback procedure then returning to let the parser check the return value to know if it should start over based on EAX being true or false.

As you can see, using that second method there is no need for OOP at all. I did suggest using a child class or collection but truth is there is no real reason to use OOP, that was more for your benefit since I figured you were already doing so much of the project in OOP you might decide to make that part more modular as well. Like for instance, if someone wanted to derive various parts of it to change the conditions to their liking. Say for example if they prefer the VB style '<>' instead of '!=' they could modify that. Personally though I wouldn't go through the trouble, I say let them mod the table if they want that stuff. Or possibly create a method within your current object which gives users the option to update the table (even at runtime if need be). Can't ask for more dynamic than that. Erm, if you are going to do that you might want to make it into a linked list instead of a normal table, just to be safe.. but for normal processing a table would be suitable.

Well, I'm now more determined to do stuff differently just to be different


Good, there is always a need for originality. While you are searching through different compiler construction tools, you might want to check out Bison and Lemon. Those two are fairly popular as well.
Posted on 2007-09-28 13:47:17 by Synfire
I think I should describe parser behavious within my Syntax File (where the Grammar Rules live).
I can introduce a few at the beginning to be used while parsing the Grammar Rules themselves, and then at the end describe the 'default' parser behaviours to be used for parsing the given source language.
IE, the parser behaviours can be described as 'inline switches' within the Syntax File.

When my parser object parses a Syntax File, it can modify its own behaviour based on the 'Parser Switches' it sees, and retain the state of those switches when it completes, leaving it in a fit state for parsing Language X files.

Now I don't need to specialize my Parser for this or that task, I need to Train it at Parse time.
And if we wished to switch parser contexts WHILE parsing an input file, for example to parse INLINE ASM in C, we can create a Grammar Rule whose semantic action is 'directly mess with Parser switches', or alternatively, 'include a Syntax File which contains a set of Parser Switch statements'

What you think?
Posted on 2007-09-30 03:55:25 by Homer
That sounds fine. Not sure it's the most optimal solution but it does give the most flexability. And in this case flexability is much more important I suppose. I think I'm a little confused, is this part of XASM? In other words, are you going to be loading the grammar files at runtime to handle compile/script some managed code? Or is this just part of a distributable object? I'm just wondering because the more we discuss it, the more "selective" (for lack of a better term) it's becoming. ;)
Posted on 2007-09-30 16:06:14 by Synfire
Obviously I intend to use it for XASM, but I have other things on my mind also.
In particular, I plan to implement a game console which allows the developer to view and edit the private fields of arbitrary objects at runtime, to script the behaviours of objects, etc.
Posted on 2007-09-30 17:49:58 by Homer
Attached is a small update of the StatementParser object.

Several small bugs squished.
Behavior has been changed slightly.. any 'Special' (non-alphanumeric) symbol such as a Brace or Math Operator is now treated as a Term in its own right, with the exception of the "." period character.
I allow this character to appear as part of a Term.
This is done by omitting it from the hardcoded szTermDelims string.
A future update may allow the user to elect a set of 'Special' characters.
If the Period is not allowed to be part of a Term, we cannot identify Terms of FloatingPoint Type.
I retained the ability (from XASM v1) to detect various Types of Terms, and to do so 'loosely'.
For example, Hex values can use a C Prefix (0x) or an Asm Suffix (h) (or both!), and FloatingPoint values can have a "f" suffix, or you can omit it, provided there is a Period.

Example: ".IF Axx!=30.0" is delimited into Terms as follows:
.IF
Axx
!
=
30.0

As Bryant pointed out, I was beginning to specialize this code a little too much, handling of special sequences such as "!=" should be performed at a later stage.

XASM v1's parser was adept at picking out special sequences of non alphanumeric characters, but it was very much geared toward parsing ASM sourcecode, it was very specific and not at all generic.
I have moved this implementation away from XASM v1 and made it more flexible, at the cost of extra memory and processing overhead during the analysis phase.

Statements are now being Tokenized as I had planned.
Each plaintext input term is represented by a Token struct, and tagged with a Type.
This struct contains a pointer to a Term string, stored as a unique Lexeme.
Each plaintext input statement is represented by a Statement struct, and tagged with a LineNumber and SourceFile.
This struct contains a collection of Tokens, as well as a copy of the entire plaintext input string.
When you call StatementParser.Init, it builds a collection of Statement structs representing the entire sourcecode, and then returns to its caller.

The only other change I'd like to make so far is to add an option to handle LineFeeds differently, because we might want to parse a 'block language' like C, where statements are TYPICALLY multiline and delimited by 'bookend' context decorations.
But that can be done by the User in their Derived object.
I would recommend redefining onParseLine and onParseTerm methods.
If you redefine onParseLine with a dummy method that does nuffin, the Current Statement will never be Collected into the pStatements collection.
And similarly, you can change the way Terms are handled by redefining onParseTerm.
None of this will alter the tracking of Line Numbers, which is done by onParseChar.
Attachments:
Posted on 2007-10-03 02:08:01 by Homer
Another small update - the Token struct's Value union was extended to handle real8 (64bit) floating point values.
a2fp function header now properly protects kludged registers.
The only 'Term Type' not being handled correctly now is 'Binary'.
Would anyone care to supply functions to convert from BinaryString to dword (or better yet, qword) and back again? It's an easy and fun thing to code :)
Posted on 2007-10-03 03:58:27 by Homer
Hi
for the last proc, look at dword2bin.asm in the OA32 code library.

Regards

Biterider
Posted on 2007-10-03 15:24:40 by Biterider
Thats a nice implementation! But I really needed the counterpart (ascii string to dword / qword) at this time :P

Since we don't know the string length, we have to bytescan :|
I am limiting the length to 32 bits here.
Arbitrarily longer conversions can be performed as batches of 32 bits.
This version ignores the ASM suffix 'b' if it is found.
It assumes that the characters are only '0' and '1' (30h and 31h), and simply checks sign of bit zero.


;Convert binary string (up to 32 chars) into binary value (up to 32 bits)
;Returns eax=binary value, ecx=#Bits
bin2dword proc uses ebx pBuffer
xor eax,eax
xor ecx,ecx
xor edx,edx
mov ebx,pBuffer
;First, find End of String
.while byte ptr!=0 && byte ptr!="b"
.break .if ecx==32
shl eax,1 ;this does nothing the first time, since edx=0

mov dl,byte ptr
and dl,1
jz @F
or al,1
@@:
inc ebx
inc ecx
.endw
ret
bin2dword endp


Posted on 2007-10-04 09:01:27 by Homer