Just finished writing a framework for DFA/LALR(n) parsing engine if anyone is interested.
If so let me know and I'll post source+demo
Posted on 2010-04-09 07:52:14 by Homer
My parsing engine takes a pluggable Grammar, which you can define yourself, so it has many uses ;)
Grammar is imported from a 'CGT' (Compiled Grammar Tables) file, which you can generate with 'Gold Builder' (an editing/compiling tool for your grammar).
Once the engine has loaded a grammar, you can feed it an input textfile to be parsed.
The parser will completely consume the file, or terminate parsing at the first syntax error.
If successful, it will return a ParseTree, which represents a tokenized and strongly-typed version of the input, which we should 'walk' in order to interpret the input.

The parsing engine has two main components:

Stage 1: DFA Lexer = Deterministic Finite Automata
Deterministic because the output from each step sets the input state of the machine for its next step.
Finite because there are a limited (although varying) number of state choices at each step.
Automata because the system can operate without intervention until/unless it reaches the 'exit state' (in our case, EOF).

The DFA takes input plaintext one character at a time, and checks it against a number of character sets to determine whether the next input character is acceptable according to the rules of our Grammar.
Entire whitespace-delimited 'words' are gathered and 'tokenized'.

Stage 2: LALR(n) Parser = Look Ahead (n tokens) Left to Right

Both of these things are driven by 'state tables'.
There are tables of data representing the current 'state' of the engine.
Each time we consume a piece of input data, we consult the current 'state' data to see which 'state' we should go to next - this is exactly what 'deterministic finite automata' means.

The parser takes one input token at a time, and builds sequences of tokens.
As each token is input, its symbol is compared to a list of potentially-acceptable symbols (associated with the current state of the engine) to determine whether the current input sequence matches the rules of our grammar.
Whenever a sequence of tokens matches a grammar rule exactly, it is removed, and replaced with a single token representing the replacement - this is known as 'reducing a rule' (actually reminds me of data compression).
Eventually, the entire input will have been reduced to a single token - whose reduction rule will match the 'start symbol' of our grammar, and represents the root node of a tree.
The engine builds a tree out of all the tokens it 'removed' whenever a 'reduction' was performed.
This tree represents all the reductions, in the reverse of the order they were performed - but it also represents the input text in a compressed form, and in a form well suited to interpretation.

Posted on 2010-04-12 00:16:57 by Homer
Yes, I am interested.

James
Posted on 2010-04-12 11:41:24 by jcfuller
Cool, I'm just making a few finishing touches, in particular the ParseTree was being constructed incorrectly.
So give me a couple more days, I'd rather not post in a broken state.
Take the time to familiarize yourself with the Gold Builder tool and with the BNF style that Gold grammar employs.
There's a number of example grammars to look at on the website.

My implementation is based on all of the example engine implementations posted on that site.
I took the features which I liked from each implementation and rolled them together using ObjAsm32/RadASM.
Especially, I greatly improved the memory management of the DFA section.
Posted on 2010-04-14 00:07:38 by Homer
Attached is a beta demo - I want feedback before I post the source.
Choose a file to parse (text.txt) and if successful, you should see the 'parse tree' appear on the app's treeview control.
This demo will leak memory like a bitch if you open multiple files, the demo isn't robust, there's no garbage collection code at this point. Basically I just wanna hear 'yeah, it works, i even tried a few other inputs and got a few to work'.
I don't wanna hear 'It started then disappeared' or 'It crashed when I opened the file'. :D
Note that its a debug build - the true size of the exe is a lot smaller.

Included are:
- tokenizer.exe demo 32bit executable
- test.cgt compiled grammar tables file
- test.grm plaintext containing the grammar before it was compiled
- test.txt example sourcecode to feed the parser

The grammar provided is for the original C (1973) specification, I think it's by Devin Cook.
I added grammar for ObjAsm32 object definitions which is what the test.txt is testing - but you can try throwing some simple C stuff at it, and it should be accepted.

Forgot to mention - after opening a source file, you can either Step through the parsing process, or Run the parser until it finishes - but the big buttons should have made it obvious :P

Attachments:
Posted on 2010-04-15 00:47:13 by Homer

I should also mention - the current implementation expects the input sourcefile to be american ascii.
The entire engine is unicode based, except for one thing, which I'll probably change...
Currently, characters are assumed to be one byte each, and input plaintext is stored as ascii strings.

I'll probably add code to check for the presence of a Byte Order Marker which would indicate more clearly the plaintext encoding we are dealing with, this project isn't finished but it''s not far off :)
Posted on 2010-04-15 00:56:39 by Homer
I've implemented the garbage collection now, made a few optimizations and tested more experimental grammars - the demo grammar now supports multiple opcode statements on c-style multiline statements, mixed inline with old C statements, at the cost of the asm ';' single line comment symbol (new grammar, new syntax, live with it? I can)

Example of asm codeblock:

{
nop ; mov eax, something; test eax,eax; jz someplace
}


So basically what I've been testing is a parser for asm which is not 'linebased' but rather based on grammar.
My previous work with XASM was quite different in some regards, since it would enter a new context without looking for closure to said context, only generating an error some time later.
Posted on 2010-04-16 01:44:39 by Homer
My intention is unclear - I will be using this for a runtime interpreter / script engine, but i really do wanna write my own assembler / compiler / compembler for multiple syntaxes under a common grammar.
And I know for a fact that I can do it.
It might be best if I began with a simple 'pretty printer' / preprocessor, but chances of me taking the easy path are slim.
Posted on 2010-04-16 01:48:11 by Homer
Go big or go home :P
Posted on 2010-04-16 12:58:00 by SpooK

{
nop ; mov eax, something; test eax,eax; jz someplace
}


That actually reminds me of bash scripting to an extent.

for song in /home/bkeller/media/music/*
do
mplayer ${song} ; read -p "Should we keep this song (y/n)?"
[ "$REPLY" == "n" ] || rm ${song}
done


Kinda like it actually. :)
Posted on 2010-04-17 11:49:14 by Synfire

At the moment I am just making some small experiments in grammar.
But ultimately I'd like to see a community-backed effort to develop a nicer grammar!
This might start with a black and white list of most useful/most annoying grammar features of some common assemblers.
Today I'll finish implementing native unicode support - the parser automatically detects a number of types of plaintext encoding, and imports the input characters as utf16 (little endian).

Posted on 2010-04-17 21:08:30 by Homer

OK, this version supports Ascii, UTF16-LE and UTF16-BE - it detects UTF8 but its not supported.
The native string representation is UTF16-LE throughout.

Also included are the files for the grammar experiments I've mentioned.
Let me know if it works ok !!
Attachments:
Posted on 2010-04-17 22:09:09 by Homer
Perfect

Biterider
Posted on 2010-04-18 00:17:13 by Biterider
Cool! ... need to make a grammar with full assembler commands =) , sad that the project Gold Parser is not written in assembler
Posted on 2010-04-18 04:37:58 by slick
The attached grammar was extended to support Directives, starting with the include directive.
Directives may appear anywhere (inside procs, macros, inside other statements, just by themselves, whatever).
You can put multiple directives on a single statement, with OR WITHOUT ';', the '{}' block delimiters, or whatever .. look at this:


include @Environ(OA32_Path)/mything.inc
include this.inc ; include that.inc ; include another.inc
include alpha.asm include ../path/file.inc include path/file.asm


Interesting times :)

Out of interest, I've been spending half my time working on an educational project involving reverse-engineering of OBJ files for the purpose of gaining insight into the behaviors of the backends of various assemblers.
Attachments:
Posted on 2010-04-18 23:52:05 by Homer
And this version supports <braced string literals> in both include statements and as a general Value...
It also supports mixed use of forward and/or backslash in filepaths:


include <My Folder Has Spaces In It/Wheres My File\Here.inc> include ../AppData.inc include My\AppCode.inc



Attachments:
Posted on 2010-04-19 00:34:51 by Homer
I've decided today to post the guts of the parser, in its current form ...
Like any of my projects, it's likely to change without notice - it's a work in progress.

The past couple of days have been spent trying to write a generic and extensible interpreter baseclass.
Maybe I'll post something about that in the near future :)

I've also began writing a new grammar from the ground up... My experiments with extending the C73 grammar have recently produced a lot of compile warnings due to ambiguities in the grammar.
This has shown me that my grammar must be constructed carefully and with thought, it's unwise to add things in an ad-hoc way. One big benefit of constructing my own grammar is that I can simply write my interpreter in parallel with the emerging grammar.
Attachments:
Posted on 2010-04-21 06:56:39 by Homer
While writing the base interpreter class, I was thinking forwards... You know, with respect to the output of my Parser (a parsetree), out of all the possible constructs of all the infinite possible grammars, theres only one exceptional case in terms of interpretation...

99.999 percent of the time, we can 'simplify' the children of a parsetree node, then solve the 'problem' posed by the node, with our pre-simplified arguments.

But theres ONE special case!

The IF Directive (buildtime, not runtime).

When we see an IF, we need to simplify and resolve the Conditional, test the result, and MAYBE we'll continue to process the children of the IF node... we do NOT need to reach down to the root of the IF subtree everytime.
This allows our interpreter to skip entire subtrees when it encounters failed IF tests during the first pass (assuming theres multiple passes, I will talk more about that).

But here's the catch:
IF belongs to some grammar or other, not ALL grammars will have a buildtime IF directive.
So it doesn't belong in my interpreter baseclass (?).

Anyway, its the ONLY special case I can think of.
In all other cases, we will reach down to the roots of the tree, generating persistant entities as we encounter them, and solving problem nodes on our return journey.

My gift for today will be the latest version of my own 'from the ground up' grammar file.
If you want to compile it for use with the previously posted demo, go download Gold Builder !
I endorse and recommend Gold - and I'm not being paid to say it.
Attachments:
Posted on 2010-04-22 02:29:28 by Homer
Brainfart!

I want to state that I have not coded for three days.
When I run into this kind of brick wall, I walk away and wait for the solution to reveal itself.
And it did.

Every node in the reduction tree represents a problem to be solved, right?
Well, IF is no different in this respect.
Let's take a look what the Parse Subtree for IF looks like..
In the case of the IF statement, we have something like:


                                <Statement>
                                          |
if <Expression> <Statements> [] endif



For those who care, means 'optional components' and [] means 'optional and repeatable'.

So for any complexity of IF statement, we can boil it down to:

TERMINAL <Statements> []

Yes?

IE, in order to process an IF Directive, we skip the current Terminal, test the Case in the next token, if its false we try more cases, if its true we swallow the statements inside that case.

So we don't really need to look ahead for the 'special case' of an IF Directive.
We can process Everything from a common handler, if we wanted to.

Posted on 2010-04-23 01:01:55 by Homer
This is cool! should have an interpreter demo up shortly :)
First goal will be to handle complex math expressions.
Actually easier than it sounds !!
You'll see it here first :D
Posted on 2010-04-23 01:29:40 by Homer