Well guys, any major dramas?
If you messed with the test sourcecode file, chances are you managed to trigger an int 3 trap - I've left dozens of them throughout the project specifically to alert me to unhandled cases.
Does anyone have any questions (or even suggestions) in regards to the project source?
Posted on 2010-05-25 23:40:11 by Homer
I'll make it a point to check it out this weekend :)
Posted on 2010-05-26 10:24:52 by SpooK
Just wanted to report that it works fine on Win7 64-bit.
It's a shame I don't have more time for thorough tests.
Posted on 2010-05-26 16:46:14 by ti_mo_n
Thanks for the feedback, good to know theres no HUGE problems already!

Next mission is to write code for the data declaractions that have all the nested braces like this:

MyData SomeKindaStruct <ThisData, ThatData, <Some other embedded thing, a few zeroes maybe>, couldbemore>

I've thought for a couple days about the two ways to do it.
One of them is to do it during parsetree recursion, using a heap pointer to the current reference struct.
The other is to expand the whole thing into a statement similar to the example, and analyze it linearly, which could involve using the procedure stackframe as a heap for struct pointers.
After some soulsearching, I'm inclined to go with the latter, even though it's 'less cool', it does conform more closely to the structure declaration field formatting, and should be easier to implement.
I'm struggling to find the motivation to get much done, kinda taking a breather.
Posted on 2010-05-29 03:23:46 by Homer
I've implemented code to express data declarations from simple structures:

viking struct
x db ?
y dw ?

mydata viking <255,255>

Where possible, the assembler will attempt to conform the user's declared data to the datatype expressed by the associated reference structure field:

viking struct
x db ?
y dw ?

mydata viking <"h","i">

Not yet handled is the case of a struct reference appearing as a field of the current reference struct.
This will require some kind of local stack push, as mentioned in the previous post.
But since all the simple cases are now handled, I should be able to implement this last thing relatively cleanly.

Also not done is to count the size of the labelled data declaration, and mark the label with that size.
This is important, since the Segment object will append any subsequent unlabelled data to the labelled data entity, so if we want to implement a 'sizeof' directive, we're going to need a permanent record of the size of the labelled data entity, taken at the time that the entity was declared. It's ok to tag the Label itself with the sizeof value, as the label is created once only.

Anyway, I'll hold back on posting more code until after the weekend, as I am still eager to hear from people who checked out the most recent demo I posted.

Posted on 2010-05-30 00:54:56 by Homer
I've implemented all the code to handle those complex data declarations of nested struct references.
You know, these horrible things:

mydata mystruct <data,data,<data,data,<data>,data>,data,data>

I did this by expanding the reduction completely into terminal tokens, and then analyzing the token sequence while simultaneously tracking the current reference struct and field.
The trick is to switch reference structs, using a stack to retain prior nesting levels.
In my case, I was able to just use the procedure stackframe, and PUSH/POP  of current struct/fieldindex for that.

The result is that I can detect when the user's data matches the struct declaration, and whether the user's input terminates earlier or later than dictated by the struct declaration.
All data is "flexible" in that user input will be conformed to the structfield type whenever possible (else error).
And it is not a terminal error for the user data to terminate earlier or later than dictated by the struct - the assembler will simply generate s suitable warning and deal with the situation.

The code was completely hooked up to emit all data to the output Segment class via Assembler.pCurrentSegment, so I'm happy to say that directives for  data declarations of all types appear to be complete (probably missed a bunch of stuff, just haven't triggered any traps lately).

Posted on 2010-05-30 03:53:32 by Homer
Corrected a small bug in Main Grammar (and corresponding handlers) which disallowed declaring a comma-delimited series of 'known' (machine) datatype.
The Main Grammar was also modified to allow for UNLABELLED data declarations and struct fields.

These example cases show what wasn't working, and is now allowed.

dat1 BYTE 0,0,0
      db <0,"hello",0,<0,0>,0,0>

Note that for data declarations of 'known datatype', the '<' and '>' characters are essentially IGNORED.
There's no need to track thru ref structs or any of that complex stuff - I simply pretend they are not there.

So... what next?
If I have to ask that question, I must be getting somewhere :)

Posted on 2010-05-30 08:46:23 by Homer
The sizeof() directive was implemented within the Complex Expressions part of the Main Grammar.
You can freely use sizeof(X) or sizeof X within expressions:

x *= sizeof y*2+j
y += sizeof(j)+32

The code handler implemented only supports using sizeof with Structs - this needs extending to at least support sizeof 'known type' .
Posted on 2010-05-30 21:20:00 by Homer
The ++ and -- modifiers were implemented in the Evaluator class.
They can be used anywhere in an expression, except at the start.
x = 3.14159
y = x++ / 2

Precedence: all other math operators have precedence over the ++ and -- modifiers.
Thus, the above example will calculate y=x/2 , and then increment x.

Currently, only 'literals' (buildtime variables) of floating, integer and hex types can be modified, and only by +/- 1
It would be nice to extend this later, to be able to increment say, a pointer by the size of the type it points to.
I'm sure the asm puritans out there are scoffing and shaking their heads solemnly.
Hey, you don't HAVE to use any of the high level directives I'm implementing, that's your choice :P

Posted on 2010-05-31 02:04:40 by Homer
Implemented some COMMENT symbols.

/* starts a block comment
*/ ends a block comment
// is comment for rest of line (equivalent to masm ';')

These were actually a little tricky to implement - my Parser needed some slight changes to the input stage in order to suppress activity within comment blocks.
It's done now.

Here's the most current test sourcecode:

/*some comments

ramble ramble

y= x++ * 0.5 //im allowed to
echo "this"

moose struct
x db ?
x = (sizeof moose*2)

viking struct
x moose <?>
y dw ?

db "hello",0, z

Starting to look more like some sourcecode, yes?
And I mean, this is junk, it's just to test everything.
It was certainly gratifying to see that the data declaration at the bottom emits a byte with a value of 12 :)

I may change the comment symbols later.

Posted on 2010-05-31 05:10:46 by Homer
The Main Grammar was modified to implement data declarations of 'array' type.

xxx db 32 dup(0)
      db 26 dup ("repeat this string",13,10,0)
      real8 12 dup (13.2)
yyy viking dup (<>)

The handler to implement this is not coded yet, but should be a trivial extension of existing code.
Fun , fun !!

Every time I sit down to this project, I add one more thing, or two.
It's like reading a chapter or two of a book.
It's been that linear.
How refreshingly sane to code within such a clean and progressive framework!

Since I've been steeped in MASM syntax for so long, the syntax I've been implementing thus far is very masm-like.
But it's not the same - I can implement whatever I can imagine, so I'm hoping to hear some suggestions when I get this thing to the point that it can emit object files ;)

Posted on 2010-06-02 04:18:34 by Homer
Some slight changes were made to the <DefineData> rule, in the Main Grammar.
These were made so that the reductions for Array data declarations look very similar to reductions of regular data declarations.... so that my <DefineData> handler can deal with both array and non-array data declarations...

Implemented the code for expressing data declarations of Array type, as previously discussed.
This was done by making some modifications to the <DefineData> handler, such that it can optionally iterate when evaluating the expanded version of the declared data values.
It's not as graceful as it could be, but it works just fine, and it works for arbitrarily complex inputs, while avoiding the need to deal with all kinds of special cases.

One side-effect of implementing this code was the adding of a new method to the Interpreter ancestor class.
Interpreter.ClipTokens is able to eliminate one or more (terminal) tokens from anywhere within an (expanded) reduction. This is useful for forcing a 'recognized' reduction to look like some similar, but more simple reduction... which in turn allows us to use existing code to do SOME of the work of resolving special cases.

Also, the Evaluator.Get_Value method has been made a little more intelligent, as it can now recognize various kinds of literal tokens *without* having first evaluated them via the <Literal> handler. And the handler for tagging literal values was modified such that we tag them at a lower level in the grammar .. so we don't rely on passing through any particular handler to get our tokens correctly tagged for us - and so we can rely more on tag values in higher-order handlers.

I'll do a little more testing, then I'll post another full update of the project.
Posted on 2010-06-03 02:03:26 by Homer
OK, here is a full update of the sourcecode.
This time, I have *not* included a binary.
I'd like to know if there's any problems building this project.

Also attached is a tool I've been playing with on the side, useful for exploring the internals of OBJ files.
Since that's my first target object file format, I needed to investigate what I'm expected to emit.
This tool will continue to receive updates, but will only ever be an educational tool.
Posted on 2010-06-03 02:59:09 by Homer
Handler code for the <MacroArgs> symbol was implemented in the Assembler class.
I really like how NASM lets you refer to "unnamed" macro arguments, via the %n directive.
But I like them having meaningful names too.
So I want to support both.
Anyway, now I'm correctly recording macro argument names , if present in a macro declaration.
I'm almost ready to implement code for executing instances of macros - the heart of the macro engine!
Posted on 2010-06-03 06:42:57 by Homer
Late last night, I tried to implement the second of the two macro execution modes.
Let me explain how my thoughts are going.

Theres two kinds of macros, and two kinds of macro executions.

The first kind of macro is straightforward - it generates some lines of sourcecode, but does not 'return' anything to its caller.
This kind of macro can only be executed as an entire statement - not as part of another statement. In masm, these take the form of MACRONAME

The second kind of macro is able to return something to its caller (see masm's EXITM directive), and can only be executed as a subexpression of another statement. In masm, these take the form of MACRONAME [(ARG,ARGS)].
It is worth noting that the macro interpreter replaces the macro instance with the single token returned by the macro, and then reinterprets the affected subtree.

I had absolutely no trouble implementing the execution grammar for the second kind of macro - I slipped the grammar rule into the set of rules used for complex expressions, seemed appropriate to handle these 'inline macros' as subexpressions.

But I had a heck of a time implementing the simpler, regular, statement-based macro execution grammar.
I ran into all kinds of ambiguities involving other parts of the grammar (such as data declarations and even structure field grammars).

Then I was drawn by an error comment I generated in masm during experimentation.
It said something like "statement must appear within segment block".

I realized that masm has divided its grammar up into several groups of statements, each associated with one of the default segments (code, data, data?). This probably means that a lot of grammars are duplicated (which it turns out is an unavoidable consequence when implementing certain complex grammars) - but it also means that they are partitioned in a way that allows them to defeat a lot of the ambiguities which are plaguing me.

There appeared to be two possible solutions.
One of them is to try rewriting the grammar starting with my Segment Selector directives near the top, which sounds like a heck of a lot of work, basically a full rewrite (its not too late!)
The other is just to bite the bullet and accept limitations in my grammar.
For example, I found that the form %MACRONAME is acceptable as a statement execution.
This I suppose would make it look like nasm in terms of macros, which probably won't be too bad.

Your thoughts?

Posted on 2010-06-03 22:53:47 by Homer

OK, I have implemented a pretty flexible EXITM directive.

- EXITM is a Statement, but can only appear inside a Macro.
- EXITM can return a list of one or more comma-delimited names, values or registers inside sharp braces.
- EXITM can alternatively return nothing at all, expressed as <> a pair of empty sharp braces, or nothing.
- EXITM can alternatively return a Complex Expression inside regular round braces.
  This last one allows us to return Literal Strings to the interpreter to be 'subparsed' - it allows us to use macros to 'construct' (possibly multiline) sentential sourcecode statements that will be interpreted as IF they appeared in the sourcecode directly - just for example, we could write a macro that literally writes a declaration for another macro at buildtime, based on some buildtime switches etc.
This is a VERY powerful feature which is lacking in some macro engines.

Anyway, here's some example EXITM statements that might appear in a macro:

EXITM <eax>
EXITM <j, counter, eax>
EXITM (x=x+j/2)
EXITM <("mov eax,"),j>

Any macro can contain zero or more EXITM statements, so any macro can return something.
If the macro is executed in 'statement form', the data returned by EXITM will be ignored.
But if the macro is executed in 'expression form', the data returned by EXITM will replace the macro execution tokens inline, and the resulting statement will be reparsed and reinterpreted inline...
Well, if we returned a literal string it needs to be reparsed - otherwise we can skip that part since we're not working with anything 'new'.

Sound good?

Posted on 2010-06-04 01:02:27 by Homer

In the end, the grammar for EXITM became even more flexible.
The surrounding <> are optional.
Braces are required around complex expressions, but not needed for literal strings.
Of course - I'm yet to code the handler for EXITM, but I have begun writing code to perform a macro execution - so far I'm able to search for and find the macro by name with respect to its declaration scope (namespacing), and I have expanded the list of macro arguments (if any) given with the macro execution statement.
Now I need to clone the macro's payload of statements, and for each statement, perform any 'replacements' (of macro arg name references) and finally throw the statement to the Interpret() method.
Once I'm able to 'expand a macro' in this way, I'll be interested in handling the EXITM directive.

All smooth sailing!

Attached is an update of the Main Grammar, showing some new stuff for Macros.

Posted on 2010-06-04 01:38:33 by Homer

Successfully implemented code to execute (expand and interpret) "macro execution statements", complete with a mechanism for returning EXITM tokens (which I currently do nothing with).
But hey, it's a HUGE step fowards!
I'll probably now implement the code for handling "macro execution expressions", expecting to use at least some of the same code.

What I've really done is collect up the statements inside macro declarations (as a single, unexpanded reduction, equivalent to one token called <MacroStatements>), deferring their interpretation until the macro is actually executed. At that point I clone a copy of the macro's (still unexpanded) statements reduction.
I pass this copy of the unexpanded macro statements, along with a copy of the execution args, to a utility method whose purpose is to perform 'argument name replacements', before passing the (STILL unexpanded) statements to the default handler for standard interpretation.

Our macros can contain two kinds of statements.
They can contain <Statement>, and they can contain :
  'exitm' <MacroExit> <Terminator>

Normal <Statement> reductions never return any tokens - statements are wholly consumed by the interpreter.
But the exitm statement MAY return some tokens - so if we see a MacroExit statement, we need to:

#1 - stop expanding the macro statements
#2 - expand and return the <MacroExit> reduction.

Posted on 2010-06-04 03:04:04 by Homer
Modified the grammar slightly - attached update.

Especially, see this:

<MacroX> ::=                     '%' <ID>  <ArgList> !Execute Macro statement
<Macro Execute InLine> ::= '%' <ID> '(' <ArgList> ')' !Execute Macro expression
         |    <ID> '(' <ArgList> ')' ! % is optional for expression-based macro executions

<ArgList> ::= <Value> ',' <ArgList> | <Value> |  !<--Can be nothing

There's our two forms of macro execution.
As you can see, I've decided that both kinds of macro executions should be able to begin with a '%' (for consistancy), but that for 'expression based' macro executions, the '%' is entirely optional.
You can also see that I've gone to some effort to make these two reductions (MacroX and Macro Execute Inline) look very much like each other - I intend to use my new ClipTokens method to force one of them to conform strictly to the other and so be able to use the same code to handle the conformed reductions.

Posted on 2010-06-04 04:29:09 by Homer

Both statement and expression-based macros have been implemented, with the exception of a missing function for performing 'argument-name replacements' prior to macro expansion and interpretation.

It's worth talking briefly about the behavior of the IF directive with respect to macros.

Normally, when the interpreter sees an IF directive, it will conditionally follow one of the blocks of casecode associated with the IF directive - that means it will always ignore the sourcecode in the cases that it did not follow - possibly all of them! And our interpreter is destructive - we'll never see this IF directive ever again (and we should not need to).

But macros defer the interpretation of any statements they contain.
So when the assembler sees our macro declaration and 'scoops up' the statements inside it, it ALSO scoops up any IF directives that are in there.
In turn, this means that the IF directives inside our macros are 'persistant' in that they exist permanently as statements of the macro, thus they will be interpreted every time the macro is executed.

Even though our interpreter is destructive, this won't destroy the content of a macro, because we hand a cloned copy of the content to the interpreter, rather than the original... thus, "permanent IFs".

This is a good thing, because although an IF condition may not be true at one point in time, it MAY be true at a later point in time, so a macro containing one or more IF directives can 'unfold' in more than one way, depending on some condition that we define.

Posted on 2010-06-04 07:04:41 by Homer