I've done some really good work today.
Macro executions are now performing replacements of references to named arguments.
I was able to do this in a very efficient way...

When I enter a Macro Execution, I shove a reference to the macro declaration onto the 'macro declarations stack' (normally used to trace nested macro declarations). The rest of the application can easily see at any time which macro declaration, if any, is on top of that stack. I also mark the macro declaration with a temporary pointer to the user's list of execution args (more about that below).
As I leave the macro execution, I pop that reference from the stack.
It's worth noting that the macro on top of the stack sets the scope for subsequent macro namesearches - ie, the top macro determines the current namespace.

I realized that my macro execution code was about to call the Default Handler to interpret the macro's statements payload.
Here was my chance to perform the replacement of named macro arguments - during the conventional handling of the <ID> nonterminal !!!

So - now when the interpreter sees an <ID> token, first it type-tags the token as usual, then it checks if the macro-decls stack is empty - if its not, it grabs the top macro reference, and checks if its been marked with a list of execution args - and if so, checks if the current ID's string matches any of the macro's fixed list of arg names.. and if so, replaces the <ID> token with the "nth execution arg" token - it naively replaces the whole token, not just the string, so that means the replacement token retains its own type-tag and other info.

Thus, I have been able to implement the replacement of named macro arguments during the default expansion of the macro's content statements, and totally avoided the need to search for tokens suitable for testing as replacement candidates. Very efficient indeed :)

Guess what - the macro engine is practically completed !!
Sure, we don't have any of the nifty directives that make macros really rumble, even simple stuff like string manipulation is missing, and we still can't get the interpreter to "reinterpret" a string after a macro has manipulated it, but these things are really asides to the main job of arg-ref replacements - the main things are all in place and hey, its working !

Posted on 2010-06-05 02:51:47 by Homer
Another stone in place.

A far less important, but very useful addition.. the code for the echo directive was implemented - echo emits strings to stdout, and to a dedicated DebugCenter window.

I must again ask myself what's next?
Perhaps I should look at implementing the code for handling 'macro locals'.
Currently there's zero support for 'anonymous symbols' - this would be a good start on that.
And it would bring the current macro support to completion.
Very soon, I am going to have to take a good long look at the requirements of the back-end, and I'm going to have to bite the bullet and nail down the opcode-encoding pattern-matcher stuff.

D'oh - I missed my own anniversary - yesterday marks one month since this assembler project was initiated.

If anyone wishes to comment or ask questions, feel free.
Posted on 2010-06-05 04:41:23 by Homer

Corrected a small bug in Main Grammar regarding Macro Locals, and made a couple of other relevant but more subtle changes.
Implemented code to collect a list of names of macro locals --> macro declaration.

Implementing code for handling references to macro locals will be interesting, not sure how I'll go about it yet.
Think I might leave it for tomorrow and stew over it for a while.

It's worth noting that a macro local only has ONE PURPOSE: to declare an anonymous label in the current segment, which may only be referenced by statements within the current macro. So this is not necessarily the time to implement a name aliasing scheme. We may simply wish the ExecuteMacro function to generate a unique label name for each local, and the <ID> handler to perform token-replacements for any references to local names.

Anyway, attached is the current Main Grammar file.
Posted on 2010-06-05 05:51:08 by Homer

I've decided that the solution to replacing of references to macro locals is similar to what I did for <ID>.
But this time I implement the replacement check within the <Label> handler - clearly the right place to do it.
I've put the code in place, and I'm already able to detect locals being used as labels in macros.
But I don't have any way to define unique (anonymous) symbols yet, so I guess until I do, I cannot continue.
Yay! Now I know what I'm doing next :P

Posted on 2010-06-05 07:04:05 by Homer
Macro engine is complete, more or less :D

A new method was added to the _Segment class to generate names unique to that segment.
The <Label> handler's code was extended to check if we're inside a Macro, and if so, whether the Label's ID matches any of the macro's named locals.
If we find a match, and haven't already done so, we ask the current segment to generate a unique name for this macro local, we record that name against the macro local so we can't accidentally redeclare it, and we emit a Labelled Code Element to the current segment which has the name we generated, but no other content.
We now have a mechanism for detecting subsequent references to the macro local ID which are within the macro, but which are NOT labels - and when we see them, we can replace their strings so that they refer to the local's generated name (which is only valid within the current execution of the macro, next time we execute the macro, the locals labels will have a whole new set of unique names generated).

I only just implemented that last part - the <ID> handler will search for and replace references to named locals with the Generated name - and will Generate a name for the named local NOW if the label hasn't yet been declared.

It's important to note that , regardless of how many locals the user declares in their macro declaration, we'll only generate a unique name when we reach a localname reference during a macro execution.. we won't simply generate a name for every local that the user declared.
This means that later, when we start emitting symbols, we'll be generating as few as possible - we won't generate any unreferenced symbols, if we can help it :)

I'd say it's about time I reposted the entire sourcecode for the project, so you guys can check out what's happening in there lately :) You'll find I've tidied up the code and improved commenting, this will be an ongoing thing for the life of the project. But I won't attach the update tonight, you can wait until tomorrow incase I have a brainwave :P

The day is coming soon when I will begin to document this assembler's (current) syntax, and separately document the sourcecode internals.
Posted on 2010-06-05 10:35:56 by Homer

OK I've attached a full update of the sourcecode.
I'm not sure that my comments sufficiently explain what's happening.
It's not really complex, but it does involve recursion, which can seem confusing.

Have a look at my name generator function, in the Segment class.
I'm not certain that name collisions are impossible, I might have to actually add  code to check for that.
Posted on 2010-06-06 01:43:16 by Homer
Made the following small change to the Main Grammar:

! This is our 'Start Symbol' where our grammar begins.
! Program is defined as any number of Declarations or Statements.
<Program> ::=   NewLine  <Program>        !Deal with multiple leading CRLFS (at start of program)
| NewLine
| <Executable>
| <Statements>

!Executable programs will have the last statement = "end EntryPointName"
<Executable> ::=  <Statements>  'end' <EntryPoint> <Terminator>
 | <Statements>  'end' <EntryPoint>
<EntryPoint> ::= <ID>

There's two new symbols: <Executable> and <EntryPoint>.
If the last statement in the program is "end SomeLabel", then our parsetree will contain a <Executable> reduction right near the top (which is <Program>).
That is to say, our entire program will be a child of <Executable> (which contains <Statements>).
The assembler can immediately identify that the output is intended to be executable, simply by the appearance (or lack of) the Executable symbol.
But more important is the <EntryPoint> symbol, which contains, via <ID>, the name of the program entrypoint label.

No code handler for this yet - but I won't be doing anything with the entrypoint for some time yet.
Anyway, it's nice to have it there in the grammar, something more that's done.

One cute side-effect of the new 'top grammar rules' is that the last statement , if its an 'end label' statement, does not have to be terminated - up until now I've needed there to be a linefeed after the last statement - think its a 'feature' of the grammar compiler I'm using, nice to see there's ways around it.

Posted on 2010-06-06 03:26:17 by Homer

The code handler for <EntryPoint> was implemented.
For now, I just grab a copy of the name of the entrypoint label.
Later, the backend can search the segments for it.
I won't deal with this one any more until I need to.

Oh - there's no handler for <Executable>... I'm just letting the default handler deal with it.
The presence of <EntryPoint> tells us what we wanted to know (whether or not the program is executable).
The <Executable> rule still has value, it ensures that there can only ever be one "end label" statement, and that it must be the very last statement, if it appears at all.
Posted on 2010-06-06 06:23:29 by Homer
I am really enjoying this project.
My previous attempt at a macro assembler was very linear.
There was no parsetree, everything was flat.
And the grammar rules were implicit.
This time, I'm using compiler techniques, I feel like I will complete this project.
It is time to ask those of you who are obviously following this thread how interested you are.
Would you like to get involved? This could be YOUR assembler. Right now, it's exactly nothing.
But I am really enjoying creating it, so far it seems easy, which is not true of the previous incarnation.
Posted on 2010-06-07 03:15:29 by Homer
Another small change to the start of the Main Grammar.
Executable programs are now required to begin with a directive for choosing the valid set of opcodes, as WELL as ending with a "end EntryPoint" statement.

I've implemented masm's ".x86" directive (ie .486) , whereby we limit the selection of opcode encodings to a given cpu instruction set (and below).

Allowing grammars such as ".486" caused a conflict with the definition of a FloatingLiteral Terminal (ie a floating point value). To get around this, I've insisted that floating point values must contain one or more digits BEFORE the decimal point, UNLESS they have a "f" suffix. That's all it took. So , 0.5 is valid, .5f is valid, .5 is not valid.

!Executable programs will have the last statement = "end EntryPointName"
<Executable> ::=  <CPUSets> <Statements>  'end' <EntryPoint> <Terminator>
 | <CPUSets> <Statements>  'end' <EntryPoint>
<EntryPoint> ::= <ID>

!Executable programs will have their first statement(s) = cpu opcode family selection
<CPUSets> ::=  <CPUSet>  <MachineTypes> <Terminator>
  |    <CPUSet> <Terminator>

<CPUSet> ::= '.' <CPU>
<MachineTypes> ::= <MachineType> <MachineTypes> | <MachineType>
<CPU>     ::= 8086 | 186 | 286 | 386 | 486 | CYRIX | PENT | P6
<MachineType> ::= MMX | FPU | PRIV | UNDOC

My syntax is slightly different though, I let the user specify one or more OPTIONAL sets of instructions.
would allow instructions from the 486 instruction set and below, including FPU instructions, including UNDOCUMENTED instructions, and including PRIVILEGED instructions - we have chosen not to use MMX ;)

Trivial code for handling the <CPU> nonterminal was implemented: I simply grab the name of the major cpu version eg "486" and store it for later.

No handler has been implemented for <MachineTypes> or <MachineType> yet...
Posted on 2010-06-08 02:46:11 by Homer
That grammar change was cleaned up a little more, and code was implemented to handle the <CPU> and <CPU_Optional> symbols - the presence of these will cause a corresponding bitflag to be enabled in the assembler's data variables.
The bitflag values come from a new enumeration, which allows any combination of the cpu instructionset identifiers.
This post will make sense once you've seen the next update ;)

At this point, I am tempted to modify my OpCodes class so that it sorts the opcode encodings into families, perhaps stripping relevant tokens from the reductions. Soon I'm going to be working hard on the opcode encodings stuff, so this would be a positive step, and not just fiddling around the edges.

Posted on 2010-06-08 07:43:41 by Homer
Next step was to play around a bit more with the object file thingy.
I think I might redesign the GUI, maybe use a treeview or something, not really happy with it as much of the most useful information is still only available in debugcenter. But then I am only writing it as a learning tool, must not get carried away.
Perhaps someone else wants to pick up the COFF_VIEW project, maybe add a disassembler and relocation fixer-upperer...

But I am getting useful information from it, the kind that you don't really get from reading the specs.
For example, the specs state that a symbol with a class of 3 (static) and a value of 0 is a section identifier - but this is actually only true if the type and derived type are both also 0.

And I understand for the first time something I read somewhere, which stated that "functions are only labels".
This is certainly true for any static (non-external) functions in our sourcecode.

This update gives much more clean symbol data.
Posted on 2010-06-09 07:33:46 by Homer
Ah - this is MUCH better - The GUI layout is MUCH cleaner.
Support was added for looking at the Relocations table of any FileSection.
Some small bugfixes too.
- one small remaining bug, fixed here, is a GPF if you try to view stringtable before opening an obj file)

I'll be spending all day today poking around in OBJ files, starting with the most simple possible (included as Simple.obj), and gradually adding things, so that I can decipher exactly what needs to be relocated, where, and why.
Posted on 2010-06-09 23:24:02 by Homer

This version addresses stability - the handling of 'auxiliary symbols' is now more simple and quite robust - and I'm handling some of those 'undocumented cases' which I mentioned.
If you can manage to trigger any INT 3 , please post the offending OBJ file in this thread so I can see what happened!
Posted on 2010-06-11 02:20:01 by Homer
I got a GPF using Demo01A.obj (debug). The last listed symbol was:
ebx = 70t, Symbol#
eax -> \Code\Ma, Symbol Name
.external_syment.e_numaux = 108t, #auxiliary symbols
al = 01101111y

Posted on 2010-06-11 03:56:11 by Biterider
My bad - try this (small bug was found after posting demo, and corrected before your reply, I just hadnt updated)
Posted on 2010-06-11 07:13:38 by Homer
Much better. Ty

Posted on 2010-06-11 07:19:41 by Biterider
OK, I think I'm starting to get a grasp on relocs.

While interpreting the sourcecode, we will issue a reloc every time we reach a REFERENCE to a symbol('s address).
The reloc will be issued to the current segment which contained said reference.
If the symbol is declared in this object (any segment), we'll issue a type 6 reloc (Absolute 32 bit)
And if the symbol is external(ly declared), we'll issue a type 20 reloc (Relative 32 bit)

Does that sound right?
Posted on 2010-06-12 02:45:21 by Homer
In regards to the EXITM macro directive:

Code was added for handling the <MacroStatements> symbol.
This is our opportunity to look at the value returned from handling <MacroStatement>, and if we see that an EXITM was reached, to "bail out" from expanding this macro any further.
IE, we get to CONDITIONALLY expand the statements in a macro by taking some control over the expanding at this point. We're simply implementing the option to exit early from the expanding of a macro.

<MacroStatements> ::= <MacroStatement> <MacroStatements> | <MacroStatement>

I interpret the first token.
If we hit an exitm, I return immediately - I should clean up the unexpanded <Statements> token i think.. and we'd really like to see the returned tokens copied into the current reduction, and that child reduction deleted, i think.. Anyways..
If we didnt hit an exitm, I clip the first token.
If theres now one token, its <MacroStatements> - I pass it to default handler and return.
Otherwise I release the current (<MacroStatements>) reduction, and return NULL (ok).

So - we will expand and process macro statements in order, until we reach EXITM, or run out of statements.

Posted on 2010-06-12 04:06:31 by Homer
Today I am working on the Main Grammar again.
This time, I am improving the rules for opcodes, and specifically, the operands that are declared along with a mnemonic.
The more information I can extract from the parser input stage, the less work my opcode encoding matching function has to perform.

Firstly, a couple of the 'low order' rules for Data Declarations were slightly modified...
The new symbol <MachineDataType> defines the set of datatypes which are native to the machine.

<DefType> ::= db | dw | dd | dq | dt | <MachineDataType>
<MachineDataType> ::= byte | word | dword | qword | tword | real4 | real8 | real10

Next, I moved to the <OpCode>rules... the <Operand> rule was completely redefined.
The new <OpIndType> symbol, which implies some kind of MEMORY access, differentiates between native and user-defined datatypes.
The old <Immediate> symbol has been eliminated from the grammar, and its use in operands replaced by the more flexible <Expression> symbol (which can declare pretty much anything).

<Operand> ::= <OpIndType> <OpInd>
|      <OpInd>
| <Operandi>

<OpIndType> ::= <MachineDataType> 'PTR'
         | <UserType>        'PTR'

<OpInd> ::= '[' <Operandi> ']'
<Operandi> ::= <reg8> | <reg16> | <reg32> | <reg64> | <regfpu> | <Expression>
    |  <segreg> | <creg> | <dreg> | <treg> | <mmxreg>

I have not changed the <OpCode> rule itself, which is very manifold, if you care to look at the main grammar for its definition... I'm using the parser input stage to enforce the number of operands that can appear with each mnemonic - it will be up to my code handler to pass as much information about the opcode as possible to my matching function, so it became imperative to type the operands as clearly as possible.

It was recently suggested to me that I should stop thinking about this encoding matching stuff, and just implement each opcode as a macro that emits some data to the current segment.
Sure, I could do that, in fact I expect users to be able to do that.
But the assembler will be much slower if every opcode is implemented as a macro.
And it will take me forever to write them all.

What do you think?
Posted on 2010-06-14 00:02:55 by Homer