A crapload more screwing around with type-tagging, and some light editing of the two grammar files, and I've finally got the assembler finding a matching encoding for my example input opcode statement :D

For the input "mov dword ptr,0"

the program is telling me I should be using this encoding:

#Identifier# MOV Tag0  #r/m32# r/m32 Tag7  #,# , Tag0  #imm32# imm32 Tag10  #;# ; Tag0  #o32# o32 Tag0  #Hex# C7 Tag0  #/# / Tag0  #Int# 0 Tag0  #id# id Tag0  #[# [ Tag0  #386# 386 Tag0  #]# ] Tag0  #NewLine#

which if you look closely, the opcode is C7.

MASM assembles this statement to:
C7 00 00000000

That 00 by itself is the R/M byte - and again, the encoding shows us that the encoding expects that the first operand be encoded using the r/m32 encoding table.

So - although I don't yet emit the actual encoding as per the descriptor, I am damn close, and I am a happy man.
Posted on 2010-05-16 07:18:01 by Homer

I've decided to make some dramatic changes to the Opcode grammar rules in the Main grammar.
I want to be a lot more expressive and descriptive in my opcode rules.

<reg8>  ::= AL | CL | DL | BL | AH | CH | DH | BH | R8
<reg16> ::= AX | CX | DX | BX | SP | BP | SI | DI | R16
<reg32> ::= EAX | ECX | EDX | EBX | ESP | EBP | ESI | EDI | R32
<reg64> ::= RAX | RCX | RDX | RBX | RSP | RBP | RSI | RDI | R64
<regfpu> ::= ST0 | ST1 | ST2 | ST3 | ST4 | ST5 | ST6 | ST7
<segreg> ::= CS | DS | ES  | FS | GS | SS | segreg
<creg> ::= CR0 | CR2 | CR3 | CR4 | 'CR0/2/3/4'
<dreg> ::= DR0 | DR1 | DR2 | DR3 | DR6 | DR7 | 'DR0/1/2/3/6/7'
<treg> ::= TR3 | TR4 | TR5 | TR6 | TR7 | 'TR3/4/5/6/7'
<mmxreg> ::= mmxreg !need to define these

<Mem8>  ::= byte ptr '[' <reg16> ']' | byte ptr '[' <reg32> ']' | byte ptr'[' ID ']' | '[' ID ']'
<Mem16> ::= word ptr '[' <reg16> ']' | word ptr '[' <reg32> ']' | word ptr'[' ID ']' | '[' ID ']'
<Mem32> ::= dword ptr '[' <reg16> ']' | dword ptr '[' <reg32> ']' | dword ptr'[' ID ']' | '[' ID ']'
<Mem64> ::= qword ptr '[' <reg16> ']' | qword ptr '[' <reg32> ']' | qword ptr'[' ID ']' | '[' ID ']'
<Mem80> ::= tword ptr '[' <reg16> ']' | tword ptr '[' <reg32> ']' | tword ptr'[' ID ']' | '[' ID ']'

<RegToReg8> ::= <reg8> ',' <reg8>
<RegToReg16> ::= <reg16> ',' <reg16>
<RegToReg32> ::= <reg32> ',' <reg32>

<RegToMem8>  ::= <Mem8>  ',' <reg8>
<RegToMem16> ::= <Mem16> ',' <reg16>
<RegToMem32> ::= <Mem32> ',' <reg32>

<MemToReg8> ::= <reg8>  ',' <Mem8>
<MemToReg16> ::= <reg16> ',' <Mem16>
<MemToReg32> ::= <reg32> ',' <Mem32>

<RegMem8>  ::= <Reg8> | <Mem8>
<RegMem16> ::= <Reg16> | <Mem16>
<RegMem32> ::= <Reg32> | <Mem32>

<ImmToRegMem8>  ::= <RegMem8>  ',' <imm8>
<ImmToRegMem16>  ::=  <RegMem16> ',' <imm16>
<Imm8ToRegMem16> ::=  <RegMem16> ',' <imm8>
<ImmToRegMem32>  ::=  <RegMem32> ',' <imm32>
<Imm8ToRegMem32> ::=  <RegMem32> ',' <imm8>

<RegMemToMem16> ::= <Mem16> ',' <RegMem16>
<RegMemToReg8>  ::= <Reg8>  ',' <RegMem8>
<RegMemToReg16> ::= <Reg16> ',' <RegMem16>
<RegMemToReg32> ::= <Reg32> ',' <RegMem32>

<RegToRegMem8> ::=  <RegMem8> ','  <Reg8> !r/m8,reg8
<RegToRegMem16> ::= <RegMem16> ',' <Reg16> !r/m16,reg16
<RegToRegMem32> ::= <RegMem32> ',' <Reg32> !r/m32,reg32

<Operands> ::= <RegToRegMem8>
<imm8> ::= <Immediate> ;I cant tell how large an integer is during the input parse stage
<imm16>::= <Immediate> ;so I will leave these tags in the parsetree
<imm32>::= <immediate> ;and make the comparison in the parsetree resolver

allows me to express my opcodes more clearly as:

<OpCode> ::= AAA
| AAD <imm8>
|      AAM <imm8>
| ADC <RegToRegMem8>
| ADC <RegToRegMem16>
| ADC <RegToRegMem32>
| ADC <RegMemToReg8>
| ADC <RegMemToReg16>
| ADC <RegMemToReg32>
| ADC <ImmToRegMem8>
| ADC <ImmToRegMem16>
| ADC <ImmToRegMem32>
| ADC <Imm8ToRegMem16>
| ADC <Imm8ToRegMem32>
| ADC AL  ',' <imm8> !Immediate size assumed by register size
| ADC AX  ',' <imm16>
| ADC EAX ',' <imm32>

Posted on 2010-05-17 06:16:12 by Homer

Argh, huge problems with the new grammar!
I am getting a lot of complaints from the compiler - they're not terminal errors (they are shift-reduce conflicts), but I'd rather they were not there.

There are a lot of ambiguities caused by the opcode encodings, heres a clear example:

ADC <RegMem16>  ',' <reg16> <Terminator>
ADC <reg16> ',' <RegMem16>  <Terminator>

Since <RegMem16>::= <reg16> | <Mem16> , the grammar compiler detects an ambiguity when it analyzes the second rule. Let's expand those so its even clearer:

ADC <reg16> | <Mem16>  ',' <reg16> <Terminator>
ADC <reg16> ',' <reg16> | <Mem16>  <Terminator>

It should be clear that BOTH rules cover the case "<reg16> ',' <reg16>"
This is the ambiguity the compiler is complaining about.
One way around it, although I don't like ANY of the workarounds I've found, is to clearly define the three cases separately.

ADC <reg>  ',' <reg16> <Terminator>
ADC <reg16> ',' <Mem16>  <Terminator>
ADC <Mem16> ',' <Reg16>  <Terminator>

Now we've gotten rid of the ambiguity - but now our opcode grammars dont exactly match the encodings.
We will have to rely on our opcode-encoding matching function to perform the necessary 'wildcarding' that will be needed to find a match for the opcode encodings' defined by the above three rules.

You'll notice the <terminator> symbol - that's new :)
Opcode statements can be delimited by the ';' character (multiple opcodes per line), and/or terminated by CRLF.
So you can have a single line with an opcode, optionally followed by ';', optionally followed by more opcodes.
This was actually required to eliminate some nasty Reduce-Reduce conflicts (those are terminal errors, ouchies) - I needed to stick something after opcodes - being able to delimit multiple opcodes on a single line is a nice side effect.
Posted on 2010-05-17 10:31:55 by Homer

That was an almost complete waste of one day of my life - I've reverted to the previously-posted Main Grammar.

Just spent an entire day battling reduce-shift and reduce-reduce conflicts in the grammar compiler.
Learned a few things, some "tricks" for evading/avoiding ambiguities etc, but the bottom line is clear.
It was impossible to write a grammar that defined ALL of the components of ALL the opcodes without producing a LOT of shift-reduce conflicts, and since the main grammar is still quite slim, I thought lots of conflicts was a bad idea - could begin to prevent me from reaching parts of the grammar later on.
And I figured that if I can't have ALL the components 'typed' for me by the LALR stage of the parser, then I have to perform my own analysis of input statements for SOME components - and if I have to do that, I may as well do it ALL myself, and have NO ambiguities in my grammar - I will allow very flexible and often "illegal syntax" sourcecode to pass the DFA/LALR stage, and catch it myself in the Interpreter stage.

So tonight I'll be writing a function whose job is to 'crack' an expanded OpCode reduction, analyze the form of the input in terms of the number, type and order of the components, and reformat the expression to a standard format when necessary, BEFORE passing it on to the opcode encoding matcher function.

Posted on 2010-05-18 02:16:56 by Homer
I've just been twiddling with the Main Grammar again (update attached), and while doing so, as an experiment, I decided to replace all the instances of NewLine with <Terminator>.

If you recall, NewLine is a terminal symbol that represents CRLF, a while back I made the grammar 'line based' by appending a NewLine symbol to the end of some of my grammar rules, to indicate the end of logical statements.

Well, <Terminator> is defined as follows:

<Terminator> ::= ';' | NewLine | ';' NewLine

ie, <Terminator> equals a semicolon, or a crlf, or a semicolon followed by a crlf.

Now that I've replaced all my NewLine references with <Terminator>, we are much more free to place our statements on single lines if we choose, example:

nop;if x==5;g=5;add dword ptr,2;x=x+2;endif

Again - we can use EITHER delimiter, so the last thing on a line does not require a trailing semicolon.
You can put one there if you want, but you don't have to.

So you can see that extending this <Terminator> rule across the grammar as a general delimiter has some interesting consequences.

There was also a subtle change to the expression evaluation rules in regards to Assignments (ie, meddling with buildtime variables via the =, +=, etc operators)
Posted on 2010-05-18 05:25:23 by Homer
I figured my next logical step was to be able to emit raw data to a segment (whether its code or data being irrelevant at that level... its all just bytes, yeah?)
With much hair-pulling, trial and error, I managed to find an EASY way to declare the grammar rules for "Data Declarations", for all the machine data types. ... it took me hours, and in the end, I did it with just THREE SMALL GRAMMAR RULES :D

The following example parses just fine using the attached update.

heya db "this";db "that",13,10,0;dword 200
byte 20,21,23
dw 10,9,8,4

So - it took a while, but it wasn't a total waste of a day to screw around with the grammar some more.
Next will be to implement the handler for the new nonterminal <DefineData> (see Main Grammar)
Posted on 2010-05-18 10:03:10 by Homer
I've implemented code for interpreting the 'data declarations' (with/without labels).
Very soon I will have to stop using the parsetree-evaluator for executing stuff.
In fact, I may already have overstepped the mark.
Anything that may only appear once in our sourcecode can be dealt with here.
Things like named data declarations are unique - so dealing with them now is ok.
We'll see when I begin to implement the macro engine, but I get the feeling that at some point, I'm going to be expanding token sequences and interpreting them flat like that - I hope not, because we're sure to lose a lot of contextual information that is implied by the tree structure.
It really depends how my parser feels about me forcefeeding it sequences of tokens to re-interpret.
I'm pretty sure that I can simply shove all my tokens onto the parser's Input Stack, and then call my Parse method to generate a parsetree - as long as the input stack is not empty, the DFA input stage will simply be bypassed - we already have the tokens, we just want the parser to parse them into a tree for us.
If its as easy to re-parse tokens as I think it is, then I won't ever have to deal with 'flat token sequences' unless I actually want to.... and it means that the macro engine will be able to spit out flat token sequences directly to the parser's input stack, without a care in the world about what they represent.

The Main Grammar received a little more work today too, mostly a tidy-up, removal of some unnecessary rules (simplification), and otherwise subtle changes designed to format the parsetree nicely - fewer and less-complex handlers for the nonterminals.

This brings up a good point - for anyone who intends to use my parser code at any stage, you may note your interest in a symbol of either termin or nonterminal type (or both) , declaring the type you are interested in if both exist by the same name.
But you can only register a Handler function for a NONTERMINAL grammar.
These represent the nodes of our parsetree - terminal tokens don't lead anywhere, so we don't need handlers for them.
Makes sense?

If you REALLY wish to catch terminal tokens, you need to write a handler for their nonterminal parent reduction.
IE, if you wanted to catch "StringLiteral", you would need to handle <Literal>, and in that handler, examine the child terminal token. So, it can be done, but if you have multiple Rules that contain the same terminal, you'd have to deal with all of those Rules in order to catch the terminal in all cases.
Not that I see any reason for wanting to pay so much attention to a terminal token.
If it was that important, we'd have replaced it with a nonterminal rule and handled it as usual.

Posted on 2010-05-19 07:30:02 by Homer
Did a bit more work on the data declarations stuff, its pretty complete now.
Supports string literals expressed as byte or word data, supports reals, qwords, etc, implemented code to check if numerical literals are within the binary range of the expressed datatype's size, etc.

I'm building this using OA32 / MASM.
MASM supports REAL10, but not TWORD.

Since my internal representation of integers is QWORD (and REAL8 for floats), I can't currently support 10-byte types with full numerical precision for internal maths (such as expression evaluation).
TWORD integers aren't even supported at the cpu level - but that does not mean that we can't implement them, or even larger datatypes - it just means that operations using them will require multiple instructions.

Currently I have three outstanding areas of work
#1 is the opcode encoder matching function - there's some code, but its quite incomplete, especially in the light of all the newly-supported symbols in the grammar for datatyping etc.
#2 is the COFF/OMF backend, which includes the Segment class... this has been receiving some work already.
The Segment class is absolutely naive as to the backend, which seemed to work well for Randy - it's just a mechanism for creating lists of elements , similar in some ways to his pretty-printer, but geared for runtime.
#3 is the macro engine, which currently does not exist as such - but I've written one of these before... I just have to be careful not to let my previous experience cloud my judgement wrt the current work.

OK, so I will be posting another demo + sourcecode within the next few days.
This time, the sourcecode will be complete and current.
At this time, I will ask anyone who has any passing interest in this thread to have a look over the code, get a feel for for what I'm doing and how I'm doing it, and make a few suggestions!

I do NOT want to write another macro assembler that implements existing syntaxes with their nuances - I believe it's high time we had something new and cool, and if it has to come from our own community, so much the better.
Posted on 2010-05-20 02:39:55 by Homer
<reg8>  ::= AL | CL | DL | BL | AH | CH | DH | BH | R8

Seems like using R8 would conflict with the 64bit register R8.
I assume you're not dealing with the R[8-15][|b|w|d] stuff yet.
Posted on 2010-05-20 11:42:38 by r22

Nice catch - that's just a remnant from when I backported part of the OpCode Encodings grammar back into the Main grammar. r8 Is an operand in opcode encodings as defined by intel... not meant to appear in the main grammar at all.

Posted on 2010-05-21 07:59:48 by Homer
Here's a small update to the Main Grammar, implementing support for Nested Macro Declarations, and some rules for dealing with multiple CRLF's (one or more empty lines).

To implement nested macro declarations, all I've done is move the macro declaration rule into the <Statement> group. Macro declarations contain statements - and statements can be macro declarations.

Posted on 2010-05-21 09:46:48 by Homer
Currently, I'm working on implementing handlers for Macro Declarations and their component nonterminals.
One idea I've adopted from my previous work is to keep a stack of pointers to macro declaration objects that we're still parsing (ie, a Context Heap - but just for macro declarations). This allows macro declarations to be implemented with a scope that is local to its parent declaration (if any).
The same logic can be applied when I implement structure declarations, and the counterpart is when I implement macro EXECUTIONS... here the scoping comes into its own, because we can ensure that a macro execution is only valid within the scope of the execution of its parent macro declaration  8)
And the fun part will be implementing switches to disable scoping :)

Anyway, progress is good, seems to be in the right direction, is methodical and hierarchical and most of all, consistant.
I'll be proud to show this sourcecode to the public in the coming days.

7:21:21 PM  _H_: im building a stack of pointers to macro declarations which are still being parsed, so that i can implement scoping of macro declarations, and thus their executions
7:22:04 PM  _H_: macro executions which have non-root scope must then occur within the execution scope of their parent macro declaration, or one of its ancestors

Posted on 2010-05-22 03:57:56 by Homer
OK its been implemented...

The assembler keeps a list of 'macros which have Root Scope' - this simply means that they appeared in the sourcecode alongside other stuff as usual - not nested inside another macro declaration.
Each 'root macro' can act as the head of a tree structure , where each Node is another macro declaration object.
This happens if we try to declare a macro from inside another macro declaration - ie, nested macros declarations.

Each 'root macro' acts as the head of a NAMESPACE.
If we are defining a nested macro, its name only needs to be unique to the namespace (subtree) it belongs to.

As for macro executions: when a macro is executed, the assembler will check if the execution is nested or not.
If it is, the search for a macro declaration by name will be restricted to the namespace of the macro whose execution we're already within, and its ancestors back to root.

I haven't yet introduced a switch to disable this scoping behavior.
Anyway guys, what ya think?
Posted on 2010-05-22 22:02:22 by Homer
Today I implemented the rules for Structs and Unions in the Main Grammar.
This took absolutely hours, I can't begin to describe how tedious it was, although the result looks so elegant (easy).
The main problem I had was trying to describe a masm-style "<ID> ends" rule.
After many hours, I was unable to find any way to do it cleanly, so I gave up and accepted a more simple syntax, as shown below. There is a way around it, the newest version of Gold supports something called 'Virtual Terminals' - but I wanted to avoid using that 'expert' mechanism, at least as long as possible.

Struct fields of 'known' type may end in (?),  ?  , () or nothing.
Struct fields of 'compound' type may end in <?>, <> or nothing.
Struct fields may be named, or not.
Union fields MUST be named - but otherwise are just like regular Struct fields.

The following example input is acceptable:

goose struct
myfield db ?
myfield2 dw
myfield3 dw


gander struct
grr goose
nose dive <>

As you can see, it's pretty flexible.
Next will be to implement the code handlers for this new grammar.
Some of the work has already been done with a fledgeling "_Struct object".

I am going to allow both Structs and Macros to use the same "work stack" for tracing nested declarations.
This is because I think my grammar rules already make it impossible for the declarations to ever become interleaved - we can say that we can only ever be inside one struct/union declaration at a time, so its safe to share the same stack.

The last rule - that union fields must be named - was an afterthought - we don't really need this, do we?
It just seemed pointless to have a bunch of unnamed fields who share the same physical data.

Why would you want an unnamed field in a union? Any takers?

Attached is an update of the Main Grammar containing the new stuff.

Posted on 2010-05-23 07:27:34 by Homer
Another small update to the Main Grammar... changes are to allow the Struct Declarations to contain nested struct and macro declarations, in addition to the usual fields and unions.
Macro declarations could already allow nested struct declarations, since a macro contains a bunch of <Statement>'s

gander struct
hello struct
         monza macro; echo mybum; endm
monza db ?
grr goose
nose dive <>

I'm trying to implement scoping of searches for nested declaration entities..
I've switched to keeping separate stacks for struct and macro declarations.
This lets me know that, in the example above, the macro called "monza" is in fact a root-level macro... its not nested inside another macro, so its not nested at all !!!
The structure called "hello" is nested within the structure called "gander". The idea is that "gander" is the root of a namespace, and that "hello" must be a unique name within the "gander" namespace - but otherwise could be duplicated.

I've implemented code for handling most of the new nonterminals, currently looking for a bug in my overly-complex storage scheme ... hopefully its something small (usually the case).
I'm quite happy to tear that storage scheme apart and rebuild it at this point, since I now need it 'online'.

Posted on 2010-05-24 02:12:25 by Homer

The bug was fixed, and the namespacing is working great :)

gander struct
hello struct
monza db ?
        gday hello <>
grr goose
nose dive <>

hello struct
bbb db ?

In the above example, the struct "hello" is NOT being redefined - nor is there any name collision.
The first instance is being declared as a "child of the gander namespace".
And the second is being declared as a root entity, alongside "gander".
Note that the nested structure declaration is actually being implemented in the next line... and that this structure definition is ONLY valid within the context of this structure, or its more-deeply-nested substructures.
That is to say, a structure declaration has access to the declarations of structures within the same scope, or higher, up to the root of the tree it is stored within (ie, root of a namespace).

I currently throw to an int3 if a name collision is detected for structs (no support yet for redefinition of anything) - macros need to be brought to the same stage as structs.

YAYYYYY ITS WORKING !!! (And so it should, it's a re-implementation of an approach I've used in another project).

Posted on 2010-05-24 03:00:29 by Homer
Macros now enjoy similar NameSpacing to structs.
Just finished implementing code handlers for the remaining new nonterminals ie <Union> and family.
Unions are not retained in my structures - rather, I collect their fields along with all the other fields of the struct, but I flag the unioned fields. This is not ideal and may change - I think I'd rather have a real container object for unions, which shares enough similarities with a <StructField> that it can be safely stored in the same list.

- Changed storage scheme for Unions..

Unions are containers for fields (like structs) and appear inside structs (like struct fields).
Structs now contain a list of two possible kinds of fields - union or structfield.
This is made possible simply by ensuring that both object classes contain an identifying marker at a common location.
Size and Offsets of all fields within a struct (including unioned fields) are calculated as each new field is added - the size of a union is assumed to be that of the largest member field in the union, with all its fields starting at the same offset, relative to start of struct.

Posted on 2010-05-24 07:39:24 by Homer
Another small update to the Main Grammar.
This time, I had noticed that a couple of grammars were causing problems with the Identifier Terminal... specifically, I was no longer able to declare names that are just one letter long.

You see, one of my rules in the Math Expressions mentions the character 'e', used for Exponents.
And my definition of the Identifier terminal was such that the parser could not distinguish between 'e' and 'Id'.
So now I have a quite elaborate definition for the Id terminal which prevents the user specifically from using the lone character 'e' (or E) as a naming identifier (at all), but allows any other single letter to be used alone, and generally allows all other number/letter/decorative combinations.

I was able to implement this stuff all within the 'Character Sets' par of the grammar (DFA part of the parser), so no new nonterminals were generated, although theres a couple of new terminals, but theyre just used internally, and are not particularly interesting, except as a cool example.

Sorry for the delay in posting an updated demo, I keep finding things to fix, and with frequency that I've found a little alarming. However, I suspect that is almost over.

Posted on 2010-05-25 04:10:31 by Homer
Yet another minor update to the Main Grammar.
This time, I noticed that I had not supported nested sharp braces in data declarations eg:

MyThing SomeKindaStruct <27,<"Escape",0">>

Another thing that caused a problem was the << and >> symbols, which were used for binary shifting.
Even though the rules for binaryshifting are in a totally different area of the grammar to those for data declarations, nonetheless the grammar compiler was claiming at least one 'ambiguity' existed - so the << and >> symbols just had to go. I've replaced them with shl/shr (as immediate operators, these are not opcodes).

If I hadn't,  we would have to whitespace our sourcecode to eliminate double sharp braces:
MyThing SomeKindaStruct <27,<"Escape",0"> >

I'd rather be lazy thanks.
Now I'll be able to implement some code to check the types and sizes of the input fields of initialized data structs.
For any Type, the size won't have to match, but it will have to be the same or less.
Posted on 2010-05-25 07:06:34 by Homer
At long last, here's the entire project, with all files up to date.
You can play with the demo executable although its best to have DebugCenter installed if you wanna see much.
Essentially, the demo reads "test.txt" and if it parses correctly, a little application window will appear (empty window).
If your sourcecode has something wrong, or that my grammar does not yet handle (at all, or partially), then you just won't get an application window - thats why you really need to have debugcenter installed.

It's possible to load the resource files into the previous project, which has a GUI that lets you step through the parsing process, but we are way beyond parsing now, and deep into the interpreting of the parser output.

If I've forgotten to include any files, I apologize profusely, and for that reason have included the binary executable of the project demo so you don't absolutely HAVE to build the project yourself - however I'd like to know that there's no problems in doing so if you can spare the 5 seconds or so it takes to build ;)

Posted on 2010-05-25 09:37:51 by Homer