Hello once again!

This thread will document my implementation of a general-purpose x86 assembler.
The AssemblerEngine class derives directly from MacroEngine.
Its main job is to select a binary encoding for an input opcode reduction, and emit that data to an output segment.. but it also needs to generate labels, etc.

The grammar defines the opcodes very loosely: just as a mnemonic and a number of operands.
It's up to the AssemblerEngine to observe operand types, and attempt to select an opcode that matches.
The opcode encodings themselves are parsed (by AssemblerEngine) from a custom text file into a searchable array.
Posted on 2010-05-04 05:19:40 by Homer
You might want to take a look at the "HLA Back Engine" (HLABE) while working on this project.  The HLABE program is an x86 object-code formatter than takes an intermediate format (mostly byte statements plus certain variable-sized instructions like conditional jumps) and translates the result into PE/COFF .obj files (Windows), ELF files (Linux and FreeBSD, among others), and Mach-O (Mac OSX).  If you conform your object output to what HLABE expects, you can automatically generate code for Windows, Mac OSX, Linux, and FreeBSD.
You can find all the source code to the HLABE on SourceForge here:

Randy Hyde
Posted on 2010-05-04 18:36:23 by rhyde
Thanks Randy, I'll be sure to take a good look at what you've been up to :)
It's interesting that you too have decided to distance your work from MASM.
I'm guessing it's for the same reason - the license restrictions regarding output for non-windows (and os2) platforms.
You didn't call it an assembler - but an object emitter ... an interesting choice.. I'm already curious to find out exactly what it provides - and what kind of license you chose.
I've already begun working on my own back end, as you can imagine.. so I'm quite looking forwards to taking a peek.

Must be VERY new, there's no binary or source files available on the sourceforge page... although I didn't check SVN.
Posted on 2010-05-04 23:26:40 by Homer

Thanks Randy, I'll be sure to take a good look at what you've been up to :)
It's interesting that you too have decided to distance your work from MASM.
I'm guessing it's for the same reason - the license restrictions regarding output for non-windows (and os2) platforms.

Well, I originally moved away from MASM because MASM didn't run on Linux, FreeBSD, and Mac OSX (the other platforms that HLA supports). However, as time passed, I finally got tired of listening to people complain that "HLA isn't a *true* assembler because it doesn't directly emit object code." This was largely a historical misconception. Sometime around 2005 or 2006 I modified HLA to emit the actual opcodes for the instructions as BYTE statements in MASM/Gas/FASM/etc., so all that HLA was using the back-end assemblers for was as an object code file formatter (i.e., producing PECOFF, Elf, and Mach-O files from the binary data I'd generated) and as a jump displacement optimizer. A year or two ago, however, I bit the bullet and wrote my own back-end (HLABE) that converted an HLABE internal format directly into object output.

As for being concerned about the MASM license restrictions -- I stopped caring about that issue when I ported HLA to emit FASM, TASM, NASM, and Gas code many, many years ago.  Today, HLA's ability to emit source file for these assemblers is mainly educational -- sometimes it's nice to see what the HLA code looks like when compiled into the syntax of these other assemblers.

You didn't call it an assembler - but an object emitter ... an interesting choice.. I'm already curious to find out exactly what it provides - and what kind of license you chose.
I've already begun working on my own back end, as you can imagine.. so I'm quite looking forwards to taking a peek.

Must be VERY new, there's no binary or source files available on the sourceforge page... although I didn't check SVN.

Sorry, it's all in SVN.  You can also grab a copy of HLABE as part of the HLA source code at this link:

There is no documentation for HLABE other than what is in the comments. Here is the basic format of the HLABE internal format (an ASCII memory file):

// Scan an HLABE (HLA back engine) assembly file.  Such files contain the
// following statements:
// .a <x> Alignment
// .b <blist> Byte data
// .c Code section
// .d <dlist> Dword data (includes relocatable)
// .e l1, l2        Equate
// .f l End of function
// .l <llist> Lword data
// .o <string>        sOurce file name
// .p l Public symbol
// .q <qlist> Qword data
// .r <x> Reserve Storage
// .s Static/Data section
// .t <tlist> TByte data
// .ub <x1>,<x2> Duplicated byte data
// .uw <x1>,<x2> Duplicated word data
// .ud <x1>,<x2> Duplicated dword data
// .v BSS section
// .w <wlist> Word data
// .x lbl External symbol
// .y READONLY/CONST section
// .z End of source
// :lbl Defines label at current program counter location.
// Except for label (which is terminated by a newline character), there
// is always at least one space between the statement and any operands.

// Numbers beginning with '$' are hexadecimal values, decimal if no '$' prefix.
// Decimal numbers may contain chars 0..9 and '_'. Hexadecimal numbers may
// also contain 'a'..'f' and 'A'..'F'.
// Labels always begin with alpha or '_' character and may contain
// alphanumeric, '_', '$', '?', and "@" characters after the first char.
// In general, labels can be any length, but the object file format or specific
// linkers may enforce their own limits. As a general rule, symbols should be
// unique within the first 32 characters.
// <x>, <x1>, <x2>, and <x3> are simplified (absolute) arithmetic expressions
// defined as follows:
// $<hex digits> -- Hexadecimal value
// <dec digits> -- Decimal (base 10) value
// <x1> + <x2> -- Sum of two subexpressions
// <x1> - <x2> -- Difference of two subexpressions
// Evaluation of subexpressions is strictly left-to-right with no precedence.
// Relocatable expressions are a vector with a label component and an
// absolute expression component.  This is generally specified as <r+x>.
// The syntax for a relative expression is one of the following:
// <x> -- An absolute expression (which has a NULL relocatable value)
// id -- A relocatable identifier.
// id + <x> -- An identifier (relocatable) followed by an abs expr.
// id - <x> -- A relocatable identifier followed by an abs expr.
// Blank lines are permissible in the source file.
// Comments begin with a ';' and consume everything to the end of the line.
// .a <expr>
// The alignment statement will align the next byte emitted in the current
// section to some boundary that is a power of two. The operand is a single
// expression that evaluates to a small integer value. The alignment value
// must always be a power of two. It should also be in the range 1..16.  
// Alignment statement will fill with zeros in static/data section, with no-
// operations in a code section (this could be "multi-byte NOPs", not
// necessarily a sequence of individual NOP instructions),
// and will do a reserve storage operation in the BSS section.
// .c, .s, .v, .y
// These four statements begin (or continue) a code (.c), readonly/const (.y),
// data/static (.s), BSS (.v) section in the program. Note that multiple
// instances of each section statement may appear within a single source file.
// When multiple instances of a given section in the source file exist,
// HLABE will combine the different instances into a single section.
// Within a section, order of data/code is strictly maintained, but if multiple
// section declarations for the same section appear in a source file, there is
// no guarantee of the order the subsections will be combined. If strict ordering
// is required, the caller should combine the sections and emit them as a single
// section when creating the HLABE source file.
//  No explicit alignment is assumed when a section begins. Calling code
// must explicitly issue a ".a" statement if alignment is desired or required.
// .b, .w, .d, .q, .t, .l
// These directives emit bytes, word, doublewords, quadwords, tbytes, or
// lbytes (128-bit values) to the current section (code or data/static, these
// directives cannot appear in a BSS section).  These directives have the
// following syntax and semantics:
// .b <blist>
// <blist> is a list of one or more byte items. A byte item is either an
// expression that evaluates to a value in the range 0..$ff
// (or -128..+127) or a sequence of characters surrounded by
// quotes. If more than one byte item appears in a <blist>, the
// byte items are comma-separated. Note that quote
// characters never appear within a string (they must be converted
// to '$22' byte items). Also, only printable ASCII characters
// may appear within a quoted string (characters whose codes are
// in the range $20..$7e). All other characters must be converted
// to numeric byte item entries.
// .w <wlist>
// A <wlist> is a list of one or more word items. Word items are
// expressions that evaluate to 16-bit (or smaller) values. Multiple
// items in a <wlist> are comma-separated.  
// .d <dlist>
// A <dlist> is a list of one or more dword items. Dword items are
// relocatable or absolute expressions that evaluate to 32-bit (or smaller)
// values. Multiple items in a <dlist> are comma-separated. Pointer
// constants (relocatable objects) are also valid dword items.  A pointer
// constant is one of the following:
// lbl
// lbl+<x>
// lbl-<x>
// (lbl+<x>)
// (lbl-<x>)
// where "lbl" is a relocatable statement label and <x>
// is any valid dword expression. HLABE always emits a
// relocatable offset value for these items ( <r+x> ).
// Note that all dword constants are always a tuple. If
// the dword constant is absolute, then the relocatable component (<r>)
// is set to the NULL pointer.  
// .q <qlist>
// A <qlist> is a list of one or more qword items. Qword items are
// numeric operands that evaluate to 64-bit (or smaller) values. Multiple
// items in a <qlist> are comma-separated.  
// .t <tlist>
// A <tlist> is a list of one or more tbyte items. TByte items are
// numeric operands that evaluate to 80-bit (or smaller) values. Multiple
// items in a <tlist> are comma-separated.  
// .l <llist>
// A <llist> is a list of one or more lword items. Lword items are
// numeric operands that evaluate to 128-bit (or smaller) values. Multiple
// items in an <llist> are comma-separated.
// .ub <x1>,<x2>
// <x1> is a duplication count. <x2>  is a data value, which should be
// a byte. This directive, which is valid only in the
// code and static/data sections (illegal in the BSS section) is used to
// fill a block of memory with a specific value. The values must be
// absolute.
// .uw <x1>,<x2>
// <x1> is a duplication count. <x2>  is a data value, which should be
// a word. This directive, which is valid only in the
// code and static/data sections (illegal in the BSS section) is used to
// fill a block of memory with a specific value. The values must be
// absolute.
// .ud <x1>,<x2>
// <x1> is a duplication count. <x2>  is a data value, which should be
// a relocatable dword value. This directive, which is valid only in the
// code and static/data sections (illegal in the BSS section) is used to
// fill a block of memory with a specific value. The values can be
// absolute or relative (absolute dword values are <r+x> values with
// the relocatable field set to NULL).
// .r <x>
// Reserves <x> bytes of data at the current program counter location in
// the current section. If a code section, the reserved storage is filled
// with NOP-style instructions, if a data/static section, the reserved
// storage is filled with zeros. This statement is valid in, and is
// mainly intended for use in, a BSS section.
// .e lbl, <text>
// Equates simply do a textual substitution of the <text> operand for the label
// operand. I.e., everywhere "lbl" appears (in the example above), HLABE
// substitutes the remaining text on the line (up to the end of the line or up
// to a comment beginning with a ";") for the symbol. After substitution, HLABE
// continues processing the source line as though the <text> data originally
// appeared in place of the symbol. Note that if the <text> string contains
// other equate symbols, they will be processed as well. There is no check
// for infinite loops in the text substitution process. It is the responsibility
// of whomever created the equate(s) to ensure that a recursive definition
// does not exist. Note that only a single line of text substitution is
// possible (i.e., this is not a generalized macro facility).
// .f lbl
// Marks the end of a function. When generating ELF code, this will
// change the symbol type (of the corresponding label) in the symbol
// table and set the length of the function.
// In addition to the above statements, an HLABE program may also contain
// jmp, call, and any of the following conditional jump instructions:
// ja, jae, jb, jbe, jc, je, jg, jge, jl, jle, jna, jnae, jnb, jnbe,
// jnc, jne, jng, jnge, jnl, jnle, jno, jnp, jns, jnz, jo, jp, jpe, jpo,
// js, jz, jcxz, jecxz, loop, loope, loopne, loopz, or loopnz.
// Any number of spaces and/or tabs may precede these statements. Any number
// of spaces and/or tabs may appear between the instruction mnemonic and
// the single label operand. After the label, at least one space will appear.
// After each jump, call, or conditional jump instruction, there will always
// be a comment of the form:
// ";filename, line#"
// ";filename, line# ;filename line#; ..."
// This is a list of filenames and line numbers in the original source
// file where the statement that emitted this code can be found. If the
// statement was emitted from a macro or include file, there will be more than
// one filename/line# pair (with the last entry being the file/line# of the
// actual statement within the macro or include file). The HLABE compiler
// should parse this information and display it if there is an error
// compiling the statement (e.g., branch out of range).  Line numbers are always
// unsigned decimal integers.
// Note that call and jmp statements only appear in a source file for
// relative jumps and calls. Those that do indirect jumps or calls must be
// compiled directly to machine code by the caller.
// All reserved words use lower case characters only. Labels, however, may
// contain upper and lower case characters (and are case sensitive).

If you run the HLA compiler, you can tell it to produce HLABE output (rather than object code) by using the -source -hlabe command line options. For example, here's a test file I had laying around:

unit t;

#includeonce ("t.hhf")

?@nodisplay := true;
?@noalignstack := true;

// _VMT_tList___hla_ :dword; @external;

listebx :text := "(type fileNode_t)";

method tFileList.remove ( var lNode:baseNode_t);
begin remove;
mov (lNode, ebx);
str.free (listebx.fname);
if (listebx.iname <> 0) then
str.free (listebx.iname);
if (listebx.ipath <> 0) then
str.free (listebx.ipath);
//push (lNode);
//call ( tListVMT [@offset ( tList.remove)]);
super.remove( lNode );
end remove;

end t;

and here is the HLABE output that HLA produced for the above code using the -source -hlabe command line options:

; Assembly code emitted by HLA compiler
; Version 2.9 build 3590 (prototype)
; HLA compiler written by Randall Hyde
; HLA backend compatible output

.p tFileList_remove

.x HWexcept__hla_
.x abstract__hla_
.x Raise__hla_
.x shortDfltExcept__hla_


; procedure tFileList_remove

.b $55
.b $8b
.b $ec
.b $8b
.b $5d
.b $8
.b $ff
.b $73
.b $8
call STR_FREE ;t.hla,22
.b $83
.b $7b
.b $4
.b $0
je false__hla_1882 ;t.hla,23
.b $ff
.b $73
.b $4
call STR_FREE ;t.hla,24
.b $83
.b $7b
.b $c
.b $0
je false__hla_1883 ;t.hla,26
.b $ff
.b $73
.b $c
call STR_FREE ;t.hla,27
.b $ff
.b $75
.b $8
.b $8d
.b $3d
.d _VMT_tList_t___hla_
.b $ff
.b $17
.b $8b
.b $e5
.b $5d
.b $c2
.w $4
.f tFileList_remove


.x __imp__MessageBoxA@16
.x _VMT_tFileList___hla_
.x __imp__ExitProcess@4
.x _VMT_tBase_t___hla_
.x _VMT_tList_t___hla_
.a $4


One thing really nice about HLABE is that it does a really great job of optimizing branch displacements. Better than MASM, FASM, and most of the other assemblers I've seen (on par with NASM, which also does a great job).

BTW, HLABE is quite fast; you don't take a performance hit for using it to process the back end of your assembler's output. One of the main reasons the intermediate language is so simple is to make scanning and parsing each statement as trivial (and as fast) as possible.

Randy Hyde
Posted on 2010-05-05 16:19:47 by rhyde

The first step for building the assembler proper is to somehow load/import a description of how to form legal opcode encodings. For this, I have chosen to create a small textfile containing all the opcode encoding data, and a specialized parser just to read that file.
Here I have attached the opcodes file, and the grammar for parsing it.

Posted on 2010-05-10 21:54:37 by Homer

That's half the story.
Now we need a derived Parser class which implements handlers for the nonterminals in our grammar (well, the ones that are important to us).

We want to expand each <OpCode> reduction in the parsetree, and store them out into some kind of runtime array of structs / objects. This will be the database from which we'll make opcode encoding selections later on.

It's important that we eliminate each <OpCode> reduction from the tree as soon as we've fully expanded it.
The reason is that if we don't, it will be handed back UP the tree to the parent <OpCodes> reduction, which will then do the same thing, effectively appending all the statements together into one massive statement.. that's not our goal.
Posted on 2010-05-11 07:10:03 by Homer

I've made a few more small improvements to make the parser framework rock solid.
There is now a direct descendance from Parser, through Interpreter, etc.
I've just successfully tested a beta for the new derived 'OpCode Parser' class.
It's not much to look at yet, so no demos.. took a bit of work to recover from the changes I made.
All for the greater good :)

Anyway, I'm now at the point where I'm ready to store the parsed opcode encodings into some other form.
I'm going to stop there, because I'm not sure yet which form will prove to be most useful in terms of the pattern-matching required for selecting opcode encodings.

It's probably time to turn my attention once more to the main grammar - which I'm completely unhappy with right now.
I'll probably strip it right back to just try to implement what a crude assembler needs: opcodes, operands and little else.
This is mainly in response to some expected technical issues / implementation nightmares regarding the macro engine.

Posted on 2010-05-11 08:29:54 by Homer
Attached is an improved version of the grammar for parsing the opcodes sourcefile - note that there are fewer cases for each nonterminal (rule)... and especially, there's only two cases for <OpCode>, and they're almost identical.
This makes for less complex (smaller and faster) handling code in the parsetree resolver, but has a more dramatic effect on the overall time for tree recursion: it can make a huge difference because it makes the parsetree smaller and less deep, and so faster to resolve. Of course, if not used appropriately, it can have the opposite effect :D

Also attached is an update of the Parser baseclass, and Interpreter midclass.
Stability and speed have both been addressed, changes are minor but cumulative.

I've also extended the main grammar file, it now contains a simplified description of all of our opcodes.. in most cases, just the unique mnemonics, and number of operands. I'm not at ALL enforcing types in the dfa/lalr stage of the parser, this could change, but for now I'll be happy to accept everything and have my code try to guess the operand types.
(Basically, I did try to explicitly describe a handful of opcode encodings in the main grammar rules, and it caused so many ambiguities in the grammar that I decided this wasn't the place to handle it).

Also, the main grammar has been modified into a LINE BASED GRAMMAR.
Essentially, all that means is that we define a rule for 'new line', which we can use within our rules at our discretion, and we'll expect to see it appear as a token in our parsetree, at the end of logical statements and soforth.
(I found that a few well-placed <NewLine> delimiters in the 'high order entity' rules actually helped eliminate 99.9% of grammatic ambiguities, which became critical as the grammar grew in complexity).

I'm still of two mind as to how much I'll allow to appear on a single line, but statements that are delimited by something OTHER than a new line, are certainly not out of the question.. however, being able to detect these 'new line' tokens in the stream can still be useful as a tool for delimiting entities late in the parsing job (say, after some macro has spewed forth some inline code that our parser has never seen).

Posted on 2010-05-12 02:23:26 by Homer
Here's the latest incarnation of the main grammar file.
This one contains a 'brief' version of the opcodes, as well as some minor improvements to the grammar.
This simple grammar already generates over 1000 LALR states, and over 45,000 DFA States.
Uncompressed, this grammar compiles to a datatables file of 508kb.

I'm implementing the following grammar:
#1-Directives must appear on their own line, terminated by <CR>,<LF>, or both.
#2-Multiple OpCodes can appear on a single line, if separated by the ';' character, and observing rule #1
#3-Complex Expressions can be used as Operands, if they eventually express a numerical value (ie #immediate).

This is perfectly legal:

a = 72.3
mov eax, a*32/12.4  ; push eax

"a" is a numerical (currently hex, dec, or float) literal representing an immediate numerical value, its not a stack local or a data variable.... we can't use those in complex expressions, only in complex addressing modes (which can involve complex expressions, heh).
In this example, I'm pushing a real4 immediate floating point value calculated at buildtime.
It's not a good or clever example, but it shows the flexibility of the syntax.
In fact it should be obvious that we could simply push (complex expression) - and in this case, we could declare the TYPE of the float we are pushing - up to real8, my current choice of internal representation.

push real8 a*32/12.4

I will implement the 'line continuation' symbol at a later time.

Right now, I'm playing with a class to represent an output segment, so I can get down to the dirty job of recognizing addressing forms in opcodes, and searching for matching opcode encoding descriptors.
The current demo loads the opcode descriptor grammar and then parses the opcode descriptors sourcefile using an embedded derived parser class... then it loads the main grammar file, and then attempts to parse and interpret an input sourcecode file using the main grammar... a start :)

The grammars are completely unrelated, thats why I used a separate parser-derived class for the opcode descriptors.
Those are purely internal, just my mechanism for getting that data into a form I can readily use in the main engine.
I believe that NASM uses a Python script to generate its opcode tables, I don't know what other people do but I suggest that everyone, at some point, is parsing the data from plaintext descriptions.

I still haven't talked about the issues involved in opcode selection, that will happen soon :)

Posted on 2010-05-14 10:35:15 by Homer

OK - about opcode encoding selection.
There are two key issues: one-to-many mappings, and branch optimization.

Often, we won't have a choice - there will only be one suitable opcode encoding.
But sometimes, we'll have to choose between several possible ways to encode the same instruction.
Some will be longer in terms of #bytes - but may still execute more quickly.
Which do we choose? This heuristic should be something we can set.

We might think of the bytes being emitted to the Code Segment as being sequence of chunks of linear opcodes, delimited by (Conditional or Unconditional) branching instructions.
As an example, let's assume we have a number of opcodes, followed by a conditional jump (forwards), and the user hasn't specified what kind of jump.. should we be using 8, 16 or 32-bit offset? We can't know yet, because we don't know the 'distance' to the target.. There are two solutions in common use.
The first is to use the longest encoding, and try to shorten it later.. causing a forwards-ripple effect as we 'shrink' the code. Messy.
And the second is to defer this problem until we've scanned all of the input... known as 'multipass assembly'.
We don't actually rescan the input - we rescan our intermediate representation of it.
This allows us to resolve all the sizes of the linear codeblocks, and so optimize the size of the branch instructions.

I'd love to hear your thoughts on this, guys :)

Posted on 2010-05-14 23:06:59 by Homer
OK, so here is a quite poor implementation of a Segment object which implements "deferred optimization of branch instructions". It's not very complete - but the basic ideas are there.
Note that it has no base address, no notion of where it exists in memory, or on disk.

I make the assumption that all segments are code segments - because data segments are trivially handled by the same code.

The segment collects a list of two kinds of elements.
The first represents the binary data for a linear chunk of code - ie, a series of completely-resolved opcodes.
And the second represents a Branch instruction which we wish to defer.
Labels (local to the segment) are optionally, supported, and any element may be tagged with a label which represents the start of that element's data.

The Insert method has been overridden: if the last element in the list is a code element, and the input element we're appending is also a code element and has no label, then we will merge the data for these two neighbouring code elements.

This is in fact the first step to 'resolving the list'.
Once the list is complete, we'll deal with the 'Unresolved Branch' elements that are delimiting our list from collapsing into a single code element.

For each unresolved branch element:
If, between this element and its labelled target, there are NO unresolved branch elements, then we can resolve this branch element and merge it into the neighbouring code element.
If theres an unresolved branch inbetween, we'll leave this element and come back to it after we've resolved all that we can... I think there will be few,if any, cases which will remain unresolved after this... we should find that we've only got one element, its a code element, and it contains all the binary data for this segment, ready to write out to our object file.
Actually - we'll have a list of labeled and unlabeled code elements, and no unresolved branches - at which point we can safely merge the remaining elements... anyway, the goal is to be left with just one element, one block of binary data.

I haven't mentioned object symbols, relocations, etc.
Guess thats coming up too :)

How am I doing guys? Do you think I'm on the right track with this? :)

Posted on 2010-05-15 00:46:13 by Homer

The main grammar was extended to allow the following directives:

.segment SegmentName ]

Where the possible flags are 'Writeable', 'Executable', and 'BSS'.

If you select a segment that doesn't exist, it will be created.
If you specify no flags, it will be executable, read-only, and initialized (non bss).

If you select a 'known' segment (such as .Code), or a segment that you created earlier, any flags will be ignored.

The assembler now has a toplevel wrapper class to contain its components :)
This class embeds a collection of named Segment objects, and keeps a pointer to the current segment.

I haven't yet implemented the 'Reduction Handlers' for these new grammars, but I have implemented most if not all of the code that they'll be driving - so progress is good!

And anyway, writing reduction handlers has become a trivial thing, given that I now have a good number of existing handlers to serve as a kind of design template.

Here's a rough idea of what the assembler's current object hierarchy looks like:

;Assembler <- Evaluator <- Interpreter <- Parser
; |  |
; | OpCodeBank <- OpCodes <- Interpreter <- Parser
;Segments      <-    Collection (of Segment)
Posted on 2010-05-15 02:20:15 by Homer

I've implemented the handler for "segment selection" (.code, .data, .data?, userdefined).
Attached are the assembler toplevel class, and an update of the Segment class.
You can see how they work together.

Now that I'm able to create/select an output segment, it's time to take a look at handling opcodes in the main grammar.
We'll want to identify the number and type of operands, and pass that information to an as-yet unwritten function in the OpCode-Encodings Manager which will try to find at least one encoding whose argument count and type(s) is the same (or equivalent).

Posted on 2010-05-15 06:34:22 by Homer

This update to the Main Grammar contains some new NonTerminals which help to identify the type of operands appearing in our opcodes. And so, it also defines all the basic datatypes (byte, word, dword, qword, tword, and the reals).

I will be implementing a mechanism by which I can attach some useful information and/or payload entities to tokens in the parsetree, before handing them back up the tree.
Via this mechanism, I will be able to, at minimum,  tag tokens with type information implied by their parent reduction, and so retain information that I deliberately planted in the parsetree according to my grammar rules.

The attached image shows the parsetree generated by the input "mov dword ptr,0"
Posted on 2010-05-15 08:18:28 by Homer
Btw why
"szDataSegment  db ".",0,"d",0,"a",0,"t",0,"a",0,0,0"
instead of simply using "dw" ?
Posted on 2010-05-15 08:55:34 by Ultrano
Ultrano - I have formed this habit during the design phase of this project, in order to make it absolutely clear to myself, from a distance, that a given set of strings are Wide, and not American Ascii ... The native format of strings throughout the engine is Wide, however there are a handful of methods and procedures which take an Ascii string - and a couple of times early in the work I managed to confuse them and cause problems.
Having resolved those problems, I left those string definitions looking like that as a future warning and reminder to myself.
There's no other reason.

Everyone - I've extended the Main Grammar to include all the 64-bit registers, and better/more Operand Type support.
I have extended the Token structure to allow it to carry 16-bit tags, or 32-bit pointers to user-structs.
This allows me to tag Tokens with context-based information, especially type information, but also generally I can now generate internal objects to be associated with tokens and passed back up the parsetree.

Token struct
   ParentSymbol dd ?     ; -> Symbol
   TokenData    dd ?     ; LPWSTR if Symbol is terminal, otherwise -> Reduction
   State dd ? ; is a LALR state.
      pPayload dd ? ; User-Defined..
      _high word ? ; If _high != 0 , Payload field = ptr to MemAlloc'd struct
      Tag   word ? ; If _high == 0 , Tag field = userdefined constant
Token ends

Posted on 2010-05-15 09:30:03 by Homer
I've added a handler for the <Operand> nonterminal symbol.
When I see it, I grab the type of its content, then I resolve its content, then I tag the first of the N resulting tokens with the type id... look below, see the first token of "operand 1" has been tagged as "type 5", which is "Memory".

In fact,  I went ahead and also implemented 'tagging handlers' for <reg8> thru <reg64>, and also <Immediate>, <memory> , <regfpu> and <DataType>  ... I'll tag all those tokens, so things like "edx" will be tagged as "Reg32" etc... automatically.
But the <Operand> handler will overwrite the tag of the first token in each resolved operand, with the overall type of the operand (will become clear, read on).

The input statement "mov dword ptr , 0" currently gets processed into this:

#MOV# mov Tag0   #dword# dword Tag5   #ptr# ptr Tag0   #[# [ Tag0   #EAX# eax Tag3   #]# ] Tag0   #,# , Tag0   #IntegerLiteral# 0 Tag6  

Where the format is "#SymbolName# terminalstring TagID", and the Tag values are from my Assembler constants:
TAG_REG8 equ 1
TAG_REG16 equ 2
TAG_REG32 equ 3
TAG_REG64 equ 4

I think I should have enough information available to start looking for matching opcode encodings - except that I still have not retained my encoding data, because I wasn't sure how or what I wanted to store.
Now I have all the pieces to make that decision.

It should be noted that, for OpCode statements at least, these tags are somewhat contextual - we've tagged "dword" as being "<Memory>" - but this is only because it's the first token in the resolved Operand of that Type.
Normally, "dword" would be receiving a tag indicating that it is in fact a <DataType>.
Posted on 2010-05-15 09:58:24 by Homer
So .. I've started tagging a bunch of Terminal tokens with type information, and I'm also identifying and type-tagging the operands in opcode expressions. All good stuff - some people call this "augmenting the parsetree", others call it "annotating" - I prefer to call it decorating the tree, because it reminds me of my childhood :)
Posted on 2010-05-15 10:57:13 by Homer

I may not get a lof of coding done today - we'll see.
Here's what I'm dealing with.

My example opcode statement:

mov dword ptr , 0

How the Assembler is viewing that statement (see previous post for tag info)

#MOV# mov Tag0  #dword# dword Tag5  #ptr# ptr Tag0  #[# [ Tag0  #EAX# eax Tag3  #]# ] Tag0  #,# , Tag0  #IntegerLiteral# 0 Tag6 

The opcode encoding information which should be the only match for this example:

#Identifier# MOV Tag0  #r/m32# r/m32 Tag0  #,# , Tag0  #imm32# imm32 Tag0  #;# ; Tag0  #o32# o32 Tag0  #Hex# C7 Tag0  #/# / Tag0  #Int# 0 Tag0  #id# id Tag0  #[# [ Tag0  #386# 386 Tag0  #]# ] Tag0  #NewLine#

I need to write a function that detects that these two are a match, and no other match exists.
It might mean playing with the decorations a little.

Posted on 2010-05-15 20:33:03 by Homer
Having added that code to tag Main Grammar tokens with type identifiers, I looked at the output of my OpCode Encodings Grammar, and decided it needed pretty much the same thing - to identify the overall types of each operand in the encodings, and to also identify as many other tokens as possible.

So, I've just done a crapload of work to support all the Types of the Operands in the OpCode Encodings - there's a bunch of 'compound types', special register types and other stuff that the Main Grammar currently doesn't know about - and likely won't need to.

But I still haven't 'quite' got the opcode encodings data in the form that best suits me for searches.
Posted on 2010-05-16 04:04:51 by Homer