I have been profiling the HLA compilation of the following source file:

program t;
#include( "w.hhf" )
begin t;
end t;

(w.hhf, for those who don't know, is the include file that
includes the Windows' system constants, types, and API
declarations).

When I first started seeing if I could speed this up, it took
about 25 seconds to compile on a 300MHz PII and about
17 seconds to compile on a 2GHz PIV (obviously, memory
access times are the bottleneck here; and processing the
w.hhf file touches so much memory that the cache gets
blown away).

Almost 90% of the time was spent in the symbol table lookup
routine, even with the hash table algorithms (prior to adding
the hash table, compilation of this file took over 40 seconds
on the PII machine). After hacking up the symbol table search
routine in assembly language, and a few other minor improvements,
I got the compile time down to 20 seconds on the PII and about
14.5 seconds on the PIV machine. Not a tremendous improvement.

After further research and profiling, I discovered that the record
declarations wind up doing a linear search for their data types, even
if they are part of a namespace. I'm looking into fixing this right now,
but in the meantime, it suggests a quick optimization you can make
to speed up the compilation of programs that include w.hhf.

The trick is to reorganize all the declarations in "windows.hhf".
This file consists of a bunch of constants followed by a bunch of
type declarations. Many of the type declarations are records, and
they're the ones that wind up doing a linear search. To help speed
things up a little bit, move all the type declarations from the end of the
"windows.hhf" header file to the beginning of the file. You'll have to
compile a sample program like my "t.hla" program above to discover
which constants need to be moved before certain type declarations
(i.e., there are several constants used as array bounds in the type
declarations, obviously those constant must be declared before their
first use). This took me about 15 minutes to do. Compilations of the
t.hla header file went from about 20 seconds down to 14 seconds.
I would expect a like increase in performance on the PIV machine
(though I'm not at that machine right now, so I can't run the test).

Once again, I looked at the feasibility of doing "pre-compiled header"
files. With the HLA v1.x design, this just isn't practical.
However, after profiling the compilation of t.hla, and discovering that
86% of the time is being spent in the linear search routines, processing
about 9000 lookups, while about 2% of the time is being spent in the
hash table routines, doing about 13,000 lookups, it's pretty clear that
I need to spend some time moving certain searches from the linear
lookups to the hash table lookups (already sped up the linear search
about as much as I can - faster code isn't so much the issue here as
avoiding a linear search altogether).

I'll try and have something better by HLA v1.42. My goal is to get the
compilation time for t.hla below 10 seconds on my 300MHz PII.
The fact that there are so many record field declarations in windows.hhf
and these wind up running a linear search is probably the main reason
for the slow compile times of this file. So hopefully, I'll be able to kludge
something together so avoid the linear search once the search moves
beyond the record fields.

Cheers,
Randy Hyde
Posted on 2003-02-25 14:25:32 by rhyde
Okay,
I've spent two days reworking the symbol table search routines
for HLA v1.42. HLA now compiles the w.hhf header file much more rapidly.

To recap HLA's symbol table history:

HLA started out as a prototype design sometime around 1996.
Expecting to quickly design the assembler and then rewrite it
in itself, I never really paid much attention to the performance of the
assembler itself. As I usually do, when first writing a compiler, I used
a simple (easy to verify) linear search for the symbol table lookup routines.

Over the years, HLA's performance was never that bad. The big performance
loss in the past has always been macro expansion (which is incredibly
disgusting in the current implementation). For "normal" applications
(those that don't consist of thousands and thousands of lines of macro
invocations), HLA has always produced an executable in just a few seconds.
No big deal.

The I ported Hutch's MASM32 windows.inc file from MASM to HLA.
An HLA program that includes "w.hhf" (which includes windows.hhf,
kernel32.hhf, gdi32.hhf, and user32.hhf) has something like 30,000
lines of code, and it's all very dense (averages just under a declaration
per line). Attempting to compile this makes the linear search rear
its ugly head.

The first time I attempted to compile the windows header file (sometime
back around HLA v1.38) the performance was absolutely dismal.
It took better than 40 seconds to compile just these header files into
an empty HLA program (on a 300MHz PII).

Unfortunately, HLA had grown over the years and the symbol table data
structure had come to rely upon the linear search algorithm used to
search (and build) the symbol table. A wholesale switch to a new
algorithm was out of the question.

After careful consideration, I figured out that while a generic algorithm
change was not possible, I could use a specialized algorithm when
searching through namespaces (because namespaces don't support
arbitrarily deep nested lex levels like the symbol tables outside a
namespace). So I devised a hash table algorithm for searching through
namespaces. The idea was, I could put the windows declarations into
the "w" namespace and speed up searching through the symbol table
by a massive amount (the hash table has 256 entries for each namespace,
and my hash function does a fairly good job of distributing the symbols
through the hash space, so a speed up of approximately 100x seemed
doable and a speed up of 10x seemed imminently achievable).

So I coded the Hash Table lookup algorithm for namespaces into HLA v1.39.
Compiles of the empty program that included w.hhf dropped from 40 seconds
to 25 seconds. Far less improvement than I was expecting.

So I did what any self-respecting assembly programmer would do, I recoded
the search routines (including the string compare) in assembly language.
This dropped compile times from 25 seconds down to about 21 seconds.
Not a real big improvement.

The next step I made was really cheating. I profiled the code and discovered
that my problem was that a lot of the time (even within the namespaces) the
linear search was still being used. It turns out that record fields and parameter
lists couldn't use the hash table algorithm, they reverted to a linear search,
even when searching through the symbols found in the namespace (e.g.,
if a record field or a parameter had a user-defined type appearing in the
namespace, HLA would still do a linear search through the namespace for
that symbol). As a result, my linear search function was still consuming 86%
of the compilation time (according to the Microsoft VC++ function profiler).
So, as I said, I cheated. Recognizing that a linear search *can* be fast if the
symbol you're searching for occurs early in the list, I rearranged a lot of the
type and constant declarations in windows.hhf to speed up the linear search.
With no changes to HLA, I was able to reduce the compile time from about
21 seconds down to around 13-14 seconds. Still, about 80% of the compilation
time was being spent in the linear search routines.

The big problem was that HLA always looks up each identifier in the symbol
table once the scanner (lexical analyzer) discovers it has an ID. Generally,
fields of a record and parameter declarations are always undefined symbols.
Undefined symbols, of course, require a complete scan of the linear symbol
table list to determine that they are truly undefined. This gets ugly since
we know (for a correct program, anyway), that the symbol is undefined to
begin with. So I kludged up HLA *big-time* and added a new attribute,
"@fast" that you can attach to a namespace, e.g.,

namespace w; @fast;

<< declarations >>

end w;

The "@fast" attribute tells HLA not to bother checking record/union field
declarations and parameter declarations to see if the symbol is undefined.
The scanner skips the symbol lookup and immediately returns "undefined symbol".

This is a big kludge because you cannot just arbitrarily put "@fast" on
a namespace. Doing so may create some problems, consider the following:

namespace error; @fast;

const
somesym := 5;
somesym := 6;

end error;

Under normal circumstances, HLA should return a "duplicate symbol error"
when it hits the second occurrence of "somesym" in the const declaration
section. However, with the "@fast" option, HLA always returns "undefined symbol"
for the symbols you are declaring in the const section (as well as other sections),
so HLA will not properly report the error.

Therefore, you should *never* use the @fast attribute when developing
the declarations for a namespace. However, once the namespace declarations
become stable (e.g., for a set of library routines), then you can attach the
@fast attribute to the namespace to speed up compilation when building
an application that includes the namespace. Of course, you should take care to
delete the @fast attribute when making modifications to the namespace.
Indeed, I suggest doing something like this:

namespace ns; #if( @defined( fastNamespaces )) @fast; #endif

<<declarations>>

end ns;

Then if you run HLA (presumably from a makefile) using a statement
like the following you will get a fast compile:

hla -DfastNamespaces fileContainingNS.hla

Adding the @fast option sped up HLA a little bit, but when compiling
the w.hhf header file (and all that it includes), there was still a tremendous
amount of linear searching taking place. Although the @fast attribute
spares HLA from looking up X and Y in the following example, HLA still
had to do a linear search on A and B:

namespace w; @fast;
type
A:int32;
B:qword;

r:record;
X:A;
i:int32;
endrecord;

p:procedure( X:B );

end w;

So, just for the w.hhf file, I added another couple of *big* kludges
to HLA's symbol table lookup routines to do only a linear search
within the parameter list or record field list and then switch back to
a hash table search algorithm once outside the parameter list or
field list. There are still a few weird cases where HLA has to resort
to a linear search, but they rarely occur (certainly they are rare in
the w.hhf header file), so such uses of the linear search do not have
a big impact on compile time.

The end result?
Well, after adding @fast and modifying (i.e., kludging) HLA to
handle record/union field lists and parameter lists specially, the
compile time for the empty program that includes w.hhf fell from
about 13-14 seconds to under two seconds (still on a 300MHz PII).

Now, it is practical to simply include the entire w.hhf header file
whenever compiling an arbitrary Win32 application (e.g., a small
app like "Hello World").

In comparison to other assemblers, HLA is still several times slower.
For example, SpAsm, which the author claims is "blazing fast"
compiles a 1MB source file producing a 300K PE file in 3.16 seconds
on K6/200/Win95 (I compile on a PII/300/W2K). Now, the empty
program I'm compiling produces no object code, so this is far from
a direct comparison (OTOH, I have no idea what's contained in that
300K PE file, either, so a direct comparison isn't that easy).
Nevertheless, you can see that HLA compile times now compare
reasonably well on fast machines against assemblers that are
written in pure assembly language (note that my compile times
include running HLA.EXE, HLAPARSE.EXE, ML.EXE, and LINK.EXE,
whereas SpAsm's times are measured running the assembler
from within an IDE, which is going to be faster).

Now I'm not going to sit here and claim that HLA is as fast as
a well-written assembler that has been coded in assembly language.
Even with the changes I've made to HLA, I can still make it run
really slow (just create a file with thousands of user-written macro
invocations, you'll slow HLA down). Nonetheless, it is interesting
to see that a program written in a VHLL (FLEX/Bison) doesn't
do such a bad job.

BTW, for those who are interested, HLA now spends most of its
time in the parser (about 36% of the time) and the lexer (about
25% of the time) and the linear symbol table search routine
only consumes about 2% of the total execution time; this
is about as it should be (note that there are no macro invocations
when compiling the w.hhf file by itself; I'm sure the lexer and
the macro expansion-related functions would consume the vast
majority of the time if there were a lot of macro invocations).

This new, faster version of HLA will appear in the HLA v1.42 release.
That's still a few weeks off: I need to test the FASM code generator
quite a bit more, and furthermore, a big change like I've made to the
symbol table routines is bound to break something, I've done very
little testing at this point so I don't really know that my changes are
working properly yet (compiling the w.hhf header file without error
is one thing, using those definitions is something else entirely).
In any case, just thought I'd keep you up to date on what's going on here.
Randy Hyde
Posted on 2003-02-26 12:18:33 by rhyde