Does anyone have source or any tutorial on creating lexers or parsers,manually? I mean without using YACC or Flex. I tried looking through a bit of flex generated source, but failed to get the main part of the algo which uses arrays heavily.
Posted on 2001-12-12 10:15:51 by MovingFulcrum
Here is the only link I have with me: ????
Parsing Expressions By Recursive Descent
Look at how to represent a langauge description: BNF, EBNF, grammar files, etc.. (Google) Look at the source to some assemblers: FASM, SpAsm, NASM. Usually, there is a table of names and branch addresses. You can combine that with other methods, too. If you can describe the syntax you want to parse then you can convert that to code, but it's usually better to use tools for the convertion because if the syntax changes you don't want to have to rewrite the whole thing!
Posted on 2001-12-12 15:55:35 by bitRAKE
thanks a zillion bitRake. those links really helped.:)

btw , it is lexing that i am more concerned abt right now.
how do i extract the tokens. all your links assume that there is a space between each token but look at this c statement and the ways it can be declared.

int i =f(as);
int i = f(as);
int i= f (as );
int i=f(as);


Do u understand what i mean?
Any solutions?:)
Posted on 2001-12-13 10:32:34 by MovingFulcrum
Did you read the pages on expression parsing? The tokens are separated by white space or operators. Look at FASM source - nice easy read.
Posted on 2001-12-13 12:39:26 by bitRAKE
One lexer method is to code up a finite state machine (FSM) that may change state based upon the currenly scanned char. There's also an emmiter function & you'll sometimes need to back up one char on a state change.

So to answer your question directly: A space will change state depending what state you're currently in. If you're currently processing a "string literal" then a white space char will not change state, but most times yes.

Then again if the current state is "variable name" then any char not allowed in a variable name will change state... to what depends on what you allow in the language you've created.

You can easily wind up with hundreds of states and then multiply that by your "alphabet size". Not all of this will wind up being a separate arc on the FSM but enuf to that it can be a pain to track & debug and is why lexx etc. was/is considered useful.

These things are often coded up with tables & big SWITCH statements. The SWITCH statement can handle the current state & emitter & backup, & the table will usually contain the state change info. But of course there's more than one way to do this.
Posted on 2001-12-13 15:26:03 by rafe
Assuming you don't need to process macros, and you don't want to build tables, and you're not creating a "general purpose" lexer.
Two standard "tricks" to lexing:

1) Lookahead. You don't test the "current" character. Instead, you test the "next" character. If the next character is part of a valid token (treat whitespace as an ignored token), accept it by copying it to your token buffer (if desired) and "advancing" your scanning (lexing) position. In some cases (e.g., escape and continuation characters), you may want to look at more than one lookahead character.

Make the scanning "pointer" point to the "next" character. You can avoid testing against a buffer length, if you can append, as a sentinel, an invalid character at the end of your buffer. Put in enough sentinels to cover the longest lookahead sequence. The invalid character is preferably one that won't get accidentally typed from the keyboard.

2) End tokens by testing for valid/invalid characters. Instead of testing for whitespace, you end identifiers and numbers by testing for digits, letters, and any other valid continuation characters. An invalid character is assumed to be the start of the next token. This assumption makes it very easy to change the set of tokens.
Posted on 2001-12-13 15:36:51 by tank
And the output was this

Please enter a string of characters
to have duplicate characters removed:
YOUUOU UHOAAYVYVEEH HITTE

YOU HAVE IT
Program completed. Hit any character to exit.

You need Visual C++ to run it and a macro and .mak file
I'll add the rest if anyone likes

but I sent the .asm file

just remove the macro calls and and add Masm output calls

to run in Masm32
Posted on 2001-12-13 17:07:02 by andy981
I did it like rafe explained, but I think I came up with the idea on my own, at the time.

Here is the

macro

http://www.cs.wisc.edu/~smoler/x86text/Pmacros.inc

and the mak file
below




# Set this macro to 'VCNT' to use the Visual C++ tools.
# Set it to anything else to use the Win32 SDK tools.

TOOLS = VCNT

!IF "$(TOOLS)" == "VCNT"

LINK = link

!ELSE

LINK = link32

!ENDIF


all:
ml /c /coff /Zi youhaveit.asm
$(LINK) -incremental:no -subsystem:console -entry:mainCRTStartup -out:youhaveit1.exe \
youhaveit.obj kernel32.lib libc.lib -DEBUG
Posted on 2001-12-13 17:16:26 by andy981
Thanks a lot everyone for replying. Thanks andy for posting the sources. They are pretty well commented.

The reason why i want to code lexer/parser by hand is cause i need to code IntelliSense for C/C++. I dont think code generated by bison/flex will be fast enough to exectue while typing. But then again there is very little that has to be parsed. All that i need is the name of-

Variables
Functions and their parameters
Classes
Macros

So would it be ok to use a flex/bison combo on this or should i code it by hand. I would really like someone's opinions on this.

Also i was wondering about when to run the parser. I mean its fine to run it each time while typing so that it starts from the last state it exited. But what to do if someone presses backspace. I mean how will the parser know that a token has been deleted and to remove it from the symbol table?

Also one last thing. How do i tell bison to chekc for a variable to be public,private or protected? Also how do i specify the hierarchy of classes?

Sorry for the long list of questions but i am a bit confused:)
Posted on 2001-12-15 06:18:28 by MovingFulcrum
I am trying to write some kind of discompiler.I want to show output on richedit control.However discompiled text is bigger than original input file and I cant know how much big.It changes according to how many and which functions used.So I just allocate memory which is twice of file size.Then I do my parsing operation in memory and output to richedit control.However when I read memory to richedit control some 0 charecters appears.Only way I can think is to count how many bytes I operated and then read as this count.Is there a any wiser approach for this?.I am using below functions toread from memory to richedit.Thanks.


INVOKE HeapCreate, HEAP_GENERATE_EXCEPTIONS, BytesO, 0
mov hMem, eax
INVOKE HeapAlloc, hMem, HEAP_GENERATE_EXCEPTIONS, BytesO
mov pMem, eax

invoke GetInfo,hWnd,gMem,addr info ;do may parse operation
invoke lstrcpy,pMem,addr info
...
and Hold, 0
push pMem
pop EditS.dwCookie
mov EditS.dwError, 0
mov EditS.pfnCallback, offset EditStreamReadM
INVOKE SendMessage, hREdit, EM_STREAMIN, SF_TEXT, addr EditS
; eax = bytes read
INVOKE HeapDestroy, hMem



EditStreamReadM PROC uses esi edi dwCookie:DWORD, pbBuff, cb, pcb
mov ecx, cb ; Block size for this callback
.if BytesO > ecx
sub BytesO, ecx ; BytesO = total bytes to read
.else
mov ecx, BytesO
and BytesO, 0
.endif
mov eax, pcb ; Pointer to pcb
mov dword ptr[eax], ecx ; Actual bytes read for this callback
mov esi, dwCookie ; Input
add esi, Hold ; Apply offset to the input buffer
mov edi, pbBuff ; OutPut, doesn't need the offset??
add Hold, ecx ; Hold = accum offset for input buffer
rep movsb
mov eax, 0 ; Must return Zero
ret
EditStreamReadM ENDP
Posted on 2001-12-15 08:01:01 by LaptoniC
MovingFulcrum,
Great question with an elegant solution... You use a table to reference your currently in use symbols this will have a smallish structure associated with the chars for the symbol. Classically, one uses a hash table for this (as I'm doing) but there's no need to stand on tradition use what's easiest for you. In this symbol structure you'd keep a reference count. Then when you see the symbol again you bump the reference count, when you back up & no longer need the symbol you decrement the reference count. You only delete the structure when the reference count goes to 0.

Run the parser when the user signals they are done inputing data for the line or field or whatever... or do it on command if you're writing a full blown compiler.

As far as class hierarchies & scoping I'm doing this for assembler so it's either global or local & nothing in between. But you can do various scoping schemes like linking your symbol structs in a push down stack manner etc. I'm currently working on a scheme for doing the struct.field resolution that will work for intellisense for asm but I'm not to sure that this will help too much for the C/C++.

I'm no expert in bison or flexx so I'm no help there sorry.
-----------
LaptoniC,
I can only tell you what I'm doing for this in a similar but not an identical situation... I actually use separate richEdits for source & opcodes & other things, about 4 or 5 in all. This then allows me to use a splitter bar to make the columns as wide as I need. So I'm just setting up the op codes in one richEdit with each line being however long as the hex2ascii needs (cr/lf). Brutally primitive but it's working so far. Don't know if this is what you need tho.
-----------

Hope this helps some.
Posted on 2001-12-15 18:47:37 by rafe