A Basic Lexer

I sometimes need to make a program which parses some kind of configuration and generate code. Usually the most time consuming part for me is the parsing. In order to make that easier, I've made a lexer which I will use for all my projects from now on.

The lexer recognizes the following lexemes:

Numbers
Identifiers
Quoted strings
Operators

In addition, comments are allowed in the input. A comment starts with the '#' character and lasts until the end of the line.

Block quotes can also be parsed with this code. The lexer normally doesn't look for block quotes by itself, you need to parse them explicitly as explained in a later section.

Only ASCII characters (bit 7 cleared) can be used for lexemes, except within quoted strings.

Numbers

The lexer recognizes integers in four bases: decimal, hexadecimal, octal and binary. Floating point lexemes are not implemented yet, but it's quite easy to do when I have the need. You can group digits using the underscore sign. For example:

  0xffff_ff_ff
  12_543
  0b1010_0011_1100_0_0_0_0

Here are the prefixes to be used for different bases:

decimal (LEsub=='9'): no prefix
binary (LEsub=='1') : "0b" or "0B"
octal (LEsub=='7') : "0o"
hexadecimal (LEsub=='f') : "0x" or "0X"

Identifiers

Identifiers start with a letter or underscore. They may subsequently contain letters, numbers or underscores. You may use the following convenience function to test for a specific identifier:

  int LEisIDENT(const char *v);

Return value is nonzero iff the current token matches the given identifier.

Quoted Strings

Strings may be quoted by either double quote (ASCII 34) or single quote (ASCII 39) characters. The subtype (LEsub) contains the character used for quoting, 34 or 39.

Inside the quoted string, you may use any non-ASCII byte in addition to the normal ASCII set. A quoted string may also span multiple lines. In order to escape encode some special characters, the backslash character is used. Here is a list of recognized escape sequences:

  \d    => double quote
  \q    => single quote
  \s    => backslash
  \n    => newline (ASCII 10)
  \t    => tab
  \e    => escape (ASCII 27)
  \xCC  => any byte, CC are hexadecimal digits

The value (LEval) contains the decoded form of the quoted string, without the surrounding quotes.

Operators

All other printing characters are classified as operators. Each operator consists of just one byte. This byte is stored in LEtok.

Block Quotes

The lexer doesn't have preset rules for block quotes. You need to call the function

int LEpair(int lp,int rp);

in order to parse a block quote. The argument lp is the character code for the "left parenthesis" and rp is the right one. When LEpair is called, the current token has to be equal to lp. The function then scans the input for a matching right parenthesis. While doing this, it skips over quoted strings, but doesn't ignore comments. It also handles nested parenthesis.

The return value is 0 if there was a matching right parenthesis, otherwise it's nonzero. In case of success, the variable LEval holds the contents of the blockquote, without the quoting parenthesis. This content also includes any comments.

You should be careful to not have any dangling quote character within a block quoted text. For example, the following won't work:

  test_block_quote { 
    the following parenthesis is ignored because it is inside
    a quoted string "{". However, if you have some dangling
    quote characters, those will actually start a quoted
    string which won't end within the block quoted text
    and cause the closing parenthesis to be skipped.
  }

This also applies to comments within the block quoted string because comments aren't recognized there.

Initialization and I/O

You need to call the function

void LEinit();

in order to initialize the lexer. At compile time, you may modify buffer sizes for I/O using the macros

#define LEbufSIZE   256
#define LEinpSIZE   256

The former is the initial size of the token buffer. The latter is the size of the input buffer.

After you've initialized the lexer, you can give it some input using:

int  LEfile(char *fn);
void LEmemory(unsigned char *buf, int len,char *name);

The first one uses the given file and returns nonzero if there was a problem opening it. The second one uses a memory block and stores the given name so that you can refer to it later when printing error messages.

After setting the input, you need to call LEnext in order to get the first token. When you're done with the lexer, you should call

void LEclose();

in order to close the associated file and reset the lexer.

Running the Lexer

After the initial call to LEnext, you can begin parsing by referring to the variable LEtok. This will contain one of the following values, or an ASCII character (in case the recognized lexeme was an operator):

 LEtIDENT= 256, LEtNUMBER, LEtQUOTED, LEtEOF,
 LEtBADNUM, LEtBADCHAR, LEtUNTERMQ, LEtBADQ

LEsub will contain the subtype for the token as described in sections regarding numbers and quoted strings.

LEval will contain the string associated with the token.

LEfil and LElin can be used to determine the file name and line number for token position. Note that this isn't accurate for multi-line tokens such as quoted strings.

For tokens signalling end of input or errors, only LEtok is valid, the rest shouldn't be used. In case of an error, the lexer can not recover. The only thing to do is to close the lexer and abandon processing.

When you're done with the current token, you should call LEnext() to get the next token. Calling LEnext() after end of input continues to generate LEtEOF tokens.

Download

lexer.c lexer.h