rxsheyi: the regular expression compiler
1 Table of Contents
2 Introduction
Writing regular expressions is a pain in the ass sometimes, especially so if you’re using the libc regxxx() functions. You have to remember what is special in which context. And also do another level of quotation for C; consider matching a literal \ in your expression.
rxsheyi is a regular expression compiler that does the dirty work for you. It compiles a BNF description into various forms . You can augment this BNF notation by writing functors in C and calling them from your syntax.
2.1 Principle of Operation
When you give a file to rxsheyi, it generates a C file which is used to build the regular expressions, possibly using user-defined functors. Then, this file is compiled & executed. The main output of rxsheyi is the output of this generated program. You can control the compilation phase by using the following flags:
- -t
- Set the temporary directory for intermediate files. Your home directory is used by default.
- -cc
- Set the compiler. "cc" is the default value.
- -cf
- Append an argument or flag to the compiler invocation. %i is the input file %o is the output file. You should give this option once for each compiler option. The default argument sequence is -o %o %i.
Example usage: ./rxsheyi -t /home/sukru/tmp -cc gcc -cf -g -cf -Wall -cf -o -cf %o -cf %i
The output of the generated program then may be used in your own C code.
3 Input File
The input file consists of two parts. In the first one you write your rules describing each symbol and your directives to the compiler. In the second part, you write your functors and any helper functions you may need to use for them.
3.1 Rules and Expressions
Rules are written in a similar fashion as in yacc or bison; each rule consists of a symbol and an expression. The expression consists of terms and operators applied to them. A term can be one the following:
- string
- Strings are written as C string literals. Even when you mean a single character, it’s written as a string literal, with double quotes. Letter escape codes such as \r \n and octal codes are recognized. Hexadecimal or decimal escape codes aren’t.
- number
- Only integers make sense for our purposes. You can only use decimals. No funny business with length modifiers or other bases. Note that a number doesn’t make sense when it’s not an argument to a functor.
- identifier
- When you give an identifier, it evaluates to it’s definition. Easy, da? Todo: make sure that there are no cycles in the given expression. If rule A uses rule B, rule B should not directly or indirectly use A. A simple all-paths will do it.
- functor
- This is the kick. You’ll learn more about functors below. For now, know that they look like a function call, albeit with [ and ] instead of the regular parenthesis ( and ).
- an expression surrounded with ( and )
- The usual precedence override mechanism.
The operators are the usual operators of the regular expressions; + * ? | and the concatenation operator. As you may have guessed you don’t write any character to concatenate two expressions. Just put them side by side. Here is a sample rule: a : ( Except[ everything, "a" ] | "aa" )+ "bb" ;
Note the semicolon ending the rule.
3.1.1 OneOf
Another operator is oneof. This operator takes any number of arguments (which are all single characters) and then ORs them together. It can only be used as the sole operator in the definition of a symbol. The purpose of this operator is to create character classes easily, without doing much quotation. Therefore, it’s syntax is also a little different. Here is an example a: oneof + - * / = \; \n ;
Note that you don’t need to quote any characters that follow oneof. However, you need to escape encode the ; character since it normally terminates the declaration.
3.2 Functors
A functor is a function that transforms a [set of] regular expressions into another expression. For example, you may want to have exactly three copies of a regular expression concatenated together. You could do this by writing a super-concatenate functor and giving the expression to it along with 3.
Some functionality that is considered as core in the classical POSIX regular expressions is implemented as functors. This is because I can’t find a good syntactical representation for them which is both clean and functional. One of them is the Except functor as shown in the above example.
Note the syntax above. You give the name first, then the [ followed by any number of arguments, which can be regular expressions or numbers, and finish it with a matching ]. You can also nest them. Whitespace is always insignificant.
When you call a functor, the corresponding function is called with the arguments and it returns another regular expression, which may be a transformation of the arguments.
3.3 Built-in Functors
3.3.1 Range
Syntax : Range[ <number or single character string> , <number or .. >]
Range always takes two arguments. It’s equivalent to the regular expression [<firstchar>-<secondchar>]
A number corresponds to the corresponding ASCII character. Note that the first argument can’t be greater than the second. Also, when giving ranges like this, be careful about the NUL.
3.3.2 Except
Syntax : Except[ <character class 1> , <character class 2> ]
Except subtracts the second character class from the first one and returns the result. It is similar to the ^ operator of POSIX regular expressions but it doesn’t have an equivalent for the general case.
3.3.3 Save
Syntax : Save[ "savename", <expression> ]
This functor causes rxsheyi to generate a symbol savename with a prefix provided by the generate directive. This symbol is an index to your regmatch_t array you give to regexec(). For example if you have : x: Save[ "x" , "a" + ];
y: x "b"*;
.generate[ y , "RX_y", "RXS_y_"];
then you can find the matched sequence of a’s like this if (!regexec(&rx, RXS_y_COUNT+1, matches, 0)) {
==> matches[RXS_y_x].rm_so <==
==> matches[RXS_y_x].rm_eo <==
}
3.4 User Defined Functors
You can also write your own functors in C. After your rules, put a %% on a line by itself and then write your functors to your heart’s content. Like this: %%
t List(t x) {
return Cat( x, Star( Cat( Str(",") , x )));
}
That’s right. Now that this is C-space, you use the proper names for the operators. Cat for concatenation, Plus for + , Star for * , Except for Except. You get the idea.
You can give more than one argument to a functor. In this case, the arguments are grouped into an expression using the left-associative , operator. You can find more info on the interface of functors in the C interface.
3.5 Directives
In order for your file to do something, you need to write generate directives. The syntax for a generate directive is: .generate[ <symbol> , "constmacroname", "saveprefix" ];
This directive generates a macro called constmacroname, which contains the string representation of the regular expression. You can use this string to compile the expression using regcomp().
It also generates a set of macros (defined to be integers). Each such macro consists of the saveprefix and the name of the saved part, as given to the Save functor.
3.5.1 Todo: Include Directives
When there are a lot of expressions that share some of their sub expressions (think RFCs), it’s convenient to just import the stuff from elsewhere.
4 The Output
The output is a set of macro definitions as described in generate directive. There are some options that control the nature of the output.
- -s filename
- Prints the source code for the generator program to given file.
- -x filename
- Compiles the generator program into the given file.
- -nc
- If you give this flag, the generator’s source is not compiled. The output is just the source code for the generator.
- -nx
- If you give this flag, the generator program is not executed. The output is the source code for the generator and the it’s compiled form.
There are some issues to consider about the output.
You can’t have a null character in a regcomp() regex since a null character ends a regular expression. Therefore, you should be careful to not to include them in your Ranges.
5 Example: docthing
Here is the input which is used in making the docthing.
linearspace: Range[ 1, "\t" ] | "\v" | "\f" | "\r" | " ";
nl: "\n";
whitespace: linearspace | nl;
everything: Range[1,255];
line_begin: nl linearspace* ;
title: by_itself [
Save[ "titlebegin", "="+]
Save[ "titledescr", Protect["="]+ ]
Save[ "titleend", "="+ ]
];
code_open: by_itself [ "[[" ];
code_close: by_itself [ "]]" ];
table_header: by_itself[ "|+" "+"* Save[ "tablecaption" , Protect[ "|" | "+"]* ] "+"* "+|" ];
table_footer: by_itself[ "|-" "-"* "-|" ];
term: by_itself[
Save[ "termbegin" , "("+ ]
Save[ "termtext" , Protect[")"]* ]
Save[ "termend", ")"+ ]
];
olistbegin: line_begin Save["olistmark","#"+] linearspace ;
ulistbegin: line_begin Save["ulistmark","-"+] linearspace ;
linktext: Save["linktext", Protect[ ":" | "]" ]*];
linkbody: ":" Save["linkbody", Protect["]"]+];
link: linearspace "[" linktext linkbody? "]";
emphasis: "*"
Save[ "emphasistext",
Protect["*"]+
]
"*";
paragraph: nl linearspace* nl;
markup: Save["title", title ];
markup: Save["code_open", code_open];
markup: Save["table_header", table_header];
markup: Save["table_footer", table_footer];
markup: Save["term" , term ];
markup: Save["link", link];
markup: Save["paragraph", paragraph];
markup: Save["olistbegin", olistbegin];
markup: Save["ulistbegin", ulistbegin];
markup: Save["emphasis", emphasis];
markup: Save["nl", nl];
.generate[ markup, "RX_markup", "RXS_"];
.generate[ code_close, "RX_codeclose", "RXS_codeclose_"];
%%
t by_itself(t x) {
return Cat(
Cat(
Cat(
Cat(
Str("\n"),
Star(__generate__linearspace())),
x),
Star(__generate__linearspace())),
Str("\n"));
}
t Protect(t x) {
return Or(
Except(__generate__everything(), Or( x, Str("\\"))),
Cat( Str("\\"), __generate__everything() )
);
}
And here is the output from rxsheyi:
#define RX_markup "(\n[\001-\t\v\f\r ]*(=+)(([]-\377\001-<>-[]|[\\][\001-\377])+)(=+)[\001-\t\v\f\r ]*\n)|(\n[\001-\t\v\f\r ]*\\[\\[[\001-\t\v\f\r ]*\n)|(\n[\001-\t\v\f\r ]*\\|\\+\\+*(([]-{,-[\001-*}-\377]|[\\][\001-\377])*)\\+*\\+\\|[\001-\t\v\f\r ]*\n)|(\n[\001-\t\v\f\r ]*\\|--*-\\|[\001-\t\v\f\r ]*\n)|(\n[\001-\t\v\f\r ]*(\\(+)(([]-\377*-[\001-(]|[\\][\001-\377])*)(\\)+)[\001-\t\v\f\r ]*\n)|([\001-\t\v\f\r ]\\[(([\001-9;-[^-\377]|[\\][\001-\377])*)(:(([\001-[^-\377]|[\\][\001-\377])+))?\\])|(\n[\001-\t\v\f\r ]*\n)|(\n[\001-\t\v\f\r ]*(#+)[\001-\t\v\f\r ])|(\n[\001-\t\v\f\r ]*(-+)[\001-\t\v\f\r ])|(\\*(([]-\377+-[\001-)]|[\\][\001-\377])+)\\*)|(\n)"
#define RXS_title 1
#define RXS_titlebegin 2
#define RXS_titledescr 3
#define RXS_titleend 5
#define RXS_code_open 6
#define RXS_table_header 7
#define RXS_tablecaption 8
#define RXS_table_footer 10
#define RXS_term 11
#define RXS_termbegin 12
#define RXS_termtext 13
#define RXS_termend 15
#define RXS_link 16
#define RXS_linktext 17
#define RXS_linkbody 20
#define RXS_paragraph 22
#define RXS_olistbegin 23
#define RXS_olistmark 24
#define RXS_ulistbegin 25
#define RXS_ulistmark 26
#define RXS_emphasis 27
#define RXS_emphasistext 28
#define RXS_nl 30
#define RXS_COUNT 30
#define RX_codeclose "\n[\001-\t\v\f\r ]*\\]\\][\001-\t\v\f\r ]*\n"
#define RXS_codeclose_COUNT 0
6 C Interface
When you’re writing your functors, you need to interface to the C functions already present in the generator program. The prototype for a functor is: t FunctorName( t x );
t is the type for an expression. Following functions are available for use by a functor.
- t Cat(t x,t y)
- Returns the concatenation of x and y.
- t Or(t x,t y)
- Returns x | y.
- t Star(t x)
- Returns x *.
- t Plus(t x)
- Returns x +.
- t Opt(t x)
- Returns x ?.
- t Str(char *x)
- Returns the expression that matches the string x.
- t Number(int x)
- Returns an expression that contains the number x. Then, this expression is suitable for passing into Range.
- t Except(t x,t y)
- t Range(t x,t y)
- t Save(t x,t y)
- These do the functions defined in the builtin functors section.
In order to learn more about the internals of the expression type t (i.e. if you need to write multi-argument functors), execute rxsheyi with -nc and see the resulting source file.
7 Contact and Updates
If you have any questions about rxsheyi, you can contact me by email. rxsheyi lives at my homepage .
8 Footnotes
- various forms
- Currently, only C representations of POSIX extended regular expressions are generated. Some other forms are being considered:
- Perl expressions.
- Lua expressions.
- Table-based C code, like flex output. Perhaps with two flavors, depth-first and breadth-first.
back to text