text2code

t2c is a very simple literate programming tool. I wrote it because all other LP tools I have seen are geared towards generating beautifully typeset documents at the expense of the user. t2c has four very simple commands and doesn't fill up your source code with TeX or HTML stuff.

Other LP tools have the notion that, what the programmer writes is a source code for both the document and the program, and these need to be extracted separately (hence the "weaving" and "tangling"). This causes the source code to be strictly structured (with sections, list items etc) and cluttered with markup commands which are useful only for generating a document.

However, these documents aren't the thing which is most frequently used by a programmer during his daily work. He works much more on the source code itself. Therefore, the source code itself must be easy to read and shouldn't disturb a reader's focus with markup commands.

Another point I have noticed is that, typesetting doesn't improve much in a well written text. I have read many books in plain ASCII format (thanks to Gutenberg Project) and haven't been disturbed by the fact that they aren't typeset with lots of different fonts or that they aren't paginated.

Based on these points, I concluded that the best literate programming tool would let the programmer write a document in plain ASCII and then the tool would figure out the code from the document. The source text will normally need no further processing: it's the final product in the documentation branch. t2c is an implementation of this idea.

Contents

Installation

Change into the build directory and just run the given Makefile:
$ make 
Put the resulting binaries somewhere in your $PATH. I used the following platforms so far: and had no problems. It should work on pretty much any operating system with POSIX system calls, but I haven't tried. In particular, I implemented the filtering feature by using Linux man pages as a reference. Therefore, it may or may not work on other systems in the same way.

Command Line

When given the -v switch, t2c prints its version and exits.

The -f switch causes t2c to force all writes. Normally, t2c doesn't touch an output file if the write operation wouldn't change the file. This enables use of t2c with the 'make' program to re-build only the changed files after each modification.

When given the -f option, all output files will be touched even if the write operation wouldn't change anything. This can be used to update the modification times of output files.

All other arguments are taken to be names of input files. t2c processes the input files in the given order.

t2c can not process the standard input stream. All input must be stored as files and then given to t2c as command line arguments.

Input Files

t2c was designed to work with ASCII files. UTF-8 should also be fine. Just make sure you don't use any non-ASCII whitespace characters in t2c commands.

An input file consists of blocks. There are two kinds of blocks.

Text in redirection blocks are put into the given section. Text in file blocks are put into the given file. Both redirection blocks and file blocks may contain section declarations. What this means will be evident later.

You can think of a section as an in-memory file. When you redirect some text into a section, the text is put into the in-memory file. After this, when you declare a position for the section, the contents of the in-memory file is inserted at that position.

Before going into more detail, let's look at a simple input file:

+ A
  Text to be put in section A

+ B
  Section B header
: A
  Section B footer

> file.out
  File header
: B
  File footer
Here, we have 3 blocks. Two sections called 'A', 'B', and a file block for "file.out". After reaching the end of the input, t2c starts to emit output files. In this case we have only one. If there were many, they would be processed in the order they were given.

When t2c processes the contents of "file.out", it sees that the section 'B' is needed. It finds that section from its internal list of sections and then inserts the body of it in the current position. At this point, "file.out" buffer contains the following:

 File header
 Section B header
: A
 Section B footer
 File footer
Now, it sees that the contents of section 'A' are needed. It searches for the section and inserts the contents of section A.
 File header
 Section B header
 Text to be put in section A
 Section B footer
 File footer
This is the final contents of the file. Since there are no more files to output, the program exits with success indication.

Let's now get into the details. As previously stated, a file consists of blocks. Other commands are embedded inside the contents of blocks. A block starts with a redirection command and ends at the next redirection command or at the end of input file. The text between the command and the end of the block is the 'body' of the block.

It's an error to start a file without a redirection command. So, the first line in an input file must be such a command. See todo.

A command consists of a single character command code and a command argument. The command code must be at the first column of a line, without preceding whitespace. The argument is the rest of the line until the newline character. There is no way of continuing a command to the next line.

The body of a block starts after the newline terminating the associated command. Therefore, you will need an empty line after the command if you wish to insert a newline before the body.

The argument of a command is usually normalized before being passed on. The normalization process converts all whitespace and control byte sequences to a single space character (ASCII 32), then removes the leading and trailing spaces.

The following are the commands recognized by the program:

The append command has some variations as well, we will see them in a moment.

The Append Command

This command (+) takes a section name as its argument and then puts the body of the block at the end of the given section. If there are multiple appends to a given section, they are put into the destination section in the order they are seen in the input.

The argument to an append command can not start with an asterisk (*) or a bang (!) sign. These are reserved for template usage. A sole dot (.) character as the section name signifies a document section, which is treated specially. This section name (a sole dot) should be exclusively used for document blocks.

This command has two variations: numbered append and PREV append.

Numbered Append

Normally, the append command adds the body to the end of the given section. However, if the section name ends with an integer, something else is done. When the argument of an append command ends with an integer, the integer is removed from the argument but it's remembered. After this, the body of the append command is inserted into the corresponding section. When doing so, the integer is used as a position indicator much like line numbers in some languages. If there is another append command for the same section but with a smaller integer ending the command argument, then the body of that command will precede the body of this command. For example if you have:
+ A 100
 Body 1

+ A 200
 Body 2

+ A 50
 Body 3
The contents of section 'A' will be:
 Body 3
 Body 1
 Body 2
If there is an append command for the same section without an integer suffix, then that command works normally, appending to the end of the section, after all numbered bodies.

Note that you can't use trailing numbers in Declare commands (:). For instance:

: Foo 300
is an invalid usage. Please see
here for more details.

Anyway, This feature is particularly useful for writing type definitions in the document order but having them output in compiler order. Ditto for variables, functions without prototypes etc.

PREV Append

If the argument of an append command is PREV, then the body of the command will be appended to the body of the block prior to the previous block. So, it is used for alternating output. This is a relic from older versions. I experimented with it for a while until I found that it's error prone and confusing. I'm keeping it because some of my libraries may be making use of it.

The File Command

The file command (>) puts its body into the given file. File commands are processed in the order they are seen in the input. This becomes important if you're doing some postprocessing on the files after they are output.

A file command can be given multiple times for the same file. If this is the case, the later command will append to the file rather than re-opening it.

This command can take more than one argument. The first one is the file name. The others are options about the file. Two options are recognized:

nolines
Disables the output of line number information for the given file. Normally t2c outputs CPP directives in order for the compiler to give error messages relative to the t2c input files, not the generated C files. This option may be useful when an output file isn't a C source or header file.
force
Forces the file to be written on the disk even when the new contents are the same as the contents on the disk.
Since spaces separate file names from options, it's not possible to have space or control characters in file names.

The Declare Command

This command (:) inserts the contents of the given section at that point in the output stream. The declare command is not a block command, it consists of only the command line. Therefore, it's found in the bodies of append or file commands. A section can be inserted in many places.

The declare command is executed after all append commands have been executed. Therefore it isn't necessary to do all definitions related to a section before you can insert it somewhere.

The declare command can't be used with a section name ending with an integer. For example, if you have:

+ Types 100
 stuff
and later
> file.c
 A
: Types 100
 B
you will end up with only
 A
 B
in file.c. This is because the append command removes the integer suffix from its argument and there is no longer a section named 'Types 100' when the declare command runs. Therefore, nothing gets emitted for the declare command with argument "Types 100".

The Filter Command

The filter command (<) filters its body using the given program and inserts the output of the executed program in the output stream. This command is not a block command. i.e. it doesn't end a previously started block command. However, it does have a body which specifies the text to be input to an external program. End of the input is marked by an empty filter command, a '<' character on a line by itself. For example:
+ A
< perl
  print(" hello world!\n");
<
will put the line
 hello world!
in section A. The filter command gets executed after both the append and declare commands. Therefore, it's not possible to execute append or declare commands output by an external program. For instance:
< perl
 print(": Types\n");
<
Would insert the line
: Types
into the output stream literally and no processing for this line would be done since the program is way past that stage.

The body of a filter command is fed to the standard input stream of the executed program and is replaced by the contents of its standard output stream. If the external program fails, t2c will fail as well. The standard error stream of the executed program is redirected to somewhere else for successful invocations. Therefore, you won't see warning messages if an external program executes successfully with the given input but gives diagnostic messages on the standard output stream.

The filter command can also be used in a nested fashion. If this is the case, the commands are processed from inside out. First, the innermost command is executed, then the one enclosing that etc. For instance, this works properly:

< perl 
< perl
  print("print(\"hello\");");
< 
<
It's also possible to pass command line arguments to external programs. These arguments are seperated by whitespace characters. If you wish to include space characters in the arguments, you shall enclose the corresponding arguments in single quotes. This also means that if you have single quotes surrounding the arguments, they will be removed. There are no other escape encoding mechanisms. For example, you can't have an argument which contains a single quote character.

If you need to pass such delicate arguments to an external command, it's best to put the invocation in a shell script and do the quoting there. The quoting mechanism of t2c is not very robust and is designed to be used only in the simplest of cases.

As a final note, the filter command uses execvp to execute the external program. Therefore, shell features such as environment variable expansion, home directory expansion, globbing etc. are not available. If you need these, put them in a shell script and execute that from t2c.

Templates

This is a new feature in version 1.5. You can write template code which can be later customized to specific situations. This is done by a little bit of macro processing.

Within template code, you can have variables which will be replaced by some values later. The system also includes a couple of macros with arguments.

The template feature also lets you import a library in private or public scope.

Defining Templates

In order to define a template, you write append commands in a specific form. Here is the general syntax for it:
+* templatename.codesection

  .. code ..
Here, templatename is the identifier for the template. It should be a C identifier, but that restriction is not enforced yet. The codesection identifier should be one of the following: Within the template code, you can use variables to generate new symbol names based on the settings of the library user. For instance:
  void $pfx_print();
can generate something like
  void foo_print();
if the user has set the variable pfx to value foo. Each variable consists of a dollar sign, followed by one letter and optionally more letters and digits. The underscore sign can not be part of a variable name.

There are some builtins to help you turn some sections of your code on and off based on variable settings.

$def(expression)(conditional code)
If the given expression results in a non-empty string, the conditional code is output. Otherwise, it's supressed.
$nef(expression)(conditional code)
This is the opposite of $def.
$dev(expression)(default value)
If the expression results in a non-empty string, then that result is output. Otherwise, the default value is output.
$T(text)
Internally used, not for library code.
$F(text)
Internally used, not for library code.
The expressions here can be single variables or a combination of builtin macros. For instance, if you want to include some code only if both A and B are defined, you can do this:
  $def($def($B)($A)) (conditional code)
If you want to make an OR operator, you can simply write your expressions side by side:
  $def( $A $B ) (conditional code)
There are also generated variables. If the variable name looks like $Gxy where x is a lower case letter and y is an optional sequence of letters and digits, then the variable is a generated variable.

Such a variable is automatically assigned a value by the system. This value is suitable for use as a C identifier and has the general form _YQRZnumber. So, if you see such nonsense in your output, you probably capitalized some variable name, like $Gain or something.

Using Templates

In order to instantiate a template, you need to make an append command with the following form:
+! templatename
  var1  value1
  var2  value2
  ..
The variables are given without the dollar sign. Values are optional. If some variable is used only for deciding whether some code is to be output etc, then you can omit the value part. The system will automatically assign the value 1 to the variable. Leading and trailing spaces are removed from values. No more processing is applied to values. Each variable assignment spans a single line, there is no way to make a line continuation.

Scope and Code Placement

Templates are instantiated in public scope by default. You may override this by setting the scope variable to private.

When you do that, the public code sections are output at the same places as their private counterparts. For instance, public prototypes will be placed just before the private prototypes etc.

In addition to this, the variable static will be set to the value static. In public scope, this variable is set to an empty string. This means that, you should write your public functions and public prototypes with the prefix $static, as follows:

+* map.public_functions
$static $pfx_new() 
  { .. do something .. }
This makes it possible for the public function to have the static storage specifier in case the user imports the template in private scope.

For the instantiated code to appear in your C files, you need to provide locations for different parts of the code. There are two ways of doing this.

First, you can use the IFACE and CODE variables to tell the program where the corresponding elements should go. Public definitions, types and prototypes go in to the section given by the IFACE variable. The rest goes to the place defined by the CODE variable. For instance:

+! map
  IFACE  Map Interface
  CODE   Map Implementation
could be used to relocate the corresponding code into sections marked as ': Map Interface' and ': Map Implementation' respectively.

The second method is to micro-manage where each single part goes. To do this, you need to specify section names for each used template section. The variable names are the same as the corresponding codesection identifiers explained before. Here is an example:

+! map
  public_definitions  Map Header
  public_types        Map Types
  private_variables   Map Variables
  ..
You can actually skip some, there is an amount of leniency here. However, you shouldn't use this unless you absolutely have to.

Errors and Warnings

t2c gives a warning when a section is not output into any file. This prevents code from being lost inadvertantly. Document sections are marked with a lone dot and they don't generate this warning if not output to a file.

Another warning is given when any empty sections are found. If you have

: Pretty Functions
and there is no relocation to 'Pretty Functions', then t2c will give another warning.

Some Tips

It's OK to not declare some sections. For example, document sections need not be output to any file since the source code is already the document. Therefore, I use very simple section names for documentation, such as a lone dot, or a minus sign.

You can declare a section in multiple spaces to repeat the content. This could be useful for making two slightly different programs sharing some of their private code.

If you want to include code samples within documentation sections, it's a good idea to write them all in uppercase in order to tell them apart from normal code sections.

When you write a program using literate programming techniques, it's good practice to concentrate on the document and make it as clear as possible. In my case, this causes me to interrupt function definitions in the middle and explain something. When that happens, I lose focus and I sometimes omit some critical piece of code from the interrupted function. Therefore, I now try to never interrupt a function with comments. Instead, I try to keep the functions small, less than one screenful if possible.

An Example

t2c requires commands to start at the first column so be careful if you copy-paste the example.
+.
This is an example file playing a game of 'guess the number'. In order
to have some amount of code to discuss, let's make it a non-standard
game which cheats and changes its secret number every turn to make it
hard to win. Of course, this needs to be done in a way consistent with
previous guesses and responses to make sense.

Let's start with the output file:

> guess.c
#include <stdio.h>
#include <time.h>
#include <stdlib.h>

: Variables
: Functions

+.
This is pretty much how I use t2c to write files. I put nothing in the
body of the file command other than trivial stuff such as headers etc.
I have two sections here, since it's a simple program. More complex
programs typically have type sections, prototype sections etc. 

Here is the main function.
+ Functions

int main()
{
   int guess;
   int turns;
+.
Since we're trying to avoid the user from finding out our secret number,
we're going to change it constantly. In fact, we don't even need a
secret number in the first place, just the boundaries established by
previous guesses.

+ PREV
   int upper=100, lower=1;
+.
Did you notice that I'm using a single dot as the section name for
comment blocks? This is so, because I won't be writing the comments to
any file at all.

Let's move on with the problem at hand. 

+ PREV
: Initialize
   for(turns=0;;turns++) {
     guess= get_guess(turns);
     if (guess==0)
     {
       printf("My number was %d. You made %d guess%s. Good day.\n",
           (lower+upper)/2, turns, turns>1? "es":"");
       return 1;
     }
     if (found(guess, &lower, &upper))
      {
        printf("Yes! My number was %d. You "
                "found it in %d guess%s.\n",
                  guess, turns, turns>1? "es" : "");
        return 0;
      }
   }
   return 1;
}
+.
As you can see, the PREV append command is not so useful if you use it
to split up a function. It's best to write whole function in one go,
and then surround it with comments on both ends. For that to work well,
the function needs to be small. I did it here just to demonstrate PREV.

So let's implement the get_guess function.

+ Functions 100

int get_guess(int turns)
{
  int guess;
  if (turns) 
     printf("You made %d guess%s so far.\n",turns,turns>1?"es":"");
  printf("What is your %sguess?\n", turns? "next ": "");
  scanf("%d",&guess);
  return guess;
}

+.
I added an integer suffix to the section name so that this function gets
emitted before main(). I do this because I don't want to write function
prototypes. This kind of tracking dependencies becomes tiresome after
some point. There is another program distributed along with t2c, called
t2c.proto. This program helps generate prototypes automatically.

Now, let's proceed with the 'found' function. This function simply
adjusts the upper and lower bounds and tells the user that he was
unsuccessful with his guess.

+ Functions 100

int found(int guess, int *lower, int *upper)
{
  int secret_is_lower;
  int valid_guess= 0;
  if (guess<*lower) secret_is_lower= 0;
  else if (guess>*upper) secret_is_lower= 1;
  else { 
      valid_guess= 1; 
      if (*lower==*upper) return 1;
      else if (guess-*lower>*upper-guess) secret_is_lower= 1;
      else secret_is_lower= 0;
  }
  
  if (secret_is_lower) {
    print_message(guess, secret_lower_message,
        sizeof(secret_lower_message)/sizeof(char*));
    if (valid_guess) {  *upper= guess-1; }
  } else {
    print_message(guess, secret_higher_message,
        sizeof(secret_higher_message)/sizeof(char*));
    if (valid_guess) { *lower= guess+1; }
  }
  return 0;
}

+.
We could have simply printed a string for the message, but I didn't
want the program to be boring :)

+ Functions 50

void print_message(int guess, char **message_set, int nmsg)
{
   int mno= rand() % nmsg;
   printf(message_set[mno], guess);
}

+.
We're writing the program top-down. Therefore our line numbers are
getting smaller and smaller as we go into more detail.

Now, we're using 'rand' for something completely different. Let's
put an srand call into main:
+ Initialize

  srand(time(NULL));
  printf("I have the number ready, let the game begin.\n");
  printf("Enter 0 any time to quit the game.\n\n");

+.
Now all that needs to be written is the message set. 

+ Variables

static char *secret_higher_message[]=
{
   "My secret number is higher than %d.\n",
   "You guessed %d, but it is too small. Try again.\n",
   "Guess a number greater than %d.\n",
   "I wouldn't make my secret number as small as %d, would I?\n"
};

static char *secret_lower_message[]=
{
   "%d is too big.\n",
   "You should try a number smaller than %d.\n",
   "I don't think my secret number is that big.\n",
   "This time, I chose a number smaller than %d.\n"
};
+.
So that's about it for this example. If you want to see a bigger
and much more complicated example, look for self_print.u in the
source distribution.

Things to Do

There are no bugs I know of. There are some things that would be nice to have, but I'm quite happy with the state the program is in.

Here is a list of things that could be nice to have.

An input file must start with a block command. It's an error to put anything before a block command at the top of the file since the program then doesn't know where to put that text. I should probably allow this for empty lines.

It would also be nice to be able to print t2c documents. Printing them as plain text files is possible, but those would not be so easy to read on paper. It would be nice to have a table of contents, section titles, indices, page numbers etc. Also, I could format code and comments differently to make them more pleasant to read.

Speaking of formatting, it could be nice to have an HTML formatter for t2c input files. This could hyperlink section names to append commands, format the code parts differently etc.

I shall mention the prototype generator and other tools in this document.

The code is a mess since its original form was just a hack to see whether something like this could be useful. Then it grew bigger and I have many things that could go in properly named functions. I should clean it up some time.

This document also needs touch-up. A proper man page should be written, showing the synopsis for each command along with short descriptions.

A proper regression test suite shall be made. I did only very basic testing and can't be confident in the program.

Things That Won't Be Done

The current system requires labels to match exactly, but tolerates the amount of whitespace in the labels. It should also be possible to match labels in a case-insensitive way.
This will not be done, since it will entail a lot of work for UTF-8 documents. Meanings of glyphs can change with the language of the document they are used in. For example, in Turkish the lower case letter for 'I' is not 'i'. How do you recognize an English word embedded in Turkish text? Marking them up is definitely not the answer, since that's what we're trying to avoid here.

Change Log

1.0   Initial Build
1.1   May 2013
      - Cancelled the .w stuff, writing directly in C.
      - Separated the proc library from the main program,
        importing from alib.
      - Same for strbuf
1.2   May 2013
      - Added line number information to output files.
1.5   20170512
      - Templates

Hacking Guide

There are 3 major object types in the program: blocks, parts and files. When you use the append command, the related text gets stored in a part object. The same thing happens when you use the file command. An input file is thus separated into parts.

After parsing, for each append-part, we find the block which has the same name as the part's target. If not found, it's created. Then, the contents of the part is dumped into this block.

Finally, we process the output files. When a declare command is seen within the contents of an output file, the corresponding block is found. The contents of the block are output at that position.

Lines of text are represented by the line_t object. This object contains a field called kind. This field determines the function of the line. If it's one of + : < > then it's taken to be a command line. Otherwise, it can be T, Y or Z.

T
This represents a normal text line.
Y
This represents a line number information line. Stuff which look like #line 23 "foo.c" are marked with Y so that the line number functions can tell where they are.
Z
Lines output from the filter command have this kind. This tells the line enumerator to not bother trying to find the number and file of this line. Template output also uses this.

Companion Programs

The following programs are distributed along with the t2c sources. Eventually, I might incorporate them into the main executable since they are quite simplistic.

Prototype Generator

The t2c.proto program generates function prototypes. However, it's not for use in the general case, it can be used only as a filter inside a t2c source code.

The program reads a set of function definitions from the standard input and prints their prototypes to standard output. Since this program doesn't parse the C code at all, it looks for a specific pattern to do its job. The program simply replaces text in top-level {} pairs with semicolons. For example

  int foo(int a) { body }
becomes
  int foo(int a) ;
This has some consequences. First, the input must consist of only function definitions. For example, things like:
  typedef struct foo { body } foo_t;
will be translated to:
  typedef struct foo ; foo_t;
which will fail spectacularly. In order to give only function definitions to this program, you can use the relocation feature of t2c. For instance, I use the following idiom.
> file.h
< t2c.proto
: Public Functions
< 

> file.c
#include "file.h"
:Public Functions
and later ..
+ Public Functions

 void foo()
 {
 }
This way, I get prototypes for all functions and no warnings :).

Another consequence of the program's algorithm is that you can't have struct declarations inside parameter declarations, but who does that?

Finally, the function definitions must not be generated by CPP macros. Since t2c.proto doesn't try to interpret the meaning of its input in any way, there is no way it can generate a declaration when the function is specified like:

 FUNC_BEGIN(foo)

 FUNC_END
or similar. However, the following does work:
 FUNC_DEF(foo) {
 }
Note that t2c.proto removes all comments and pre-processor directives from the input before generating prototypes. Therefore, things like the following will break:
 #ifdef _WIN32
  int print(DisplayDevice *device,char *str) 
 #else
  int print(XDisplay *dpy, char *str) 
 #endif
When you have such things, it's just easier to not relocate the corresponding function to the t2c.proto input.

Speaking of directives, since the program doesn't process them at all, it can be fooled by inactive '{' or '}' tokens. For example, the following will break:

 #if 0
  if (a==3) {
 #else
  if (a==5) {
 #endif
    stuff
 }
Since the program sees two '{' tokens, it expects two '}' tokens to close them. It doesn't find those, so the rest of the file is gobbled up into the first '{' block, until something equally weird happens.

There are no recognized command line options or arguments. The regular use is as shown in the previous example:

+ Private Prototypes
< t2c.proto
: Private Functions
< 

Quoter

The program t2c.q quotes its standard input and emits it as a multi-part C string literal. By multi-part, I mean one string literal token per input line. For instance,
 Roses are red,
 Violets are blue,
 I made this program,
 Just for you.
will be translated to
" Roses are red,\n"
" Violets are blue,\n"
" I made this program,\n"
" Just for you.\n"
The program encodes special characters using octal escape sequences. There are no options or command line arguments, you can simply invoke it as:
< t2c.q
 Roses are red,
 Violets are blue,
 I made this program,
 Just for you.

: Poem Footer
< 
and such.

Downloads

Main program:
I keep the older versions in case I have messed up in the new version. I really should clean up this place.

Below are the links to yet another companion program, which worked something like the template feature. It's called t2c.ar.

Here are the sources for t2c.ar and here is a document describing the code within it.