Serialization Tool

genser creates code necessary to marshall/unmarshall C structs. Its main purpose is to help create a binary communication protocol. As such, it supports definition of new versions of a protocol in an incremental manner. It can also be used for storing configuration or cached information.

Due to its purpose, it will not handle recursion and is not suitable for storing linked information.

Invocation

genser doesn't take any command line options. All arguments given on the command line are considered to be input file names, which are processed as if they were all concatenated to form a single input file.

Input File

Input file is a strictly ASCII text file consisting of object definitions and global directives.

Lines starting with the character # are ignored and can be used for comments.

Directives

Directives are line-based and start with the character %. After that you have the directive name and arguments. Here is the list of all directives:

%source source-file-name: Name of the generated C source file. Defaults to "sz.c". Do not use quotes.
%header header-file-name: Name of the generated C header file. Defaults to "sz.h".
%enum_prefix prefix_to_add prefix_to_remove: When generating enum values from struct names, the string given as prefix_to_add argument is added to the struct name and the second argument is removed from it if present. An argument given as a single dot character stands for the empty string.
%enum_suffix suffix_to_add suffix_to_remove: Same as above, but for suffixes.
%enum_case [upper|lower|same]: Defines the case of the generated enum values.
%enum_start value: This value is used for the first enum value.
%enum_end symbol: This symbol is added to the enum as its last element. You could then use this symbol to check that a given message tag is valid.
%func_prefix identifier: The generated functions will have the given prefix. Default is "sz".
%union_name identifier: All generated structs are combined into one union, which is then the argument to various functions. This is the name of that union. Defaults to "szObject".
%version integer: Explained later.
%table symbol: Specifies name of the variable containing the table.
%license: Verbatim directive, contents are put at the beginning of both the source file and header file.
%source-top: Verbatim text. Contents are output at the top of the source file, after license.
%source-bottom: Verbatim text.
%header-top: Verbatim text.
%header-bottom: Verbatim text.
%extra-fields struct_name: Verbatim text. Contents are output inside the typedef struct { } for the corresponding struct, after all field definitions. Fields declared in this way are not processed at all by the library.

Verbatim text directives copy some input lines to the output. These start with one of the directives listed above and last until the line

%end

Text in between is not interpreted by the program in any way, even comments (starting with #) are passed thru.

Object Definitions

Here is the syntax for an object definition:

  object_definition : struct_name "deprecated" ";" 
  object_definition : name_specifier ";" 
  object_definition : name_specifier "{" field_list "}"

  name_specifier    : struct_name 
  name_specifier    : struct_name struct_tag 
  struct_name       : identifier 
  struct_tag        : "@" identifier

  field_list        : field_definition ";" field_list 
  field_list        : field_definition ";" 

  field_definition  : type_specifier field_name_list 
  field_name_list   : field_name 
  field_name_list   : field_name "," field_name_list 
  field_name        : identifier 

  type_specifier    : type_name  
  type_specifier    : type_name "[" "]" 
  type_specifier    : type_name "[" array_size "]" 
  array_size        : identifier 
  array_size        : decimal_number

In this syntax, quote characters are just there to emphasise that the quoted strings are input tokens. You don't need actual quote characters for any tokens.

The given struct name and tag are used as follows in the output:

  typedef struct struct_tag
  {
    fields
  } struct_name;

In the input, the struct tag is written with an @ sign at the front, without any spaces between.

A type specifier specifies either a singular field, an array or a variable sized array. The base type can be a basic type or a type specified using an object definition. Here are the basic types:

Basic Type
Name Corresponding
C Type

int8 int8_t

unt8 uint8_t

int16 int16_t

unt16 uint16_t

int32 int32_t

unt32 uint32_t

int64 int64_t

unt64 uint64_t

string char*

In case of a variable sized array, the generated struct contains an additional field called n_X where X is the name of the original field. This is an uint32_t typed count of items in the variable sized array.

Basic Type Name	Corresponding C Type
int8	int8_t
unt8	uint8_t
int16	int16_t
unt16	uint16_t
int32	int32_t
unt32	uint32_t
int64	int64_t
unt64	uint64_t
string	char*

Output

The program outputs a header and a source file. The source file already contains everything in the header file and thus does not #include the latter.

The header contains an enumeration which identifies the type of an encoded object.

Following that, a typedef struct is output for each object definition. Finally, all structs are combined into a typedef union. This type is used for encoding/decoding functions.

In the source file, you have a table describing the objects. This table contains information about all versions. Following that, encoding, decoding and helper functions are output.

Versions

In the header, a pointer called sztab is declared. This points to the first entry in the table which corresponds to the default version in the input file. The default version is the first declared version or the set of objects without a preceding %version directive.

In order to use a different version, you should call:

  const szTable* szVersion(const szTable* table, int version);

This will return NULL if the required version is not found. You can call this function using sztab as the first argument or any other szTable* typed value obtained from szVersion.

In the input file, when you use the directive %version, a new version of the protocol is declared. This new version inherits all objects from the previous version. Any object definition following the directive will be inserted into the new version of the protocol. If you want to deprecate any message, you may use the deprecated object definition as shown before:

  object_definition : struct_name "deprecated" ";"

From this version on, the given object will no longer be recognized.

Encoding

Encoding is done by the functions:

uint8_t* szEncodePad
    (const szTable *table, szObject *obj,
     size_t hdr, size_t ftr, size_t *size);   
uint8_t* szEncode
    (const szTable *table, szObject *obj, size_t *size);

The first function adds some header and footer space to the allocated buffer.

The return value is the allocated buffer. The allocated size is returned thru the size pointer. If the given object is not recognized or if there is a memory allocation failure, the return value is NULL.

In order to encode a struct, you need to set its _type field to the corresponding enum value. This needs to be done only for the top level struct passed to Encode since the types of lower level structs are inferred from the top level struct.

The encoder is capable of handling 0 sized variable arrays and NULL strings.

The encoded buffer contains the following (after any header space):

byte index: 0 1 2 3 4 5 6 7 8 9 ....
            T T T T S S S S D D D ...... D

The first 4 bytes encode the enum value corresponding to the given struct (copied from _type). The second 4 bytes encode the size of the data portion, as shown by D bytes above. Therefore, this size is 8 less than the encoded size.

All integers are encoded little endian.

The encoded buffer is independent of the given object. All data stored in the object is copied to the buffer. The object may be freed without affecting the buffer.

When you're done with the buffer, you may call free() to deallocate it.

Decoding

The following function decodes an object from the given buffer and also advances the buffer parameters to point to the first byte past the encoded object.

szObject* szDecode
   (const szTable *table, uint8_t **buffer, size_t *length);

The initial contents of (*buffer) are expected to conform to the format described above. 4 bytes of type, followed by 4 bytes of size and corresponding number of bytes of data.

The return value is NULL if there are any encoding errors or memory allocation failures. The _type field of the returned object may be used to tell which message has arrived.

The decoded object is independent of the decode buffer. They may be deallocated independently. Functions below are provided for your convenience.

Object Destruction

The functions

void szFree(const szTable *table, szObject *obj);
void szDestroy(const szTable *table, szObject *obj);

can be used to destroy objects created by the user or by the szDecode function. These simply call free() recursively on all pointers.

When you provide objects you have created yourself, make sure that all pointers point to separately allocated addresses. strdup() is your friend. Also, the _type field in the top level object must be correctly set.

The second function doesn't free the top level pointer.

Suggested Use

Since this tool generates a binary protocol, it's important that both the reader and writer of messages are built with the same specification. However, you may need different code for the two programs. In order to resolve this issue, I suggest the following.

License, object definitions and enum specifications should reside in one file, which will be the protocol specification. This file should ideally contain all the versions. Having additional versions in additional files will necessiate that they are always given in the correct order to the tool.

Other specifications regarding code generation such as %extra-fields, source and header top/bottom code etc. should reside in a separate file for each program.

There should be some sort of version negotiation at the start of the protocol. Be careful to include messages related to this in the default version.

Download, Installation and Hacking

The program consists of a single file. Simply compile this file with gcc and you're done. There is no installation, just put it somewhere in your $PATH.

In order to develop genser further, you need to be aware that the source file also contains a shell script which is used to re-encode the library code into string literals within the rest of the code. These string literals are used to copy the library code to the output. If you make modifications to the library code, quit the editor and run:

$ sh genser.c

This will create a temporary tool in /tmp and update genser.c. The old version will be stored in genser.c.bak.

Within the library code, some prefixes are used to refer to functions or types. These are then replaced by the program during output.

__REP0: Function prefix.
__REP1: Table name.
__REP2: Union name.

Here is a tarball which contains

test input
test program
Makefile to build genser along with the test.

In the encoded format, there are no tags indicating type. Encoding of data is as follows:

Each integer is encoded little endian.

A string is encoded as a 4-byte length, followed by string data, including the null terminator. The length also counts the null byte. If length is zero, then the string is NULL.

A variable length array is encoded as a 4-byte number of elements, followed by each element.

Data is encoded in a depth-first fashion. If you have arrays, strings or variable sized arrays, data for these are encoded directly after the previous field's data. End of the encoded buffer corresponds to the end of the top level object.

Bugs and Things to Do

When the tool fails, it will leave partially written output files. This is not acceptable. Ideally, the writes should be done to temporary files which are subsequently renamed in case of success.

Although the generated library has no memory leaks, the same can't be said for the tool itself. It never frees any memory. Normally this is not an issue since this is a build tool and the input size is not likely to be huge. However, properly deallocating stuff sometimes exposes bugs that are not normally encountered by chance.

For example, if you have two things that are (supposed to be) copies of each other, deallocating one should not create a problem for deallocating the other. If you never deallocate, you won't find this bug easily. On the other hand, valgrind will happily tell you that a pointer is freed multiple times if you do the deallocations.

Error handling is not tested neither in the tool nor the library. It does work for simple cases, but I expect some bugs to emerge later.

In szFree and szDestroy, the table argument is redundant. These functions could easily use the all-versions table to locate the requested object definition. Also, these functions should really return an integer, indicating success(0)/failure(non-0). Failure happens when the library is unable to find the object definition.

Some checks are intentionally omitted. For instance, if you have an object which claims to have N elements in a variable sized array, the library doesn't check whether the corresponding pointer is NULL or not. It simply goes ahead with the encoding. If it ends up trying to read from a NULL address, the program will naturally die with a segfault. This is intentional because this inconsistency is a serious bug and allowing Encode to treat it as a regular error would make it harder to find the bug.