Binary File Format Tool

This tool is designed for making save and load operations easier for programs. The resulting code allows the programmer to save only modified data.

The tool takes a description text file and generates code to interact with files of the format described therein. Generated code is C and porting it to another language wouldn't be very easy since the associated structs are also generated by the tool.

Files created using the generated code grow and shrink as needed. You can look at the allocation algorithm for more details.

Each file also contains an index which is loaded whole to memory when the file is opened. This makes it possible to have very fast lookups for reading.

The tool generates come C code to be used in your program. Necessary types are also generated by the tool, but you can modify them using some directives.

An object is a top level data structure which can be directly read or written. All data you read or write must be part of an object. For example, you can't simply read an integer value with a given identifier. The integer must reside inside an object.

There is no top-level 'main' object. All objects are equal in standing and are handled independently of each other. The object index is not visible to the user. If you want to maintain a list of things, you should do it yourself by storing some pointers in an object.

When you run the generated code, almost all functions return some error indication. The return value is non-zero if a function fails. Errors within the generated code are usually non-recoverable. Either memory or disk should fail for an error to occur. Another source of errors is a corrupt binary file. None of these can be recovered from gracefully. Therefore, in case of a failure within the generated code, the best option is to make a new file and store as much as you can into the new file before abandoning the current one.

Download

1.2 1.1

Command Line

Command line switches recognized by the program are:

-v: Prints version information and exits.
-s <filename>: Uses the given file as source code output.
-h <filename>: Uses the given file as header output.
-p <prefix>: Uses the given prefix for the generated code.

Rest of the command line are input file names. The first input file is considered to be the main input file and the rest, if any, are auxilliary.

The main input file describes the object types to be used for the file format. It can also contain other things like directives etc. A single main input file is all you'd need to generate a given format code.

If present, auxilliary input files modify the description given by the main input file. These files can not contain new object descriptions but can modify the existing ones in certain ways. For example, an auxilliary input file can modify the 'raw' fields in an object. Indeed, when an auxilliary input file is given, all raw fields in all object definitions are removed from the main description.

Another usage for auxilliary input files is to modify the compilation environment. In an auxilliary input file, you may define new code to be inserted before and after the source output, you may change the library prefix etc.

The main usage pattern for this is follows. The main input file is supposed to be a 'pure' declaration of the file format, much like a grammar description. Following this main input file, you can have an auxilliary input file which adds raw fields for use in a specific program, along with compilation specifics such as header information, output file names etc.

This system makes it easier to share one file format description among several programs which operate on the same file format.

Types

The program can process the following basic types:

Name	C Type	Name	C Type
int8	int8_t	unt8	uint8_t
int16	int16_t	unt16	uint16_t
int32	int32_t	unt32	uint32_t
int64	int64_t	unt64	uint64_t
flo32	float	flo64	double
str8	uint8_t*	str16	uint16_t*
str32	uint32_t*

These are typically defined in <stdint.h> The strings are null terminated, with no interpretation of the first elements. Normally, some programs insert a sentinel value at the beginning of a multi-byte string to determine its endianness. The generated code doesn't care about it at all.

You can also define arrays and dynamic arrays. Arrays are multi-dimensional C arrays with fixed dimensions. Dynamic arrays are one-dimensional arrays with variable size. There are some restrictions on what you can store inside an array element. An array or a dynamic array can have the following as its elements:

A basic type
An object or a struct

Arrays of arrays or arrays of pointers are not allowed.

Speaking of pointers, you can also have pointer fields in objects or structs. These can only point to objects. No double pointers, no pointers to anything else.

Objects and structs are containers in which you can store fields of the above mentioned types. These two types are identical except for one fact: Objects are top level structures which can be read and written independently. They have their own object identifier and read/write functions are generated for each object.

On the other hand, structs can only be handled as part of another object or struct. No read/write functions are generated for structs and they are not stored in the object index.

Both types are represented by C structs. C structs generated for objects have an extra field in them, called "_objid".

Input Files

Input files are quite simple line based text files. Each command spans one line and is seperated from its arguments by whitespace. If not part of C code, a '#' character starts a comment which ends at the next line.

In an input file, you can specify object and struct declarations, directives and C code.

Object and Struct Declarations

An object is declared using the following syntax:

object <name>
  <field declaration>
  <field declaration>
  ..
  <field declaration>
  <raw field>
  <raw field>
  ..
  <raw field>
end

A 'raw field' is just a regular C struct member declaration. Raw declarations aren't processed by the tool. They are simply passed along to output. The syntax for a raw field is:

  raw <C declaration>

Normal fields are declared using the following syntax:

  field <name> <type>

If the type is a basic type, then the name in the first table is used in the expression. Other cases are listed below:

  array(*) <array element type>
  array(<dim1>,<dim2>, ... <dimN>) <array element type>
  ptr <object name>
  <object or struct name>

Array element types are explained in the Types Section.

Directives

The following directives are recognized:

output_source "file.c": Name of the output source file. This can be overridden from the command line, with the -s switch.
output_header "file.h": Similar to output_source. Can be overridden with the -h switch on the command line.
library_prefix pfx: The given prefix is used for function names and struct names in the resulting source and header files. The corresponding command line switch is -p.
magic "string": Magic string. Must be at most 16 characters.

In addition to these, you may specify some extra code to be inserted at the top or the bottom of header and source code output. This is done by using the following directives:

src_top / end_src_top
src_bot / end_src_bot
hdr_top / end_hdr_top
hdr_bot / end_hdr_bot

You simply put your C code between the given pair of directives. For instance:

src_top
#include "myincludes.h"
end_src_top

In the case of src_top, the given C code replaces the #include directives for the generated code. This is a good place to introduce your platform specific #includes.

Generated Code

The tool will generate some utility functions along with read/write functions for each object type. The default prefix is "bff_", in case you don't provide it. Below, it's represented as PFX. The following utility functions are generated:

int  PFXopen(const char *fn, int mode,bff_t **R);
int  PFXcreate(const char *fn, bff_t **R);
void PFXclose(bff_t *bf);
int  PFXerrno(bff_t *bf);

For PFXopen(), mode is 1 if you need write access and 0 otherwise. PFXopen() and PFXcreate() return non-zero if the operation failed. PFXerrno() returns the latest errno.

For each object type OBJ, you get one struct declaration and two functions:

typedef struct
{
  < fields go here >
} PFXOBJ_t;
int PFXread_OBJ (bff_t *bf, uint64_t objid, PFXOBJ_t **R);
int PFXwrite_OBJ(bff_t *bf, PFXOBJ_t *obj);

In read(), the object and its fields are allocated using malloc(). When you create a new object, it's important to set its _objid field to 0. This way, it will get a valid object identifier when it's written.

Binary File Format

A binary file starts with the file header, continues with some data, ends with an object index. A file header looks like:

MAGIC  : 16 bytes
FILESIZ: uint64
NEXTID : uint64

The file header is padded with zeroes to size 128.

All data, including header and meta-data is written in litle endian order. For example, FILESIZ is an 8-byte unsigned integer written with the least significant byte first. All file offsets are relative to the beginning of the file.

In the file header, MAGIC is any 16 byte sequence of bytes. It's not interpreted in any way. FILESIZ is the offset of the object index. This is also where the data section of the file ends. NEXTID is the next available object identifier. These identifiers start from 32. Values below 32 are reserved.

At the end of the file, we have an object index. This index holds the positions of objects as well as free areas. The following is the format of the object index:

  NENTRIES: uint32
  <entry1>
  <entry2>
  ..
  <entryN>

Each entry has the following information:

 TYPE  : uint32
 OFFSET: uint64
 IDENT : uint64

TYPE is 0 for free areas. Otherwise, it identifies the type of the object. This is just an integer assigned by the system. The OFFSET is the offset of the object or free area within the file. IDENT is the object identifier. If the entry refers to a free area, then IDENT is the size of the free area.

Rest of the file consists of blocks. Each block has the following header:

 SIZE : uint64
 PREV : uint64
 NEXT : uint64

PREV and NEXT fields form a doubly linked list. These fields are meaningful only for blocks that are part of objects. For free areas, these are ignored.

The SIZE field tells us how big the block is. This size includes the size of the block header.

An object may be split into several blocks. For all blocks except for the last one, the amount of data stored in a block is the same as the block size minus the block header size. The last block may contain less data than indicated by SIZE. This happens when freeing the remaining area would result in a too small region which wouldn't be useful for anything.

For free blocks, only the SIZE field is meaningful, which stores the same value as the associated index entry does.

As mentioned above, an object may span several blocks. The first block for an object contains the object header followed by some object data. The remaining blocks contain only object data within their payload section. An object header is:

 SIZE : uint64

Type and identifier information is already present in the index, so they are not repeated here. SIZE doesn't include the size of the object header. It's simply the size of the object data.

The object data can contain aggregate data such as strings or dynamic arrays. For these types of data, the number of elements is given before the elements themselves. For strings, the terminating null is not included in this count. The count is an unsigned 4 byte integer.

Arrays are another form of aggregate data. Dimensions for arrays is not stored on disk since they are already known by the generated code. Elements of an array are stored row-first.