Welcome to Duncan White's Practical Software Development (PSD) Pages.
I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.
![]()
datadec: bringing Recursive (Inductive) Data Types to ANSI C.
Part 2: Understanding The Module Generated: datatypes.c
- In the first part of this article I discussed datadec - a tool I built 20 years ago - which reads an input file specifying one or more Functional Programming style Recursive (or Inductive) Data Types, such as:
TYPE { intlist = nil or cons( int head, intlist tail ); illist = nil or cons( intlist head, illist tail ); idtree = leaf( string id ) or node( idtree left, idtree right ); }and automatically writes a complete C module (definition part - a .h file - plus an implementation part - a .c file) that implements the above types in a standard way.- We looked at the header file (datatypes.h) that datadec wrote, and, using that knowledge of the interface, looked at two test programs (testintlist.c and testidtree.c) to see how we can construct lists and trees, and how we can take them apart - summing lists, traversing trees etc.
- Now, here in the second part, let's look inside datatypes.c - the implementation module that datadec produced - to see what sort of code datadec generates to define all the functions that the header file declared prototypes for, i.e how datadec actually implements various forms of RDT.
- Finally, in the third and final part I'll discuss the internals of datadec and how it actually works. Datadec is quite a complex piece of software, here's a chance to investigate how we might go about building such a piece of software.
datatypes.c: Includes, Recap and Constructors
- datatypes.c starts with a few includes, stuff we need and the datatypes.h interface - doing this is standard practice in C modular programming, to enable the compiler to check the consistency of the header declarations and the true function definitions in the .c file:
#include <stdio.h> #include <stdlib.h> #include "datatypes.h"- The next thing you'll notice inside datatypes.c is that each non-empty shape (for each type) is turned into a corresponding C constructor function, which malloc()s a block (using the NEW() macro we defined in datatypes.h) and then fills in the appropriate fields:
intlist intlist_cons( int head, intlist tail ) { intlist new = NEW(intlist); new->head = head; new->tail = tail; return new; }At this point, it may help to refer to datatypes.h and check the definition of the NEW() macro, the type declarations relating tostruct intlist
and the constructor declarations that datadec generated:#define NEW(t) ((t)malloc(sizeof(struct t))) .. typedef enum { intlist_is_nil, intlist_is_cons, } kind_of_intlist; struct intlist { int head; intlist tail; }; .. #define intlist_nil() ((intlist)NULL) extern intlist intlist_cons( int , intlist );Notice that intlist's nil() shape is empty (i.e takes no arguments), so datadec implemented the
intlist_nil()
constructor as a macro returning NULL. Hence, only theintlist_cons(int head,intlist tail)
constructor is given as a prototype above, with a corresponding definition in the C code.Aside: you can switch these optimizations off using datadec -n.., if you do this the
struct intlist
type definition will become a union of two structures (with one empty), andintlist_nil()
will become a proper function, with a prototype declaration in datatypes.h and a definition in datatypes.c, and the constructor code for both shapes will be more complex. This may help you to understand the regular method, without the special cases. In practice, however, I wouldn't bother with disabling the optimizations, as I haven't found a bug in the optimization logic for over 15 years:-)You'll also notice that the code datadec generates is neatly laid out - this was a deliberate design decision on my part, I don't see why a tool should write unpleasant code! I wanted datadec to automatically write the code that I would have written myself if I'd done it by hand.
- When it comes to our second datatype,
illist
, the list of integer lists, the data structures in datatypes.h are virtually identical, the choice to use(illist)NULL
as the value ofillist_nil()
is the same, and theillist_cons()
constructor is:illist illist_cons( intlist head, illist tail ) { illist new = NEW(illist); new->head = head; new->tail = tail; return new; }- When it comes to
idtree
, recall that the RDT definition was:idtree = leaf( string id ) or node( idtree left, idtree right );Because no idtree shape is empty, the NULL optimization cannot apply. Because there are at least two non-empty shapes, a tagged union is required (what Pascal-style languages would call a variant record). One small optimization does apply: the leaf shape has only a single argument (the id) so we don't need an inner leaf struct. Looking in datatypes.h we see:typedef enum { idtree_is_leaf, idtree_is_node, } kind_of_idtree; struct idtree { kind_of_idtree tag; union { string leaf_id; struct { idtree left; idtree right; } node; } u; }; .. extern idtree idtree_leaf( string ); extern idtree idtree_node( idtree , idtree );- The corresponding constructor definitions are simple given the choices that datadec has already made for
struct idtree
. First, theleaf(id)
constructor:idtree idtree_leaf( string id ) { idtree new = NEW(idtree); new->tag = idtree_is_leaf; new->u.leaf_id = id; return new; }If the lvaluenew->u.leaf_id
looks strange to you, reread thestruct idtree
declaration. There you'll see thatu
is the union, which when storing a leaf(), stores a string calledleaf_id
.
- Next, the
node(l,r)
constructor:idtree idtree_node( idtree left, idtree right ) { idtree new = NEW(idtree); new->tag = idtree_is_node; new->u.node.left = left; new->u.node.right = right; return new; }Now the left and right subtrees are contained inside a structure called node, inside the union called u, hence we assign tonew->u.node.left
andnew->u.node.right
.We've now covered all the constructors. I point again to the elegance of the code generated - if I was writing this by hand, I might line up the '=' signs within a constructor, otherwise it's exactly the code I would write.
datatypes.c: Shape testers and Deconstructors
- The next section of datatypes.c defines, for each type, a shape tester function and, for each non-empty shape, a deconstructor function that breaks an element of the appropriate shape into the constituent parameters.
- For
intlist
, we find the following shape tester:kind_of_intlist intlist_kind( intlist this ) { if( this == NULL ) { return intlist_is_nil; } return intlist_is_cons; }This function takes an intlist and returns a
kind_of_intlist
- look higher up this page and remind yourself that it's an enumeration. Note that the body has to compensate for the NULL optimization (turning NULL back intointlist_is_nil
), so that the user of the generated datatype module doesn't have to care about irregularities caused by optimizations: they simply test whether the kind of the list is nil or cons.- Then we find the deconstructors for non-empty shapes: (
nil()
is an empty shape, so needs no deconstructor), but the non-emptycons(head,tail)
needs the following deconstructor:void get_intlist_cons( intlist this, int *headp, intlist *tailp ) { *headp = this->head; *tailp = this->tail; }
get_intlist_cons()
takes an intlist (this) and two var parameters in Pascal-speak, i.e. pointers to theint head
andintlist tail
that we want to store the answers in, then the body returns the internal pieces of the intlist element to the caller by storing them in the "var parameters",*headp
and*tailp
. We do it this way because C can't return tuples - if it could, we would just writereturn (this->head,this->tail)
instead. But it can't:-(The multiple returns via pointer parameters technique is standard in C, but if you've having difficulty understanding what the above is doing, recall from the first part that a deconstructor is called as:
int h; intlist t; get_intlist_cons( l, &h, &t );Making the callget_intlist_cons( l, &h, &t )
setsthis=l, headp=&h
andtailp=&t
. The first assignment insideget_intlist_cons()
sets*headp = this->head
, i.e.*&h = l->head
, i.e.h = l->head
.Similarly, the second assignment sets:
*tailp = this->tail
, i.e.*&t = l->tail
, i.e.t = l->tail
.Overall, this function has the affect of writing l's head value into h, and l's tail pointer into t. Aren't pointers great!
- Now that we've seen the constructors, the shape tester and the deconstructors, it's worth spending a few minutes making sure that you are completely clear about how they work together. One way of doing this is to dry-run some simple example. For example, let's consider how a specific version of our sum up an intlist code from the first part works:
intlist l = intlist_nil(); /* build list [300,200,100] */ l = intlist_cons( 100, l ); l = intlist_cons( 200, l ); l = intlist_cons( 300, l ); int sum = 0; /* now sum it up */ intlist p = l; while( intlist_kind(p) == intlist_is_cons ) { int head; get_intlist_cons( p, &head, &p ); sum += head; } printf( "sum=%d\n", sum );I strongly recommend that you single-step through this whole program so you understand exactly how this works before proceeding further. You can use a debugger, but I'd recommend stepping through on paper, drawing pointer diagrams and modifying them as each assignment happens. It should take you about 15-20 minutes, require that you fill out several sheets of paper, and desperately make you wish for a tool that could automatically draw such pointer diagrams for you. Now there's a cool project for somebody to take on!
Continuing,
illist
's shape tester and deconstructors are, as you should expect, virtually identical tointlist
's, so let's skip them.- Turning now to
idtree
, things are rather different due to the tagged unionstruct idtree
definition. The shape test simply picks out the tag, which answers the "which shape am I" question:kind_of_idtree idtree_kind( idtree this ) { return this->tag; }- The
leaf()
deconstructor is as shown below. Note that, because a leaf has a single argument, we could have chosen to return it, but I decided to use a single "var parameter" instead, for interface consistency:void get_idtree_leaf( idtree this, string *id ) { *id = this->u.leaf_id; }- Our final deconstructor (
get_idtree_node()
) is similar:void get_idtree_node( idtree this, idtree *left, idtree *right ) { *left = this->u.node.left; *right = this->u.node.right; }Aside: You'll notice that we're not checking inside a deconstructor that the element has the right shape before deconstructing it. I reasoned that such checks were not necessary because the usual structure of calling these functions guarantees that the data is of the correct shape:
if( kind(t) is leaf ) { /* we know it's a leaf */ get_idtree_leaf( t, &id ); } else { /* we know it's a node */ get_idtree_node( t, &l, &r ); }If I was building datadec today, I might want to check this precondition explicitly via
assert(this->tag == idtree_is_leaf)
. At the time, computers were slower so unnecessary checks might significantly slow your program down - these days defensive programming is more important!Actually, less forgivably, you may have noticed that our constructors never check that the result of malloc()/NEW() is non NULL! Perhaps I'll add an option in the near future causing datadec to generate rather more checks. You're welcome to do this as an exercise yourself!
datatypes.c: Print functions
- The final part of the generated C file implements, for each type, a function that prints an element of the type to an open file. For an
intlist
the print function is as follows:void print_intlist( FILE *f, intlist p ) { if( p == NULL ) { fputs( "nil", f ); return; } fputs( "cons", f ); fputc( '(', f ); fprintf( f, "%d", p->head ); fputc( ',', f ); print_intlist( f, p->tail ); fputc( ')', f ); }Pretty cool, huh? print functions generated by datadec show the object in terms of constructor calls (just as most functional programming languages do automatically). With our 3-element list that we summed above, the print function generates:
cons(300,cons(200,cons(100,nil)))- In fact, datadec allows you to alter the print formatting. If you alter types.in to read:
TYPE { intlist = nil "" or cons( int head, intlist tail ) 1 "," 2; }Here we've added a series of literal strings and numbers following each shape - these are called print hints: a literal string is printed as-is, and a number N means "print the Nth argument here, in whatever natural format it should be printed".Having added the print hints, rerunning datadec will generate the following print function:
void print_intlist( FILE *f, intlist p ) { if( p == NULL ) { fputs( "", f ); return; } fprintf( f, "%d", p->head ); fputc( ',', f ); print_intlist( f, p->tail ); }which would print our 3-element list as:300,200,100,However, there's no way of suppressing the trailing comma, or enclosing the list in square brackets (say). If you want something like this, you'll have to write your own custom print function (with a different name). datadec allows you to embed additional code in your types.in file which is copied as-is into the generated header file or the generated C file, see the EXPORT and GLOBAL sections in datadec's manual for more details.
- The default print function for the idtree data type would be:
void print_idtree( FILE *f, idtree p ) { switch( p->tag ) { case idtree_is_leaf: fputs( "leaf", f ); fputc( '(', f ); fputs( p->u.leaf_id, f ); fputc( ')', f ); break; case idtree_is_node: fputs( "node", f ); fputc( '(', f ); print_idtree( f, p->u.node.left ); fputc( ',', f ); print_idtree( f, p->u.node.right ); fputc( ')', f ); break; default: fprintf( stderr, "print_idtree: impossible tag %d\n", p->tag ); exit(1); } }- In most cases, this would bring us to the end of our investigation of the structure of the generated datatypes.c file....
datatypes.c: experimental Free functions
- ... Unless of course you run datadec with the experimental -f option which adds a free function for each type, such as:
void free_idtree( idtree p ) { switch( p->tag ) { case idtree_is_leaf: free_string( p->u.leaf_id ); break; case idtree_is_node: free_idtree( p->u.node.left ); free_idtree( p->u.node.right ); break; default: fprintf( stderr, "free_idtree: impossible tag %d\n", p->tag ); exit(1); } free( p ); }Like a print function, this traverses the RDT, checking which shape the element is, freeing any arguments that need to be freed, before finally freeing the malloc()ed chunk itself. By the way,
free_string(s)
is a macro that you need to define in the GLOBAL section (at the top of types.in) as follows:GLOBAL { #define free_string(s) free(s) }While this mechanism basically works, freeing everything that was malloc()ed, it is depressingly easy to build something containing shared pointers - eg.
node(t,t)
builds a valid tree node, but both left and right pointers point to the same tree t. If t is itself anode()
thenfree_idtree()
will inevitably free() t twice. It is unclear how to fix this, or indeed if there's any way we could - given that C doesn't possess a reference counting garbage collector. Of course, you could write one yourself, but that's not an appealing thing to do!As a simple feature for now, I added an option: if you don't want an argument freed, you mark it in the input file with a `-', as in:
idtree = leaf( -string id ) or node( idtree left, idtree right );This suppresses the call to
free_string(leaf_id)
. This is useful, but of course this doesn't really help with the shared subtree problem. We could mark (say) the right subtree as never to be freed:idtree = leaf( string id ) or node( idtree left, -idtree right );but this affects all right subtrees in all idtrees we ever build rather than detecting whether a particular subtree has already been freed, while freeing all the other subtrees.
For now (and probably forever), this is a problem left for the user of datadec. Think carefully about how you construct your RDT values, to avoid shared pointers.
Unfortunately, shared pointers can easily emerge in other ways. Consider a list of strings. Now write a function to reverse the list. Most likely, you'll reuse the head strings (but in reverse order), but now both the original list and the reversed list share all the head string pointers! Now freeing both lists is going to go badly wrong! But if you don't free the second list, you never free the structure of the second list. In this case, you actually want two free functions: one which frees the heads, and one which doesn't. Or perhaps you want one function - with the ability, per call, to say "don't free the heads this time".
This really is tricky, and I am still thinking about it. I got as far as considering some sort of per-free call "set of RDT arguments not to free". This set would usually be empty, but you could add one or more elements to such a set before a free call, to suppress freeing particular parameters this time. For example, to prevent an idtree's leaf(id) being freed, you could add
idtree/leaf/id
to the set. The free routines generated would then need to check for set membership before each free call. For example, thefree_idtree
for freeing aleaf(id)
would be:case idtree_is_leaf: if( ! in_set( "idtree/leaf/id", excludeset ) { free_string( p->u.leaf_id ); } break;However, I don't want to have to ship a set implementation with datadec to implement this feature, so this remains an idea for now!Onto Part 3
- Ok, in the third and final part, I'll describe the internals of datadec itself, and explain how I built the tool.
d.white@imperial.ac.uk Back to the first part of this article Onto the third part of this article Back to PSD Top Written: Oct 2014