PSD: Duncan White's Practical Software Development Pages

Welcome to Duncan White's Practical Software Development (PSD) Pages.

I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

See all my Practical Software Development (PSD) Pages

datadec: bringing Recursive (Inductive) Data Types to ANSI C.
Part 2: Understanding The Module Generated: datatypes.c
In the first part of this article I discussed datadec - a tool I built 20 years ago - which reads an input file specifying one or more Functional Programming style Recursive (or Inductive) Data Types, such as:
 TYPE {
        intlist =  nil or cons( int head, intlist tail );
        illist  =  nil or cons( intlist head, illist tail );
        idtree  =  leaf( string id )
                or node( idtree left, idtree right );
 }
and automatically writes a complete C module (definition part - a .h file - plus an implementation part - a .c file) that implements the above types in a standard way.
We looked at the header file (datatypes.h) that datadec wrote, and, using that knowledge of the interface, looked at two test programs (testintlist.c and testidtree.c) to see how we can construct lists and trees, and how we can take them apart - summing lists, traversing trees etc.

Now, here in the second part, let's look inside datatypes.c - the implementation module that datadec produced - to see what sort of code datadec generates to define all the functions that the header file declared prototypes for, i.e how datadec actually implements various forms of RDT.

Finally, in the third and final part I'll discuss the internals of datadec and how it actually works. Datadec is quite a complex piece of software, here's a chance to investigate how we might go about building such a piece of software.
datatypes.c: Includes, Recap and Constructors
datatypes.c starts with a few includes, stuff we need and the datatypes.h interface - doing this is standard practice in C modular programming, to enable the compiler to check the consistency of the header declarations and the true function definitions in the .c file:
 #include <stdio.h>
 #include <stdlib.h>
 #include "datatypes.h"
The next thing you'll notice inside datatypes.c is that each non-empty shape (for each type) is turned into a corresponding C constructor function, which malloc()s a block (using the NEW() macro we defined in datatypes.h) and then fills in the appropriate fields:
 intlist intlist_cons( int head, intlist tail )
 {
	intlist	new = NEW(intlist);
	new->head = head;
	new->tail = tail;
	return new;
 }
At this point, it may help to refer to datatypes.h and check the definition of the NEW() macro, the type declarations relating to struct intlist and the constructor declarations that datadec generated:
 #define NEW(t) ((t)malloc(sizeof(struct t)))
 ..
 typedef enum {
	intlist_is_nil,
	intlist_is_cons,
 } kind_of_intlist;

 struct intlist {
	int	head;
	intlist	tail;
 };
 ..
 #define intlist_nil() ((intlist)NULL)
 extern intlist intlist_cons( int , intlist );
Notice that intlist's nil() shape is empty (i.e takes no arguments), so datadec implemented the intlist_nil() constructor as a macro returning NULL. Hence, only the intlist_cons(int head,intlist tail) constructor is given as a prototype above, with a corresponding definition in the C code.
Aside: you can switch these optimizations off using datadec -n.., if you do this the struct intlist type definition will become a union of two structures (with one empty), and intlist_nil() will become a proper function, with a prototype declaration in datatypes.h and a definition in datatypes.c, and the constructor code for both shapes will be more complex. This may help you to understand the regular method, without the special cases. In practice, however, I wouldn't bother with disabling the optimizations, as I haven't found a bug in the optimization logic for over 15 years:-)
You'll also notice that the code datadec generates is neatly laid out - this was a deliberate design decision on my part, I don't see why a tool should write unpleasant code! I wanted datadec to automatically write the code that I would have written myself if I'd done it by hand.
When it comes to our second datatype, illist, the list of integer lists, the data structures in datatypes.h are virtually identical, the choice to use (illist)NULL as the value of illist_nil() is the same, and the illist_cons() constructor is:
 illist illist_cons( intlist head, illist tail )
 {
	illist	new = NEW(illist);
	new->head = head;
	new->tail = tail;
	return new;
 }
When it comes to idtree, recall that the RDT definition was:
 idtree  =  leaf( string id )
         or node( idtree left, idtree right );
Because no idtree shape is empty, the NULL optimization cannot apply. Because there are at least two non-empty shapes, a tagged union is required (what Pascal-style languages would call a variant record). One small optimization does apply: the leaf shape has only a single argument (the id) so we don't need an inner leaf struct. Looking in datatypes.h we see:
 typedef enum {
	idtree_is_leaf,
	idtree_is_node,
 } kind_of_idtree;

 struct idtree {
	kind_of_idtree	tag;
	union {
		string	leaf_id;
		struct {
			idtree	left;
			idtree	right;
		} node;
	} u;
 };
 ..
 extern idtree idtree_leaf( string );
 extern idtree idtree_node( idtree , idtree );
The corresponding constructor definitions are simple given the choices that datadec has already made for struct idtree. First, the leaf(id) constructor:
 idtree idtree_leaf( string id )
 {
	idtree	new = NEW(idtree);
	new->tag = idtree_is_leaf;
	new->u.leaf_id = id;
	return new;
 }
If the lvalue new->u.leaf_id looks strange to you, reread the struct idtree declaration. There you'll see that u is the union, which when storing a leaf(), stores a string called leaf_id.
Next, the node(l,r) constructor:
 idtree idtree_node( idtree left, idtree right )
 {
	idtree	new = NEW(idtree);
	new->tag = idtree_is_node;
	new->u.node.left = left;
	new->u.node.right = right;
	return new;
 }
Now the left and right subtrees are contained inside a structure called node, inside the union called u, hence we assign to new->u.node.left and new->u.node.right.
We've now covered all the constructors. I point again to the elegance of the code generated - if I was writing this by hand, I might line up the '=' signs within a constructor, otherwise it's exactly the code I would write.
datatypes.c: Shape testers and Deconstructors
The next section of datatypes.c defines, for each type, a shape tester function and, for each non-empty shape, a deconstructor function that breaks an element of the appropriate shape into the constituent parameters.
For intlist, we find the following shape tester:
 kind_of_intlist intlist_kind( intlist this )
 {
	if( this == NULL )
	{
		return intlist_is_nil;
	}
	return intlist_is_cons;
 }
This function takes an intlist and returns a kind_of_intlist - look higher up this page and remind yourself that it's an enumeration. Note that the body has to compensate for the NULL optimization (turning NULL back into intlist_is_nil), so that the user of the generated datatype module doesn't have to care about irregularities caused by optimizations: they simply test whether the kind of the list is nil or cons.
Then we find the deconstructors for non-empty shapes: (nil() is an empty shape, so needs no deconstructor), but the non-empty cons(head,tail) needs the following deconstructor:
 void get_intlist_cons( intlist this, int *headp, intlist *tailp )
 {
	*headp = this->head;
	*tailp = this->tail;
 }
get_intlist_cons() takes an intlist (this) and two var parameters in Pascal-speak, i.e. pointers to the int head and intlist tail that we want to store the answers in, then the body returns the internal pieces of the intlist element to the caller by storing them in the "var parameters", *headp and *tailp. We do it this way because C can't return tuples - if it could, we would just write return (this->head,this->tail) instead. But it can't:-(
The multiple returns via pointer parameters technique is standard in C, but if you've having difficulty understanding what the above is doing, recall from the first part that a deconstructor is called as:
	int     h;
	intlist t;
	get_intlist_cons( l, &h, &t );
Making the call get_intlist_cons( l, &h, &t ) sets this=l, headp=&h and tailp=&t. The first assignment inside get_intlist_cons() sets *headp = this->head, i.e. *&h = l->head, i.e. h = l->head.
Similarly, the second assignment sets: *tailp = this->tail, i.e. *&t = l->tail, i.e. t = l->tail.
Overall, this function has the affect of writing l's head value into h, and l's tail pointer into t. Aren't pointers great!
Now that we've seen the constructors, the shape tester and the deconstructors, it's worth spending a few minutes making sure that you are completely clear about how they work together. One way of doing this is to dry-run some simple example. For example, let's consider how a specific version of our sum up an intlist code from the first part works:
	intlist l = intlist_nil();	/* build list [300,200,100] */
	l = intlist_cons( 100, l );
	l = intlist_cons( 200, l );
	l = intlist_cons( 300, l );

	int sum = 0;			/* now sum it up */
	intlist p = l;
	while( intlist_kind(p) == intlist_is_cons )
	{
		int head;
		get_intlist_cons( p, &head, &p );
		sum += head;
	}
	printf( "sum=%d\n", sum );
I strongly recommend that you single-step through this whole program so you understand exactly how this works before proceeding further. You can use a debugger, but I'd recommend stepping through on paper, drawing pointer diagrams and modifying them as each assignment happens. It should take you about 15-20 minutes, require that you fill out several sheets of paper, and desperately make you wish for a tool that could automatically draw such pointer diagrams for you. Now there's a cool project for somebody to take on!
Continuing, illist's shape tester and deconstructors are, as you should expect, virtually identical to intlist's, so let's skip them.
Turning now to idtree, things are rather different due to the tagged union struct idtree definition. The shape test simply picks out the tag, which answers the "which shape am I" question:
 kind_of_idtree idtree_kind( idtree this )
 {
	return this->tag;
 }
The leaf() deconstructor is as shown below. Note that, because a leaf has a single argument, we could have chosen to return it, but I decided to use a single "var parameter" instead, for interface consistency:
 void get_idtree_leaf( idtree this, string *id )
 {
	*id = this->u.leaf_id;
 }
Our final deconstructor (get_idtree_node()) is similar:
 void get_idtree_node( idtree this, idtree *left, idtree *right )
 {
	*left = this->u.node.left;
	*right = this->u.node.right;
 }
Aside: You'll notice that we're not checking inside a deconstructor that the element has the right shape before deconstructing it. I reasoned that such checks were not necessary because the usual structure of calling these functions guarantees that the data is of the correct shape:
 if( kind(t) is leaf )
 {
       /* we know it's a leaf */
       get_idtree_leaf( t, &id );
 } else
 {
       /* we know it's a node */
       get_idtree_node( t, &l, &r );
 }
If I was building datadec today, I might want to check this precondition explicitly via assert(this->tag == idtree_is_leaf). At the time, computers were slower so unnecessary checks might significantly slow your program down - these days defensive programming is more important!
Actually, less forgivably, you may have noticed that our constructors never check that the result of malloc()/NEW() is non NULL! Perhaps I'll add an option in the near future causing datadec to generate rather more checks. You're welcome to do this as an exercise yourself!
datatypes.c: Print functions
The final part of the generated C file implements, for each type, a function that prints an element of the type to an open file. For an intlist the print function is as follows:
 void print_intlist( FILE *f, intlist p )
 {
	if( p == NULL )
	{
		fputs( "nil", f );
		return;
	}
	fputs( "cons", f );
	fputc( '(', f );
	fprintf( f, "%d", p->head );
	fputc( ',', f );
	print_intlist( f, p->tail );
	fputc( ')', f );
 }
Pretty cool, huh? print functions generated by datadec show the object in terms of constructor calls (just as most functional programming languages do automatically). With our 3-element list that we summed above, the print function generates:
 cons(300,cons(200,cons(100,nil)))
In fact, datadec allows you to alter the print formatting. If you alter types.in to read:
 TYPE {
        intlist =  nil                               ""
		or cons( int head, intlist tail )    1 "," 2;
 }
Here we've added a series of literal strings and numbers following each shape - these are called print hints: a literal string is printed as-is, and a number N means "print the Nth argument here, in whatever natural format it should be printed".
Having added the print hints, rerunning datadec will generate the following print function:
 void print_intlist( FILE *f, intlist p )
 {
	if( p == NULL )
	{
		fputs( "", f );
		return;
	}
	fprintf( f, "%d", p->head );
	fputc( ',', f );
	print_intlist( f, p->tail );
 }
which would print our 3-element list as:
 300,200,100,
However, there's no way of suppressing the trailing comma, or enclosing the list in square brackets (say). If you want something like this, you'll have to write your own custom print function (with a different name). datadec allows you to embed additional code in your types.in file which is copied as-is into the generated header file or the generated C file, see the EXPORT and GLOBAL sections in datadec's manual for more details.
The default print function for the idtree data type would be:
 void print_idtree( FILE *f, idtree p )
 {
	switch( p->tag )
	{
	case idtree_is_leaf:
		fputs( "leaf", f );
		fputc( '(', f );
		fputs( p->u.leaf_id, f );
		fputc( ')', f );
		break;
	case idtree_is_node:
		fputs( "node", f );
		fputc( '(', f );
		print_idtree( f, p->u.node.left );
		fputc( ',', f );
		print_idtree( f, p->u.node.right );
		fputc( ')', f );
		break;
	default:
		fprintf( stderr,
		 "print_idtree: impossible tag %d\n", p->tag );
		exit(1);
	}
 }
In most cases, this would bring us to the end of our investigation of the structure of the generated datatypes.c file....
datatypes.c: experimental Free functions
... Unless of course you run datadec with the experimental -f option which adds a free function for each type, such as:
 void free_idtree( idtree p )
 {
        switch( p->tag )
        {
        case idtree_is_leaf:
                free_string( p->u.leaf_id );
                break;
        case idtree_is_node:
                free_idtree( p->u.node.left );
                free_idtree( p->u.node.right );
                break;
        default:
                fprintf( stderr,
                 "free_idtree: impossible tag %d\n", p->tag );
                exit(1);
        }
        free( p );
 }
Like a print function, this traverses the RDT, checking which shape the element is, freeing any arguments that need to be freed, before finally freeing the malloc()ed chunk itself. By the way, free_string(s) is a macro that you need to define in the GLOBAL section (at the top of types.in) as follows:
  GLOBAL {
  #define free_string(s) free(s)
  }
While this mechanism basically works, freeing everything that was malloc()ed, it is depressingly easy to build something containing shared pointers - eg. node(t,t) builds a valid tree node, but both left and right pointers point to the same tree t. If t is itself a node() then free_idtree() will inevitably free() t twice. It is unclear how to fix this, or indeed if there's any way we could - given that C doesn't possess a reference counting garbage collector. Of course, you could write one yourself, but that's not an appealing thing to do!
As a simple feature for now, I added an option: if you don't want an argument freed, you mark it in the input file with a `-', as in:
 idtree  =  leaf( -string id )
         or node( idtree left, idtree right );
This suppresses the call to free_string(leaf_id). This is useful, but of course this doesn't really help with the shared subtree problem. We could mark (say) the right subtree as never to be freed:
 idtree  =  leaf( string id )
         or node( idtree left, -idtree right );
but this affects all right subtrees in all idtrees we ever build rather than detecting whether a particular subtree has already been freed, while freeing all the other subtrees.
For now (and probably forever), this is a problem left for the user of datadec. Think carefully about how you construct your RDT values, to avoid shared pointers.
Unfortunately, shared pointers can easily emerge in other ways. Consider a list of strings. Now write a function to reverse the list. Most likely, you'll reuse the head strings (but in reverse order), but now both the original list and the reversed list share all the head string pointers! Now freeing both lists is going to go badly wrong! But if you don't free the second list, you never free the structure of the second list. In this case, you actually want two free functions: one which frees the heads, and one which doesn't. Or perhaps you want one function - with the ability, per call, to say "don't free the heads this time".
This really is tricky, and I am still thinking about it. I got as far as considering some sort of per-free call "set of RDT arguments not to free". This set would usually be empty, but you could add one or more elements to such a set before a free call, to suppress freeing particular parameters this time. For example, to prevent an idtree's leaf(id) being freed, you could add idtree/leaf/id to the set. The free routines generated would then need to check for set membership before each free call. For example, the free_idtree for freeing a leaf(id) would be:
        case idtree_is_leaf:
		if( ! in_set( "idtree/leaf/id", excludeset )
		{
                	free_string( p->u.leaf_id );
		}
                break;
However, I don't want to have to ship a set implementation with datadec to implement this feature, so this remains an idea for now!
Onto Part 3

Ok, in the third and final part, I'll describe the internals of datadec itself, and explain how I built the tool.

d.white@imperial.ac.uk
Back to the first part of this article Onto the third part of this article Back to PSD Top
Written: Oct 2014

Welcome to Duncan White's Practical Software Development (PSD) Pages.

datadec: bringing Recursive (Inductive) Data Types to ANSI C. Part 2: Understanding The Module Generated: datatypes.c

datatypes.c: Includes, Recap and Constructors

datatypes.c: Shape testers and Deconstructors

datatypes.c: Print functions

datatypes.c: experimental Free functions

Onto Part 3

datadec: bringing Recursive (Inductive) Data Types to ANSI C.
Part 2: Understanding The Module Generated: datatypes.c