PSD: Duncan White's Practical Software Development Pages

Welcome to Duncan White's Practical Software Development (PSD) Pages.

I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

See all my Practical Software Development (PSD) Pages

Using features from one Language in a language without them:
case study 2: bringing Recursive (Inductive) Data Types to ANSI C.
An excellent general principle is to transfer cool features from one language into another. There are many examples of this, for example Perl teaches us the importance of hashes (aka dictionaries or maps, a collection of key/value pairs indexed on the key) - so in a language without such a datatype (eg ANSI C!) it makes sense to build yourself a generic hash ADT, in which every key is an arbitrary string, and every value is a void * pointer, and then reuse it at every opportunity.

Similarly, in my recent PSD Article 7, I present a method of implementing simple single-inheritance OOP in C.
Here in this article, I'd like to show you another example. If you've ever used a Functional Programming language like Haskell, Miranda, ML or (for the old school) Hope, you'll know that you can define Recursive (or Inductive) Data Types, such as:
 intlist = nil or cons( int head, intlist tail );
Here, we define a new data type - a list of integers, an intlist - which has two different possible "shapes" (or variants): A list is either empty (nil) or non-empty (cons(head,tail)), when it comprises a head element (integer) and a tail (intlist). But do you see what we did there? The tail of an intlist - the type we're defining - is itself an intlist. i.e. the type definition is itself recursive.
Many years ago, when I first encountered Functional Programming (yes, with Hope, I'm that old:-)), I absolutely loved Recursive Data Types, and I started thinking as follows:

RDTs are among the coolest feature of functional programming languages (IMHO), along with easy to use polymorphic lists and tuples.

Perhaps RDTs could be detached from the rest of Functional Programming (i.e. the rather restrictive lack of global variables, for loops, mutable variables etc) and used elsewhere.

Could RDTs be copied into imperative programming languages? Yes!

But designing a new imperative programming language sounds hard:-)

Specifically, I realised that I'd love to have RDTs in ANSI C, one of my main programming languages - albeit a much lower-level language than any Functional Programming language.

If only there was a tool that read one or more RDT definitions like intlist and then automatically wrote an ANSI C module implementing them, wouldn't that be awesome?

Nowadays, there are a few languages (Rust and Dafny are the obvious ones that spring to my mind) that implement RDTs in an imperative (ish) language, but at the time I started thinking about this argument, it seemed pretty revolutionary.

However, when I looked around (pre-Google of course, pre-Web in fact) I couldn't find any such tool, and in fact noone seemed to have ever proposed such an idea.

I thought about it some more, still thought what an excellent idea it would be, and after taking a brilliant Compilers course, I understood the stages and tools I would need (lexer-generators, parser-generators, Abstract Syntax Trees, semantic checking and code-generation). So I had a decision to make: abandon my brilliant idea, or make the tool?

This was a non-trivial project, so I thought hard about it: I estimated at least a week's work. After some thought, I decided to build the tool! (Otherwise this article would be rather short:-)).

After about a fortnight one summer, (a weeks's work to build the basic tool, and another week to produce optimal output code), I had designed and built a tool called datadec (written in ANSI C) to do the job. Datadec is (of course) installed in all Linux computers where I work (Dept of Computing, Imperial College London) - after all, what's the point of being head SysAdmin if you can't install your own software:-)
You may like to download it from https://gitlab.doc.ic.ac.uk/dcw/c-datadec.git and follow along. There are compilation and installation instructions inside the tarball - in most cases make install is all you need.
Example
After installing datadec, you can use it as follows:

Download this tarball called datadec-example.tgz and extract it, creating a datadec-example directory. Inside that, you'll find several files (including a README).
You'll find a file called types.in containing the following:
 TYPE {
        intlist =  nil or cons( int head, intlist tail );
        illist  =  nil or cons( intlist head, illist tail );
        idtree  =  leaf( string id )
                or node( idtree left, idtree right );
 }
This defines three new data types: a list of integers (intlist), a list of intlists (illist) and a non-empty binary tree of strings/identifiers (idtree) - this is either a leaf with an identifier, or a node with a left and a right subtree.
To generate a C module called datatypes from types.in, invoke:
 datadec datatypes types.in
This generates datatypes.c and datatypes.h, a normal two-file implementation of a module in C which implements the three data types. Like any manually written module, you can read both files, write test programs against the interface, compile them and link against them in production code.
However, you should resist the temptation to modify these files, because that sacrifices an important ability: the flexibility to modify types.in when you realise that your types aren't quite right, and regenerate the module.
For example, suppose you later realise that an idtree leaf needs to store an intlist as well as the identifier (perhaps this is implementing a hash/map/dictionary mapping each id (key) to an intlist (the associated value). Change the idtree type defn to read:
        idtree  =  leaf( string id, intlist value )
                or node( idtree left, idtree right );
and rerun datadec. Now the automatically generated module has a idtree_leaf(id, value) constructor function taking 2 arguments, and similarly the corresponding deconstructor function takes an extra argument (we'll see them later).
Or perhaps you decided that a node should store an id as well as the left and right subtrees:
        idtree  =  leaf( string id )
                or node( idtree left, string id, idtree right );
Or perhaps you want to make both changes (ids have associated values, and nodes have ids and values too):
        idtree  =  leaf( string id, intlist value )
                or node( idtree left, string id, intlist value, idtree right );
This ability to alter your datatypes as you go seems pretty nimble - or even agile - to me.
Also in the directory, you'll find two unit test programs (testintlist.c and testidtree.c) and a Makefile to build everything (it even contains the datadec invocation we used above). Investigate the files, and when you're ready type make and then run the test programs.
Of course we'll investigate them in more detail shortly, but even now we can understand the tests that testintlist is making by reading it's output:
 ./testintlist 
 kind(nil) == nil: ok
 kind([100]) == cons: ok
 head([100]) == 100: ok
 kind(tail([100])) == nil: ok
 head([200,100]) == 200: ok
 head(tail([200,100])) == 100: ok
 sum 1000: ok
Here, head() means "take the first element of a non-empty list", and tail() means "take the rest of a non-empty-list". The output is intended both to be human-readable, and also to be automatically processable - eg. to see only failed tests
 ./testintlist | grep -v ': ok$'
The last test (sum 1000: ok) is a bit cryptic, reading the source code you'll see that it is summing the list:
 cons( 400, cons( 300, cons( 200, cons( 100, nil() ) ) ) )
via an implementation of the following pseudo-code:
 int sum = 0;
 foreach element in list
 {
         sum += element;
 }
Understanding The Generated Module - datatypes.h
If we look at datatypes.h, the interface of the automatically generated module built by datadec, we see several sections:
A boilerplate header:
 /*
  * Automatically Generated by DataDec
  */

 typedef char *string;

 typedef char BOOL;
 #define TRUE 1
 #define FALSE 0

 #define NEW(t) ((t)malloc(sizeof(struct t)))
Note: datadec predates ANSI C's #include <stdbool.h, one of these days I should tweak datadec's boilerplate to use it not define it's own BOOL type..
Second, for each data type, a "forward structure declaration" of the form struct TYPE; - this declares that a structure of that name exists, but not what's inside it - and a pointer type definition of the form typedef struct TYPE *TYPE - note that this is entirely legitimate, and a widely-used way of defining pointer data types in C, the compiler will not get confused between struct TYPE and TYPE:
 struct intlist;
 typedef struct intlist *intlist;

 struct illist;
 typedef struct illist *illist;

 struct idtree;
 typedef struct idtree *idtree;
Third, for each data type, an enumerated type called kind_of_TYPE with a value for each shape the type can have (named TYPE_is_SHAPE), and the true definition of the struct TYPE structure. These are placed after all the above type declarations to allow any type's parameters to refer to any other RDT (enabling mutual recursion etc, often tricky in single-pass ANSI C):
 /* ---- Type intlist ---- */
 typedef enum {
 	intlist_is_nil,
 	intlist_is_cons,
 } kind_of_intlist;
 
 struct intlist {
 	int	head;
 	intlist	tail;
 };
 
 /* ---- Type illist ---- */
 typedef enum {
 	illist_is_nil,
 	illist_is_cons,
 } kind_of_illist;
 
 struct illist {
 	intlist	head;
 	illist	tail;
 };
 
 /* ---- Type idtree ---- */
 typedef enum {
 	idtree_is_leaf,
 	idtree_is_node,
 } kind_of_idtree;
 
 struct idtree {
 	kind_of_idtree	tag;
 	union {
 		string	leaf_id;
 		struct {
 			idtree	left;
 			idtree	right;
 		} node;
 	} u;
 };
Note that, in the case of the list types, the structure only stores the cons() parameters (because, as we'll see shortly, it uses NULL to represent the shape nil()), whereas in the case of the tree both shapes (leaf and node) store data, so a tagged variant record is used: a structure with a tag field (to tell us which shape is stored) and a union to represent the shape's parameters. Note further that the union contains either an individual shape parameter (when the shape only has one parameter, as in leaf) or otherwise a nested structure containing a shape's several parameters (as in node).
You can see that I've tried very hard (inside datadec) to produce not just correct, but optimal, implementations of each type, as pretty as code I would have written myself! Getting that stuff right was what took the rest of the second week to implement!. Not many code-generation tools bother to produce well-written code (in particular, yacc/bison is well known to produce horrible - albeit correct! - code that will exercise C compilers!
Fourth, for all types, prototypes and macros for all functions needed:
 #define intlist_nil() ((intlist)NULL)
 extern intlist intlist_cons( int , intlist );
 extern kind_of_intlist intlist_kind( intlist );
 extern void get_intlist_cons( intlist , int * , intlist * );
 extern void print_intlist( FILE *, intlist );

 #define illist_nil() ((illist)NULL)
 extern illist illist_cons( intlist , illist );
 extern kind_of_illist illist_kind( illist );
 extern void get_illist_cons( illist , intlist * , illist * );
 extern void print_illist( FILE *, illist );

 extern idtree idtree_leaf( string );
 extern idtree idtree_node( idtree , idtree );
 extern kind_of_idtree idtree_kind( idtree );
 extern void get_idtree_leaf( idtree , string * );
 extern void get_idtree_node( idtree , idtree * , idtree * );
 extern void print_idtree( FILE *, idtree );
Focussing on one of those types - idtree for example, we can break down those prototypes into four categories:
For each shape, a constructor function called TYPE_SHAPE that takes the shape's parameters, malloc()s a new element of the appropriate type and shape, initializes it's fields, and returns it to us for use:
 extern idtree idtree_leaf( string );
 extern idtree idtree_node( idtree , idtree );
A single function to tell us which shape an element of our type has, returning the appropriate one of the enumerated values:
 extern kind_of_idtree idtree_kind( idtree );
(Now I look back on this, I wonder why I didn't replace "kind" with "shape", i.e. which shape is this idtree could have been idtree_shape(t) returning a shape_of_idtree result. 20-20 hindsight is a wonderful thing).
For each shape with parameters, a deconstructor function called get_TYPE_SHAPE which takes an element of the appropriate shape, and breaks it apart, giving us shallow copies of the internal parameters of that shape:
 extern void get_idtree_leaf( idtree , string * );
 extern void get_idtree_node( idtree , idtree * , idtree * );
Note that all parameters after the tree itself are pointers-to-the-corresponding-shape-parameters, because these functions will be called as in:
 if( idtree_kind( t ) == idtree_is_node )
 {
 	idtree l, r;
        get_idtree_node( t, &l, &r );
 }
Finally, based on the principle I laid out in my first PSD article that every ADT needs a display_as_str operation, a bonus function that prints out a human readable form of the element to an open file:
 extern void print_idtree( FILE *, idtree );
Note that - by default - no functions to free the element are produced. It's surprisingly hard to automatically generate these in a safe fashion, due to shallow vs deep copy pointer considerations (let alone what problems mutual recursion between types might cause). For many years, I put off doing anything about this, while I thought about the best way of implementing this. One could still manually write the free functions you needed, and in fact add them into the types.in datadec-input file in another section.
However, in Spring 2014, when preparing some C Tools Lectures, I decided that I must implement something. Although it's still experimental, if you datadec -f.. you'll get experimental free_TYPE() functions as well. I'll explain more about this in part 2. Incidentally, it was having finally implemented the automatic generation of free functions that made me decide to write this article!
Understanding The Generated Module - clients (testintlist.c and testidtree.c)
Now, leaving the implementation module datatypes.c to one side for a while (we'll look at it later), let's see how the module is used, based only on our knowledge of the interface (this is a good abstraction/information hiding principle anyway). So let's look at the test programs, testintlist.c and testidtree.c: Let's start with testintlist.c:
We start with some preamble, including standard headers we'll need, and our own datatypes.h, which defines the interface of our datatypes module:
/*
 * test automatically generated "intlist" data type
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "datatypes.h"
Second, we define a test helper (a test framework would provide this and many other functions), to evaluate a boolean (integer) condition, and print out a message suffixed with ok or fail based on the condition:
void test( int condition, char *string )
{
	if( condition )
	{
		printf( "%s: ok\n", string );
	} else
	{
		printf( "%s: fail\n", string );
	}
}
Next, we start main() by some basic tests: Using the constructors and kind (shape) checkers, we can check that an empty list reads as empty, and a 1-element list as a cons() node:
int main( void )
{
	intlist l = intlist_nil();
	test( intlist_kind( l ) == intlist_is_nil,
		"kind(nil) == nil" );

	l = intlist_cons( 100, l );
	test( intlist_kind( l ) == intlist_is_cons,
		"kind([100]) == cons" );
Note in passing that the messages (second argument to each test() invocation) give a human-readable description of what property we're testing.
Now that we know that L is a non-empty (cons) list, let's break the cons(h,t) node into it's separate head and tail, using a deconstructor, and check that the correct values are returned. Given that we just constructed the list, we know that the head should be 100 and the tail should be nil:
	intlist tail;
	int head;
	get_intlist_cons( l, &head, &tail );
	test( head == 100, "head([100]) == 100" );
	test( intlist_kind( tail ) == intlist_is_nil,
		"kind(tail([100])) == nil" );
Let's add a second element (200) to our list (so our list now contains 200 followed by 100 - or [200,100] in an obvious notation), and then test that we can (non-destructively) break our 2-element [200,100] list into it's pieces all the way to the nil():
	l = intlist_cons( 200, l );
	get_intlist_cons( l, &head, &tail );
	test( head == 200, "head([200,100]) == 200" );
	get_intlist_cons( tail, &head, &tail );
	test( head == 100, "head(tail([200,100])) == 100" );
	test( intlist_kind( tail ) == intlist_is_nil,
		"kind(tail([200,100])) == nil" );
Note that this is C, not a functional language, so it's ok to write:
	get_intlist_cons( tail, &head, &tail );
i.e. breaking the non-empty list "tail" into it's head (writing that into head) and it's tail (writing that back into tail). If that bothers you, write:
        intlist tail2;
	get_intlist_cons( tail, &head, &tail2 );
Then use tail2 instead of tail below. But there's no need (as Occam's Razor has it) to multiply entities unnecessarily. In Perl, of course, we'd make use of lists (tuples) and write:
	( my $head, $tail ) = get_intlist_cons( $tail );
Next, let's cons two more elements onto our list, giving [400,300,200,100]:
	l = intlist_cons( 300, l );
	l = intlist_cons( 400, l );
Finally, let's sum up all the elements on our list, traversing along it (still non-destructively), summing up the elements, testing that the sum is the correct value, and then exiting the program. Earlier we gave the pseudo-code:
        int sum = 0;
        foreach element in list
        {
                sum += element;
        }
Now we see the C code that implements that:
	int sum = 0;
	intlist p = l;
	while( intlist_kind(p) == intlist_is_cons )
	{
		int head;
		get_intlist_cons( p, &head, &p );
		sum += head;
	}
	test( sum == 1000, "sum 1000" );
Now you should feel that you understand the intlist interface - except print_intlist(). Our unit test program doesn't test print_intlist() because it's hard to check the answer for correctness (we could create a temporary file, print the list into it, close it and then diff the contents of the file, but that's a pain).
I have often considered making datadec implement a "print into a string buffer (not a file)" routine, of course there's then the problem of running off the end of the buffer (buffer overflow). One could generate code that safely checks that, or implement a dynamic resizable string library that supports appending data to a dynamic string (growing it when necessary), but it's a pain either way. If we had such a function, we could then test it via string comparison, and then use it to test many other functions.
For now, to test it, edit testintlist.c and uncomment the following code you'll find present but commented out:
	printf( "the final list is: ";
	print_intlist( stdout, l );
	printf( ": can't check\n" );
Recompile (via make) and rerun the program and you should see an extra line of output:
 the final list is: cons(400,cons(300,cons(200,cons(100,nil)))): can't check
For now, you'll just have to use your eyes to check that it's the right output.
Now let's go onto testidtree.c to look at how we use idtrees.
There's a preamble, which defines two slightly different test helpers: inteqtest(value, expected, label) and streqtest(value, expected, label) are integer and string equality tests that print ok/fail messages:
 /*
  * test automatically generated "idtree" data types
  */

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <assert.h>

 #include "datatypes.h"

 void inteqtest( int value, int expected, char *mesg )
 {
	if( value == expected )
	{
		printf( "%s==%d: ok\n", mesg, expected );
	} else
	{
		printf( "%s==%d, actual: %d: fail\n", mesg, expected, value );
	}
 }

 void streqtest( char *value, char *expected, char *mesg )
 {
	if( strcmp(value,expected) == 0 )
	{
		printf( "%s==%s: ok\n", mesg, expected );
	} else
	{
		printf( "%s==%s, actual: %s: fail\n", mesg, expected, value );
	}
 }
Looking now at main(), we see that we build two leaves, and then test that we can break them apart again:
  idtree t1 = idtree_leaf( "absolutely" );
  testleaf( t1, "absolutely", "ab" );
  idtree t2 = idtree_leaf( "fabulous" );
  testleaf( t2, "fabulous", "fab" );
Here, testleaf(t, expected, treename) is another helper which tests that t is a leaf with the expected id. treename is a symbolic name for the tree:
 void testleaf( idtree t, char *expected, char *treename )
 {
        char label[1024];
        sprintf( label, "isnode(%s)", treename );
        inteqtest( idtree_kind(t), idtree_is_leaf, label );
        string id;
        get_idtree_leaf( t, &id );
        sprintf( label, "getleaf(%s)", treename );
        streqtest( id, expected, label );
 }
Next, we construct a node from our two leaves, and test that we can break it apart correctly:
  idtree t = idtree_node( t1, t2 );
  inteqtest( idtree_kind(t), idtree_is_node, "isnode((ab,fab))" );

  idtree l, r;
  get_idtree_node( t, &l, &r );
  testleaf( l, "absolutely", "left((ab,fab))" );
  testleaf( r, "fabulous", "right((ab,fab))" );
Read on through the test program, and you'll see that it also tests that our tree has the right number of leaves:
	inteqtest( nleaves(t), 2, "nleaves((ab,fab))" );
It does this by defining a recursive tree-traversal routine (a translation of simple Haskell-style recursive rules, shown in comments):
 /* recursive idtree leaf counter
  * In Haskell, this'd be:
  *   nleaves(leaf(x))   = 1
  *   nleaves(node(l,r)) = nleaves(l) + nleaves(r)
  */
 int nleaves( idtree t )
 {
	/* have we found a leaf?  answer 1 if so */
	if( idtree_kind(t) == idtree_is_leaf )
	{
		return 1;
	}

	/* found node: break it apart into left and right trees */
	idtree l, r;
	get_idtree_node( t, &l, &r );

	/* sum their numbers of leaves */
	return nleaves(l) + nleaves(r);
 }
The program goes on, adding another leaf and then checking that there are now 3 leaves:
	t = idtree_node( t, t2 );
	inteqtest( idtree_kind(t), idtree_is_node,
		"isnode(((ab,fab),fab))" );
	inteqtest( nleaves(t), 3, "nleaves(((ab,fab),fab))" );
Note in passing that the leaf I just added is t2, one of the leaves already in the tree, so here I'm creating a shared pointer (ie. a shallow copy of t2 present in two different node()s). This is the sort of thing that will undoubtedly cause the free_idtree() function to double-free a leaf. I did say free functions were experimental - this is why! To avoid this, the leaf added should be totally new; not t2 reused.
Finally, we define a function concatleaves(tree,separator) to traverse the tree finding all the leaves, concatenating all the identifiers with a separator string between them, and uses this to test that the results of concatatening the leaves in the tree give the correct string:
	string s = concatleaves( t, "," );
	streqtest( s, "absolutely,fabulous,fabulous,",
		"concatleaves(((ab,fab),fab),',')" );
	free(s);
concatleaves(tree,separator) is rather like Perl's join() function - but for trees. Unfortunately, implementing it gets into exactly the same dynamic string problems we discussed above. I've chosen to implement it as a malloc(MAX) and "don't overflow MAX chars" sanity check in the helper functions that concatleaves() calls. Hence the free() call in the above.
Understanding The Generated Module - datatypes.c

In Part 2 of this article we'll look at the datatypes.c, the implementation module that datadec generated.

Building Datadec

In Part 3 of this article we'll look into how I actually designed and built datadec.

d.white@imperial.ac.uk
On to the second part of this article Back to PSD Top
Written: August 2014

Welcome to Duncan White's Practical Software Development (PSD) Pages.

Using features from one Language in a language without them: case study 2: bringing Recursive (Inductive) Data Types to ANSI C.

Example

Understanding The Generated Module - datatypes.h

Understanding The Generated Module - clients (testintlist.c and testidtree.c)

Understanding The Generated Module - datatypes.c

Building Datadec

Using features from one Language in a language without them:
case study 2: bringing Recursive (Inductive) Data Types to ANSI C.