Welcome to Duncan White's Practical Software Development (PSD) Pages.
I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.
![]()
datadec: bringing Recursive (Inductive) Data Types to ANSI C.
Part 3: The internals of datadec - and how it was built.
- In the first part of this article I discussed my tool datadec - which reads an input file specifying one or more Recursive Data Types - such as:
TYPE { intlist = nil or cons( int head, intlist tail ); illist = nil or cons( intlist head, illist tail ); idtree = leaf( string id ) or node( idtree left, idtree right ); }and automatically writes a complete C module that implements the above types in a standard way. I looked into the interface of the module datadec generates, and showed examples of how we'd use it in practice.- In the second part, I looked inside datatypes.c - the implementation module that datadec produced - to see how datadec implemented various forms of RDT, satisfying the interface discussed in the first part.
- Now, in the third and final part, I'll discuss the internals of datadec and how it actually works. Datadec is quite a complex piece of software, so I have two choices:
- Attempt to describe the current architecture and structure of datadec, discussing each module in turn.
- Show how we could build the core of datadec afresh in stages, thus understanding how I may have built it all those years ago.
The second option appeals more to me, so it's what I'll do. Note that this is a reconstruction of what I did, so I can't guarantee the historical accuracy of the stages.
The Input Language
- I haven't made the point explictly up to now, but datadec's input language is what K&P call a little language. They define this as a specialized notation for a narrow domain (page 216, Practice of Programming), and suggest that little languages not only provide a good interface, but also help organize the program that implements them. We'll see very clearly how this is true of datadec.
In fact, datadec's input language is virtually two little languages intertwined: first, the outer data declaration language (containing a compulsory TYPE section with RDT definitions inside, and optional GLOBAL, BEGIN and EXPORT sections which wrap plain text, for instance:
EXPORT { typedef char *string; extern int global_variable; } GLOBAL { int global_variable = 17; } TYPE { RDT declarations }Second, the inner RDT declaration language inside the TYPE block, eg:intlist = nil or cons( int head, intlist tail ); illist = nil or cons( intlist head, illist tail ); idtree = leaf( string id ) or node( idtree left, idtree right );Overall Structure of datadec
- datadec is essentially a miniature compiler, which must start by parsing the input file and finish by emitting valid C code corresponding to the input type definitions.
- This suggests a basic lexer-parser-treebuilder-treewalker structure.
First Problem: Lexical Analysis of the Input
- My first task, then, was to build a lexical analyser (aka lexer) that can provide the parser with a stream of tokens from our input file. To build lexers and parsers, most people (including me most of the time) would reach straight for Lex and Yacc (aka Flex and Bison these days). However, for whatever reason, all those years ago, I decided to write a hand-written lexer and parser.
- To get started, let's ignore the outer language and focus first on the inner RDT declaration language - adding the outer language later. In addition, let's ignore the print hints (the numbers and string literals that can follow each shape) - I added these much later.
- When building a lexical analyser, the first step is to identify the tokens. The obvious tokens of the RDT language are:
'=', 'or', ';', '(', ')', ',' <an identifier>(Here, an indentifier is an alphanumeric name, eg. a type, shape or parameter name). Let's formalise this as a simple non-recursive RDT (even though we can't use RDTs when building a tool to give us the ability to use RDTs freely:-)):token = eq | or | semicolon | openbracket | closebracket | comma | id( string name )- With these tokens in mind, let's consider the stream of tokens that correspond to:
intlist = nil or cons( int head, intlist tail );
Input Token intlist
id('intlist')
=
eq
nil
id('nil')
or
or
cons
id('cons')
(
openbracket
int
id('int')
head
id('head')
,
comma
intlist
id('intlist')
tail
id('tail')
)
closebracket
;
semicolon
- So, the first piece of actual programming to do was: build a lexical analyser which recognises the above tokens, plus an EOF token and an ERROR token, and a test program that reads input and reports the stream of tokens seen, as shown above. This is sufficiently simple as to be presented without much more discussion.
- We start with a helper file bool.h - yes, these days I'd use
stdbool.h
, this was a while ago:typedef char BOOL; #define FALSE 0 #define TRUE 1 #define streq(a,b) (strcmp((a),(b))==0)- Next, we present the lexer's interface (lexer.h). Shorn of comments this is:
typedef enum { tEQ, tOR, tSEMI, tOPENBR, tCOMMA, tCLOSEBR, tID, tERROR, tEOF, } TOKEN; extern char *tokenname[]; /* mapping from TOKEN to string */ extern char lexidval[]; /* value of an identifier (tID) */ extern int lineno; extern FILE *lexfile; extern TOKEN nexttok( void ); extern void ungettok( void );To use this, we set lexfile to a readable filehandle, then repeatedly call nexttok(). On each call nexttok() delivers the next token. Most tokens have no associated data (eg. when you see '=' it's enough to say it's an equals sign. However, tID (the identifier token) does have a piece of associated data - it's not enough to know that an identifier has been seen, you also need to know the name of the identifier seen. This is stored in lexidval.
When all the input is exhausted, nexttok() returns tEOF. If the lexer sees utterly unexpected input (eg. unused punctuation signs such as '+' or '*' or '&', or numbers or quoted strings) it consumes a single input character and returns tERROR.
On occasion during parsing, you check the next token, find that it's not one of half a dozen special tokens you wanted to handle, and need to put that token back. Some classes of parsers only need the ability to put back a single token (parsers for LL(1) grammars), while others need to put back several tokens. Our parser will need only a single token put back (it's LL(1)), to do that we will call ungettok(). ungettok() will object strenuously (abort) if we ever call it twice in a row.
- Before we look at the implementation of the lexer (i.e. lexer.c), let's see lextest.c - a test program that exercises the lexer, thus showing how to use the lexer functions:
#include <stdio.h> #include <stdlib.h> #include "bool.h" #include "lexer.h" int main( int argc, char **argv ) { if( argc == 1 ) { lexfile = stdin; } else if( argc == 2 ) { lexfile = fopen( argv[1], "r" ); if( lexfile == NULL ) { fprintf( stderr, "lextest: can't open '%s'\n", argv[1] ); exit(1); } } else { fprintf( stderr, "Usage: lextest [filename]\n" ); exit(1); } TOKEN t; while( (t=nexttok()) != tEOF ) { printf( "token: %s", tokenname[t] ); if( t == tID ) { printf( "(%s)", lexidval ); } putchar( '\n' ); } exit(0); /*NOTREACHED*/ }Note how it preserves the Unix tradition of acting as a filter, processing stdin by default, or the contents of a file named on the command line if given.
- Despite not having yet looked at lexer.c, if I run lextest with our intlist example via:
echo 'intlist = nil or cons( int head, intlist tail );' | ./lextestIt produces, as we would expect, the following output:token: id(intlist) token: eq token: id(nil) token: or token: id(cons) token: openbr token: id(int) token: id(head) token: comma token: id(intlist) token: id(tail) token: closebr token: semi- Now we understand how to use the lexer, let's see how to implement lexer.c:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "bool.h" #include "lexer.h" #define MAXIDSIZE 100 char lexidval[ MAXIDSIZE ]; int lineno = 1; FILE *lexfile; char *tokenname[] = { "eq", "or", "semi", "openbr", "comma", "closebr", "id", "ERROR", "EOF", NULL }; static void skipwhitespace( void ) { int c; for(;;) { c = getc(lexfile); if( c == '/' || c == '#' ) /* comment to end of line */ { while( (c=getc(lexfile)) != EOF && c != '\n' ); } if( c == EOF ) break; if( c == '\n' ) lineno++; if( c != ' ' && c != '\t' && c != '\n' ) break; } ungetc( c, lexfile ); } static BOOL havepushedtok = FALSE; static TOKEN curtok; void ungettok( void ) { if( havepushedtok ) { fprintf( stderr, "ungettok: can't push 2 tokens\n" ); exit(1); } havepushedtok = TRUE; } TOKEN nexttok( void ) { int c; int pos; if( havepushedtok ) { havepushedtok = FALSE; return curtok; } skipwhitespace(); c = getc(lexfile); switch( c ) { case EOF: curtok = tEOF; break; case ';': curtok = tSEMI; break; case '(': curtok = tOPENBR; break; case ')': curtok = tCLOSEBR; break; case ',': curtok = tCOMMA; break; case '|': curtok = tOR; break; case '=': curtok = tEQ; break; default: curtok = tERROR; if( isalpha(c) ) { curtok = tID; for( pos=0; isalpha(c)||isdigit(c)||c=='_'; ) { if( pos < MAXIDSIZE-1 ) lexidval[pos++] = c; c = getc(lexfile); } ungetc( c, lexfile ); while( pos < MAXIDSIZE ) { lexidval[pos++] = '\0'; } if( streq( lexidval, "OR" ) || streq( lexidval, "or" ) ) curtok = tOR; } } } return curtok; }It's pretty straighforward code, there are only a few points worth making:
- nexttok() and ungettok() handle token pushback using static global variables called havepushedtok and curtok: havepushedtok is a BOOL telling the lexer whether we have a pushed back token or not. curtok contains the last token returned by nexttok(), so represents the pushed back token (when there is one). If that token is a tID, then lexidval still stores the name of that identifier.
- Just as the parser often needs to read a token and put it back, the lexer often has to read one character ahead in the input in order to know that it's not part of the current token, so we often need to put that character back into the input using ungetc().
- When it needs a fresh token, nexttok() calls:
skipwhitespace(); c = getc(lexfile);skipwhitespace() skips white space, newlines and even optional comments-to-end-of-line. Why would we support comments in our datadec input? Comments are helpful to the user, and very cheap to allow, so why wouldn't we?
- Then a switch statement does the rest of the work: simple one-character tokens are handled by cases like:
switch( c ) { ... case ')': curtok = tCLOSEBR; break;- The default case handles everything else (an identifier or keyword):
curtok = tERROR; if( isalpha(c) ) { curtok = tID; [gather identifier string in lexidval] if( streq( lexidval, "OR" ) || streq( lexidval, "or" ) ) { curtok = tOR; } }- I'll leave you to understand how the gather identifier code works - would you expect to see ungetc() in the gather code? what happens if an overlong identifier is encountered? should the lexer issue a warning if that happens?
- You can download this first bundle of source code, including a Makefile and a cutdown types.in file, from: 01minimal-lexer.tgz. Feel free to download it, build the code, and test it out. For example, give it input containing one or more invalid tokens, overlong identifiers etc.
Second Problem: Parsing the token stream
- Our second task is to implement a minimal parser on top of the lexer, i.e. acting as a consumer of the token stream that is returned by successive calls to nexttok(). Initially, our aim is to simply recognise whether the input is valid or not. Later on, we will add actions to build internal data structures to represent an Abstract Syntax Tree of the input.
The first step in implementing a parser is to describe the grammar in some version of Backaus-Naur Form (BNF):
declns = decln* tEOF decln = tID tEQ shapes tSEMI shapes = shape (tOR shape)* shape = tID [ tOPENBR params tCLOSEBR ] params = param (tCOMMA param)* param = tID tIDHere, X Y means X followed by Y, X* means zero or more Xs, [Y] means Y is optional; zero or one copies of Y, and (X Y)* is simply a bracketing mechanism to allow zero or more X-followed-by-Y sequences.
You'll find this reads really intuitively, each rule corresponds to a simple English sentence:
- To successfully match declns (a list of declarations) you must match zero-or-more consecutive instances of decln (each one a single declaration of an RDT) followed be tEOF.
- To successfully match a single RDT decln you must match an identifier (tID), immediately followed by a tEQ, immediately followed by a successful match of the shapes rule, immediately followed by a tSEMI.
- To successfully match shapes (a list of shape definitions), you must first successfully match a shape, immediately followed by zero-or-more sequences of an tOR token and another successful match of a shape. In other words, we match an tOR separated list of individual shapes.
- So on for the rest of the rules.
- Note that the process of matching a grammar (starting with the first rule, in this case declns) against a piece of input (called a sentence of the grammar) can be entirely automated: this is how tools like Yacc and Bison can exist! But it's quite easy to build our own parser, especially for such a simple grammar!
Aside: Shortest Sentential Forms
- It's often instructive (when trying to understand a grammar) to ask yourself what are the shortest kinds of valid inputs (sentences) that match the rules? To generate the shortest sentence, you omit all optional and zero-or-more components. Then for longer inputs, you let optional or zero-or-more rules occur once - think of this as gradually releasing the compression of several tightly coiled springs.
- The shortest sentence matching declns clearly has zero decln matches:
declns = tEOFSo an empty file is the shortest (albeit damn silly!) sentence. In this case,datadec
would generate an essentially empty ADT module. If we dislike this idea, we could disallow it by making our grammar require one-or-more decln matches. This is usually written as:declns = decln+ tEOFAlternatively, it can be written as one decln followed by zero-or-more decln's:
declns = decln decln* tEOF- The second shortest class of sentence occurs when exactly one decln appears, but with all optional/repeated elements omitted. ie: as if the grammar had been written:
declns = decln tEOF decln = tID tEQ shapes tSEMI shapes = shape shape = tIDSubstituting these rules together, we get:
declns = tID tEQ tID tSEMI tEOFThus, the second shortest type of sentence is of the form:
rdt1 = shape1;- When it comes to the third shortest class of sentence, we have several options about which optional/repeated component to expand, so we have to investigate them all and pick the shortest:
- Either there is still one decln and one shape, but this time the shape has a single parameter:
rdt1 = shape1( type1 param1 );(length 9 tokens including tEOF).- Or there is still one decln, but two shapes - each with zero parameters.
rdt1 = shape1 or shape2;(length 7 tokens, shorter than the first option).- Or there could be two declns, each with one empty shape:
rdt1 = shape1; rdt2 = shape2;(length 9 tokens, longer than the second option).The third shortest class of sentence is thus of the form:
rdt1 = shape1 or shape2;- We could continue to "relax the springs" to investigate slightly less short sentential forms, but I'm sure you get the point now.
Building a Recursive Descent Parser from our grammar
- For simple LL(1) grammars like ours, we can straightforwardedly build a recursive descent parser, by turning each rule into a function which attempts to match exactly one occurrence of that rule, and returns a boolean indicating whether or not it succeeded. This relies upon being able to choose a unique path on the basis of a single token at a time. Difficulties occur when a single token can be interpreted in more than one way - the dangling else is the classic example of this - google for it. However, our grammar is simple enough to avoid those problems.
- We also need to work out whether any rules can match zero tokens: in this case, none can. Rules matching zero tokens can complicate the parser, but we don't have to worry about that here!
- We start with an error() function, and a 'next token must be X' macro:
static void error( char *s ) { fprintf( stderr, "%s at line %d\n", s, lineno ); } #define MUSTBE(t,mesg) if( nexttok() != (t) ) {error(mesg); return FALSE;}- Our first rule:
declns = decln* tEOFcan be translated into C as a simple while loop, based on the observation that every RDT declaration must start with an identifier (tID), followed by a MUSTBE(tEOF) test at the end:BOOL parse_declns( void ) { while( nexttok() == tID ) /* NB: a decln starts with a tID */ { ungettok(); if( ! parse_decln() ) return FALSE; } ungettok(); MUSTBE( tEOF, "Spurious characters found beyond EOF" ); return TRUE; }- Our second rule:
decln = tID tEQ shapes tSEMIis a simple sequence of tokens and rules, and translates pretty obviously:BOOL parse_decln( void ) { MUSTBE( tID, "declaration expected" ); MUSTBE( tEQ, "'=' expected" ); if( ! parse_shapes() ) return FALSE; MUSTBE( tSEMI, "'or' or ';' expected" ); return TRUE; }- Our third rule:
shapes = shape (tOR shape)*(a non-empty tOR separated list of shapes) translates to a do..while loop to capture the non-empty aspect of the rule:BOOL parse_shapes( void ) { do { if( ! parse_shape() ) return FALSE; } while( nexttok() == tOR ); ungettok(); return TRUE; }- Our fourth rule:
shape = tID [ tOPENBR params tCLOSEBR ]contains an optional part, conveniently marked by a tOPENBR. It translates to:BOOL parse_shape( void ) { MUSTBE( tID, "shape name expected" ); if( nexttok() == tOPENBR ) { if( ! parse_params() ) return FALSE; MUSTBE( tCLOSEBR, "',' or ')' expected" ); } else { ungettok(); } return TRUE; }- Our fifth rule
params = param (tCOMMA param)*is another non-empty token-separated list, just like shapes so the structure of the parse function is identical:BOOL parse_params( void ) { do { if( !parse_param() ) return FALSE; } while( nexttok() == tCOMMA ); ungettok(); return TRUE; }- Our final rule
param = tID tIDis a very simple sequence of two tID tokens:BOOL parse_param( void ) { MUSTBE( tID, "Field type expected" ); MUSTBE( tID, "Field name expected" ); return TRUE; }Actually, that's such a simple sequence that it's very tempting to inline it into parse_params() and get:BOOL parse_params( void ) { do { MUSTBE( tID, "Field type expected" ); MUSTBE( tID, "Field name expected" ); } while( nexttok() == tCOMMA ); ungettok(); return TRUE; }(This is the code you'll find in parser.c).- This completes our parser - simple, isn't it? We add a parsetest test program which initialises the lexer (just as lextest did):
if( argc == 1 ) { lexfile = stdin; } else if( argc == 2 ) { lexfile = fopen( argv[1], "r" ); if( lexfile == NULL ) { fprintf( stderr, "parsetest: can't open '%s'\n", argv[1] ); exit(1); } } else { fprintf( stderr, "Usage: parsetest [filename]\n" ); exit(1); }and then attempts to match declns in the input, prints "success" or "failure" before exiting:BOOL ok = parse_declns(); printf( "parse: %s\n", ok?"success":"failure" ); exit(0);- You can download this second bundle of source code from: 02minimal-parser.tgz. Feel free to download it, build the code, and test it out. Give it a few correct examples, for example:
echo "rdt = a or b or c or d or e or f;" | ./parsetest echo "token = eq | ort | semi | openbr | closebr | comma | id(string name);" | ./parsetest echo "rdt = nil or first( int x ) or second( int a, char b );" | ./parsetest echo "tree = leaf(string id) or node( string l, string r );" | ./parsetest(Note that in our example token RDT above, we can't use or as a shape name, because it's already a token in our lexer. Hence the ugly ort).Also try giving parsetest incorrect input - containing a missing or misplaced token etc, as in:
echo "rdt = a or b or c d or e or f;" | ./parsetest echo "token = eq | ort | semi | openbr | closebr | comma | id string name);" | ./parsetest echo "rdt = nil or first( int x ) or second( int a char b );" | ./parsetest echo "tree = leaf(string id) or ( string l, string r );" | ./parsetest echo "intlist = nil or cons( int head intlist tail );" | ./parsetest echo "intlist = nil cons( int head, intlist tail );" | ./parsetestSee if you can find any situation you're not happy with the accuracy of one of the error messages, improve it and recompile, try again. While writing this article, despite datadec's parser having been in use for over 20 years by hundreds of users, I found and improved two such messages, so it's quite likely that further improvements are possible!Third Problem: Building a Parse Tree
- A recognizer parser is useful, but better by far is a parser that builds an internal structure representing the parsed input, this is called a Parse Tree or an Abstract Syntax Tree (AST). This AST will make it very easy to produce datadec's final output - the C equivalent of our RDTs - by traversing or walking over the AST several times. Otherwise, we'd need to reopen our input file several times and reparse it each time. Walking an AST is much simpler.
- This means we'll need to construct a series of simple data structures that mirror the grammar. Looking back at our grammar, we'll need to store information about parameters, lists of parameters, shapes, lists of shapes, declarations and lists of declarations. Rather than define 6 data types, we'll use an old trick - a single item (a parameter, or a shape, or a declaration) can be stored as a one element list, ie. with the "next" or "tail" pointer set to NULL, or at least ignored. In pseudo-code, our 3 AST types are thus:
paramlist = list( tuple( name, type ) ) shapelist = list( tuple( shapename, params ) ) declnlist = list( tuple( rdtname, shapes ) )- To implement these lists as standard linked lists, we write the following C type declarations, placing them in ast.h:
struct paramlist; typedef struct paramlist *paramlist; typedef paramlist param; /* with next ptr ignored */ struct paramlist { paramlist next; char *type; char *name; }; struct shapelist; typedef struct shapelist *shapelist; typedef shapelist shape; /* with next ptr ignored */ struct shapelist { shapelist next; char *name; paramlist params; }; struct declnlist; typedef struct declnlist *declnlist; typedef declnlist decln; /* with next ptr ignored */ struct declnlist { declnlist next; char *name; shapelist shapes; };- We follow these types with the following function prototypes for 3 "builder" functions and 3 "printout" functions (based on the principle I laid out in article 1 - every ADT needs a display_as_str operation):
extern paramlist build_paramlist( char * type, char * id, paramlist next ); extern shapelist build_shapelist( char * id, paramlist p, shapelist next ); extern declnlist build_declnlist( char * name, shapelist s, declnlist next ); extern void print_paramlist( FILE *, paramlist p ); extern void print_shapelist( FILE *, shapelist s ); extern void print_declnlist( FILE *, declnlist d );- Next, we implement these 6 functions in ast.c. After a few includes, we have a couple of helper macros:
#define NEW(t) ((t) malloc( sizeof(struct t) )) #define COPYOF(new,old) {new=malloc(1+strlen(old));if(new)strcpy(new,old);}- Then the paramlist functions, builder and printer:
paramlist build_paramlist( char *type, char *id, paramlist next ) { paramlist new = NEW(paramlist); COPYOF(new->type, type); COPYOF(new->name, id); new->next = next; return new; } void print_paramlist( FILE *out, paramlist p ) { while( p != NULL ) { fprintf(out, "%s %s", p->type, p->name); p = p->next; if( p != NULL ) { fputs( ", ", out ); } } }- Next, the shapelist function pair:
shapelist build_shapelist( char *id, paramlist p, shapelist next ) { shapelist new = NEW(shapelist); COPYOF(new->name, id); new->params = p; new->next = next; return new; } void print_shapelist( FILE *out, shapelist s ) { while( s != NULL ) { fputs( s->name, out ); if (s->params) { fputs( "( ", out ); print_paramlist( out, s->params ); fputs( " )", out ); } putc(' ', out ); s = s->next; if( s != NULL ) { fputs( "or ", out ); } } }- Finally, the declnlist function pair:
declnlist build_declnlist( char *name, shapelist s, declnlist next ) { declnlist new = NEW(declnlist); COPYOF(new->name, name); new->shapes = s; new->next = next; return new; } void print_declnlist( FILE *out, declnlist d ) { for ( ; d; d = d->next) { fputs(d->name, out); fputs(" = ", out); print_shapelist(out, d->shapes); putc('\n', out); } }Note that the printout routines attempt to reproduce the original little language input, this makes visual inspection easier than any other form I can think of, as we're already familiar with it.
- Now that we've implemented the ast module, we build a quick test program asttest.c which manually builds a few RDT declarations and their accompanying paramlists and shapes (note that each list is built in reverse order), and then prints them out:
shapelist s = build_shapelist( "d", NULL, NULL ); s = build_shapelist( "c", NULL, s ); s = build_shapelist( "b", NULL, s ); s = build_shapelist( "a", NULL, s ); declnlist d = build_declnlist( "abcd", s, NULL ); paramlist p = build_paramlist( "string", "gval", NULL ); s = build_shapelist( "g", p, NULL ); s = build_shapelist( "f", NULL, s ); s = build_shapelist( "e", NULL, s ); d = build_declnlist( "efg", s, d ); p = build_paramlist( "intlist", "tail", NULL ); p = build_paramlist( "int", "head", p ); s = build_shapelist( "cons", p, NULL ); s = build_shapelist( "nil", NULL, s ); d = build_declnlist( "intlist", s, d ); print_declnlist( stdout, d );- You can download this third bundle of source code from 03ast.tgz. Feel free to download it, build the code, and run it. You should see the output:
intlist = nil or cons( int head, intlist tail ); efg = e or f or g( string gval ); abcd = a or b or c or d;Now that we have the AST module, we want to add appropriate calls to the parser to automatically build an AST and return it when parsing completes successfully. We need to retain the idea that parsing an individual rule can succeed or fail, but add the concept that on succcess, a piece of AST is also constructed and returned.
- There is a minor technical difficulty - C does not support returning multiple values from a function. We'd like to write something like:
BOOL ok; shapelist s; (ok, s) = parse_shapelist();with the understanding that we only use s if ok is true. But C doesn't let us write that.An alternative (more functional programming style) way of doing this would be to define a two-shape RDT, effectively a variant record:
fail_or_shapelist = fail or success( shapelist sl );and make the function return a fail_or_shapelist, so it would be called as follows:if parse_shapelist() is success(sl) { // handle the shapelist sl } else { // handle failure. }Apart from observing that I'd love to program in that pseudo-code language, essentially a procedural version of a function programming language like Haskell, that seems like overkill so let's reject trying to implement that.
- So what can we do in ANSI C? Return the ok boolean, and simulate the additional return value(s) via what Pascal would call var-parameters, or Algol inout parameters. These are parameters that convey values in and out of the function rather than purely into it. In C, you implement var-parameters by passing a pointer to whatever additional value you want to send in and out, even if that value is already a pointer. In fact, we've already seen these pointer-parameters in action - back in the first part of this article, every datadec deconstructor function delivers the internal pieces of the shape being broken apart via pointer-parameters.
Thus, we call:
BOOL ok; shapelist s; ok = parse_shapelist( &s );and we define the function as:BOOL parse_shapelist( shapelist *sl );In the body of parse_shapelist, we might set the var-parameter by:*sl = build_shapelist( pieces );because this, by dereferencing sl and assigning to it, sets the caller's variable s to the value returned by build_shapelist. This, btw, is how var-parameters actually work in Pascal/Algol/whatever. C simply makes the magic extra level of pointers explicit, and therefore makes understanding this at the machine level easier.Btw, I got this idea years ago from programming in Prolog - every predicate in Prolog either fails to match or matches and binds it's arguments to the (first set of) values that make it match. Stripping out Prolog's lovely backtracking engine, and making the boolean return value explicit gives us:
ok = try_to_do_something( args, address-of-each-success-value );as a method of implementing the pseudo-code:(ok, args_if_ok) = try_to_do_something( args );I like it a lot more than recent languages' overblown exception handling mechanisms, where you spend more time matching which of a dozen exceptions could be thrown!Using this convention, each of our parse functions will end up taking a single pointer parameter and returning a boolean. So, our AST-building parser.h has the revised prototypes:
extern BOOL parse_declns( declnlist * d ); extern BOOL parse_decln( declnlist *d ); extern BOOL parse_shapes( shapelist *s ); extern BOOL parse_shape( shapelist *s ); extern BOOL parse_params( paramlist *p );Our AST-building parser.c needs a few more definitions at the top:#include "bool.h" #include "lexer.h" #include "ast.h" #include "parser.h" static void error( char *s ) { fprintf( stderr, "%s at line %d\n", s, lineno ); } #define COPYOF(new,old) {new=malloc(1+strlen(old));if(new)strcpy(new,old);} #define MUSTBE(t,mesg) if( nexttok() != (t) ) {error(mesg); return FALSE;}Working from the bottom up, our AST-building version of parse_params() reads:
BOOL parse_params( paramlist *p ) { for(;;) { char *name; char *type; MUSTBE( tID, "Field type expected" ); COPYOF( type, lexidval ); MUSTBE( tID, "Field name expected" ); COPYOF( name, lexidval ); *p = build_paramlist( type, name, NULL ); if( nexttok() != tCOMMA ) break; p = &( (*p)->next ); } ungettok(); return TRUE; }It builds a correctly ordered list of parameters, appending each one to the end as it parses it. We do this by turning our earlier do..while loop into an N-and-a-half times loop, which exits in the middle. We do our MUSTBE() checks as before, and immediately after each tID token is found, we use our COPYOF() macro to make a malloc()ed copy of lexidval, before we call nexttok() again - which will overwrite lexidval as the next token is another tID. If we accidentally replaced one of the COPYOF() invocations (the first, say) withtype = lexidval
, the C compiler would not object. However, the AST would contain multiple pointers all to the same global lexidval variable. After a successful parse, lexidval would contain the name of the last identifier seen in the input, so every identifier in our AST would be shown as that final identifier!Next, we build a 1-element parameter list containing our type and name string values and a NULL pointer, storing the result in *p. The first time through the loop *p will of course set the caller's variable whose address was passed in. Where does p point on subsequent iterations? we'll come back to this.
Next, we break out of the loop unless the next token is a tCOMMA. This (or it's negation) used to be the do..while exit condition. Finally, thep = &( (*p)->next )
trick makes p point at the newly allocated parameter's next field, giving the ability for the next iteration to alter that next field. Thus, within the loop, p always points at the pointer we need to alter to append the new element to the end of the list. This trick builds the paramlist in the correct order. If you're not sure how that works, single step through, drawing pointers and boxes, and you'll see how!Btw, had we chosen to use Yacc to build our parser, this trick of appending at the end of the list tends not to work, for reasons beyond the scope of this article to explain (look up left recursion in Yacc). As a consequence, when using Yacc there are a depressing number of fixup actions needed to take a back-to-front list delivered by a lower-level list-parsing grammar rule, and then call a function to reverse the list before building it into the AST. This is a pain!
Moving from paramlists to shapes, our AST-building parse_shape() and parse_shapes() routines read:
BOOL parse_shape( shapelist *s ) { char *name; paramlist p = NULL; MUSTBE( tID, "shape name expected" ); COPYOF( name, lexidval ); if( nexttok() == tOPENBR ) { if( ! parse_params(&p) ) return FALSE; MUSTBE( tCLOSEBR, "',' or ')' expected" ); } else { ungettok(); } *s = build_shapelist( name, p, NULL ); return TRUE; } BOOL parse_shapes( shapelist *s ) { *s = NULL; for(;;) { if( ! parse_shape( s ) ) return FALSE; if( nexttok() != tOR ) break; s = &( (*s)->next ); } ungettok(); return TRUE; }Note that parse_shapes() is another N-and-a-half times loop using the append trick.Finally, our AST-building parse_decln() and parse_declns() routines are as follows:
BOOL parse_decln( declnlist *d ) { char *name; shapelist s; MUSTBE( tID, "declaration expected" ); COPYOF( name, lexidval ); MUSTBE( tEQ, "'=' expected" ); if( ! parse_shapes(&s) ) return FALSE; MUSTBE( tSEMI, "'or' or ';' expected" ); *d = build_declnlist( name, s, NULL ); return TRUE; } BOOL parse_declns( declnlist *d ) { *d = NULL; while( nexttok() == tID ) { ungettok(); if( ! parse_decln( d ) ) return FALSE; d = &( (*d)->next ); } ungettok(); MUSTBE( tEOF, "Spurious characters found beyond EOF" ); return TRUE; }Just like parse_params() and parse_shapes(), parse_declns() uses the append trick.Finally, we adapt our parsetest program as follows:
#include "ast.h" ... declnlist d; BOOL ok = parse_declns(&d); printf( "parse: %s\n", ok?"success":"failure" ); if( ok ) { printf( "\nparse tree:\n\n" ); print_declnlist( stdout, d ); }You can download this 4th bundle of source code from 04ast-parser.tgz. Feel free to download it, build the code, and run it. Running./parsetest types.inYou should get the output:parse: success parse tree: intlist = nil or cons( int head, intlist tail ) ; efg = e or f or g( string gval ) ; abcd = a or b or c or d ;Fourth Problem: Building the Code Generator
The final part of our journey is to write the Code Generator, the part of our system that generates the C code corresponding to our RDTs. This will perform several traversals of the AST, printing various snippets of valid C code into two files (base.h and base.c, where base is the name of the C module we're generating).
As we saw in part 1, the full version of datadec performs a number of data structure optimizations, when a particular RDT fits one or more special cases. When I developed datadec, approximately 50% of the total effort was spent on getting those optimizations correct; the details of how one optimization interact with other optimizations was really tricky. So, for our purposes here, we're going to ignore all the optimizations.
Let's start by working out how to produce the (relatively simple) header file. The obvious thing is to look at some actual code produced by datadec. Given types.in:
TYPE { intlist = nil or cons( int head, intlist tail ); efg = e or f or g( string gval ); abcd = a or b or c or d; }Run datadec with optimizations turned off:datadec -n output types.inThis generates the output.h file, let's look at in sections. The first section is some simple introductory boilerplate material, literal text which can be printed out directly:
/* * Automatically Generated by DataDec */ typedef char *string; typedef char BOOL; #define TRUE 1 #define FALSE 0 #define NEW(t) ((t)malloc(sizeof(struct t)))The second section is where we declare, for each RDT, a forward declaration of a structure type and a pointer to it:
struct intlist; typedef struct intlist *intlist; struct efg; typedef struct efg *efg; struct abcd; typedef struct abcd *abcd;For a particular RDT named T, we produce:struct T; typedef struct T *T;The third section of output.h is where we fully define each RDT's kind enumeration and tagged variant data structure. For
intlist = nil or cons( int head, intlist tail )
we generate:/* ---- Type intlist ---- */ typedef enum { intlist_is_nil, intlist_is_cons, } kind_of_intlist; struct intlist { kind_of_intlist tag; union { struct { int head; intlist tail; } cons; } u; };For
efg = e or f or g( string gval )
we get:/* ---- Type efg ---- */ typedef enum { efg_is_e, efg_is_f, efg_is_g, } kind_of_efg; struct efg { kind_of_efg tag; union { struct { string gval; } g; } u; };Our final RDT,abcd = a or b or c or d
, generates:/* ---- Type abcd ---- */ typedef enum { abcd_is_a, abcd_is_b, abcd_is_c, abcd_is_d, } kind_of_abcd; struct abcd { kind_of_abcd tag; union { } u; };Note that each of the 3 structure declarations above show places where the optimizations could be applied:
- struct intlist has only one structure (cons) inside the union. Obviously there's no point having a union in that case.
- struct efg has only one structure (g) inside the union (as above), and furthermore, there's only one field (gval) inside that structure.
- struct abcd (representing a pure enumeration) has a completely empty union, not only is that pointless but then the structure only contains a tag. In this extreme case, the pure enumeration could even be represented as an enumeration in C, or an integer.
The final part of output.h is all the prototype declarations:
/* Prototypes for all types */ extern intlist intlist_nil( void ); extern intlist intlist_cons( int , intlist ); extern kind_of_intlist intlist_kind( intlist ); extern void get_intlist_cons( intlist , int * , intlist * ); extern void print_intlist( FILE *, intlist ); extern efg efg_e( void ); extern efg efg_f( void ); extern efg efg_g( string ); extern kind_of_efg efg_kind( efg ); extern void get_efg_g( efg , string * ); extern void print_efg( FILE *, efg ); extern abcd abcd_a( void ); extern abcd abcd_b( void ); extern abcd abcd_c( void ); extern abcd abcd_d( void ); extern kind_of_abcd abcd_kind( abcd ); extern void print_abcd( FILE *, abcd );These comprise, for each RDT, one constructor for each shape:
extern intlist intlist_nil( void ); extern intlist intlist_cons( int , intlist );Each one of these is of the form:extern T T_S( PL );(where T is the typename, S is the shapename, and PL is the comma-separated list of the parameter types of S).Next, we have one kind function, eg:
extern kind_of_intlist intlist_kind( intlist );This is of the form:extern kind_of_T T_kind( T );Next, for each shape with any parameters, a deconstructor function:extern void get_intlist_cons( intlist , int * , intlist * );Of the form:extern void get_T_S( T , T1 * , T2 *... );(where T1, T2 etc are the types of S's parameters).Finally, a print function prototype:
extern void print_intlist( FILE *, intlist );Of the form:extern void print_T( FILE *, T );Now we understand the sort of output we want, let's start writing the code to achieve it: Let's call the code generator module cg.c, after various includes, cg.c kicks off with a general purpose line indentation system:
static int numtabs = 0; static FILE *outfile; #define indent() numtabs++ #define outdent() numtabs-- #define usefile(f) outfile = f, numtabs = 0 /*VARARGS*/ static void line( fmt, a, b, c, d ) char *fmt; long a, b, c, d; { int i; for( i=numtabs; i; i-- ) fputc( '\t', outfile ); fprintf( outfile, fmt, a, b, c, d ); fputc( '\n', outfile ); }The line function's parameter list is written in old-school, pre-ANSI K&R C, which is a simple way of switching off "wrong number of parameters" error messages. line is called with a printf() style format variable fmt and up to 4 additional parameters. This is the cheapest and easiest way I know to do this. The /*VARARGS*/ comment is a traditional K&R lint style comment, I'm not sure whether recent C compilers care about this, but I left it there anyway.Now that we have the indentation system, let's write code to create the .h file and emit the top declarations (calling ourselves MiniDataDec):
void make_declns( declnlist d, char *base ) { printf( "Making data declarations in %s.h, %s.c file coming soon\n", base, base ); h_declns( d, base ); /* c_declns( d, base ); */ } static void h_declns( declnlist d, char *base ) { char tempname[256]; FILE *hfile; sprintf( tempname, "%s.h", base ); hfile = fopen( tempname, "w" ); if( hfile == NULL ) { fprintf( stderr, "minidatadec: can't create '%s'\n", tempname ); exit(1); } usefile( hfile ); line( "/*" ); line( " * Automatically Generated by MiniDataDec" ); line( " */\n" ); line( "typedef char *string;\n" ); line( "typedef char BOOL;" ); line( "#define TRUE 1" ); line( "#define FALSE 0\n" ); line( "#define NEW(t) ((t)malloc(sizeof(struct t)))\n\n" ); data_decls( d ); protos( d ); fclose( hfile ); }data_decls() is responsible for emitting the data declarations, and protos() (obviously) the prototype declarations. Starting with data_decls(), remember that, for each particular RDT (called T), we want to emit forward structure declarations and pointer declarations:
struct T; typedef struct T *T;This code emits that output:static void data_decls( declnlist decs ) { declnlist d; for( d=decs; d != NULL; d=d->next ) { line("struct %s;", d->name); line("typedef struct %s *%s;", d->name, d->name ); } fputc( '\n', outfile ); fputc( '\n', outfile );The second half of data_decls() is responsible for generating the type declarations for each RDT:
for( d=decs; d != NULL; d=d->next ) { declare_type( d ); } }declare_type( decln d ) is responsible for emitting the kind enumeration and the tagged variant data structure for a single RDT declaration. First, to emit a kind enumeration of the form:
/* ---- Type abcd ---- */ typedef enum { abcd_is_a, abcd_is_b, abcd_is_c, abcd_is_d, } kind_of_abcd;We write:
static void declare_type( decln d ) { char *dname = d->name; line( "/* ---- Type %s ---- */\n", dname ); line( "typedef enum {" ); indent(); shapelist s; for( s = d->shapes; s != NULL; s=s->next ) { line( "%s_is_%s,", dname, s->name ); } outdent(); line( "} kind_of_%s;\n", dname );Second, to emit the tagged variant data structure for a single RDT declaration, such as:
struct intlist { kind_of_intlist tag; union { struct { int head; intlist tail; } cons; } u; };We continue data_decls() as follows - first emit the structure and the tag field:line( "struct %s {", dname ); indent(); line( "kind_of_%s\ttag;", dname );Then emit the union, containing, for each shape, an internal structure declaration with all that shape's parameters:line( "union {" ); indent(); for( s = d->shapes; s != NULL; s=s->next ) { if( s->params == NULL ) continue; /* internal struct */ line( "struct {" ); indent(); paramlist p = s->params; for( ; p != NULL; p=p->next ) { line( "%s\t%s;", p->type, p->name ); } outdent(); line( "} %s;", s->name ); } outdent(); line( "} u;" ); outdent(); line( "};\n" ); }Note that we have permitted ourself one tiny optimization: we do not bother to emit an internal structure for a shape with no parameters.Next, we move onto protos(), to emit the prototypes. The overall structure is yet another traversal over all declarations:
static void protos( declnlist d ) { line( "\n/* Prototypes for all types */\n" ); for( ; d != NULL; d = d->next ) { proto_types( d ); } }Then, for one declaration, we produce a constructor prototype for each shape:static void proto_types( decln d ) { char *dname = d->name; shapelist s; /* constructor function prototypes */ for( s = d->shapes; s != NULL; s = s->next ) { cons_fn_proto( dname, s ); }Then a deconstructor kind prototype:deconskind_proto( dname );Then a deconstructor prototype for each non-empty shape:for( s = d->shapes; s != NULL; s = s->next ) { if( s->params != NULL ) { decons_fn_proto( dname, s ); } }Finally a print function prototype:print_fn_proto( dname ); }- Looking at these in turn, to emit a single shape's constructor protoype, we write:
static void cons_fn_proto( char *dname, shape s ) { fprintf( outfile, "extern %s %s_%s( ", dname, dname, s->name ); if( s->params == NULL ) { fputs( "void ", outfile ); } else { BOOL first = TRUE; paramlist p; for( p = s->params; p != NULL ; p=p->next, first=FALSE ) { if( !first ) fputs( ", ", outfile ); fputs( p->type, outfile ); fputc( ' ', outfile ); } } fputs( ");\n", outfile ); }The tricky part here is suppressing the extra comma (the first variable, and the "print the comma unless first" logic). For those familiar with Perl, this is wherejoin(",", @args)
is so lovely.Our next prototype generation function is the very simple Deconstructor Kind prototype:
static void deconskind_proto( char *dname ) { fprintf( outfile, "extern kind_of_%s %s_kind( %s );\n", dname, dname, dname ); }Next, one non-empty shape's deconstructor prototype can be generated as follows:
static void decons_fn_proto( char *dname, shape s ) { if( s->params == NULL ) { fprintf( stderr, "decons_fn_proto: no fields in shape!!\n" ); exit(1); } fprintf( outfile, "extern void get_%s_%s( %s ", dname, s->name, dname ); paramlist p; for( p = s->params; p != NULL ; p=p->next ) { fprintf( outfile, ", %s * ", p->type ); } fputs( ");\n", outfile ); }- Our final prototype generation function generates the print function's prototype:
static void print_fn_proto( char *name ) { fprintf( outfile, "extern void print_%s( FILE *, %s );\n", name, name ); }At this point, we have finishing generating the entire header file, and hopefully you now see all the principles involved in code generation. It's just a series of however many traversals over the AST we find are necessary - in our case, over declnlists, then over shapelists, then over paramlists - printing out snippets of valid C code, which form our output files.
We perform exactly the same process to produce the C file (implementation module). However, this part of the article is quite long enough already. You can download this 5th and final bundle of source code from 05codegen.tgz. This contains all the code we've built so far, and the full code generator module. Download it, extract the tarball into a directory, and then use
make cutdowncg
to build a version of the code generator which only builds the header file for your inspection. Or usemake fullcg
to build a version of the complete code generator (which builds the C file as well), links the lexer, parser, ast and code generator together to get the finished minidatadec.Then
make fullcg
carries on, using minidatadec to generate datatypes.[ch] from our types input file:./minidatadec types.in parse: success parse tree: intlist = nil or cons( int head, intlist tail ) ; efg = e or f or g( string gval ) ; abcd = a or b or c or d ; datadec: Making data declarations in datatypes.[ch]and then compiles it and our previous test program, then links them together to form testintlist:gcc -g -UDEBUGGING -Wall -c -o testintlist.o testintlist.c gcc -g -UDEBUGGING -Wall -c -o datatypes.o datatypes.c gcc testintlist.o datatypes.o -o testintlistThe surest proof that our code generator is correct is that datatypes.c compiles, links against testintlist.c, and works correctly if you run it:./testintlist kind(nil) == nil: ok kind([100]) == cons: ok head([100]) == 100: ok kind(tail([100])) == nil: ok head([200,100]) == 200: ok head(tail([200,100])) == 100: ok kind(tail([200,100])) == nil: ok sum 1000: ok the final list is: cons(400,cons(300,cons(200,cons(100,nil)))): can't checkSummary.
In this 3-part article, I've shown most of the process in which I designed and built the datadec tool to enable C programmers to use functional programming style Recursive Data Types freely in their programming. I find this ability enormously helpful in C programming, and strongly believe that it boosts one's C productivity tremendously. I haven't shown the complete code generator, and I've completely omitted the optimizer (which notices when an RDT happens to allow some useful optimizations in the C code generated, eg using NULL to represent an empty shape), but what I have shown gives an excellent flavour of the design and implementation, allowing you to see (a reconstruction of) the code as it was written.
datadec is a substantial tool, requiring about a fortnight's work to build, test and debug. But it's paid that investment back tenfold over the years. It uses the following "advanced" techniques:
- Designing a "little language", in this case a basic RDT syntax plus print hints and an outer language (not covered here in these articles).
- Building a Recursive Descent Parser (and it's helper, a Lexical Analyser) for that language.
- Building an Abstract Syntax Tree (AST) representing a valid RDT language program after a successful parse.
- Code generating valid C code by tree walking - repeatedly walking over the AST emitting snippets of valid C code simply by printing them out with the correct indentation.
These articles may either be viewed as a simple description of what I did to design and build this tool, or more generally as an example of the sort of tool building you can, and should, aspire to do yourself. A good programmer uses tools others have built, a great programmer knows when it's worth building your own tools, and how to do so.
Remember that you can download the full version of datadec from http://csgsoft.doc.ic.ac.uk/datadec. There are compilation and installation instructions inside the tarball - in most cases
make install
is all you need.If these articles, or datadec itself, are useful to you, I'd be delighted if you'd drop me a quick email to let me know:-) -- Duncan White, 29th Dec 2014.
d.white@imperial.ac.uk Back to the first part of this article Back to the second part of this article Back to PSD Top Written: Nov-Dec 2014