Practical Software Development: TEFEL: Tool-Enhanced Features for Existing Languages, Part 3::

Welcome to Duncan White's Practical Software Development (PSD) Pages.

I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

See all my Practical Software Development (PSD) Pages

A New Approach to Building Code Generators: TEFEL Part 3

Recap of Parts 1 and 2:
In Part 1 of this article I showed how to start building a C with Pattern Matching (CPM) to C translator, using a new technique I call: Tool-Enhanced Features for Existing Languages (TEFEL). I did this by writing a simple line-by-line pre-processor that copies most lines through unchanged (hoping they're valid C), but translates Extension Directives into a corresponding chunk of plain C. Thus, C with directives comes in, and standard C goes out.
The CPM tool fits in an ecosystem with my existing Datadec tool (as described in Article 8 of this series), which allows you to write declarative recursive data type definitions like:
  TYPE {
        tree    =  leaf( string name )
                or node( tree left, tree right );
	intlist =  nil
		or cons( int head, intlist tail );
  }
and then generates an entire plain C module to implement those types, correctly and in a standardised way.
However, Datadec never had any special support for writing client-side code that uses Datadec-generated types. C with Pattern Matching attempts to fix that. In about 200 lines of Perl code, I managed to achieve a minimal, fully working translator, supporting a single Extension Directive (%when):
  %when tree t is node( tree l, tree r )
  {
    .. code using l and r
  }
In Part 2 of this article I significantly extended my CPM tool to support another 3 Extension Directives:
  %assert tree t is node( tree l, tree r )

  %while intlist l is cons( int head, - l )
  {
    ..
  }

  %foreach int head in intlist l
  {
    ..
  }
Part 2 ended up with a working 355-line Perl script called cpm-v4 which correctly translated CPM input containing any number of the directives above to plain C.
Linking CPM and Datadec?
At the end of part 2, I said: so far all versions of the CPM tool have been entirely stand-alone, independent of Datadec itself, but is there any advantage to somehow linking the two tools more closely?

Let's think about this now: Well, Datadec obviously knows:

which recursive data types exist,

for each recursive data type, which shapes exist, and

for each shape, how many parameters, of what types, there are.
That information would be quite useful to CPM too: so far we've assumed that the author of each CPM input file always gets it right! i.e: if CPM reads the line:
  %when aardvark a is pangolin( int z )
CPM has to assume that:

There is a recursive data type called aardvark

That aardvark has a shape called pangolin

That aardvark's pangolin shape has precisely one parameter, of type int.
If this is correct, as it has been in all .cpm examples so far, then the generated C code will compile. If not, the user will have to diagnose what's wrong with the automatically generated code that CPM wrote for them.
If, however, CPM could ask Datadec what shape signatures each type had, it can check these sorts of errors itself, and generate much more pleasant error messages, along the lines of:
  %when aardvark a is pangolin( int z )
        ^ No such type <<aardvark>>

  %when aardvark a is pangolin( int z )
                      ^ No shape <<pangolin>> in <<aardvark>>
There's a second advantage too - when I first presented the %when syntax, I required that the parameter types were stated explicitly, as in:
	%when tree t is node( tree l, tree r )
But if CPM can ask Datadec for all the the parameter type information, then it can infer the parameter types for us. So we can write the simpler CPM directive:
	%when tree t is node( l, r )
This reads more elegantly. Indeed, the syntax is practically back to our original whenshape concept, although we can't easily remove the type name tree from %when tree t. Once CPM knows that t is a tree, it can do the rest - check that node is a valid shape of tree, and that tree nodes have two parameters - and that they are both trees.
So, how might CPM extract information about the recursive data types from Datadec? If you read Datadec's manual page via:
  man datadec
You may notice Datadec's -m option, described as:
  datadec [-m] infile
  ....
  -m      Do  **NOT**  produce the normal module.  Instead, produce meta-
          data on stdout.  This lists, for each type and shape, the type-
          name,  the  shapename,  and a comma-separated list of the shape
          parameter types.
Let's try running that, with types.in as the input, ie:
  datadec -m types.in > types.md
types.md now contains:
  tree leaf string
  tree node tree,tree
  intlist nil 
  intlist cons int,intlist
Each line in types.md defines one shape of each recursive data type, as 3 space-separated fields:

The first field is the type name.

The second field is the shape name.

The final field is a comma-separated list of the types of that shape's parameters.

Yippee! This is precisely the information CPM needs!
CPM->C translator: Version 5
Ok, to start implementing CPM version 5, we'll need a function to read the metadata that datadec -m emits:
  #
  # my %shapetypes = read_datadec_metadata( $filename );
  #	Read the datadec-generated metadata file $filename, listing types,
  #	shapes and a list of the parameter types that shape takes.
  #	Build and return %shapetypes, a mapping from type_shape to the
  #	comma separated list of typenames.
  #
  fun read_datadec_metadata( $filename )
  {
	open( my $fh, '<', $filename ) ||
		die "can't open datadec-generated $filename\n";
	my %result;
	while( <$fh> )
	{
		chomp;
		s/\s+/_/;
		my( $ts, $paramtypes ) = split( /\s+/ );
		$result{$ts} = $paramtypes;
		#print "ts=$ts, types=$paramtypes\n";
	}
	close( $fh );
	return %result;
  }
Note that we sneakily join the type and the shape name together, by replacing the first whitespace sequence with an underscore ('_').
The main program needs to take an additional command line argument to specify the meta-data filename (eg types.md), and will then call the above function to read that file:
  my %shapetypes;	# mapping from type_shape -> list of parameter types
  ...
  die "Usage: cpm-v5 datadec-metadata-file filename\n" unless @ARGV == 2;
  my $mdfilename = shift;
  die "cpm: no datadec-generated metadata file $mdfilename, ".
      "run datadec -m to generate it\n" unless -f $mdfilename;

  %shapetypes = read_datadec_metadata( $mdfilename );
We also need to build an additional hash %istype: a set of all recursive data types that exist. This will let us check whether a particular named type exists. Remembering that read_datadec_metadata() inserted an '_' between the type name and the shape name, let's extract the part of each such string before the '_', and build a set of those:
  my %istype;		# set of datadec types that exist
  ...
  %istype = map { s/_.*$//; $_ => 1 } keys %shapetypes;
These two hashes now enable us, throughout the program, to answer three vital questions:
Is $t a valid recursive data type? - by writing
  if $istype{$t}
Is $s a valid shape of type $t? - by writing
  if defined $shapetypes{"${t}_${s}"}
What parameter types does shape $s of type $t have? - by writing
  my $types = $shapetypes{"${t}_${s}"};
This delivers a comma separated list eg $shapetypes{"tree_node"} is tree,tree.
Next, we want to add checks during parsing that every type, and every shape, that the user names in their .cpm files, exist. In parse_match() we matched the typename after the %when (or similar) command by:
	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $type = $1;
	$sofar .= " $type";
Now we extend that slightly:
	my $typeline = $line;
	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $type = $1;
	$sofar .= " $type";

	# now check that the type actually exists
	fatal( $typeline, "No such type <<$type>>" ) unless $istype{$type};
We take a copy of the line before removing the typename, storing it in $typeline, in order to position the "No such type" message right at the type name.
Similarly, a few lines further down parse_match(), we match the shape name after 'is' by:
	my $shapeline = $line;
	fatal( $line, "ID (constructor name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s*//;
	my $shape = $1;
	$sofar .= " $shape";

	# now check that the shape actually exists
	fatal( $shapeline, "No such <<$type>> shape <<$shape>>" )
		unless defined $shapetypes{"${type}_$shape"};
Ok, next, we can remove the parameter types from the .cpm files, and the code that parses them. Specifically, we want the shape argument list (that we chose not to check) to change:

From a comma-separated list of paramtype paramname pairs, where the paramtype could be '-'

To a comma-separated list of parameter names, each optionally preceded by a '-'
That doesn't require any code changes in parse_match(), because we didn't parse the internal structure of the argument list, but we should update the function comment that describes the syntax:
  # '%when|%assert|%while' TYPE(ID) VAR(ID) 'is' CONS(ID) ( '(' ARGLIST ')' )
  # where ARGLIST is a comma separated list of ['-'] paramnames,
  # where each parameter name is an ID.
Turning to parse_foreach(), we need to make numerous small changes, removing every reference to $paramtype (one element of the 5-tuple return value was $paramtype, so the return value changes to being a 4-tuple now). It is easiest just to show you the new version of parse_foreach(), in pieces with commentary:
The only significant change to the top section is that we return a 4-tuple now (losing $paramtype), and the grammar no longer mentions PARAMTYPE. The initial code is unchanged:
  #
  # my( $command, $type, $var, $paramvar ) = parse_foreach( $line );
  #	After checking that $line starts with a %foreach, parse
  #	the rest of the line.
  #	If it parses return (command, type, var, paramvar)
  #	otherwise die via fatal()
  #
  #	'%foreach' PARAMVAR 'in' TYPE VAR
  #
  fun parse_foreach( $line )
  {
	$line =~ s/^(%\S+)\s*//;
	my $command = $1;
	$command =~ s/shape$//;	# reduce %foreachshape etc.. to %foreach

	my $sofar = $command;
	fatal( $line, "ID (var name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $paramvar = $1;
	$sofar .= " $paramvar";

	fatal( $line, "'in' expected after <<$sofar>>" )
		unless $line =~ s/^in\s+//;
	$sofar .= " in";
In the middle section, there are changes to implement type and shape checking:
	my $typeline = $line;
	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $type = $1;
	$sofar .= " $type";

	# now check that the type $type exists
	fatal( $typeline, "No such type <<$type>>" ) unless $istype{$type};

	# now check that $type is a list, ie. has a "cons" shape
	fatal( $typeline, "No such <<$type>> shape << cons >>" )
		unless defined $shapetypes{"${type}_cons"};
We could go on to check that the cons shape has two parameters, but I didn't bother! Remember: tools don't have to be perfect.
The last part of the function has only one change - returning a 4-tuple:
	fatal( $line, "ID (var name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)//;
	my $var = $1;

	fatal( $line, "end-of-line expected after <<$sofar>>" ) unless
		$line =~ s/^\s*$//;

	return( $command, $type, $var, $paramvar );
  }
Next, we must modify take_object_apart(), which is the code that does take the argument list apart. Again, it's simpler to show you the new version of the function, in pieces with commentary:
The only significant change to the top section is that, now that there are no typenames, we can aggressively remove all whitespace, and simplify later regex matches (we also tweak the comment slightly with respect to '-'):
  #
  # my $breakdown = take_object_apart( $type, $var, $shape, $arglist );
  #	Generate a single long line of C code that will take the object
  #	apart, into it's component arguments:
  #	- Declare variables for all arguments in $arglist unless they
  #	  have a leading '-'.
  #	- then call the get_{type}_{shape}() deconstructor with the
  #	  addresses of each of the argument variables.
  #	Return the take apart string..
  #
  fun take_object_apart( $type, $var, $shape, $arglist )
  {
	# remove any and all whitespace
	$arglist =~ s/\s+//g;

	# if the shape has no arguments, then we don't need to take it apart
	return "" unless $arglist;

	# ok, we have one or more arguments, comma separated:
	my @arg = split(/,/, $arglist );
Next, we retrieve the list of types for this shape's parameters, and check that the right number of parameters have been given in the argument list:
	# retrieve the shape types from %shapetypes
	my $types = $shapetypes{"${type}_$shape"};
	die "No such shape $shape of type $type\n" unless defined $types;
	my @type = split(/,/, $types );
	my $n = @type;
	my $has = @arg;

	die "type $type, shape $shape($types): should have $n parameters,".
	    " but has $has arguments ($arglist)\n"
	    	unless $n == $has;
Then, to build the $declns string, our old declaration building code is replaced completely with:
	my $declns = "";
	foreach my $i (0..$#type)
	{
		$declns .= "$type[$i] $arg[$i]; " unless $arg[$i] =~ /^-/;
	}
	my $argstr = join( ', ', map { s/^-//; "&$_" } @arg );
(Note that we have to deal with the optional '-' prefix in both the $declns loop and the $argstr map and join).
Then the function finishes as before:
	my $decons = "get_${type}_${shape}( $var, $argstr );";
	my $result = "$declns$decons\n";
	return $result;
}
Finally, in handle_foreach() we have to change the call to parse_foreach() as that returns one less value in the tuple now:
	my( $command, $type, $var, $paramvar ) = parse_foreach( $line );
The only use of $paramtype in the rest of handle_foreach() was synthezising the argument list, we used to write:
	my $takeapart = take_object_apart( $type, $var,
		"cons", "$paramtype $paramvar, - $var" );
and now we write:
	my $takeapart = take_object_apart( $type, $var,
		"cons", "$paramvar, -$var" );
The difference is subtle, we now synthesize an argument list with no parameter type, but that will be ok, because take_object_apart() will add the parameter type information itself using the metadata.

Phew! That's quite a lot of changes. Fortunately, you don't have to make all those changes one by one:

If you download and extract the cpm-v5.tgz tarball, you will find this includes the new version of CPM: cpm-v5 with all the above changes. Also, all the .cpm files have been modified to use the new simpler "no parameter type" syntax. The Makefile has also been modified to run datadec -m types.in > types.md and pass types.md as an additional (first) command line argument to all invocations of cpm-v5.
Having downloaded and extracted that, build it and run all the tests by:
  make
  make test
You should see it all work perfectly, just as before. But now check the .cpm files and you'll see the parameter types are no longer specified.
Our final cpm-v5 is 418 lines of Perl - up from 355 lines in cpm-v4. So reading the metadata from Datadec, allowing us to simplify all our .cpm files, and improve error checking, added just under 60 lines of code.
Ok, that's really all the modifications we're going to make to our CPM tool now!

Summary
First, an admission: I've not been quite truthful about one small detail: Datadec's -m option is an amazingly good fit for CPM, isn't it? Turns out that's not entirely a coincidence.
In fact, I added the -m option to Datadec precisely in order to give CPM the information it needed. The README in the directory where I originally developed CPM tells the full story:
Determining the list of types that each constructor with 1-or-more parameters takes is as easy as parsing the Datadec generated header file:
grep '^extern void get_' datatypes.h | sed 's/^.*get_//' |
	sed 's/( [^ ]* ,//' | sed 's/ );$//' |
	sed 's/ \*//g' | sed 's/ , /,/g'
(and grepping for #define's with '_'s in them finds all the constructors with no parameters).
That technique got me going, but later on I added a new Datadec "metadata mode" with -m flag, that produced the information I need on stdout, then I changed CPM to take an additional command line parameter - the filename containing the captured metadata and use that instead.
Ok. Summary time. I have now shown you a complete example of how I designed and built a simple TEFEL enhancement tool, CPM, a translator from C with Pattern Matching to plain C. It took me about 6 hours to write originally, although I had my first working version after only 2 hours, and is well under 500 lines of Perl code overall. That's a very small investment compared to the weeks or months trying to insert these new features into a C compiler.

I am now using the final version of my CPM tool (renamed as cpm and installed alongside Datadec itself) routinely in projects that use Datadec, and it has made small but significant improvements to my programming pleasure.

To give you a taste of what C with Pattern Matching can do in the wild, I had previously written an interpreter for a tiny subset of Haskell, using Lex, Yacc and Datadec. When interpreting a function call, we are running the current function's body in a parameter environment where that function's parameters have the current set of values.

If that function body happens to contain a function call (perhaps even a recursive function call to the same function), we must evaluate each expression in the argument list against the current parameter environment, build a new environment that maps each of the function's parameters to the resultant expression value, and only then allow our interpreter to evaluate the called function's body using the new environment - causing the interpreter to perform the recursive call.
CPM allowed me to write the following delightful code to create the new environment - arguments is a list of expressions, and fparams (the function's formal parameters) is a list of identifiers:
	environ newenv   = new_env();

	// iterate over fparams and aparams in lockstep
	int nargs = idlist_length( fparams );
	%foreach fp in idlist fparams
	{
		%assert exprlist arguments is cons(arg,-arguments)

		// evaluate expr <arg> against the current environment
		// and map <fp> to the result in the new environment
		int val = eval_expr( arg, currentenv );
		set_env( newenv, fp, val );
	}
Based on how useful it's already proved in a month or so since I built CPM, I confidently predict that CPM will save me hundreds of hours as I continue to build C-based projects that use Datadec - now with CPM alongside!

As to the TEFEL technique, I have used it another 3 times over the last 18 months (since thinking of it), building experimental Perl-based translation tools that add functions returning tuples, namespaces and modules to C. Other people, no doubt, may think of many other examples. I suspect that the TEFEL technique may be a reasonably general prototyping technique of allowing you to experiment with a new language feature, with a modest investment of time.
d.white@imperial.ac.uk
Back to PSD Top
Written: July 2018