Practical Software Development: TEFEL: Tool-Enhanced Features for Existing Languages, Part 2::

Welcome to Duncan White's Practical Software Development (PSD) Pages.

I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

See all my Practical Software Development (PSD) Pages

A New Approach to Building Code Generators: TEFEL Part 2

Recap of Part 1:
In Part 1 of this article I showed how to start building a C with Pattern Matching (CPM) to C translator, using a new technique I call: Tool-Enhanced Features for Existing Languages (TEFEL). I did this by writing a simple line-by-line pre-processor that copies most lines through unchanged (hoping they're valid C), but translates Extension Directives into a corresponding chunk of plain C. Thus, C with directives comes in, and standard C goes out.
The CPM tool fits in an ecosystem with my existing Datadec tool (as described in Article 8 of this series), which allows you to write declarative recursive data type definitions like:
  TYPE {
        tree    =  leaf( string name )
                or node( tree left, tree right );
  }
and then generates an entire plain C module to implement those types, correctly and in a standardised way.
However, Datadec never had any special support for writing client-side code that uses datadec-generated types. I set out to implement a simple Extension to C that I call C with Pattern Matching (CPM). For example, in CPM you can write:
  int nleaves( tree t )
  {
	%when tree t is leaf(string name)
	{
		return 1;
	}
	%when tree t is node( tree l, tree r )
	{
		return nleaves(l) + nleaves(r);
	}
  }
After a series of short tussock hopping development stages, we ended up with a working 200-line Perl script which correctly translated the above CPM input to plain C.
CPM->C translator: Version 3
Ok, there are a few improvements and additions we can make to this tool: Let's start with that nasty warning we ignored when compiling nleaves.c at the end of Part 1:
  nleaves.c: In function 'nleaves':
  nleaves.c:30:1: warning: control reaches end of non-void function
Ah. Looking back at our generated C code, it was two if-statements, where each then part contained a return statement. Of course, we know that a tree only has two possible shapes, our two if-statements span all possible cases, so control cannot reach the end of the function - but the C compiler doesn't know that!
How could we fix this? C compilers (and the venerable lint tool) once respected the magic comment:
  /*NOTREACHED*/
to suppress such warnings. Sadly, modern C compilers like gcc ignore this, so we need a different technique.
We could manually add a final return ourselves:
	%when tree t is leaf(string name)
	{
		return 1;
	}
	%when tree t is node( tree l, tree r )
	{
		return nleaves(l) + nleaves(r);
	}
	return -1;
But this is rather inelegant, because after all we know it's dead code but the compiler doesn't.
Somehow we need to mark the last %when specially in some way, for example replacing %when .. is with %else .. must be, meaning that the final pattern must match. Using this hypothetical syntax we could instead write:
 int nleaves( tree t )
 {
	%when tree t is leaf(string name)
	{
		return 1;
	}
	%else tree t must be node( tree l, tree r )
	{
		return nleaves(l) + nleaves(r);
	}
 }
But somehow that doesn't appeal to me, and we'd have to think carefully about what code we generate in order to remove the warning.
Instead, let's choose a simpler directive and layout which will do the same job:
 int nleaves( tree t )
 {
	%when tree t is leaf(string name)
	{
		return 1;
	}
	%assert tree t is node( tree l, tree r )
	return nleaves(l) + nleaves(r);
 }
Here, we've replaced %else .. must be with %assert .. is, which reads much more nicely, and also removed the trailing block, making the final piece of code (that uses the results of the asserted patten march) unconditional.
What code will our new directive produce?
	// %assert tree t is node( tree l, tree r ):
        assert( tree_kind(t) == tree_is_node );
        tree l; tree r; get_tree_node( t, &l, &r );
	return nleaves(l) + nleaves(r);
If we can implement this successfully, the warning should go away due to the very structure of the generated code!
Ok, how do we implement this? Well, the format of an %assert line is identical to that of a %when line (except of course for the command). So if we start by renaming parse_when() as parse_match() and updating it's comment to say that it will parse a %when or a %assert pattern match line, we can then carefully re-read the code, and see what we would need to change to make it satisfy it's new specification. When we do this, we find that the code was already more general than the previous comment suggested, so already satisfies this modified spec. So no code changes are needed within the body of parse_match().

Next, if we clone handle_when() as handle_assert(), and change all references from when to assert, and remove the piece of code that reads the following line and checks that it's a '{', and print out "assert( $test );" instead of "if( $test )", that should generate the correct code.
The handle_assert() function you end up with looks like:
  #
  # handle_assert( $line, $indent, $ofh );
  #	Ok, $line starts with a %assert (still in the line), and we've already
  #	removed any leading indentation (in $indent).  Handle the %assert line,
  #	printing valid C output to $ofh.
  #
  fun handle_assert( $line, $indent, $ofh )
  {
	my( $command, $type, $var, $shape, $arglist ) = parse_match( $line );

	# produce the %assert comment line
	print $ofh "$indent// $line:\n";

	# produce the assert-line
	my $test = "${type}_kind($var) == ${type}_is_${shape}";
	print $ofh "${indent}assert( $test );\n";

	# then print out code to take the object apart
	my $takeapart = take_object_apart( $type, $var, $shape, $arglist );
	print $ofh "${indent}$takeapart" if $takeapart;
  }
Here we reuse all the parsing and take-apart infrastructure that we've already built.
Finally we locate the code in handle_line() that used to read:
	fatal( $line, "%when expected" ) unless $line =~ /^%when/;
	handle_when( $line, $indent, $ofh );
We modify it to say:
	if( $line =~ /^%when/ )
	{
		handle_when( $line, $indent, $ofh );
	} elsif( $line =~ /^%assert/ )
	{
		handle_assert( $line, $indent, $ofh );
	} else
	{
		fatal( $line, "%when or %assert expected" );
	}
These are the only changes we need to make, to support %assert.

If you download and extract the cpm-v3.tgz tarball, you will find cpm-v3 containing all the above modifications, and a version of nleaves.cpm that uses %assert, just as we showed above.
If you run:
  ./cpm-v3 nleaves.cpm
You will find that it generates the correct code in nleaves.c. Specifically:
        // %when tree t is leaf(string name):
        if( tree_kind(t) == tree_is_leaf )
        {
                string name; get_tree_leaf( t, &name );
                return 1;
        }
        // %assert tree t is node( tree l, tree r ):
        assert( tree_kind(t) == tree_is_node );
        tree l; tree r; get_tree_node( t, &l, &r );
        return nleaves(l) + nleaves(r);
Now compile it (and the rest of the code) by:
  make
  make test
The compilation warning has now disappeared, and testtree still works as expected. Success!
cpm-v3 is now about 240 lines of Perl - supporting %assert only added about 40 lines of code, and took about 15 minutes!
Other Recursive Data Types: Lists
So far we've been focussing on trees and counting leaves. Lists are a common special case of recursive data types. Suppose we want to work with lists of integers. We can add the following to types.in:
	intlist =  nil
		or cons( int head, intlist tail );
Then, if we run datadec again, it regenerates datatypes.[ch], now containing all the tree types and operations and all the intlist ones too. This, btw, is part of the beauty of datadec - add more recursive data types, or add additional shapes to existing data types, rerun datadec and now your module is regenerated, at zero cost to you, and now it implements the new stuff too!

If you download lists-and-trees.tgz, you'll see types.in already has the intlist type (as well as the tree type), and in addition you'll find a Makefile, another copy of our current version of CPM (cpm-v3), a little test program for our lists of integers, testintlist.c and a small module sumlist, comprising the usual pair of files: sumlist.h and sumlist.c.
In sumlist.c, after some includes, we see the sumlist function:
  // int sum = sumlist( intlist l );
  //	sum all the elements (the heads) in l. return the result.
  int sumlist( intlist l )
  {
	int sum = 0;
	while( intlist_kind(l) == intlist_is_cons )
	{
		int head;
		intlist tail;
		get_intlist_cons( l, &head, &tail );
		sum += head;
		l = tail;
	}
	return sum;
  }
Similarly, after some includes, testintlist.c reads:
  int main( void )
  {
	intlist l = intlist_nil();
	assert( intlist_kind( l ) == intlist_is_nil );

	l = intlist_cons( 100, l );
	assert( intlist_kind( l ) == intlist_is_cons );

	int head; intlist tail;
	get_intlist_cons( l, &head, &tail );
	assert( head == 100 );
	assert( intlist_kind( tail ) == intlist_is_nil );

	l = intlist_cons( 200, l );
	get_intlist_cons( l, &head, &tail );
	assert( head == 200 );
	printf( "head([200,100]) is %d\n", head );
	get_intlist_cons( tail, &head, &tail );
	assert( head == 100 );
	printf( "head(tail[200,100]) is %d\n", head );

	l = intlist_cons( 300, l );
	l = intlist_cons( 400, l );

	int sum = sumlist( l );
	printf( "sumlist([400,300,200,100] is %d\n", sum );
	assert( sum == 1000 );

	return(0);
  }
If you now compile all the code) by:
  make
  make test
It all builds successfully and then make test runs both the old testtree and our new testintlist. Check that you are convinced that you understand everything about testintlist and sumlist, before continuing.
Ok, now looking at testintlist.c above, we may observe several opportunities to use CPM to simplify it. Copy testintintlist.c to testintlist.cpm, and revise it to read:
        ...
	intlist l = intlist_nil();
	%assert intlist l is nil

	l = intlist_cons( 100, l );

	%assert intlist l is cons( int head, intlist tail )
	assert( head == 100 );
	%assert intlist tail is nil

	l = intlist_cons( 200, l );

	%assert intlist l is cons( int head, intlist tail )
	assert( head == 200 );
	printf( "head([200,100]) is %d\n", head );

	%assert intlist tail is cons( int head, intlist tail )
	assert( head == 100 );
	printf( "head(tail[200,100]) is %d\n", head );
	...
(You'll find this modified file in the directory, named testintlist1.cpm, you can simply copy that to testintlist.cpm).
Superficially, it looks like this should work. Edit the Makefile and add/uncomment testintlist.c to the BUILD list, and add/uncomment a rule testintlist.c: testintlist.cpm near the bottom.
Type make. Oh dear, after generating testintlist.c from testintlist.cpm, it tries to compile testintlist.c and gets several compilation errors, such as:
testintlist.c:29:6: error: redeclaration of 'head' with no linkage
  int head; intlist tail; get_intlist_cons( l, &head, &tail );
Can you see the problem? Each time we %assert that l is a cons( int head, intlist tail ), CPM has generated code that redeclares both those local variables, in the same block.
We could fix this (nastily) by editing testintlist.cpm and renaming the head and tail variables each time, as in:
        ...
	intlist l = intlist_nil();
	%assert intlist l is nil

	l = intlist_cons( 100, l );

	%assert intlist l is cons( int head, intlist tail )
	assert( head == 100 );
	assert( intlist_kind( tail ) == intlist_is_nil );

	l = intlist_cons( 200, l );

	%assert intlist l is cons( int head2, intlist tail2 )
	assert( head2 == 200 );
	printf( "head([200,100]) is %d\n", head2 );

	%assert intlist tail2 is cons( int head3, intlist tail3 )
	assert( head3 == 100 );
	printf( "head(tail[200,100]) is %d\n", head3 );
	...
(You'll find this in testintlist2.cpm, again copy it onto testintlist.cpm).
But see how error-prone this is - we need a more elegant solution: We need a way, in the CPM input file, of saying that we want to suppress the generation of some of the parameter variable declarations. The easiest way of supporting this is to make the special argument type '-' mean "do not declare this variable". Why not spend a few minutes re-reading the source code of cpm-v3, and see if you can figure out what will need changing to implement type='-'?
CPM->C translator: Version 4
Ok. let's start by implementing our "don't declare a parameter variable if the parameter type is '-'" rule.
Did you figure out how to do it? it's one trivial change to the take_object_apart() function. The line in question used to read:
	$declns .= "$arg; ";
and we change it to read:
	$declns .= "$arg; " unless $argtype eq '-';
This tiny change fixes the problem. (To be neat, we should also update various comments to reflect reality.)
We will also need to update testintlist.cpm to read:
        ...
	intlist l = intlist_nil();
	%assert intlist l is nil

	l = intlist_cons( 100, l );

	%assert intlist l is cons( int head, intlist tail )
	assert( head == 100 );
	%assert intlist tail is nil

	l = intlist_cons( 200, l );

	%assert intlist l is cons( - head, - tail )
	assert( head == 200 );
	printf( "head([200,100]) is %d\n", head );

	%assert intlist tail is cons( - head, - tail )
	assert( head == 100 );
	printf( "head(tail[200,100]) is %d\n", head );
	...
If you download and extract the cpm-v4.tgz tarball, you will find this includes the new version of CPM: cpm-v4-stage1 with all the above changes - including a Makefile that runs cpm-v4-stage1, and generates testintlist.c from testintlist.cpm..

To test that '-' behaves correctly, simply type make and see whether the code CPM generates in testintlist.c now reuses the existing variables, and whether it compiles, and most importantly whether it works? Good, it does!

Remembering that cpm-c3 comprised 240 lines, cpm-v4-stage1 has only added 4 lines of new code (and all of them comments!) giving 244 lines in total.
%while: Looping over Lists
Looking now at sumlist.c we notice a familiar pattern of a test followed by code to take the cons() node apart, but within a while rather than an if:
	int sum = 0;
	while( intlist_kind(l) == intlist_is_cons )
	{
		int head;
		intlist tail;
		get_intlist_cons( l, &head, &tail );
		sum += head;
		l = tail;
	}
	return sum;
This suggests a new %while directive to help us loop over lists. I propose the syntax:
	int sum = 0;
	%while intlist l is cons( int head, intlist tail )
	{
		sum += head;
		l = tail;
	}
	return sum;
Or better yet, to avoid that easy to forget l = tail inside the body of the while loop, let's use our new '-' feature to store the tail of l back into l:
	int sum = 0;
	%while intlist l is cons( int head, - l )
	{
		sum += head;
	}
	return sum;
Implementing this is pretty much as easy as replacing if with while in the code we generate. Clone handle_when() as handle_while(), change %when to %while within it, and replace if with while in the generated code that is printed out, and insert the obvious if it matches %while then call handle_while().. test in handle_line().

Look at cpm-v4-stage2 and you should find these changes. Now, copy sumlist.c onto sumlist.cpm, and change it's body as above. Then, edit the Makefile and uncomment out the obvious changes that should already be there to build sumlist.c from sumlist.cpm, and don't forget to change the version of CPM from cpm-v4-stage1 to cpm-v4-stage2.
Then:
  make clean all
should correctly translate sumlist.cpm (and the various other .cpm files) to plain C, compile and link.
You should find that running testintlists works exactly as it did before, i.e. that CPM generated the correct code.

cpm-v4-stage2 is now 275 lines long, up from 244. So implementing %while only added 31 lines of code.
Another List-specific directive: %foreach
Looking once again at our %while example:
	%while intlist l is cons( int head, - l )
%while is clearly useful, it does the job of iterating over the list, extracting each head element in turn (by always storing the list tail back into l), we would use this exact pattern for iterating over any list type. It seems a trifle inelegant. What we are really trying to achieve is "foreach head element in list", couldn't we make our directive say that more explicitly?
I propose the following additional directive:
	%foreach int head in intlist l
This has nearly all the same information as the %while loop, although it assumes that the non-null list constructor is called cons not anything else. But it's much less fiddly.
How would we implement that? For the first time we can't just call parse_match() because the syntactic structure is different. It's grammar is:
  '%foreach' paramtype paramvar 'in' listtype listvar
So it's a purely linear sequence of words, none optional. That's very straightforward - here's parse_foreach():
  #
  # my( $command, $type, $var, $paramtype, $paramvar ) = parse_foreach( $line );
  #	After checking that $line starts with a %foreach, parse
  #	the rest of the line.
  #	If it parses return (command, type, var, paramtype, paramvar)
  #	otherwise die via fatal()
  #
  #	'%foreach' PARAMTYPE(ID) PARAMVAR(ID) 'in' TYPE(ID) VAR(ID)
  #
  fun parse_foreach( $line )
  {
	$line =~ s/^(%\S+)\s*//;
	my $command = $1;
	$command =~ s/shape$//;	# reduce %foreachshape etc.. to %foreach

	my $sofar = $command;
	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $paramtype = $1;
	$sofar .= " $paramtype";

	fatal( $line, "ID (var name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $paramvar = $1;
	$sofar .= " $paramvar";

	fatal( $line, "'in' expected after <<$sofar>>" )
		unless $line =~ s/^in\s+//;
	$sofar .= " in";

	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $type = $1;
	$sofar .= " $type";

	fatal( $line, "ID (var name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)//;
	my $var = $1;

	fatal( $line, "end-of-line expected after <<$sofar>>" ) unless
		$line =~ s/^\s*$//;

	return( $command, $type, $var, $paramtype, $paramvar );
  }
Then, write handle_foreach(), as follows:
  #
  # handle_foreach( $line, $indent, $ofh );
  #	Ok, $line starts with a %foreach (still in the line), and we've already
  #	removed any leading indentation (in $indent).  Handle the %foreach line
  #	and it's following '{', printing valid C output to $ofh.
  #
  fun handle_foreach( $line, $indent, $ofh )
  {
	my( $command, $type, $var, $paramtype, $paramvar ) =
		parse_foreach( $line );

	# produce the %foreach comment line
	print $ofh "$indent// $line:\n";

	# produce the while-line
	my $test = "${type}_kind($var) == ${type}_is_cons";
	print $ofh "${indent}while( $test )\n";

	# get the next line, check that it's a bare '{', and print it out
	my $line = nextline();
	fatal( $line, "$command: { expected at eof" ) unless defined $line;
	fatal( $line, "$command: bare { expected at same indentation, " )
		unless $line =~ /^$indent\s*\{\s*$/;
	print $ofh $line;

	# then print out code to take the object apart
	my $takeapart = take_object_apart( $type, $var, "cons",
		"$paramtype $paramvar, - $var" );
	print $ofh "${indent}\t$takeapart" if $takeapart;
  }
Note how we wish to reuse the mechanism of take_object_apart(), and have to explicitly pass shape cons to the call (despite no shapenames ever being mentioned), and explicitly synthesize an argument list that will cause the head element to be declared, and the tail of the list stored back into $var. That is, we build the argument list "$paramtype $paramvar, - $var".
Finally, update the end of handle_line():
	} elsif( $line =~ /^%foreach/ )
	{
		handle_foreach( $line, $indent, $ofh );
	} else
	{
		fatal( $line, "%when/%while/%assert/%foreach expected" );
	}
This gives us our final stage of cpm-v4. Running:
  ./cpm-v4 sumlist.cpm
Gives the encouraging message:
  debug: line is <<%foreach int element in intlist l>> at line 18
In addition, it produces sumlist.c which reads, correctly, as:
	// %foreach int element in intlist l:
	while( intlist_kind(l) == intlist_is_cons )
	{
		int element; get_intlist_cons( l, &element, &l );
		sum += element;
	}
Our final cpm-v4 is 355 lines of Perl. So implementing %foreach added 80 lines (more than most changes, because we had to implement a fresh line parser as well as a foreach handler).

Before that implementing %while added 45 lines, and implementing the '-' argument type added 4 lines of comments.

You'll be relieved to know that I can't think of any more Directives that we need to add to CPM. So are we finished?

Well, not quite: so far all versions of the CPM tool have been entirely stand-alone, independent of Datadec itself. Is there any advantage to somehow linking the two tools more closely?

So, when you're ready: let's investigate that in part 3 of this article.
d.white@imperial.ac.uk
Back to PSD Top
Written: July 2018