Practical Software Development: TEFEL: Tool-Enhanced Features for Existing Languages:

Welcome to Duncan White's Practical Software Development (PSD) Pages.

I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

See all my Practical Software Development (PSD) Pages

A New Approach to Building Code Generators: TEFEL - Tool-Enhanced Features for Existing Languages:

My previous article (no 10) discussed how to build a variety of small simple Code Generators (in the Pragmatic Programmer sense) that translate some very simple Little Language forms of input into valid program source code (mostly C). Such tools help automate as much as possible of your programming work. We often use a text manipulation language like Perl to build those Code Generators.
Call this Code Generator Tactic 1. It usually applies to Little Languages that are so simple that the input processing is trivial.

In addition the final lecture in my C Programming Tools 2018 lecture series focussed heavily on using Yacc, Lex and my own Datadec (described in Article 8 of this series) together to build larger Code Generators - effectively mini-compilers with a common structure: A lexical analyser, a parser, an Abstract Syntax Tree module to represent parse trees, and actions in the parser to build these trees as we parse. By using Yacc, Lex and Datadec together, the tools write the majority of your code for you, leaving you to concentrate on Code Generation - essentially walking the ASTs your parser has built and emitting valid C code to a file as you walk. Of course, using tools like Yacc and Datadec that write C code, we are necessarily constrained to writing such Code Generators in C, rather than Perl.
Call this Code Generator Tactic 2. It usually applies to larger, more complicated, (Not so Little) Languages where a parser and lexer are needed to parse the input.

Tactic 3: Adding one Feature to a Large Language
More recently, I've been playing with a different approach to building Code Generators, let's call it Code Generator Tactic 3. This applies when we want to add a single well-defined feature to a large existing language like C.
Let's approach this, as I so often do, by means of an example: My tool Datadec (as described in Article 8 of this series) allows you to write a simple declarative set of Functional Programming's recursive data types like:
  TYPE {
        tree  =  leaf( string name )
              or node( tree left, tree right );
  }
and then generates an entire plain C module to implement those types, correctly and in a standardised way.
However, Datadec has never had any special support for writing client-side code that uses datadec-generated types. So if you wanted to write some C code that uses the generated tree type above to, say, count how many leaves there are in a given tree. In a functional language like Haskell, you might write that as:
  nleaves(leaf(name)) = 1
  nleaves(node(l,r))  = nleaves(l) + nleaves(r)
but in C, against the datadec-generated module, you'd have to write the following solid - but rather less beautiful - code:
  int nleaves( tree t )
  {
	if( tree_kind(t) == tree_is_leaf )
	{
		string name; get_tree_leaf( t, &name );

		// leaf( name ): contains 1 leaf.
		return 1;
	} else
	{
		tree l, r; get_tree_node( t, &l, &r );

		// node( l, r ): process l and r trees.
		return nleaves(l) + nleaves(r);
	}
 }
In case you're not intimately familiar with taking Datadec-generated datatypes apart, this code proceeds as follows:

We first test whether our tree t is a leaf: If it is, we take the leaf apart:

string name; declares a local variable (string is a typedef'd char *).

get_tree_leaf( t, &name ) breaks the leaf apart, storing the char * string pointer held inside the leaf into name.

If t is not a leaf, then it's a node containing a left and a right subtree:

tree l; tree r; declares two obvious local variables.

get_tree_node( t, &l, &r ) breaks the node apart, storing the left and right subtrees held inside the node into our variables l and r.

Although this does the job, it's rather long-winded compared to the beauty of the Haskell version.
From time to time I've thought that some sort of shape pattern match would be lovely. I'd like to be able to write something like the following in an extended version of C:
  int nleaves( tree t )
  {
        whenshape t is leaf(name)
        {
                return 1;
        }
        whenshape t is node( l, r )
        {
                return nleaves(l) + nleaves(r);
        }
  }
This is much closer in spirit to the original Haskell code, much more beautiful and much clearer to read. Each whenshape is a pattern match, and if it succeeds, the following block is executed - with the named parameters being new, freshly defined variables of the appropriate types, each containing the corresponding piece of data from the shaped object.
But how might we implement this? As a first step, having defined the syntax of the new feature by example, we might define it as a partial grammar (referring to other grammatical entities, such as expressions and blocks):
when_statement
	:	'whenshape' expr 'is' ID [ '(' idlist ')' ] block
	...

idlist	:	idlist ',' ID
	|	ID
	...
Next, we can define it's semantics via a precise description of how to translate our examples back to standard C:
The first whenshape example turns into the code that we'd otherwise have to write, exactly as we saw above in our first version of nleaves(), annotated with a useful comment:
        // whenshape t is leaf(name):
        if( tree_kind(t) == tree_is_leaf )
        {
                string name; get_tree_leaf( t, &name );
                return 1;
        }
Similarly, the second whenshape example turns into:
        // whenshape t is node( l, r ):
        if( tree_kind(t) == tree_is_node )
        {
                tree l; tree r; get_tree_node( t, &l, &r );
                return nleaves(l) + nleaves(r);
        }
Note that there are two pieces of implicit information that we will need to make use of in the code we wish to generate:

First, we need to know what type our test expression (t) is, in this case it's a tree, so that we can check that it is a recursive data type - and hence that it is suitable to pattern match against.

Second, we need to know what shapes (constructors) the tree type has, to check that leaf and node are valid shapes, and also how many arguments, and of what types, each shape has. In Haskell, this is all done via it's clever type inference algorithm, but of course we'd somehow have to do this ourselves. We'll return to this later.

That aside, how might we start implementing this? The obvious idea is to build it in Lex, Yacc and Datadec. But in this case we'd have to implement all of normal C as well as our new feature. We could get the source code of a complete C compiler and graft our new feature into it. But that sounds hard work for just adding one feature!

Or we could get a complete Yacc/Lex C grammar and extend that - adding our new feature. But we still need to build a complete AST for all of C, and then walk over it, re-generating all the normal C code unaltered - while turning our new feature back into standard C. That's still a lot of work! Could we avoid it?

Of course, if someone had already written a C-C translator, we could extend that. But what on earth is the point of a C to C translator on it's own?
A Simple Alternative: Tool-Enhanced Features for Existing Languages (TEFEL)
This brings us back to our new third Code Generation tactic: a simple alternative implementation strategy occurred to me recently, that I call TEFEL which stands for Tool-Enhanced Features for Existing Languages.

TEFEL will allow us to build the whenshape into C in a few hours rather than the weeks or months that building an enhanced C to C translator might take:

The idea is simple: graft the new feature into C by writing a simple line-by-line pre-processor that copies most lines through unchanged (assuming that they are valid C), but locates specially marked Extension Directives, turning each into a corresponding chunk of plain C. Thus, C with directives will come in, and standard C will go out.
So, let's get specific. Let's mark our extension directive lines with % at the start of a line (after optional whitespace), making them very easy to locate, and let's choose the following close approximation of our pattern matching hypothetical code, but now this is intended as precise, parsable input to a tool that we intend to build in the course of this article:
 int nleaves( tree t )
 {
	%when tree t is leaf(string name)
	{
		return 1;
	}
	%when tree t is node( tree l, tree r )
	{
		return nleaves(l) + nleaves(r);
	}
 }
Let's name this input syntax C with Pattern Matching, or CPM for short. Those with very long memories may think that CPM was some kind of early operating system, but whatever..

You will notice that in moving from our original whenshape syntax to implementable %when syntax, we have filled in the missing type information in the simplest possible way - by stating it explicitly:

%when tree t tells us that t is a tree,

leaf( string name ) and node( tree l, tree r) tell us which shape we want t to be matched against, and the number, types and desired variable names of the corresponding parameter variables.

Also, we'll replace the complexity of an arbitrary expression delivering a tree with a single variable name.

Of course this is a tradeoff, it's much less elegant than the original, but a lot easier to implement! We may be able to revisit this later, but we have to start somewhere!
CPM->C translator: Version 1
So, let's build the first version of our CPM->C translation tool, starting by reading every line of input, identifying all % lines, and copying all other lines to stdout unchanged. That's trivial to write in Perl, right?
Let's start with the main structure:
  die "Usage: cpm-v1 inputfile\n" unless @ARGV == 1;
  my $inputfilename = shift;

  my $cfilename = $inputfilename;
  $cfilename =~ s/pm$//;

  open( my $infh, '<', $inputfilename ) || die "cpm: can't open $inputfilename\n";

  unlink( $cfilename );
  open( my $cfh, '>', $cfilename ) || die "cpm: can't create $cfilename\n";

  while( defined( $_ = nextline() ) )
  {
	chomp;
	handle_line( $_, $cfh );
  }
  close( $infh );
  close( $cfh );
We'll need a line reader and line counter nextline():
  my $infh;      	# fd of current CPM file we're translating
  my $lineno = 0;	# current line no inside $infh
  my $currline;  	# current line (set by nextline())

  #
  # my $line = nextline();
  #	Read the next line from $infh, incrementing $lineno afterwards,
  #	and return it.
  #
  fun nextline()
  {
	my $line = <$infh>;
	$currline = $line;
	$lineno++;
	return $line;
  }
Aside: Note that here I'm using Perl's Function::Parameters module to give nice function declaration syntax. Install it via
  cpanm install Function::Parameters
if it's not installed where you live.
Next, we need to write handle_line():
  #
  # handle_line( $line, $ofh );
  #	handle $line [and if necessary, any subsequent lines of input],
  #	print out whatever text is generated (or copied) to $ofh
  #
  fun handle_line( $line, $ofh )
  {
	unless( $line =~ /^\s*%/ )
	{
		print $ofh "$line\n";
		return;
	}
	$line =~ s/^(\s*)//;
	my $indent = $1;
	print "debug: after removing indent, line is <<$line>> ".
	      "at line $lineno\n";

	die "$indent$line\n$indent^ Error at line $lineno: %when expected\n"
		unless $line =~ /^%when/;
	print $ofh "$indent// $line\n";		# turn %when into comment
  }
Assembling these into a single Perl script, we get cpm-v1, approximately 80 lines long. You can download it by downloading this cpm-v1.tgz tarball, when you extract it you will find a directory called cpm-v1, and inside that the Perl script cpm-v1, a more complete nleaves.cpm, and a deliberately erroneous version badnleaves.cpm as follows:
  int nleaves( tree t )
  {
        %when tree t is leaf(string name)
        {
                return 1;
        }
        %when tree t is node( tree l, tree r )
        {
                %fluffy
                return nleaves(l) + nleaves(r);
        }
  }
Having extracted that tarball, inside the cpm-v1 directory, you can run:
  ./cpm-v1 badnleaves.cpm
to see some debugging output, and an error message when the bad % directive is encountered:
                %fluffy
                ^ Error at line 9: %when expected
When you run:
  ./cpm-v1 nleaves.cpm
it will generate (an incomplete) nleaves.c file. Note that in nleaves.c, the %when lines are already turned into comments, and all non-% lines have already been copied through unchanged.
I think this clearly shows that ignoring 95% of the input and parsing only marked lines, containing tightly controlled extension directives, is going to be a heck of a lot easier than building a complete CPM to C translator! It allows us to focus on only what we care about: our new feature.
CPM->C translator: Version 2
Ok, in our second version, we're actually going to need to parse the %when line, check that it's syntactically valid, and then work out what C code to produce.

The obvious question is how will we parse the line? We're working in Perl, so we can't use Yacc. However, Perl has an equivalent parser generator, the module Parse::RecDescent. This could very easily do the work, but sometimes it's more fun to do simple parsing ourselves via Perl regexps. So let's do that:-)
In particular, let's gradually remove prefixes from the current line as we match tokens. This means that our "whatsleft" string will always be a suffix of the current line, and gives us a simple way of positioning the error message beautifully (the offset is the difference in lengths of the original line and the part we have left). Let's write an error reporting routine first:
  #
  # fatal( $whatsleft, $msg );
  #       Given $whatsleft (a suffix of $currline) and a message $msg, print
  #       a standard-formatted fatal error, pointing to the correct place in
  #       the line (using length(currline) - length(whatsleft) as the basic
  #       source of indentation information), and die.
  #
  fun fatal( $whatsleft, $msg )
  {
        $currline  =~ s/^\t/        /;           # expand tabs to spaces
        $whatsleft =~ s/^\t/        /;
        my $pos    = length($currline) - length($whatsleft) - 1;
        my $indent = ' ' x $pos;
        my $err    = "$currline$indent^ Error at line $lineno: $msg\n";
        die "\n$err\n";
  }
(Note that we have to expand leading tabs to spaces beforehand, which means that we have to assume hard tabs of some specific width - I've chosen 8 spaces).
Next, let's offload the work of handling a %when line to a separate handle_when() function. Start by modifying handle_line() to call it:
	fatal( $line, "%when expected" ) unless $line =~ /^%when/;
	handle_when( $line, $indent, $ofh );
Then write handle_when()'s skeleton, filling it with code that calls a second new function parse_when() that will parse the %when line into it's component pieces, then writing the line (as a comment) to $ofh:
  #
  # handle_when( $line, $indent, $ofh );
  #	Ok, $line starts with a %when (still in the line), and we've already
  #	removed any leading indentation (in $indent).  Handle the %when line
  #	and [eventually] it's following '{', printing valid C output to $ofh.
  #
  fun handle_when( $line, $indent, $ofh )
  {
	my( $command, $type, $var, $shape, $arglist ) = parse_when( $line );
	print "debug: found $command type=$type, var=$var, shape=$shape, ".
		"arglist=$arglist\n";

	# produce the %when comment line
	print $ofh "$indent// $line:\n";
  }
Parsing the %when line
Right, now let's work out how to parse our %when line, recall that the grammar of a %when line is:
  when    : '%when' type varname 'is' shape [ '(' arglist ')' ]

  arglist : arg
	  | arg ',' arglist

  arg     : type paramname
where type, varname and shape are all simple identifiers, and arglist is a comma separated list of arguments, where each argument is a typename followed by a parameter name (another simple identifier).
To start parsing this, let's start writing parse_when(), it's interface is already set by the example call above, and implement the first stanza of regex-based parse code to extract and remove the %when command itself:
  #
  # my( $command, $type, $var, $shape, $arglist ) = parse_when( $line );
  #	After checking that $line starts with a %when, parse the rest of the
  #	line.  If it parses return (command, type, var, shape, arglist)
  #	otherwise die via fatal()
  #
  #	'%when' TYPE(ID) VAR(ID) 'is' SHAPE(ID) [ '(' ARGLIST ')' ]
  #	where ARGLIST is a comma separated list of typename paramname pairs,
  #	where the typename is usually an ID, but can be a '-'
  #
  fun parse_when( $line )
  {
	$line =~ s/^(%\S+)\s*//;
	my $command = $1;
	$command =~ s/shape$//;	# %when or %whenshape etc.. reduce to %when

	print "debug: parse_when: command=$command, line=$line\n";
	return ( $command, "", "", "", "" );
  }
If you'd like to follow along, I've prepared a series of intermediate stages in building cpm-v2. Download the cpm-v2.tgz tarball, extract it and cd inside the cpm-v2 directory.
cpm-v2-stage1 comprises this first stage of development. Run it via:
  ./cpm-v2-stage1 nleaves.cpm
and you'll see it display various debugging messages including:
  debug: parse_when: command=%when, line=tree t is node( tree l, tree r )
  debug: found %when type=, var=, shape=, arglist=
These show that parse_when() correctly split the %when command off from the rest of the line, returned just the command, and that handle_when() received the %when command, with the type, variable, shape and arglist all empty.
Next, we want to extract the next three tokens, or words, from the line: The first word is the typename, the second is the variable name, and the third is the plain word 'is'.
In Perl, the regex pattern \w+ matches a word (a non-empty maximal length sequence of alphanumeric characters), and we see that there must be some whitespace \s+ immediately after that.
So $line =~ s/^(\w+)\s+// will match a word and some following whitespace at the beginning of the line, remove both, and remember the word that was matched in $1. This allows us to replace the last 2 statements in parse_when() with:
	my $sofar = $command;
	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $type = $1;
	$sofar .= " $type";

	fatal( $line, "ID (var name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $var = $1;
	$sofar .= " $var";

	fatal( $line, "'is' expected after <<$sofar>>" )
		unless $line =~ s/^is\s+//;

	return( $command, $type, $var, "", "" );
Running this version (cpm-v2-stage2) on nleaves.cpm, via:
  ./cpm-v2-stage2 nleaves.cpm
You will see the encouraging:
  debug: found %when type=tree, var=t, shape=, arglist=
This shows that handle_when() has now received the typename (tree) and the variable name (t).
Ok, now we must tackle the shape and it's optional argument list:
The next token (another word) is the shape (or constructor) name. That's easy:
	fatal( $line, "ID (constructor name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s*//;
	my $shape = $1;
	$sofar .= " $shape";
(Note that this time there is optional whitespace following the shape name, so the regex is \s* not \s+).
After the shape name, there are two possibilities. Either the line comprises nothing but whitespace, and we're done (no arguments):
	# that may be all..
	return( $command, $type, $var, $shape, "" ) if $line =~ /^\s*$/;
Or we need a bracketed argument list. Check for, and remove, both brackets:
	# or we need '(' arglist ')'
	fatal( $line, "'(' expected after <<$sofar>>" )
		unless $line =~ s/^$\s*//;

	fatal( substr($line,-1,1), "')' expected at end of line" ) unless
		$line =~ s/\s*$$//;
Now all we have left in $line is the argument list. Of course, we should probably check that the argument list has the correct syntax, but noticing that our parse_when() function has the goal of splitting the %when line down to command, typename, variable name, shapename and argument list (as a single string), we may observe that we've met our goal. (Remember: tools don't have to be perfect).
So let's simply declare sucess and return:
	# should have an arglist left now.  should really check it's
	# syntactically valid but let's not bother...
	return( $command, $type, $var, $shape, $line );
This gives us the finished version of parse_when($line):
  #
  # my( $command, $type, $var, $shape, $arglist ) = parse_when( $line );
  #	After checking that $line starts with a %when, parse the rest of the
  #	line.  If it parses return (command, type, var, shape, arglist)
  #	otherwise die via fatal()
  #
  #	'%when' TYPE(ID) VAR(ID) 'is' CONS(ID) ( '(' ARGLIST ')' )
  #	where ARGLIST is a comma separated list of typename paramname pairs,
  #	where the typename is usually an ID, but can be a '-'
  #
  fun parse_when( $line )
  {
	$line =~ s/^(%\S+)\s*//;
	my $command = $1;
	$command =~ s/shape$//;	# %when or %whenshape etc.. reduce to %when
	my $sofar = $command;

	fatal( $line, "ID (type name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $type = $1;
	$sofar .= " $type";

	fatal( $line, "ID (var name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s+//;
	my $var = $1;
	$sofar .= " $var";

	fatal( $line, "'is' expected after <<$sofar>>" )
		unless $line =~ s/^is\s+//;
	$sofar .= " is";

	fatal( $line, "ID (constructor name) expected after <<$sofar>>" )
		unless $line =~ s/^(\w+)\s*//;
	my $shape = $1;
	$sofar .= " $shape";

	# that may be all..
	return( $command, $type, $var, $shape, "" ) if $line =~ /^\s*$/;

	# or we need '(' arglist ')'
	fatal( $line, "'(' expected after <<$sofar>>" )
		unless $line =~ s/^$\s*//;

	fatal( substr($line,-1,1), "')' expected at end of line" ) unless
		$line =~ s/\s*$$//;

	# should have an arglist left now.  should really check it's
	# syntactically valid but let's not bother...
	return( $command, $type, $var, $shape, $line );
  }
This gives us cpm-v2-stage3. Running:
  ./cpm-v2-stage3 nleaves.cpm
Gives the encouraging:
  debug: found %when type=tree, var=t, shape=node, arglist=tree l, tree r
We've now finished parsing the %when line.
Generating some code
Next, we need to start generating some code. To refresh our memory, handle_when() currently reads:
  #
  # handle_when( $line, $indent, $ofh );
  #	Ok, $line starts with a %when (still in the line), and we've already
  #	removed any leading indentation (in $indent).  Handle the %when line
  #	and [eventually] it's following '{', printing valid C output to $ofh.
  #
  fun handle_when( $line, $indent, $ofh )
  {
	my( $command, $type, $var, $shape, $arglist ) = parse_when( $line );
	print "debug: found $command type=$type, var=$var, shape=$shape, ".
		"arglist=$arglist\n";

	# produce the %when comment line
	print $ofh "$indent// $line:\n";
  }
Now we modify this, removing [eventually] from the comment, commenting out the debug statement, and appending some text at the end, giving:
  #
  # handle_when( $line, $indent, $ofh );
  #	Ok, $line starts with a %when (still in the line), and we've already
  #	removed any leading indentation (in $indent).  Handle the %when line
  #	and it's following '{', printing valid C output to $ofh.
  #
  fun handle_when( $line, $indent, $ofh )
  {
	my( $command, $type, $var, $shape, $arglist ) = parse_when( $line );
	#print "debug: found $command type=$type, var=$var, shape=$shape, ".
	#	"arglist=$arglist\n";

	# produce the %when comment line
	print $ofh "$indent// $line:\n";

	# produce the if-line
	my $test = "${type}_kind($var) == ${type}_is_${shape}";
	print $ofh "${indent}if( $test )\n";

	# get the next line, and check that it's a bare '{', print it out
	my $line = nextline();
	fatal( $line, "$command: { expected at eof" ) unless defined $line;
	fatal( $line, "$command: bare { expected at same indentation, " )
		unless $line =~ /^$indent\s*\{\s*$/;
	print $ofh $line;

	# then we will need to generate code to "take the object apart"
	# For now, a placeholder:
	my $takeapart = "// TAKE APART CODE GOES HERE\n";
	print $ofh "${indent}\t$takeapart" if $takeapart;
  }
This gives us cpm-v2-stage4. Running:
  ./cpm-v2-stage4 nleaves.cpm
Prints a bit of debugging, and terminates. If we look at the generated nleaves.c the body of nleaves now reads:
  int nleaves( tree t )
  {
        // %when tree t is leaf(string name):
        if( tree_kind(t) == tree_is_leaf )
        {
                // TAKE APART CODE GOES HERE
                return 1;
        }
        // %when tree t is node( tree l, tree r ):
        if( tree_kind(t) == tree_is_node )
        {
                // TAKE APART CODE GOES HERE
                return nleaves(l) + nleaves(r);
        }
  }
We're nearly there! The if-statements look perfect, and our placeholder comment is inserted at the correct position inside the then-block, and properly indented.
The missing take apart code that we need to take a tree leaf(string name) apart is:
  string name; get_tree_leaf( t, &name );
We would like to generate this code from the type name (tree), the constructor name (leaf) and the arglist (string name).
Similarly, the code to take a tree node( tree l, tree r ) apart is:
  tree l; tree r; get_tree_node( t, &l, &r );
We want to generate this code from the type name (tree), the shape name (node) and the arglist (tree l, tree r). So the code we wish to generate comprises a set of local variable declarations, followed by a call to break the node apart.
Ok, so that seems relatively easy to code. Let's invent a new function take_object_apart() to do this job, first we replace the placeholder code with a call:
	# then inject the takeapart line $takeapart
	my $takeapart = take_object_apart( $type, $var, $shape, $arglist );
	print $ofh "${indent}\t$takeapart" if $takeapart;
Then we define the skeleton of take_object_apart(), which will start splitting the $arglist apart, first splitting on commas (with optional whitespace before and after), giving the array of arguments, and then splitting each argument on whitespace into two pieces - the type and the name:
  #
  # my $breakdown = take_object_apart( $type, $var, $shape, $arglist );
  #	Generate a single long line of C code that will take the object
  #	apart, into it's component arguments:
  #	- Declare variables for all the arguments in $arglist.
  #	- then call the get_{type}_{shape}() deconstructor with the
  #	  addresses of each of the argument variables.
  #	Return the take apart string..
  #
  fun take_object_apart( $type, $var, $shape, $arglist )
  {
	# if the shape has no arguments, then we don't need to take it apart
	return "" unless $arglist;

	# ok, we have one or more arguments, comma separated:
	my @arg = split(/\s*,\s*/, $arglist );
	foreach my $arg (@arg)
	{
		my( $argtype, $argname ) = split( /\s+/, $arg, 2 );
		print "debug: toa: type $type, shape=$shape, ".
		      "argtype=$argtype, argname=$argname\n";
	}
	return "// TAKE APART CODE GOES HERE";
  }
This gives us cpm-v2-stage5. Running:
  ./cpm-v2-stage5 nleaves.cpm
Gives us, among the debugging:
  debug: toa: type tree, shape=leaf, argtype=string, argname=name
  debug: toa: type tree, shape=node, argtype=tree, argname=l
  debug: toa: type tree, shape=node, argtype=tree, argname=r
This shows that we can correctly split the argument list into pieces.
Ok, to generate the "take the object apart" string, we need to build two things:

A list of all the argument names, so that we can form the get_ call with the address of each argument name.

A string containing all the argument variable declarations.

We build them both as follows:
	my $declns = "";
	my @argname;
	foreach my $arg (@arg)
	{
		my( $argtype, $argname ) = split( /\s+/, $arg, 2 );
		$declns .= "$arg; ";
		push @argname, $argname;
	}
Then we generate the final "take the object apart" code by:
	my $argstr = join( ', ', map { "&$_" } @argname );
	my $decons = "get_${type}_${shape}( $var, $argstr );";
	my $result = "$declns$decons\n";
	return $result;
Our finished version of take_object_apart() now reads:
#
# my $breakdown = take_object_apart( $type, $var, $shape, $arglist );
#	Generate a single long line of C code that will take the object
#	apart, into it's component arguments:
#	- Declare variables for all the arguments in $arglist
#	- then call the get_{type}_{shape}() deconstructor with the
#	  addresses of each of the argument variables.
#	Return the take apart string..
#
fun take_object_apart( $type, $var, $shape, $arglist )
{
	# if the shape has no arguments, then we don't need to take the object apart
	return "" unless $arglist;

	# ok, we have one or more arguments, comma separated:
	my @arg = split(/\s*,\s*/, $arglist );
	my $declns = "";
	my @argname;
	foreach my $arg (@arg)
	{
		my( $type, $name ) = split( /\s+/, $arg, 2 );
		$declns .= "$arg; ";
		push @argname, $name;
	}
	my $argstr = join( ', ', map { "&$_" } @argname );
	my $decons = "get_${type}_${shape}( $var, $argstr );";
	my $result = "$declns$decons\n";
	return $result;
}
This gives us our first working CPM->C translator: cpm-v2. Running:
  ./cpm-v2 nleaves.cpm
Gives us nleaves.c that looks finished:
  int nleaves( tree t )
  {
        // %when tree t is leaf(string name):
        if( tree_kind(t) == tree_is_leaf )
        {
                string name; get_tree_leaf( t, &name );
                return 1;
        }
        // %when tree t is node( tree l, tree r ):
        if( tree_kind(t) == tree_is_node )
        {
                tree l; tree r; get_tree_node( t, &l, &r );
                return nleaves(l) + nleaves(r);
        }
  }
Does nleaves.c compile? does it work?
Ok, given that we now have generated some valid looking C code, let's try to compile it! At this point, of course, you will need - for the first time in this article - to have already installed Datadec itself, as of course nleaves.c includes the Datadec-generated datatypes.h, which we'll need to have available at compile time.
If you haven't already got Datadec installed on your system (test via which datadec), download datadec's source code via:
  git clone https://gitlab.doc.ic.ac.uk/dcw/c-datadec.git
  cd c-datadec
There are compilation and installation instructions inside the tarball - read the README and get acquainted. In most cases make install is all you need.
Now that you've got Datadec installed, come back to our cpm-v2 directory, and explore a bit more:
As well as the various cpm-v2* stages, and nleaves.cpm, you will also see a very short Datadec-input file types.in, a test program testtree.c, and a Makefile that will invoke datadec to generate datatypes.[ch] from types.in, invoke cpm-v2 to generate nleaves.c from nleaves.cpm, and then compile and link everything together.
Let's do it!
  make
Everything should compile and link. One warning is generated - we'll come back to that - but otherwise everything compiles.
Run the testtree program, either directly or via:
  make test
You should see it building a small tree one leaf at a time, and then counting how many leaves the tree has at each stage. Yes, it's invoking our CPM-generated nleaves() function, and it's correctly counting the leaves in the trees! So our cpm-v2 tool must have worked!
The final version of cpm-v2 is just over 200 lines of Perl code, and took a couple of hours to write, btw. Again: this is a relatively small amount of code, and a relatively small period of time, to produce what already looks like it could be a useful tool.. Now, we could stop there. After all, we set out to build a tool to add %when to C, and we've succeeded in doing that!

But there are a few improvements and additions we can make to this tool. So, when you're ready: let's carry on in part 2 of this article.
d.white@imperial.ac.uk
Back to PSD Top
Written: July 2018