Practical Software Development: A Study in Optimization

Welcome to Duncan White's Practical Software Development (PSD) Pages.

I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

See all my Practical Software Development (PSD) Pages

A Study in Optimization: Pokemon Chains Part 2

In Part 1 of this article I showed how I approached solving, and then optimizing, a simple Perl program (set by the Perl Weekly Challenge) to find (one of) the longest chains of a specific set of Pokemon character names, where each name in the chain starts with the last letter of the previous name in the chain.
Having first shown how I solved the problem, I then showed how to repeatedly use the New York Times's Perl profiler - the CPAN module Devel::NYTProf.
However, the law of diminishing returns led to us reaching a point where we couldn't find anything else to optimize. So, have we reached the end of the profiling line?
There is one thing we have left essentially unchanged - the fundamental recursive nature of the findseq(). Perhaps 5 million function calls, each marshalling a current word number, a reference to a set array and a sequence array - an array of word numbers already visited to get "here", add an overhead that is a large part of the run time.
If we are to speed it up any more, perhaps we need another algorithm for generating all paths. Preferably one that is not recursive, or at least doesn't make 5 million calls, 22 levels deep.
A broad class of maze-solving, or path-finding algorithms basically generate all paths of length 1, then generate all paths of length 2 (using the paths of length 1 and extending them by a single step) etc, until no new paths are added. Such algorithms are called breadth-first search, and store all the partial paths in memory, rather than the recursive version which stores them in the recursive call structure and the locals (parameters and variable) in all the recursive frames.
So, using this approach, I started to design some data structures, using word numbers just as v5.pl did. Only %sw remains unchanged. This time, I decided to work backwards from stop words, words that form the end of Pokemon name chains (because the letter these words end with does not start any of the Pokemon names we're using). I could equally well have worked forward from start words.
  my %sw;      # hash from letter L to list of word nos of words STARTING with L

  my @stopword;# list of stop word nos (word nos of words with no words going
	       # "out" from them onto another word, ie. word numbers N where
	       # no other word starts with the last letter of word N)

  my %ew;      # hash from letter L to list of word nos of words ENDING with L

  my @inword;  # array from word no N to array of wordnos of words going "in"
	       # to word N, i.e. ending with the first letter of word N
	       # if there are no such words, then []
These are all easy to construct, and you should be able now to simply write their initializations out. First %sw and %ew, %sw basically unchanged from v5.pl:
  # build %sw
  foreach my $wn (0..$#words)
  {
	my $word = $words[$wn];
	$word =~ /^(.)/;
	my $firstletter = $1;
	$sw{$firstletter} //= [];
	push @{$sw{$firstletter}}, $wn;
  }

  # build %ew
  foreach my $wn (0..$#words)
  {
	my $word = $words[$wn];
	$word =~ /(.)$/;
	my $lastletter = $1;
	$ew{$lastletter} //= [];
	push @{$ew{$lastletter}}, $wn;
  }
Now, build @stopword, using %sw:
  foreach my $wn (0..$#words)
  {
	my $word = $words[$wn];
	$word =~ /(.)$/;
	my $lastletter = $1;
	my $aref = $sw{$lastletter} // [];
	push @stopword, $wn if @$aref==0;
  }
Now, build @inword, using %ew:
  foreach my $wn (0..$#words)
  {
	my $word = $words[$wn];
	$word =~ /^(.)/;
	my $firstletter = $1;
	my $aref = $ew{$firstletter} // [];
	$inword[$wn]= $aref;
  }
Note that we no longer need %sw and %ew below here, all the rest of the code only uses @stopword and @inword. Next, we write the main code, in terms of an ideal "lengthen all sequences" function we haven't yet written:
  my $N = 1; 	# length starts at 1 and is increased..
  my @seqs;	# all sequences of length N
  @seqs = map { [ $_ ] } @stopword;	# convert each stopword wordno to a seq 

  for(;;)
  {
	my $nseq = @seqs;
	say "Have $nseq sequences of length $N";
	my $ok = lengthen( \@seqs, $N );
  last unless $ok;
	$N++;
  }
  show_seqs( @seqs );
  exit( 0 );
show_seqs() is a simple helper that displays all the longest sequences found (and there can be quite a lot of them!):
  #
  # show_seqs( @seqs );
  #	Show the sequences (as words, not word nos)
  #
  fun show_seqs( @seqs )
  {
        foreach my $su (@sus)
        {
                my $str = join( ',', map { $words[$_] } @$s );
                say $str;
        }
  }
That leaves only lengthen(). Just as we did when first solving this problem, we are solving our problem top-down by defining a function that would help us, and then implementing it later. First we define the function contract: what it's called, how we call it, what parameters we pass it, what it gives back, and what relationship there is between it's inputs and outputs:
  #
  # my $ok = lengthen( $seqs, $N );
  #	Take $seqs, a reference of a list of sequences, where each sequence
  #	is of length N and ends in a stopword, and try to lengthen them all
  #	backwards, ie. prepending a word number to the start of each sequence.
  #	If this is possible, then @$seqs is altered to deliver the new, longer
  #	list of sequences (all of length N+1 now), and 1 is returned.
  #	Otherwise, if lengthening is not possible - if no sequence of length N
  #	can be extended by any unused word in a valid way, leave @$seq alone,
  #	and return 0.
  #
Then we implement it:
  fun lengthen( $seqs, $N )
  {
	my @curr = @$seqs;	# current sequences
	my @new;		# new sequences

	foreach my $s (@curr)	# foreach current sequence
	{
		my %used = map { $_ => 1 } @$s;
		my $firstwno = $s->[0];
		my $list = $inword[$firstwno];	# list of word nos into s[0]
		foreach my $wno (@$list)
		{
			next if $used{$wno};	# no cycles need apply

			# make a single length N+1 sequence, cons(wno,oldseq)
			my @oneseq = @$s;
			unshift @oneseq, $wno;

			# it's a new sequence!
			push @new, \@oneseq;
		}
	}
	if( @new )
	{
		@$seqs = @new;
		return 1;
	}
	return 0;
  }
Notice that at this stage, we're once again writing naive, simple, non-optimized code. We notice again that we're creating a used set every time we need it, even though in the recursive findseq() implementation, that was one of the biggest time-wasters. If this is still the case now, we can optimize it again shortly!
Having written and tested all this code, I commented out the final call to show_seqs() (because displaying over 1000 answers takes a long time, and isn't really part of the problem we're trying to optimize), and added the following code to display just the first answer, via a 1-element array slice:
  # show just one of the longest sequences
  show_seqs( @seqs[0..0] );
giving v6.pl, which we profile as usual:
  ./pp v6
About 30 seconds later, point your web browser: here at v6/, opening in a new tab.
You'll see that our naive iterative version takes 7.9 seconds (down from about 12 seconds for v5). Plus, of course, it's doing more work, because of it's breadth-first nature, it delivers all the paths of the maximum length, not just one of them.
I think a significant factor must be that 5 million recursive calls to our main findseq() function have been reduced to 22 calls to lengthen(), and a few hundred Perl internal calls.
Of course, now that we have another working solution, let's look carefully at the line by line profiling information, looking for things to optimize:
Fifth profiling-led optimization: v7.pl
While looking at v6.pl, the obvious thing to optimize is the 3.7 seconds spent constructing 1.3 million used sets from sequences. As we did once before, the obvious alternative to computing the used set each time is to construct it beforehand and pass it around. Previously we passed it as an additional parameter (in the recursive version) but now we need to build the used-set equivalent of each sequence into the data structure.
In v7.pl we replace a single sequence with an SU - a sequence/usedset pair. Replace @seq (the list of sequences) with:
  my @sus;      # all SUs for sequences of length N, each entry is a [ sequence, usedset ] pair
We construct a single usedset by a helper function:
  #
  # my @suset = suset( $wno );
  #	Form a SUset in which all word nos are unused, except $wno.
  #
  fun suset( $wno )
  {
	my @suset = (0) x scalar(@words);
	$suset[$wno] = 1;
	return @suset;
  }
We initialise @sus with:
  # convert each stopword wordno into a SU pair, building a list of them all
  @sus = map { [ [ $_ ], [ suset($_) ] ] } @stopword;
We alter show_seqs() to take @sus as follows:
  #
  # show_seqs( @sus );
  #	Show the sequences (as words, not word nos)
  #
  fun show_seqs( @sus )
  {
        foreach my $su (@sus)
        {
                my( $s, $u ) = @$su;
                my $str = join( ',', map { $words[$_] } @$s );
                say $str;
        }
  }
Then we modify lengthen() as follows:
  #
  # my $ok = lengthen( $sus, $N );
  #	Take $sus, a reference of a list of SUs, all of length N,
  #	where each SU is a [ sequence, usedset ] pair.
  #	each sequence ends in a stopword, and try to lengthen them all
  #	backwards, ie. prepending a word number to the start of each sequence.
  #	If this is possible, then @$sus is altered to deliver the new, longer
  #	SUlist (all of length N+1 now), and 1 is returned.  Otherwise, if
  #	lengthening is not possible, leave @$sus alone, and return 0.
  #
  fun lengthen( $sus, $N )
  {
	my @new;		# new list of SUs

	foreach my $su (@$sus)	# foreach current SU
	{
		my( $s, $used ) = @$su;
		my $firstwno = $s->[0];
		my $list = $inword[$firstwno];	# list of word nos into s[0]
		foreach my $wno (grep { ! $used->[$_] } @$list)
		{
			# make a single length N+1 sequence, cons(wno,oldseq)
			my @oneseq = @$s;
			unshift @oneseq, $wno;

			# make an altered used array, with one more used.
			my @newu = @$used;
			$newu[$wno] = 1;

			# it's a new SU!
			push @new, [ \@oneseq, \@newu ];
		}
	}
	if( @new )
	{
		@$sus = @new;
		return 1;
	}
	return 0;
  }
We profile our new v7.pl as usual via our helper script pp:
  ./pp v7
About 30 seconds later, point your web browser: here at v7/, opening in a new tab.
You'll see that our second iterative version takes only 6.0 seconds (down from 8.0 seconds for v6). In particular, lengthen()'s run time has shrunk from 7.91 seconds to 5.97 seconds.
Sixth profiling-led optimization: v8.pl
Looking at v7.pl, I can't see much obvious to optimize, but back in v3.pl we saved some time by not computing the used set afresh each time, but altering a single used set. In v7.pl, the heart of lengthen() is the body of the innermost for loop, which processes a single SU (a sequence,usedset pair, remember):
        # make a single length N+1 sequence, cons(wno,oldseq)
        my @oneseq = @$s;
        unshift @oneseq, $wno;

        # make an altered used array, with one more used.
        my @newu = @$used;
        $newu[$wno] = 1;

        # it's a new SU!
        push @new, [ \@oneseq, \@newu ];
As we did once before, the obvious alternative to copying and modifying the used set each time is to modify the existing used set and then modify it back. The first part that builds @oneseq is unaltered, but the second part becomes:
        # alter the used array, marking $wno used.
        $used->[$wno] = 1;

        # it's a new SU!
        push @new, [ \@oneseq, [ @$used ] ];

        # alter used back
        $used->[$wno] = 0;
We profile our new v8.pl as usual via our helper script pp:
  ./pp v8
About 30 seconds later, point your web browser: here at v8/, opening in a new tab.
You'll see that our third iterative version takes only 5.8 seconds (down from 6.1 seconds for v7). This is only a small improvement, but is worth keeping.
Next alteration: refactoring - v9.pl
Looking at v8.pl's profile, I can't see much obvious to optimize inside lengthen(), but I decided to widen the envelope - to look hard at where lengthen() is called, that's the following for loop:
  my $N = 1;    # length starts at 1 and is increased..

  my @sus;      # all SUs for sequences of length N,
                # each entry is a [ seqarrayref, usedarrayref ] pair

  # convert each stopword wordno into a SU pair, building a list
  @sus = map { [ [ $_ ], [ suset($_) ] ] } @stopword;

  for(;;)
  {
        my $nseq = @sus;
        say "Have $nseq sequences of length $N";
        #show_seqs( @sus );
        my $ok = lengthen( \@sus, $N );
  last unless $ok;
        $N++;
  }
and decided to combine that for loop and lengthen() itself into a new function called findall and then (later) see if that opens up any new possibilities for optimization. This is a refactoring technique, not an optimization technique.
First, what's the call contract for findall()?
#
# my @sus = findall();
#	Find all sequences, starting with sequences of length 1 (stop words),
#	then working back, i.e. prepending words onto the front of existing
#	sequences.  Delivers the list of all maximal-length sequences, each
#	as a (sequence,usedset) pair.
#
Second, the body of findall() comes by naively combining the outer for loop and the body of lengthen(), combining some of the contract comment for lengthen() into the call site, and of course renaming the parameters where different in the call and the original body. I also took the opportunity (now the outer loop is much longer) to gather the initialization and incrementing of $N into the for loop to make it much more obvious that $N starts at 1 and is incremented every time through the loop:
fun findall(  )
{
  my @sus;      # all SUs for sequences of length N,
                # each entry is a [ seqarrayref, usedarrayref ] pair

  # convert each stopword wordno into a SU pair, building a list
  @sus = map { [ [ $_ ], [ suset($_) ] ] } @stopword;

  for( my $N=1 ; ; $N++)
  {
        my $nseq = @sus;
        say "Have $nseq sequences of length $N";
        #show_seqs( @sus );

        # Take @sus, a list of SUs of length N and ending in a
        # stopword, and try to lengthen them all backwards, ie.
        # prepend a word number to the start of each sequence.

        # If this is possible, ie. if there is at least one extensible
        # SU, then change @sus to be the new, longer SUlist (all of
        # length N+1 now), and carry on looping.  Else break out.

        my @new;                # new list of SUs

        foreach my $su (@sus)   # foreach current SU
        {
                my( $s, $used ) = @$su;
                my $list = $inword[$s->[0]];  # list of word nos into s[0]
                foreach my $wno (grep { ! $used->[$_] } @$list)
                {
                        # make a single length N+1 sequence, cons(wno,oldseq)
                        my @oneseq = @$s;
                        unshift @oneseq, $wno;

                        # alter the used array, marking $wno used.
                        $used->[$wno] = 1;

                        # it's a new SU!
                        push @new, [ \@oneseq, [ @$used ] ];

                        # alter used back
                        $used->[$wno] = 0;
                }
        }
  last unless @new;
        @sus = @new;
  }
  return @sus;
}
We profile our new v9.pl as usual via our helper script pp, although of course we're not expecting a pure refactoring to speed things up:
  ./pp v9
About 30 seconds later, point your web browser: here at v9/, opening in a new tab.
You'll see that the refactored version only takes 5.7 seconds (a tiny fraction faster than v8.pl did).
Seventh profiling-led optimization: v10.pl
Ok, now that we've got our refactored version with findall(), does this open up any possibilities for refactoring? I observe from v9's profile that 1.5s was spent copying the @new array to @sus, at the end of each successful lengthening pass. Let's look at the broad structure to see what we're doing with @sus and @new:
  my @sus = map { [ [ $_ ], [ suset($_) ] ] } @stopword;

  for( my $N=1 ; ; $N++)
  {
        ...
        my @new;                # new list of SUs

        foreach my $su (@sus)   # foreach current SU
        {
                ...
                # find a new longer SU and push it onto @new
                push @new, [ \@oneseq, [ @$used ] ];
        }
  last unless @new;
        @sus = @new;
  }
Could we somehow avoid that array copy at the end? The obvious way is to replace @sus and @new with array references $sus and $new, and then copy the array reference $sus=$new at the end. findall() becomes:
  fun findall(  )
  {
        my $sus;        # all SUs for sequences of length N,
                        # each entry is a [ seqarrayref, usedarrayref ] pair

        # convert each stopword wordno into a SU pair, building a list
        $sus = [ map { [ [ $_ ], [ suset($_) ] ] } @stopword ];

        for( my $N=1 ; ; $N++)
        {
                my $nseq = @$sus;
                say "Have $nseq sequences of length $N";
                #show_seqs( @$sus );

                # Take @$sus, a list of SUs of length N and ending in a
                # stopword, and try to lengthen them all backwards, ie.
                # prepend a word number to the start of each sequence.

                # If this is possible, ie. if there is at least one extensible
                # SU, then change $sus to be the new, longer SUlist (all of
                # length N+1 now), and carry on looping.  Else break out.

                my $new = [];           # new list of SUs

                foreach my $su (@$sus)  # foreach current SU
                {
                        my( $s, $used ) = @$su;
                        my $list = $inword[$s->[0]];  # word nos into s[0]
                        foreach my $wno (grep { ! $used->[$_] } @$list)
                        {
                                # make a single length N+1 sequence, cons(wno,oldseq)
                                my @oneseq = @$s;
                                unshift @oneseq, $wno;

                                # alter the used array, marking $wno used.
                                $used->[$wno] = 1;

                                # it's a new SU!
                                push @$new, [ \@oneseq, [ @$used ] ];
                                #say "debug: ", Dumper(\@oneseq) if $N==22;

                                # alter used back
                                $used->[$wno] = 0;
                        }
                }
        last unless @$new;
                $sus = $new;
        }
        return @$sus;
  }
We profile our new v10.pl as usual via our helper script pp:
  ./pp v10
About 30 seconds later, point your web browser: here at v10/, opening in a new tab.
You'll see that the refactored version only takes 4.9 seconds (down from 5.7 seconds). That's quite a lot faster!
Next alteration: minor refactoring - v11.pl
Looking at v10.pl's profile, I noticed that nearly all the knowledge about SU (sequence, usedset) pairs, and lists of SU pairs, is contained inside findall(). However, at the moment findall() returns the final SU list, even though the used sets are not used outside findall(). If we extract just the sequences from the final $sus list, and return an array of those sequences, then the SU pair knowledge is properly confined inside findall().
The contract for findall() becomes:
  #
  # my @seqs = findall();
  #	Find all sequences, starting with sequences of length 1 (stop words),
  #	then working back, i.e. prepending words onto the front of existing
  #	sequences.  Delivers the list of all maximal-length sequences.
  #
The final return statement at the end of findall() becomes:
  # now extract and return all the maximal length sequences
  return map { $_->[0] } @$sus;
The call to findall() becomes:
  my @seqs = findall();

  show_seqs( @seqs ) if $debug;

  # show just one of the longest sequences
  show_seqs( @seqs[0..0] );
show_seqs() must be altered to take a list of sequences rather than a SUlist, however before we do that, we should keep the original show_sequences_from_sulist just in case we want to debug the code inside findall(). So those functions become:
  #
  # show_seqs( @seqs );
  #	Show the sequences (as words, not word nos)
  #
  fun show_seqs( @seqs )
  {
	foreach my $s (@seqs)
	{
		my $str = join( ',', map { $words[$_] } @$s );
		say $str;
	}
  }

  #
  # show_sus( @sus );
  #	Show the sequences (as words, not word nos) contained in SUlist @sus
  #
  fun show_sus( @sus )
  {
	foreach my $su (@sus)
	{
		my( $s, $u ) = @$su;
		my $str = join( ',', map { $words[$_] } @$s );
		say $str;
	}
  }
We profile our new v11.pl as usual via our helper script pp, although of course we're not expecting a pure refactoring to speed things up:
  ./pp v11
About 30 seconds later, point your web browser: here at v11/, opening in a new tab.
You'll see that the refactored version takes 4.9 seconds (just like v10.pl did).
Next alteration: changing the SUlist structure

Next, I considered changing the SUlist structure ($sus and $new inside findall()). Currently a SUlist is an array of (Sequence,Used) pairs, and each Sequence is represented as a reference to an array of word numbers, and each Used set is represented as a reference to an array from word numbers to booleans).
First I tried changing the SUlist structure to a flat array (twice as long) of S,U,S,U,S,U etc to see if that would save any time. I coded that, but found that it went slightly slower. This emphasizes that often you have to try a change to see whether it's an improvement or not - so, I rejected that change.
I then thought hard about the S part of the SUlist structure, at present this is a sequence, represented as a reference to an array of word numbers that are currently used in the sequence. But I notice that the only element of the sequence array that we access specifically is element 0, i.e. the word number of the first word in the current sequence (given that we are building up word sequences backwards from stopwords).
So I wondered, what if I replaced the sequence array with a '-' separated string, and stored the first word separately. Our SUlist changes to a list of (Word,Sequence,Used) triples. The advantage of this would be that we don't need to clone the array and then add a newly found word number, instead we would just form the string "$wno-$seq".
I coded this up in v12.pl. The significant change is:
	my $sus;	# all SUs for sequences of length N, each entry is
			# a [ $wd, sequence, usedarrayref ] triple

	# convert each stopword wordno into a SU triple, building a list
	$sus = [ map { [ $_, $_, [ suset($_) ] ] } @stopword ];

	...
		foreach my $su (@$sus)	# foreach current SU
		{
			my( $sw, $s, $used ) = @$su;
			my $list = $inword[$sw];	# list of word nos into sw
			...
				# make a single length N+1 sequence with wno in front of s
				my $oneseq = "$wno-$s";
				...

				# it's a new SU!
				push @$new, [ $wno, $oneseq, [ @$used ] ];
				#say "debug: ", Dumper(\@oneseq) if $N==22;
			...
(of course, show_seqs() and show_sus() need corresponding minor changes).
We profile our new v12.pl as usual via our helper script pp:
  ./pp v12
About 30 seconds later, point your web browser: here at v12/, opening in a new tab.
You'll see that this version is a bit faster, running in 4.6 seconds, down from 4.9 seconds.
So, have we reached the end of the profiling line? Yes - I think we're hitting the law of diminishing returns here.
Here's a summary of the run-times of each successive version:

Version Run-time (in seconds)

1 32.6s

2 28.8s

3 21.1s

4 14.8s

5 12.2s

6 12.2s

7 6.0s

8 5.8s

9 5.7s

10 4.9s

11 4.9s

12 4.6s

What can we conclude from this? Successive rounds of profiling-led optimization have allowed us to make our word chain finder significantly faster - not quite 10 times faster, but nearly. I often find that I can speed my code up by 10 times by this method, but of course it's not guaranteed.
In this case, as in many cases, the most significant speedup was made by changing the fundamental algorithm (from recursive depth first search to iterative breadth first search). Iteration and appropriate data structures normally are much faster than recursion. In particular, millions of function calls are often quite expensive in themselves.
Note that our code still looks - for the most part - simple, clear and general. There's no reason to try to make it look horrible in order for it to run fast.

d.white@imperial.ac.uk
Back to PSD Top
Written: Sept 2019, revised June 2021

Version	Run-time (in seconds)
1	32.6s
2	28.8s
3	21.1s
4	14.8s
5	12.2s
6	12.2s
7	6.0s
8	5.8s
9	5.7s
10	4.9s
11	4.9s
12	4.6s