Go to the HR project pages

On The Notion Of Interestingness in
Automated Mathematical Discovery

Get paper as PDF file

Get paper as postscript file

Simon Colton and Alan Bundy
Institute of Representation and Reasoning
Division of Informatics, University of Edinburgh
80 South Bridge, Edinburgh. EH1 1HN.
simonco@dai.ed.ac.uk, bundy@dai.ed.ac.uk

Abstract

We survey five automated discovery programs working in mathematics, by looking in detail at the discovery processes they illustrate, summarising the successes they've had and by focusing on how they estimate the interestingness of concepts and conjectures. We then extract some common notions about the interestingness of conjectures and concepts. We detail how empirical evidence is used to give plausibility to conjectures, and the different ways in which a concept or conjecture can be thought of as novel. We also detail how programs assess how surprising and complex a conjecture statement is, and the different ways in which the applicability of a concept or conjecture is used. Finally we note how a user can set tasks for the program to achieve and how this affects the calculation of interestingness.

Introduction

There has been some recent progress in surveying and extracting general principles of machine discovery in science, for example [langley:tcadosk], [valdes-perez:collaboration]. We aim to add to this by surveying five programs developed to perform discovery in mathematics. We restrict our discussion to programs whose main objective is to invent concept definitions and make conjectures in pure mathematics. This leaves out automated theorem provers (which discover proofs), and programs which discover mathematical results in other domains, such as the very important BACON programs, [langley:sci_disc]. To compare and contrast the discovery programs, we detail what the project aims were, how the program worked and what contributions the programs made to mathematics and the understanding of mathematical discovery. We pay particular attention to the measures employed to estimate how interesting a concept or conjecture is.

Deciding whether something is interesting or not is of central importance in automated mathematical discovery, as it helps determine both the search space and search strategy for finding and evaluating concepts and conjectures. Best-first searches using assessments of interestingness are often needed to effectively traverse large search spaces. When it becomes clearer what results are interesting, instead of just ignoring or discarding dull concepts and conjectures, the search space can be tailored to avoid some of them completely. Estimating interestingness is difficult because often it has to be done immediately after a concept or conjecture has been introduced, whereas the true interestingness of results and definitions in mathematics may only come to light much later. Therefore, a heuristic search has to be careful not to throw away anything which may turn out to be useful later on.

In the notions section, we identify six reasons why a concept or conjecture might be considered interesting. We detail how the programs use empirical evidence to cut down on the number of false conjectures made. We show how the novelty of a conjecture can be determined by whether it, or an isomorphic conjecture, has been seen before, or whether it follows as an obvious corollary to a previous conjecture, and we detail the different ways in which a concept can be thought of as novel. We note that being surprising is a desirable property of conjectures and concepts and we show how programs will avoid making conjectures which are just instances of tautologies, and how they can assess the surprisingness of a conjecture or concept. We define the applicability of concepts and conjectures to be the subset of models to which they bear some relevance, and show that this measure can be used in a variety of ways. We also detail how programs can assess complexity and tailor their search strategies to find the least complex concepts and conjectures first. Finally, we look at how a user can set a program a particular task to achieve and how interestingness can be measured with respect to that task.

By looking in detail at five discovery programs and extracting some common ways by which the interestingness of concepts and conjectures is assessed, we will be able to suggest possible ways for future programs to measure interestingness. We will also note that the novelty, intelligibility and plausibility measures set out in [valdes-perez:collaboration] by which humans can assess the output of discovery programs are similar to those by which the programs assess the interestingness of their results internally.

Machine Discovery Programs

The five programs we discuss in detail are the AM program which worked in elementary set and number theory, the GT program which worked in graph theory, the Graffiti program which is used in graph theory, the plane geometry system from Bagai et al, and the HR program which works mainly with finite algebras. For clarity, no other programs are discussed.

The AM Program

The AM program, written by Douglas Lenat, performed concept formation and conjecture making in elementary set theory and elementary number theory, as described in [lenat:phd] and [lenat:kbsiai]. Starting with 115 elementary concepts such as sets and bags, AM would re-invent set theory concepts like subsets and disjoint sets, and number theory concepts such as prime numbers and highly composite numbers (with more divisors than any smaller integer). AM would also spot some well known conjectures, such as the fundamental theorem of arithmetic and Goldbach's conjecture - that every even number is the sum of two primes.

Concepts were given a frame representation with 25 facets to each frame, and none, one or multiple entries for each facet. Some of the facets were: (i) a definition for the concept (ii) an algorithm for the concept (iii) examples of the concept (iv) which other concepts it was a generalisation/specialisation of, and (v) conjectures involving the concept. AM repeatedly performed the task at the top of an agenda ordered in terms of the interestingness of the tasks. Each task involved performing an action on a facet of a concept. Usually the action was to fill in the facet, for example, find some other concepts which are specialisations of the concept or find some conjectures about the concept, but the action could also be to check the facet, eg. check that a conjecture was empirically true.

To perform a task, AM would look through its database of 242 heuristics, choose those which were appropriate to the task and perform each of the sub-tasks suggested by the chosen heuristics. Some sub-tasks detailed how to perform the overall task at hand, but they were not limited to that. Some sub-tasks would put new tasks on the agenda (which was how the agenda was increased). Some of the new tasks were to invent new concepts and when these were added to the agenda, AM would immediately create the frame for the new concept. This was because knowledge present at the time of suggesting the new concept was needed to fill in some of the facets of the concept. AM only filled in information at this stage which took little computation, such as a definition and examples, and a task was put on the agenda to fill in each of the other facets of the newly formed concept.

Among the new concepts AM would suggest were: (i) specialisations, eg. a new function which was a previous one specialised to have equal inputs, (ii) generalisations (iii) extracted from the domain/range of a function, eg. those integers output by a function (iv) inverses of functions (v) compositions of two functions. Some tasks on the agenda were to find conjectures about a concept, including finding that (a) one concept was a specialisation of another (b) the domain/range of a concept was limited to a particular type of object or (c) no integers of a particular type existed.

Because there could be as many as 4000 tasks on the agenda at any one time, AM spent a lot of its time deciding which it should do first. Whenever a heuristic added a task to the agenda, it would supply reasons accompanied by numerical values why the action, concept or facet of the task was interesting. AM then employed a formula involving the number of reasons and a weighted sum of the numerical values to calculate an overall worth for the task. The weighted sum gave more emphasis to the reasons why the concept was interesting than the reasons why the facet or action were interesting. When a heuristic was working out how interesting a concept was, it would collate and use another set of heuristics for the task. The heuristics which could measure the interestingness of any concept were recorded as heuristics 9 to 20 in [lenat:kbsiai], and included:

[9] A concept is interesting if there are some interesting conjectures about it.

[13] A concept is dull if, after several attempts, only a couple of examples have been found.

[15] A concept is interesting if all the examples satisfy the rarely-satisfied predicate P.

[20] A concept is more interesting if it has been derived in more than one way.

(Note that these have been paraphrased from Lenat's originals). AM also had ways to assess the interestingness of concepts formed in a particular way, for example the interestingness of concepts formed by composing two previous concepts could be measured by heuristics 179 to 189, one of which was:

[180] A composition is interesting if has an interesting property not possessed by either or .

AM would also measure the interestingness of conjectures, so that it could correctly assess tasks relating to the conjectures facets of concepts. Heuristics 65 to 68 seem to be the only heuristics which do this, for example:

[66] Non-existence conjectures are interesting.

At any stage during a session, the user could interrupt AM and tell it that a particular concept was interesting. Lenat says in [lenat:kbsiai] that users could ``kick AM in one direction or another'', and ``the very best examples of AM in action were brought to full fruition only by a human developer''. Many of AM's heuristics were designed to keep the focus on such chosen concepts, by spreading around the interest the user had shown in them. For example, these heuristics keep the attention on concepts and conjectures related to interesting concepts:

[16] A concept is interesting if it is closely related to a very interesting concept.

[65] A conjecture about concept X is interesting if X is very interesting.

In fact, AM could make a little interestingness go a long way: of the 43 heuristics designed to assess the interestingness of a concept, 33 of them involve passing on interestingness derived elsewhere.

There has been much debate about the AM program. [hanna_ritchie:method] were particularly critical of the methods AM used and the accuracy of Lenat's description of AM. The main contribution of Lenat's work is an inspiration for how computers could do mathematics, ie. by creating concepts and conjectures of many different types and using heuristic methods such as analogy and symmetry to explore a domain.

The GT Program

Th GT program by Susan Epstein performed concept formation, conjecture making and theorem proving in graph theory, as described in [epstein:otdomt] and more fully in [epstein:ladossfmk]. Given just the concept of a graph, GT would re-invent graph properties, such as being acyclic, connected, a star or a tree, (as shown in figure 1 below).

Figure 1: Graph properties re-invented by GT

Also, given a set of user-defined concepts describing graph properties, GT would make conjectures such as:

  • A graph is a tree iff it is acyclic and connected.

GT successfully illustrated a possible mechanism for automated discovery in mathematics involving both deductive and inductive reasoning. This was possible because GT represented graph properties in a carefully thought out way developed in Epstein's PhD thesis, [epstein:phd], which allowed example generation, theorem proving and concept formation.

Each graph property was represented as a triple,\newline , consisting of a set of base cases, , a constructor, , and a set of constraints for the constructor, , which together detailed the recursive construction of graphs from the base cases. For example, to define the star property above, the base cases would be just the trivial graph (with one vertex, no edges) and the constructor
would add one vertex and an edge between the new vertex and an old vertex, subject to the single constraint that the old vertex must be on more edges than any other vertex. Epstein was able to prove that 42 classically interesting graph theory concepts, including cycles, Eulerian graphs and -coloured graphs, could be represented in this manner in a sound and complete way.

This representation could be used to generate examples of a concept (Epstein called this `doodling') by starting with the base cases and repeatedly applying the constructor, subject to the constraints. Deduction was possible by proving that one graph property subsumed another [see [epstein:ladossfmk]], or by showing that no graphs could have two particular properties. Concept formation was possible by: (a) specialising a previous concept by removing base cases, restricting the constructor, or strengthening the constraints, (b) generalising a previous concept by adding base cases, expanding the constructor, or by relaxing the constraints, or (c) merging properties A and B, for example creating a new graph property with A's base cases and constructor, but the constraints of both A and B, [subject to some conditions].

GT worked by repeatedly completing one of six types of project: (i) generate examples of graphs with certain properties, (ii) see if one property subsumed another (iii) see if two properties were equivalent, (iv) see if a merger between two properties would fail, (v) generalise a concept and (vi) specialise a concept. Each project was placed on an agenda following various rules:

  • If a property has few examples in the database, then immediately generate more examples for it by `doodling'.

  • Two properties, and , are better candidates for projects (ii) or (iii) above if the set of base cases for and are similar. Two sets are most similar if they are equal, less similar if one is a subset of the other and less similar still if they only have a non-trivial intersection.

  • Only perform specialisation or generalisation projects with a concept before conjecture-making projects if the user has flagged the concept as a `focus' (see later).

As an overview, if a conjecture project was at the top of the agenda, before trying to prove the conjecture, GT would first see if there was empirical evidence against the conjecture, using the generated examples of the graphs [note that a conjecture was suggested only using the base cases]. If the project was to check a merger conjecture, then the merge step would take place, and only if no graphs of the merged type could be produced would an attempt be made to prove the conjecture. If a generalisation or specialisation project was at the top of the agenda, it would be carried out and some effort expended to generate examples of the new concept.

Focus concepts could be specified by users if they were particularly interested in them, and, as well as restricting concept formation only to the focus concepts, GT would only make conjectures involving the focus concepts. GT rated certain newly formed concepts as uninteresting and discarded them. For example, if a concept was a generalisation to a focus concept, but no example graphs could be produced which were not examples of the focus concept, the new concept was discarded. Also, if only a few graphs could be generated with a newly formed property, the new concept was discarded. By identifying the routine of ordering which conjectures to look at first, attempting to make and prove the conjectures, and performing concept formation only with the most interesting concepts, Epstein's model of discovery was successfully implemented and produced theories containing different kinds of conjecture and their proofs and concepts and graphs not present at the start of the session.

The Graffiti Program

The Graffiti program, by Siemion Fajtlowicz, makes conjectures of a numerical nature, mainly in graph theory, as described in [fajtlowicz:ocog], and more recently in [larson:discovery]. Given a set of well known, interesting graph theory invariants, such as the diameter, independence number, rank or chromatic number, Graffiti uses a database of graphs to empirically check whether one sum of invariants is less than another sum of invariants. If a conjecture passes the empirical test and Fajtlowicz cannot prove it easily, it is recorded in the ``writing on the wall'', some of which is publicly available, [fajtlowicz:wow], and Fajtlowicz forwards it to interested graph theorists. These types of conjecture are of substantial interest to graph theorists because (a) they often provide a significant challenge to resolve and (b) calculating invariants is often computationally expensive, so any bounds on their values are useful. As an example, the 18th conjecture in the writing on the wall states that, for any graph, ,

chromatic_number() + radius()
max_degree() + frequency_of_max_degree()

The empirical check is time consuming, so Graffiti employs two techniques, called the beagle and dalmation heuristics, to discard certain trivial or weak conjectures before the empirical test:

The beagle heuristic discards many trivially obvious theorems, including those of the form Note that invariants which are a previous invariant with the addition of a constant are used to make stronger conjectures. The beagle heuristic uses a semantic tree of concepts to measure how close the left hand and right hand terms are in a conjecture, and rejects those where the sides are semantically very similar.

The dalmation heuristic checks that a conjecture says something more than those made by Graffiti previously. To use the dalmation test for a conjecture of the form , Graffiti first collates all conjectures it has ever made of the form . Then, to pass the dalmation test, there must be a graph, , in Graffiti's database which for all the , . This means that, for at least one graph, gives a stronger bound for than any invariant suggested by a previous conjecture, so the present conjecture does indeed say something new about Graffiti's graphs.

Another efficiency improving technique employed by Graffiti is to restrict the database of graphs to only those which have at one stage been identified as a counterexample to one of Graffiti's conjectures. A third efficiency technique is to remove by hand any previous conjectures which are subsumed by a new conjecture. For example, Fajtlowicz would move the old conjecture to a secondary database, if the conjecture was made. However, if the latter conjecture was subsequently disproved, the former conjecture would be restored.

As Fajtlowicz adds concepts to Graffiti's database, the writing on the wall reflects the new input, eg. conjectures 73 to 90 involve the coordinates of a graph. Fajtlowicz can also direct Graffiti's search by specifying a particular type of graph he is interested in. For example, conjectures 43 to 62 are about regular graphs. To enable this kind of direction, Fajtlowicz informs Graffiti of the classification of its graphs, into, say, regular and non-regular graphs. Then, if Graffiti bases its conjectures on only the empirical evidence supplied by the regular graphs, the conjectures will only be about those graphs. To stop Graffiti re-making all of its previous conjectures, the echo heuristic uses semantic information about which graph types are a subset of which others, and rejects conjectures about the chosen type of graph if there is a superset of graphs for which the conjecture is also true.

In terms of adding to mathematical knowledge, the Graffiti program has been extremely successful. Its conjectures have attracted the attention of scores of mathematicians, including many luminaries from the world of graph theory. There are over 60 graph theory papers which investigate Graffiti's conjectures. While Graffiti owes some of its success to the fact that the inequality conjectures it makes are of a difficult and important type, this should not detract from the simplicity and applicability of the methods and heuristics it uses.

Bagai et al's System

The program developed by Rajiv Bagai et al, described in [bagai:atgipg], worked in plane geometry and constructed idealised diagrams and proved conjectures that certain diagrams could not be drawn. Each concept was a situation in plane geometry involving points and lines and relations between the points and lines, such as a point being on a line or two lines being parallel. For example, a parallelogram and its diagonals, as in figure 2 below [taken from [bagai:atgipg]], could be described by stating that there were four ingredient points, and , six lines (one between each pair of distinct points) and two relations, namely that lines and were parallel and that lines and were parallel

Figure 2: A parallelogram and diagonals, and its representation

Starting with an empty situation, constructions like the parallelogram were made by adding new ingredient points and new relations to a previous situation. Each time a new relation was added, a conjecture was made that the resulting situation was inconsistent, ie. that it was not possible to draw an example of the situation. To prove the conjecture, the situation was turned into a collection of polynomials and inequalities which were passed to an efficient theorem prover, [chou:mtp]. If the theorem prover could not find complex solutions to the polynomial then the situation was indeed inconsistent. However, if the theorem prover did find complex solutions, this said nothing about the consistency of the situation, and the conjecture was discarded to preserve the soundness of the theories produced.

Many methods were employed to reduce the number of times the system used the theorem prover. Firstly, only consistent situations were built upon, as a situation which was an extension of an inconsistent situation would itself be inconsistent. By also restricting to only adding one relation at a time, if the situation produced was inconsistent, the additional relation must have caused the inconsistency. This enabled better presentation of the theorems, eg. if the condition that lines and were parallel was added to the parallelogram situation above, this would cause an inconsistent situation. As the inconsistency was caused by the new relation, instead of just stating that a parallelogram with parallel diagonals cannot be drawn, the system could say that:

Given a parallelogram, the diagonals cannot be parallel.

Another way to reduce the time spent using the theorem prover was to avoid proving the inconsistency of a situation which was isomorphic to a previous one. Two situations were isomorphic if a permutation of the ingredient points of the first produced the second. To get around this problem, whenever a situation was built, all of its isomorphic situations were also built, so that they could be recognised if re-constructed by a different route later on. Also, to cut down on the occurrences of later theorems which implied earlier theorems a breadth first search was used where a step could only be the addition of either a single ingredient point or a single new relation. This meant that the most general situations were constructed before the more specific ones and therefore the most general versions of theorems were produced first. Not only could the program re-discover well known results such as Euclid's 5th postulate, it also provides a very clear and concise theory for the automatic production of a subset of plane geometry concepts and a set of theorems about the non-existence of models for certain concepts.

The HR Program

The HR program by Colton et al, as described in [colton:ecai98] and more recently in [colton:ijcai99], was originally developed to perform concept formation in group theory. Starting with just a few definitions, HR can re-invent classically interesting concepts such as centres of groups, Abelian and Cyclic groups and orders of elements. HR works directly with the models of concepts (stored as data-tables), and constructs new concepts by taking the data-tables of old concepts and manipulating them using one of ten production rules to produce a new data-table. From information about how a concept was constructed, HR can generate a definition for the concept whenever one is needed.

HR encounters a combinatorial explosion because a single concept can often be transformed into around 20 new ones, and any pair of concepts can be combined into a third. A heuristic search is used which chooses the best concept to use in each concept formation step. HR has a variety of ways to measure concepts and a weighted sum of measures is taken to indicate an overall level of interestingness for the concept. The weights are set by the user and depend on what types of concepts they are looking for. One way to use HR is to supply a `gold standard' categorisation of the groups in the database, and ask HR to find a function, the output of which will categorise the groups correctly (groups with the same output are put in the same category). HR can then measure how close each concept gets to this categorisation, by evaluating the proportion of pairs of groups that a concept correctly categorises. This approach can be effective, for example, given the isomorphic classification of the groups up to order six, HR found this function which correctly classifies them:

If the user has no particular task in mind, they can ask HR to explore the domain. HR has certain measures which indicate desirable properties of a concept, and users can stress some of these if they wish. The parsimony measure of a concept is inversely proportional to the size of the data-table for the concept. The data in a table corresponding to a particular group can be used to describe that group, and so a small table is advantageous as this means more parsimonious descriptions. HR can also assess the novelty of a concept, which is inversely proportional to the number of times the categorisation produced by a concept has been seen already, (with more unusual categorisations being more interesting). Finally, HR can measure the complexity of a concept which is inversely proportional to the number of old concepts appearing in its construction path. This gives a rough indication of how complicated the definition of a concept will be, and more concise definitions are desirable.

HR can make conjectures by spotting that the data-table of a newly formed concept is exactly the same as a previous concept, and conjecturing that the concepts are equivalent. When this happens, definitions for each concept are generated and used to write the conjecture in a way acceptable to the OTTER theorem prover, [mccune:ottermanual], which HR asks to prove the conjecture. For example, when HR invents the concept of elements, , for which , it spots that the new data-table is the same as the one it has for the user-given concept of the identity element, and the following conjecture is generated:

This is broken into and , which are both passed to and easily proved by the theorem prover, OTTER. Before passing a conjecture to OTTER, HR uses some simple deductive techniques to check whether the conjecture follows easily from those already proved.

HR has a set of `sleeping concepts', such as the trivial group, and when a concept is conjectured to be the same as these, the conjecture is flagged so that the user can pay special attention to it (or choose to ignore it). If the interestingness of conjectures and proofs can be estimated, then the average interestingness of the theorems a concept appears in can be taken as a new measure for the interestingness of the concept itself. Conjectures are assessed in two ways. Firstly, the surprisingness of a conjecture measures how different the two (possibly) equivalent concepts are, by evaluating the proportion of concepts which appear in the construction path of one but not both of the concepts. This gives some indication of how different looking the definitions of the equivalent concepts are going to be. Secondly, if a conjecture is proved, OTTER will provide a proof length measure in its output, which gives some indication of the difficulty of the proof.

If the equivalence of two definitions is proved, HR uses this fact to re-assess the concepts involved, and keeps only the least complex definition for the concept. If OTTER cannot prove a conjecture, HR passes it to the MACE model generator, [mccune:macemanual], which is asked to find a single counterexample to the conjecture. If MACE is successful, the counterexample is added to HR's database and all previous concepts and measures are re-calculated, giving HR a better idea of the theory it is exploring. All future conjectures will be based on the additional data provided by the new group. HR's theory formation is general enough to apply to any finite algebra and HR can bootstrap the process, that is, it can start with just the axioms of the algebra, and end with a theory containing models, definitions, open conjectures, theorems and proofs. The concept formation is general enough to apply to different domains, including graph and number theory.

Indeed, HR's biggest success so far has come in number theory, where it invented the concept of refactorable numbers, the first examples of which are:

These are defined as those integers where the number of divisors is itself a divisor. This concept was novel because it was missing from and subsequently added to the online-encyclopedia of integer sequences, [sloane:oeois], which contains over 45,000 integer sequences. Refactorables are also interesting because there are many provable properties about them, for example, all odd refactorables are square numbers. Results about refactorables have been published in the mathematical literature, [colton:rfami].

Assessing the Interestingness of Conjectures and Concepts

We cannot discuss measures of interestingness without addressing how the measures are used. For example, one program might say that the concept of even primes is interesting because a conjecture can be made that 2 is the only one, whereas another program might say that they are dull because only one example of them is known. Here the same measure has been used (see the applicability section below), but different conclusions have been drawn. Therefore we have to clearly separate the measures for interestingness from their uses. One common use of interestingness is to improve the efficiency of the programs. To save time checking and proving conjectures, certain conjectures are discarded before even checking them empirically, and the reason to perform an empirical check is, of course, to cut down on the time spent trying to prove false conjectures.

Another common use of interestingness is to improve the appeal of the output for the user. It is not possible to avoid all uninteresting concepts or conjectures when constructing a theory and interestingness measures can be used to filter the output depending on the user's needs. Also, measures of interestingness can guide the search so that the program can make informed progress into the space and find interesting concepts that it might take a longer time to find with an exhaustive search. A more specific use of interestingness measures is to predict in advance how difficult a conjecture is going to be to prove, which, in all but some trivial circumstances is not easy to do. Finally, interestingness measures can be used to steer the concept formation towards a particular concept which performs a user-defined task. Having identified some uses for interestingness, we can detail certain general measures and look at how each one is used.

Empirical Plausibility of Conjectures

A conjecture is likely to be uninteresting if the empirical evidence a program has provides counterexamples. This does not mean that false conjectures in general are uninteresting, as the production of counterexamples is a worthwhile pursuit. However, if a counterexample is found in the data a program has, the conjecture cannot provide this worthwhile pursuit. Only the system developed by Bagai et al makes conjectures which have not been first verified by some empirical evidence. In this case, the efficiency and power of the theorem prover and the nature of the idealised geometrical domain make it unproductive to look for counterexamples. The AM program is the only one which doesn't immediately discard a conjecture proved false by empirical evidence. In this case, an attempt is made to alter the conjecture to make it fit the data. One way to do this is to exclude what AM calls `boundary' integers, so for example, the conjecture that `all primes are odd' becomes the conjecture that `all primes except 2 are odd'.

HR and Graffiti use all of their data at once. HR uses its data to spot (ie. suggest) a conjecture, so by the time a conjecture has been made, the empirical check has been completely performed. Similarly, once a conjecture has been suggested by other means to Graffiti, all the empirical evidence is used to discard it. Note however, that both of these programs keep the amount of empirical data down to a minimum because they only store models which have been generated as counterexamples to previous conjectures. GT employs a more efficient system because a conjecture is suggested by the small amount of empirical evidence in the set of base cases, and only those conjectures passing this test are checked against all the examples for the concepts. Similarly, AM will make a conjecture based on a little empirical evidence, then try to generate more models to disprove the conjecture.

Novelty

Because of the redundancy often inherent in searches for concepts and conjectures, it is important to be able to spot when a repetition has occurred, and each program either tailors its search to reduce repetitions, or can spot them when they occur. Here, the programs are measuring the novelty of a concept or conjecture statement, and rejecting those which have been seen already. The Graffiti program goes to the length of storing conjectures between sessions. Another issue of novelty in programs searching for conjectures is whether a theorem follows as an obvious corollary to a stronger theorem, in which case, the weaker result does not say anything particularly new. Graffiti works hard to show that a new conjecture says something more than all the previous ones, by checking that there is at least one graph for which the inequality in the conjecture is stronger than all the previous ones (the dalmation heuristic), and by checking that the conjecture is not implied by previous ones (the echo heuristic). Spotting the implication of one conjecture by another is also used in Graffiti to improve efficiency: if a later conjecture turns out to be stronger than a previous one, the earlier one is removed, hence saving Graffiti time later when it looks through its old conjectures.

The HR program deals with the implication of one conjecture by previous ones by extracting and attempting to prove all stronger conjectures than the one it is considering. For example, if interested in the conjecture , HR will first use some simple deductive techniques to see if it follows as a corollary to the results it has already proved. If not, it will ask OTTER to prove , and and if either turns out to be true, the weaker, original conjecture is discarded and the stronger conjecture is kept. The most efficient way to deal with one conjecture implying another is to tailor the search to produce the most general conjectures first, so there is less chance that later conjectures will imply the earlier ones. This technique is used in the system from Bagai et al, but they point out in [bagai:atgipg] that it is still possible to produce a later conjecture which implies an earlier one.

A concept can easily be shown to be novel with empirical evidence. For example, if a function produces some output for a given input that no other function produces, it must be novel. Given only a limited amount of data though, it is more difficult to tell that two concepts are indeed the same. Bagai et al's system (which has no data available at all) tackles this by generating all possible isomorphic concepts whenever a new concept is introduced, so that if an isomorphic concept to the new one is reached by another route, the system will spot this. HR's conjecture making abilities rely on the fact that a proof is often needed to tell that two concept definitions are in fact equivalent, and, like AM, HR assesses concepts as more interesting if they were derived in two or more different ways. If a concept isn't the same as one already in the theory, it is possible to assess how much it differs from the others, using certain properties of it. AM, for example, gave extra interestingness to newly formed concepts, ie. those with the special property of being recently invented. The HR program assesses the novelty of a concept in terms of the novelty of the categorisation of groups it gives. It is important to make the distinction between one concept having two properties (eg. two definitions), which is often interesting, and two concepts sharing the same property (eg. a categorisation), which often detracts from the interestingness of both.

Surprisingness

As portrayed in [fajtlowicz:wow], when asked what makes a good conjecture, the mathematician John Conway said without any hesitation: ``it should be outrageous''. This is good advice, and in some cases an assessment of how surprising a conjecture is can be automated. The least surprising conjectures are those which are just instances of tautologies. For example, given objects of any type, , and predicates of any nature, and , the conjecture

is always going to be true, and conjecture finding programs should avoid making these and similar conjectures.

To avoid making tautologies of a particular type, GT did not make subsumption conjectures if one of the concepts was a specialisation of the other. The HR program avoids certain tautologies by forbidding certain series of concept formation steps, eg. not allowing two negation steps in succession. Graffiti uses a semantic tree to measure how different invariants and are in the conjecture: , and the beagle heuristic discards many tautology conjectures which involve very similar concepts, such as . Whereas Graffiti uses its measure for surprisingness only to discard conjectures, when the HR program makes a a conjecture that two concept definitions are equivalent, it has semantic information about those concepts, so can tell how different they are, giving an indication of how surprising the conjecture is. HR uses the heuristic that concepts appearing in surprising conjectures are more interesting, which helps drive a best first search. While HR and Graffiti can estimate the surprisingness of conjectures, only AM measured the surprisingness of a concept: it gave extra interestingness to concepts which possessed an interesting property not possessed by its parents (see heuristic 180 in [lenat:kbsiai], for example).

Applicability

The applicability of a predicate can be defined as the proportion of models in a program's database which satisfy the predicate. Also, the applicability of function can be defined as the proportion of models in a program's database which are in the domain of the function. This measure is somewhat analogous to the empirical plausibility of conjectures, but can itself be extended to cover conjectures: the applicability of a conjecture can be defined as the proportion of models in a program's database which satisfy the conjecture's preconditions. These measures are used in a variety of ways as follows.

In AM and GT, if a newly formed concept had low applicability (ie. few examples), a task was put on the agenda to generate some more models that it applied to. If a concerted effort to generate examples still resulted in a low applicability, GT would discard the concept as uninteresting, and AM would give it a low interestingness score. In the special case where the applicability was zero, (ie. no examples were found), both GT and AM would make the conjecture that none exist. The HR program makes similar non-existence conjectures, and other conjectures about the applicability of a concept, for example, that it is restricted to the trivial group. In Bagai et al's system, the whole point was to prove that certain concepts have no models (ie. situations are inconsistent), which is equivalent to showing that the applicability of a concept is zero. So we see that concepts with little or no applicability are often thought of as dull, but the conjecture that this is true is interesting.

Furthermore, in GT, if a generalisation step produced a concept with no greater applicability than the one it generalised, or if a specialisation step produced a new concept with a greater applicability than the one it was supposed to specialise, the new concept was discarded. In Graffiti, if the user has specified an interest in a particular set of graphs (ie. those with a particular quality), then if a conjecture is output which is applicable to a superset of that set, it is discarded as being too general (the echo heuristic). Also, in HR, the parsimony measure prefers concepts with small data-tables, and hence small applicability. This gives an emphasis to specialisation procedures and can be useful in controlling the search. Finally, the AM program used `rarely satisfied predicates' - those with low applicability - to assess other concepts. For example, heuristic 15 from [lenat:kbsiai] gave more interestingness to functions whose output always satisfied one of the rarely satisfied predicates that AM had come across. We see that applicability is a common measure, but how it is used is always determined by the context of the discovery task being attempted.

Comprehensibility and Complexity

As programs are intended to produce output understandable by the user, more comprehensible concepts and conjectures are usually more interesting. The GT and Bagai et al programs constructed concepts incrementally so that the most comprehensible ones were introduced first. The HR program employs the complexity measure which prefers concepts with smaller construction paths (which roughly relates to how long their definition will be). HR's best first search is not guaranteed to produce the least complex concepts first, so, if two concepts are proved to be equivalent, the least complex definition of the two is kept. Also in HR, concepts can be used to describe the groups in the theory, and the parsimony measure prefers concepts giving shorter, more concise descriptions.

The comprehensibility of a concept gives one, albeit shallow, indication of the complexity of that concept, and usually less complex, more comprehensible, concepts are desirable. An alternative way to assess the complexity of a concept is to evaluate how much information there is about it. AM counts how many conjectures there are involving a concept and uses this as a measure of the interestingness of the concept. The HR program goes one stage further and assesses not only the conjectures but also the proofs of them, and uses this to measure the interestingness of the concepts involved in the conjectures.

The novelty and surprisingness measures often prune easy to prove conjectures, because tautologies, conjectures which follow as an obvious corollary to previous theorems and those which are unsurprising have a greater likelihood of being easy to prove. Removing these conjectures will possibly increase the average difficulty to prove the conjectures remaining. However, this is separate from the issue of how difficult a conjecture is to understand. Preferring conjectures about less complex concepts will increase the overall comprehensibility of the theory, and choosing a simple format for conjectures can also help. For example, as noted in [valdes-perez:collaboration], it is easy to understand what Graffiti's inequality conjectures are saying. Further, the program from Bagai et al presents its theorems not as unsatisfiability results, but as relations which cannot occur once a situation has been set up, which is a more understandable format.

Achieving Particular Tasks

There are ways by which the user can explicitly express interest in particular concepts or conjectures. This is clear in the AM program which was designed as an interactive program where the user can interrupt the session at any time and express an interest in a particular concept. AM's heuristics were set up to pay particular attention to this concept, and the limited number of concepts AM could produce in a session were heavily biased by the user's choice. Here we see that the user wants AM's concepts and conjectures to perform a particular task, namely to discuss the concept chosen by the user. This is taken a stage further in the GT and Graffiti programs, where the user can specify a `focus' concept, and only conjectures involving the chosen concept will be produced. In GT's case, this also meant that specialisations or generalisations of only the focus concept would be attempted.

In the Graffiti program, all the proved conjectures give a bound for an invariant (which may save computation time), so we see that the search space has been designed with a task in mind. More explicit tasks are set for other programs. By giving a `gold standard' classification of groups to HR, users are expressing an interest in concepts which achieve that categorisation. By giving concepts and conjectures particular tasks to achieve, a program can measure how close each comes to completing the tasks and use this to estimate interestingness, which will hopefully drive the best-first search towards something which achieves the task.

Conclusions

Assessing the interestingness of a concept or conjecture automatically is difficult because a program has to try to predict how much useful mathematics will result from an investigation of the concept or the attempts to prove the conjecture. Fermat's last theorem, for example, could easily have been relegated to the appendix of a number theory text if it had not been so difficult to prove. Also, as in discovery of any kind, it is often necessary to have expert knowledge to decide whether an invention has any far reaching implications or applications, as pointed out in [langley:tcadosk].

By comparing and contrasting five machine discovery programs, all of which are forced to guide their search towards more interesting output and make instant decisions about the possible interestingness of conjectures and concepts, we have extracted some possible ways a program can measure the interestingness of the concepts and conjectures it makes. The empirical plausibility of conjectures, novelty and comprehensibility measures are analogous to qualities described in [valdes-perez:collaboration] to assess the output from machine discovery programs. Therefore, we see that automated attempts to estimate interestingness often check the internal plausibility, novelty and intelligibility of a theory in the ways a human might estimate the quality of a theory externally.

The Interestingness of Conjectures

In summary, it is often a good idea to prescribe a definite task for a concept or conjecture to achieve, and design measures of interestingness around this. If this is not possible, and an estimate of the interestingness of a conjecture is going to be made, some of the following points could be taken into consideration:

  • The conjecture should be empirically true.

This can be achieved by only making conjectures backed up by all available data, suggesting a conjecture by some other means and then using all the data to discard the conjecture if necessary, or altering a conjecture so that any data which disproves it is no longer applicable.

  • The conjecture should be novel w.r.t. previous ones.

This can be achieved by understanding and checking how two conjectures can be equal or isomorphic in the domain of interest. Also, conjectures which are implied by previous, stronger conjectures, should either be avoided by tailoring the search space, or rejected when found.

  • The conjecture should be surprising in some way.

This can be achieved by avoiding or discarding well known tautologies, and by using information about the concepts discussed in the conjecture to estimate how unlikely the suggested relation between them is.

  • The conjecture should discuss some non-trivial models.

This can be achieved by discarding any conjectures where the only models satisfying the preconditions are a trivial, or uninteresting set. Note that in certain circumstances, this kind of highly specialised conjecture may actually be of interest to the user.

  • The conjecture should be understandable, but non-trivial to prove.

This can be achieved by assessing the number and diversity of concepts involved in a conjecture and removing overly complicated conjectures, or by fixing the search strategy to output the simplest conjectures first. Making conjectures in a well known format will help understandability. If the conjecture has been proved, the proof could be used to estimate how difficult the theorem was.

The Interestingness of Concepts

If an estimate of the interestingness of a concept is going to be made, some of the following points could be taken into consideration:

  • The concept should have models.

This can be achieved by checking available data for models which satisfy a predicate or are in the domain of a function. If no models exist, an effort should be made to generate some. It may be necessary to prove that no models exist, and discard the concept (but keep the theorem).

  • The concept should be novel w.r.t. previous ones.

To achieve this, the search should ensure that no two obviously isomorphic definitions can be made, and avoid paths which will ultimately lead to the same concepts. Then, if the models of one concept are the same as another, the concepts are possibly equivalent, which should lead to a proof of this fact. If a concept is indeed semantically different to all the others, then the novelty of various properties (such as the way it categorises models) could be assessed.

  • There should be some possibly true conjectures made about the concept.

By the qualification of possibly true conjectures, we note that while false conjectures about the nature of a concept are usually of no interest, an unsettled conjecture can often be more interesting than a proved theorem.

  • The concept should be understandable.

This can be achieved by designing the search to construct concepts with the simplest definitions first, and by keeping the simplest definition when it has been proved that two concepts are equivalent.

  • The concept should have a surprising property.

A surprising property may be something that isn't true of the parent concepts.

Building on the techniques for automated discovery that have been developed in artificial intelligence and cognitive science, and learning from the results of programs developed in mathematics, an effort can be made to write more programs which act as collaborators with working mathematicians. The production of intelligently suggested conjectures and concepts plays an integral and important part in developing a mathematical theory, and automating these processes is a worthy area for research. How programs estimate the interestingness of the concepts and conjectures they produce is central to building intelligent discovery programs, and we hope that the notions of interestingness derived here will be of some help.

Acknowledgements

We would like to thank Toby Walsh for his important comments on this work and on the HR project. We would like to thank Pat Langley and Ra\'ul Vald\'es-P\'erez for sending us survey papers on machine discovery programs. We would also like to thank Susan Epstein for her comments explaining aspects of the GT program, Jan \.Zytkow for his comments on the system by Bagai et al and Craig Larson for helping us understand the Graffiti program. This work is supported by EPSRC research grant GR/L 11724.

References

[bagai:atgipg]
R. Bagai, V. Shanbhogue, J. \.Zytkow, and S. Chou.
Automatic theorem generation in plane geometry.
In LNAI 689. Springer Verlag, 1993.

[colton:ecai98]
A. Bundy, S. Colton, and T. Walsh.
HR - automatic concept formation in finite algebras.
Technical Report 920, (Presented at the Machine Discovery Workshop at ECAI 98) Department of Artificial Intelligence, University of Edinburgh, 1998.

[chou:mtp]
S. Chou.
Mechanical Theorem Proving.
D. Reidel Publishing Company, Dordrecht, Netherlands, 1984.

[colton:rfami]
S. Colton.
Refactorable numbers - a machine invention.
Journal of Integer Sequences, 2, 1999.

[colton:ijcai99]
S. Colton and A. Bundy.
HR: Automatic concept formation in pure mathematics.
In Proceedings of IJCAI, 1999.

[lenat:kbsiai]
R. Davis and D. Lenat.
Knowledge-Based Systems in Artificial Intelligence.
McGraw-Hill Advanced Computer Science Series, 1982.

[epstein:phd]
S. Epstein.
Knowledge Representation in Mathematics: A Case Study in Graph Theory.
PhD thesis, Department of Computer Science, Rutgers University, 1983.

[epstein:otdomt]
S. Epstein.
On the discovery of mathematical theorems.
In IJCAI Proceedings, 1987.

[epstein:ladossfmk]
S. Epstein.
Learning and discovery: One system's search for mathematical knowledge.
Computational Intelligence, 4(1):\penalty0 42--53, 1988.

[fajtlowicz:ocog]
S. Fajtlowicz.
On conjectures of Graffiti.
Discrete Mathematics 72, 23:\penalty0 113--118, 1988.

[fajtlowicz:wow]
S. Fajtlowicz.
The writing on the wall.
Unpublished preprint, available from http://math.uh.edu/. clarson/, 1999.

[langley:tcadosk]
P. Langley.
The computer-aided discovery of scientific knowledge.
In Proceedings of the first international conference on discovery science, 1998.

[langley:sci_disc]
P. Langley, H. Simon, G. Bradshaw, and J. \.Zytkow.
Scientific Discovery - Computational Explorations of the Creative Processes.
MIT Press, 1987.

[larson:discovery]
C. Larson.
Intelligent machinery and discovery in mathematics.
Unpublished preprint, available from http://math.uh.edu/. clarson/, 1999.

[lenat:phd]
D. Lenat.
AM: An artificial intelligence approach to discovery in mathematics}.
PhD thesis, Stanford University, 1976.

[mccune:ottermanual]
W. McCune.
The OTTER user's guide.
Technical Report ANL/90/9, ANL, 1990.

[mccune:macemanual]
W. McCune.
A Davis-Putnam program and its application to finite first-order model search.
Technical Report ANL/MCS-TM-194, ANL, 1994.

[hanna_ritchie:method]
G. Ritchie and F. Hanna.
AM: A case study in methodology.
Artificial Intelligence, 23, 1984.

[sloane:oeois]
N. Sloane.
The Online Encyclopedia of Integer Sequences.
http://www.research.att.com/~njas /sequences, 1998.

[valdes-perez:collaboration]
R. Valdés-Pérez.
Principles of human computer collaboration for knowledge discovery in science.
Artificial Intelligence, 107(2), 1999.



© 2002 Simon Colton