|
Simon Colton Mathematical Reasoning Group Division of Informatics University of
Edinburgh Edinburgh EH1 1HN United Kingdom
Theory Formation Programs in Pure Mathematics
Broadly speaking, machine learning programs are asked to identify a
single concept given a set of examples and some background
knowledge. Mathematical theory formation programs, such as the AM
program, [lenat:kbsiai] and the HR program,
[colton:ijcai99], are also given a set of examples and some
background knowledge. However, they are not asked to find a single
concept, but rather to explore the domain and attempt to gain some
understanding of it. Because the domain is mathematics, there are a
range of ways by which the program can gain an understanding of the
domain, including inventing concepts, performing calculations, making
conjectures, proving theorems and finding counterexamples. The HR
system is able to perform all of these activities in domains such as
group theory, where it employs the Otter theorem prover,
[mccune:ottermanual], and MACE counterexample finder,
[mccune:macemanual], to prove and disprove theorems.
Whereas machine learning programs are in general goal directed, theory
formation programs perform forward chaining. To control this, measures
of interestingness for concepts are derived and employed so that an
effective best first search through the space of concepts can be
achieved. For example, measures based on the clarity of the definition
of a concept are common in theory formation programs and help increase
the yield of understandable concepts. Theory formation programs are
not limited to just mathematics. We have recently enabled HR to work
in ``train theory'', in which it invents concepts which involve
properties of trains as described in the original Michalski problem,
[michalski:trains_original]. For example, it invents the concept
of trains where each carriage has a different shape.
In general, the kinds of concepts which machine learning programs and
theory formation programs can invent in pure mathematics are very
similar, but theory formation programs have to avoid many
uninteresting concepts. A good example of this is that the Progol ILP
program, [muggleton:progol], will easily learn the integer
sequence:

This concept is present in HR's search
space, but HR will not invent the concept because the search strategy
prohibits it (because the sequence is not interesting).
Research Problem Statement
We wish to define some concrete qualitative and
quantitative measures for the performance of exploratory theory
formation programs.
In the next section we list three of the many subjective
ways to assess a scientific theory formation program. Following this,
we suggest two quantitative measures for the performance of a theory
formation program, based on applying the program to machine learning
tasks. Finally, we summarise with a brief assessment of the problem.
Subjective Measures
By saying that the sequence is not
interesting, we have been subjective. Ideas about the interestingness
of individual concepts can often be turned into general rules and even
concrete calculations which assess the interestingness of a
concept. For example, HR chooses to develop concepts with a variety of
examples, which means it avoids the above sequence because it only
contains one number. Each theory formation program has its own
measures for interestingness and we have surveyed five such programs
in [colton:interestingness], and attempted to derive some common
approaches to the assessment of interestingness.
Heuristic searches based on these measures of interestingness are
intended to increase the quality of the theories produced. Assessment
of the theories themselves is also usually subjective, and likewise
assessments of the programs which produce the theories. We list here
some of the subjective ways in which theory formation programs have
been assessed historically, and highlight some of the drawbacks of
these assessments.
- The number of `classically interesting' concepts or conjectures
it re-invents in a theory. For example, AM re-invented square numbers
and prime numbers, found Goldbach's conjecture and the prime
factorisation theorem, and gained great credit for doing
so. Certainly, a program which re-invents no classically interesting
concepts or conjectures should be assessed poorly. However, what
constitutes a classically interesting concept is highly subjective,
and in areas such as train theory, this assessment is not
applicable. Moreover, it can encourage the authors of programs to
continually fine tune their programs so that they eventually reach
particular concepts. Certainly, AM was criticised for, amongst other
things, the fact that some of its heuristics were very specialised and
often only used once in a session, [hanna_ritchie:method].
- The number of `genuinely interesting' new concepts or
conjectures the program invents. Unfortunately, it is usually
difficult to know in advance even whether a concept is new, never mind
interesting. Vald\'es-P\'erez classifies machine discovery programs as
either those developed to model theory formation or those developed to
assist experts, [valdes-perez:collaboration]. An example of the
latter is the Graffiti program, [fajtlowicz:ocog], which has
produced many very interesting conjectures in graph theory which some
of the best minds in the field have tackled. With programs which
assist experts, the user can often quickly identify the importance of
a concept. Programs which model theory formation play an important
role in artificial intelligence and cognitive science. However, their
authors are often not experts in the domain of interest and the
programs are not designed to actually add to the domain. This makes
assessment of the novelty and importance of the concepts produced
difficult. Lenat originally believed AM had invented a new type of
number: those with more divisors than any smaller number. However,
these turned out to be highly composite numbers defined much earlier
by Ramanujan. We had a similar experience with refactorable numbers,
[colton:rfami], but have since redeemed the situation,
[colton:aaai00].
- The overall quality of the theory. This approach involves
examining an example theory produced and grading the interestingness
of the concepts in the theory. An overall score is then assigned to
the theory, based on the average interestingness of the concepts. This
is again subjective, and it is difficult to tell which concepts are
genuinely interesting. This is a good measure to use while developing
a theory formation program: examine the output and alter the program
so that the yield of interesting concepts increases (based on the
subjective notions of interesting being used). If this can be done in
a general way, the program will benefit greatly from such an
analysis. For example, if a program was consistently producing
concepts with very complicated definitions, we could define a clarity
measure and encourage the production of concepts with greater
clarity. If we develop a program in this way, we will require
different notions of interestingness to assess the program, as it will
obviously score well using the criteria employed to assess the theory
while developing it.
Quantitative Measures
Machine learning programs perform in a reactive way. They are given a
set of examples, some background knowledge and a categorisation of the
examples. They are then asked to use the background information to
define a concept which achieves the categorisation. For example, we
could give the ILP program Progol, [muggleton:progol], the set of
integers up to 10 and background information which includes the
concept of divisors of integers. If we then supply this categorisation
of the integers: , Progol will learn the
concept of even numbers (ie. divisible by 2) because when we use this
concept to categorise the integers 1 to 10 into even numbers and
non-even numbers, the categorisation is above, as required.
If we now imagine a more pro-active approach to machine learning,
where the program is given a set of examples and some background
knowledge, and we know that after some given time limit, it is going
to be given a categorisation of the examples which it must learn a
concept for. In the interim, the program can do anything which will
enable it to provide an answer quickly when the categorisation is
finally presented. For example, we could give the program a set of
trains and background knowledge in terms of describing the trains,
eg. carriages, objects in carriages, wheels, etc. We then give it five
minutes to prepare, and return to ask it why a particular set of
trains are going East. If the program has used its time effectively,
it would reply immediately, for example: the trains going East have
exactly four carriages. If the program has been completely effective
with its preparation time, we will be able to give it any
categorisation of the trains, for example into trains going North,
South, East and West, and it will be able to supply a concept
immediately which explains their behaviour.
This imaginary pro-active experiment describes the function of a
theory formation program. This enables us to define a measure for the
success of the program. We could simply count the number of concepts
the program comes up with, but an inferior program might produce
thousands of concepts which categorise the trains in only a few
ways. Therefore, we are actually interested in the number of different
categorisations which are achieved by the set of concepts a theory
formation program comes up within a given time limit. If a program
comes up with 100 categorisations of, say 8 example trains, after a
given time limit, it is more equipped to carry out machine learning
tasks than a program which only manages to come up with 50
categorisations.
Note that the number of different ways to categorise a set of
objects is defined as the th Bell number, [bell:numbers]. The
Bell numbers are: For example, there are
15 ways to categorise the numbers 1 to 4:

We have performed some experiments with HR in number theory and found
that it can produce concepts categorising the integers 1 to 5 in all
52 ways. Similarly, we have achieved around 90\% of the 203
categorisations of the integers 1 to 6. Note that we report an
increase in the number of categorisations HR finds in a given time
limit when a cooperative agency, rather than a single copy of HR is
employed, [colton:aisb00].
A measure of the interestingness of concepts which we can project to
measure the entire theory, is the clarity of the concepts. If a
program has a clarity measure for individual concepts, then the
average clarity of a concept in the theory can be used as a measure
for the theory itself. Even if a clarity measure is not used in the
heuristic search, it should still be possible to derive some measure
of the complexity of the definitions of the concepts produced.
Assessment of the Research Problem
We have so far employed a variety of measures to assess the HR program
in the hope that the combination will prove sufficient [the `shotgun'
approach, [bundy:shotgun]]. Some of these measures are subjective
and some are more concrete. The research problem we state here is to
derive some additional qualitative and quantitative measures for
theory formation programs, specifically those which work in scientific
domains. In particular, we do not yet have an objective way to measure
the average quality of the concepts produced in a session. We hope to
adapt some of the ideas developed in [sommer:quality] and similar
papers to help assess the theories produced by exploratory theory
formation.
Such concrete measures will not only allow us to assess the
performance of a program, but to determine whether particular
heuristics improve its performance, which was an initial motivation
for this problem. Further, we hope to use such measures to compare and
contrast theory formation programs. Investigation of why programs
perform badly with respect to these measures will allow future
developers of theory formation programs to implement better systems.
There are many ways to exhibit an understanding of a domain. We have
presented one based on machine learning tasks: an ability to
immediately explain categorisations presented. For example, HR has
completely categorised the numbers 1 to 5, so we can ask it to provide
a property shared by any two of them which the others do not share,
and it will always supply an answer immediately. Another way to
exhibit an understanding of the domain is to present an intelligent
question. In mathematics, this entails proposing a conjecture which is
simple to state, difficult (but not seemingly impossible) to prove,
and certainly not obvious. Perhaps the first step towards defining
measures is therefore to determine the ways in which an understanding
of the domain can be communicated by a theory formation program. From
this, we may be able to measure how much it really understands the
domain it has explored.
Acknowledgements
This work is supported by EPSRC grant GR/M98012.
References
[bell:numbers]
Bell, E.
Exponential numbers.
American Mathematics Monthly 41:411--419, 1934.
[bundy:shotgun]
Bundy, A.
Personal communication, 1998.
[colton:ijcai99]
Colton, S.; Bundy, A.; and Walsh, T.
HR: Automatic concept formation in pure mathematics.
In Proceedings of the Sixteenth International Joint
Conference on Artificial Intelligence, 1999.
[colton:aisb00]
Colton, S.; Bundy, A.; and Walsh, T.
Agent based cooperative theory formation in pure mathematics.
In Proceedings of the AISB-00 Symposium on Creative & Cultural
Aspects and Applications of AI & Cognitive Science, 2000.
[colton:aaai00]
Colton, S.; Bundy, A.; and Walsh, T.
Automatic invention of integer sequences.
In Proceedings of the Seventeenth National Conference on
Artificial Intelligence, 2000.
[colton:interestingness]
Colton, S.; Bundy, A.; and Walsh, T.
On the notion of interestingness in automated mathematical discovery.
International Journal of Human Computer Studies, 2000.
[colton:rfami]
Colton, S.
Refactorable numbers - a machine invention.
Journal of Integer Sequences vol 2, 1999.
[lenat:kbsiai]
Davis, R., and Lenat, D.
Knowledge-Based Systems in Artificial Intelligence.
McGraw-Hill Advanced Computer Science Series, 1982.
[fajtlowicz:ocog]
Fajtlowicz, S.
On conjectures of Graffiti.
Discrete Mathematics 72 23:113--118, 1988.
[mccune:ottermanual]
McCune, W.
The OTTER user's guide.
Technical Report ANL/90/9, Argonne National Laboratories, 1990.
[mccune:macemanual]
McCune, W.
A Davis-Putnam program and its application to finite first-order
model search.
Technical Report ANL/MCS-TM-194, Argonne National Laboratories, 1994.
[michalski:trains_original]
Michalski, R., and Larson, J.
Inductive inference of VL decision rules.
In Proceedings of the Workshop in Pattern-Directed Inference
Systems (Published in SIGART Newsletter ACM, No. 63), 1977.
[muggleton:progol]
Muggleton, S.
Inverse entailment and Progol.
New Generation Computing 13:245--286,1995.
[hanna_ritchie:method]
Ritchie, G., and Hanna, F.
AM: A case study in methodology.
Artificial Intelligence 23, 1984.
[sommer:quality]
Sommer, E.
An approach to measuring theory quality.
In Proceedings of the 9th European Knowledge Acquisition
Workshop, 1996.
[valdes-perez:collaboration]
Vald\'es-P\'erez, R.
Principles of human computer collaboration for knowledge discovery in
science.
Artificial Intelligence 107(2), 1999.
© 2002 Simon Colton
|