|
Simon Colton, Stephen Cresswell and Alan Bundy
University of Edinburgh simonco@dai.ed.ac.uk, stephenc@dai.ed.ac.uk, bundy@dai.ed.ac.uk
Abstract
Introduction
Concept formation, broadly speaking, involves analysing what is already known and coming up with a new idea based on the original knowledge. How and why particular concepts are formed can be investigated by writing computer programs designed to automatically create new concepts based on a corpus of knowledge. Such programs are implementations of some theory about concept formation, and the practicalities of writing computer programs require the theory to be concrete. Computers allow a great deal of experimentation and the results obtained when running the programs can be used to suggest improvements to the underlying theory. A good way to test the programs is to perform concept formation in a well known domain and to compare the output obtained with the concepts which are classically found in the domain. Once concept formation programs have reached a certain standard, they can be used in a more exploratory fashion, assisting in the construction of theories by searching for new concepts in less well known domains. In these cases, the computer program will need a concrete and calculable notion of `interestingness' to help it decide which concepts to keep and which to discard. This notion can also be used in a more pro-active way by employing heuristics designed to increase the average level of interestingness. One such heuristic is to base new concepts on old concepts which have been shown to be interesting, hoping that interesting concepts lead to further interesting concepts. It is therefore clear that a concrete notion of interestingness should be central to any automated concept formation program. The most cited work in automated mathematical concept formation is Lenat's work on his AM [lenat:kbsiai] computer program which invented new definitions and made conjectures based on empirical evidence. This was an exploratory program designed to work in elementary set theory which actually delved into elementary number theory. A notion of interestingness was used to maintain an agenda of tasks to do next, calculated using values for how interesting old concepts were and values for how worthwhile a task involving those concepts might be. The calculations are complex and based on many heuristic notions of why a concept might be interesting, and there was no overall goal to the concept formation. Other attempts to formalise the notion of interestingness concentrated on reducing the search space so that exploration only occurred within a narrow band of concepts. With each of the newly formed concepts having a similar nature, it is easier to make a comparison of two concepts, and interestingness can be measured more easily. For instance, in the BACON programs [langley:sci_disc], written by Langley et al, the concepts formed were polynomial relations between measurements taken during physics experiments. A concept was interesting if the polynomial relation was observable in the data, and uninteresting if not. In this case, they were able to guarantee some level of interestingness by looking at the data first, spotting patterns and trends and forming the new relations in this data driven manner. Another example of narrowing the search space is Sims' IL program [sims:phd], which accepted specifications for an operator (for instance the multiplication operator on Conway numbers), and formed operators designed to fit the specifications. The number of specifications satisfied by the invented operator gives an indication of its importance, and the process can stop once one has been invented which satisfies all the specifications. Therefore, finding an effective measure of the interestingness of concepts is a two step process; (i) imposing a similar format on all the concepts produced, (ii) defining why one concept in this format is better than another, preferably in terms of numeric values, or any partial order which can be used to grade the concepts. To begin to impose a similar format between the concepts, we first restricted the mathematical domain in which the concepts were formed to finite group theory. This choice was due to our mathematical background and the fact that many other algebras, such as ring theory, Galois theory and infinite group theory depend on the concepts from finite group theory. We further restricted the type of concept formed to definitions, loosely speaking, the sentences which appear under the `definition' heading of finite group theory texts. As detailed later, further uniformity is achieved by thinking of each definition as a function mapping a group to some output. A major goal in the development of finite group theory is the classification of
finite groups. For instance, there is a classification of finite Abelian groups which can be found
in most elementary group theory texts. More remarkably, in 1980, a complete classification of finite simple
groups was achieved, which must surely be regarded as one of the major intellectual achievements
of all time. Taking the classification of finite groups as the reason to form concepts, we can judge how important a particular concept is by how much it improves a classification. A classification of an object describes the object, and we concentrate on three intrinsic properties of a description:
The work here develops calculations which can be performed to measure these aspects of a classification, and how much of an improvement a particular concept makes. This allows us to measure three values for each concept, its parsimony, its acuity, and its efficiency, which can be used to determine the importance of any concept in the theory. Note that, from this point, we abbreviate `finite group' to just `group'. Classifications
The first step to introducing a formal notion of interestingness is to add some uniformity to the concepts so that the similar structure makes it easier to compare two concepts. As previously stated, the concepts being formed here are definitions in group theory. The three most common ways of introducing new definitions in group theory books are as follows:
(A) By specialising the concept of group using a test on the group. For
instance:
(B) By specifying a calculation which can be performed on the group. For instance:
(C) By detailing a construction which is possible from a group. For instance: It is possible to think of formats (A) and (C) in terms of format (B). Firstly, the tests on groups can be written as boolean functions, which take a group as input, and output `true' or `false'. For example, we could write the Abelian property of groups given above as:
Secondly, to use constructions as a function on a group we can make a function which outputs all the possible examples of the construction for a given input group. Eg.
Then, as soon as a construction definition is given, we can make this second, functional, definition which can be used instead of the construction definition. Note that in practice there are many ways to turn the construction definition into a function accepting a group as input. For instance, we could make a function which tests whether a given group has any examples of a particular construction. Eg.
Hence all the sentences given in the definitions can be written as functions taking a single group as input, and outputting something based on the group. To add further uniformity to the format of the concepts, and to make later calculations possible, we impose the restriction that the output from the functions is a nested vector of atoms. This is true anyway of the functions made from definitions of type A and C above, and we only have to worry about those of type B, which are the functions occurring naturally in group theory. As group theory is built heavily on set theory, most of the outputs are elements, sets or groups, all of which can be written as nested vectors, and in practice this imposition is not too restrictive. We now have a starting point for a formalisation, and can continue by
noting that the output of a function can be used to describe any input
group. For instance, given a particular group,
The following are classifying functions in group theory:
(
and using
it is easy to calculate:
So, if we let
and this gives us:
Measures of Importance
Having defined a classification of a set of groups, we can now find some measurable properties
of it, and compare the classification given by a single function against that given by a set
of functions, to gauge the importance of the single function. It is therefore necessary to
identify what is desirable in a classification. To help with this, there are two special
classifications which represent the worst and best cases.
Firstly, there is the trivial classification, describing each group with the same
sentence, "G is a group". ie. The trivial classification is given by the single function:
Note that each of the three measures introduced are calculated for a classifying theory. To assess each individual classifying function, the classification given using only the single function is measured. Note also that each of the measures is normalised to give a value between 0 and 1, with 0 representing the worst case and 1 representing the best case. Parsimony of a Classification
If you wanted to tell a colleague something about a particular group, you would first
have to establish that they knew exactly which group you were talking about. This could
be achieved either by writing down the group table, or describing the group in
sufficient detail to ensure there is no ambiguity. The second method has the advantage
of being more parsimonious than the first, as the group table is a very cumbersome way
to represent the group. For instance, if the group under discussion was of order 20,
a group table would have 400 entries, and it is clearly more sensible to write this
down once in a reference book, but refer to the group by descriptions in day to day
use. To illustrate this point, group theorists would describe the groups One of the reasons to pursue a classification is to enable us to describe a complex object in a compact manner, with the description being used to represent the object, rather than a more cumbersome, explicit representation. We can then try to measure how parsimonious a particular classification is, and use this to assess the worth of the classifying theory, and the individual functions in that theory. Remembering firstly that the output of the functions will be used to describe the groups, and secondly that we restricted the outputs to be nested vectors only, if we flatten these vectors into a single list of atoms, the size of the list will give an indication of the parsimony. We can formalise this with the following definitions:
The
for
So,
and it is easy to check that:
Therefore, in terms of parsimony, these functions are perfect. They were
chosen for this reason, to reduce the space needed for the examples.
To see a less parsimonious example, look at the
Here, the output is a set, which can be written as a vector. It is not difficult to calculate:
Hence
From these, using
Acuity of a Classification
When representing a group with a description rather than its group table there are drawbacks, because, to increase the parsimony, the amount of information on display must be reduced. The first drawback is a loss in the power to discriminate between two groups. If we look at the trivial classification, we see that it describes every group in the same way, which is clearly a defect. It is desirable to have a classification which describes each group differently. We can approximate how well the descriptions differentiate between any two groups by looking at a set of groups, and judging how many have different descriptions. This leads to the following definitions:
, then as is a set, the description only appears
once in the set, making , and ]
Using
There are
we see that
So
we see that
Hence
and similar calculations show that
These figures indicate that the
Before leaving the subject of acuity, it is worth noting that the acuity measure does not address the problem of a function describing two equivalent groups differently. In group theory, two groups are regarded as the same if they are isomorphic, that is, there is a 1:1 map between the elements of the two groups which preserves the group operation. Isomorphic groups can have different group tables, and so it is possible for a classifying function to describe isomorphic groups differently. Given a set of isomorphic group tables, this deficiency can also be measured, but such calculations have been omitted for the sake of brevity. Calculations of acuity and those determining the problem caused by isomorphisms are both addressing the same problem: whether we can tell if two objects are truly different given only descriptions of them. This would appear to be a goal for classifications in any domain. Efficiency of a Classification
The third value we can calculate to assess the quality of a classification
measures the second drawback to having more parsimonious descriptions, and takes the
most explaining. The group table tells us We concentrate on the second method for working with group descriptions, ie. we try
to assess how difficult it is to recover a group table given only a description of it.
Of course, if we are given an explicit description of the group, that is, one from which
we can read Note that if we know the order of the group, a search for the group table of the group
will be finite, and, while losing generality slightly, things are made much easier if we
assume that the order of the group is a constraint which is always given in the description. Given the
assumption that the order of the group is known to be A much more plausible method is to fill in the group table one entry at a time, checking that the incomplete multiplication table satisfies the description. This means that the description of the group has to be interpreted as ways of ruling out incomplete multiplication tables. This interpretation is not always straightforward but is usually possible. The advantage to this method is that once an incomplete multiplication table has been rejected by the description, any larger table built by adding more elements to this table will also be rejected, and so there is no point trying those possibilities. This has the potential to drastically cut down the search space. Stated in this manner, we see that the problem of
finding the group table given a non-explicit description of the group is a Constraint
Satisfaction Problem (CSP). Having formulated the retrieval of the group table as a CSP, we have been able to use the large body of knowledge in this area to define a concrete calculation which estimates how difficult the search will be when constructing the group table given only a description of the group. This estimate is an adaption of the cost to find all solutions measure introduced in [willhogg:exploit], and draws heavily on that paper, specifically sections 3.3, 3.3.2 and 6.1.4. We urge consultation of this work to put the following definitions in context.
To find a true value for the efficiency of a classifying theory, it is necessary to find the minimised no-goods for each group given by each classifying function, and also those given by the group axioms. It is then necessary to determine the extent of the overlap between the sets of no-goods, and account for this in the calculation. This is too much work for this introductory example, so we shall concentrate on the easier problem of finding the efficiency of the IsAbelian and NSIE functions.
If we first look at the efficiency of the
ie.
Hence, a no-good will occur if the table has two entries filled in, one for
For instance, this imt(3) must be no-good for
It is clear that
In constructing a no-good of this type, we can first choose any position in the group table above
the diagonal, and there are
possible minimised no-goods with 2 entries filled in which violate the `IsAbelian(
Using the Maple Computer Algebra System, [maple:manual]
the above values for
Now,
The reason for this very poor score is that during a search for a
non-Abelian group which fills in elements of the group table one at a time,
if we reject an incomplete table on the grounds that it is not
non-Abelian, we must have filled in all the non-diagonal entries. Hence
most of the search will have been carried out before the constraining power of the
description can be used, so the `IsAbelian(
With this last scaled cost value, we can finally calculate:
An analysis of the minimised no-goods for the NSIE classifying function revealed that,
for a group,
[Note that if
Using the above approximation for
and so
Examining these results, we see that both the choice of classifying function used
to describe a group and the output from that function determine how efficient a
search for the group table will be.
For example, if a group is described as `IsAbelian( Conclusions and Further Work
With the aim of automating the formation of definitions in finite group theory, we set out to derive concrete calculations to measure the interestingness, or importance, of a definition, as this is a central requirement of any concept formation program. Noting that a major driving force in group theory is the classification of groups and that classifications provide descriptions of groups, we wrote the group theory definitions as descriptions of groups. We then identified the parsimony, acuity and efficiency properties of descriptions, which can be evaluated to measure how good the descriptions are. Intuitively, these properties are desirable of descriptions in any domain: Given an object, and asked to describe it, you would hope that your description was (a) concise and to the point, (b) enough to tell the difference between the object and a similar one and (c) some help towards constructing the object. For example, if you were asked to describe a person, you might perhaps say they were tall, or equally you might say that they lived at number 17. While both of these are concise, the first would be of use if you had to pick the person out of an identity parade, but the second would be of use if you had to track the person down. In an arbitrary domain it is unlikely that concrete measures could be specified
to approximate the worth of descriptions with respect to these three properties.
However, due to the formal nature of mathematics and by imposing certain restrictions,
we have been able to derive calculations and produce some values for example
definitions. Although the example calculations are fairly poor approximations, due to the
small sample of groups used, it is possible to compare the classifications given by the
Remembering that these values have 0 as the worst case and 1 as the best case, we see that the more parsimonious descriptions have a lower acuity and efficiency than the less parsimonious descriptions. Hence the figures calculated highlight the drawbacks of working with descriptions of objects rather than the objects themselves. Now, as we can measure how good a particular function is, it is possible to write heuristic methods which produce new functions, with the heuristics designed to increase the average parsimony, acuity or efficiency of the functions. For instance, if the acuity of the classification given by all the functions in the classifying theory is lower than we require, the program could use a production rule to base a new concept on an old one which has a high acuity itself. The heuristic employed here is that interesting concepts lead to further interesting concepts. Such a heuristic could be similarly employed if the parsimony or efficiency of the classification was below par, and more analytical heuristics, based on the failings of the classification for particular groups, are planned. The writing and testing of the heuristic production rules as a computer program Acknowledgements
We would like to thank Ian Gent and Toby Walsh for their comments on constraint satisfaction problems. This project has been funded by EPSRC research grant GR/L 11724. References
[colton:phd_prop]
[lenat:kbsiai]
[gorenstein:fsg]
[humphreys:group_theory]
[langley:sci_disc]
[sims:phd]
[tsang:focs]
[maple:manual]
[willhogg:exploit] © 2002 Simon Colton |