Lecture 10

The Michalski train problem was invented by Ryszard Michalski around 20 years ago. It is simply stated  find a concept which explains why the five trains on the left are travelling eastbound and the ones on the right are travelling westbound. The solution must involve the concepts on view: size, number, position, contents of carriages, etc. It is left as an exercise to find the solution... This is a standard problem which has been used to test and demonstrate many machine learning techniques. In particular, Inductive Logic Programming implementations (which we discuss later) use this example for demonstrative purposes. 
However, it is worth remembering that in machine learning, we often have to deal with data taken from the real world. And real world data contains errors. Errors come in various flavours, including: (i) incorrect categorisation of examples, such as saying that a platypus is a bird  it does lay eggs  when it is actually a mammal (ii) wrong values for background information about the examples, such as low quality scanning of hand written letters leading to a pixel which was not written on being scanned as black instead of white (iii) missing data, for instance, examples for which we do not know the value of all the background concepts and (iv) repeated data for examples, possibly containing inconsistencies.
Certain learning algorithms are fairly robust with respect to errors in the data, and other algorithms perform badly in the presence of noisy data. For instance, the learning algorithm FINDS we discuss later is not very robust to noisy data. Hence it is important to assess the amount of error you expect to be in your data before you choose the machine learning techniques to try.
Writing a machine learning algorithm comprises three things: how to represent the solutions, how to search the space of solutions for a set of solutions which perform well and how to choose from this set of best solutions.
The solution(s) to machine learning tasks are often called hypotheses, because they can be expressed as a hypothesis that the observed positives and negatives for a categorisation is explained by the concept learned for the solution. The hypotheses have to be represented in some representation scheme, and, as usual with AI tasks, this will have a big effect on many aspects of the learning methods. We will look at a number of ways to represent solutions and the associated methods for agents to learn using them.
It is important to bear in mind that a solution to a machine learning problem will be judged in worth along (at least) these three axes: (i) accuracy: as discussed below, we use statistical techniques to determine how accurate solutions are, in terms of the likelihood that they will correctly predict the category of new (unseen) examples (ii) comprehensibility: in some cases, it is highly desirable to be able to understand the meaning of the hypotheses (iii) utility: there may be other criteria for the solution which override the accuracy and comprehensibility, e.g., in biological domains, when drugs are predicted by machine learning techniques, it is imperative that the drugs can actually be synthesised.
Each learning task will be better suited by one or more representation schemes. For example, to some extent, it is not important exactly how a learned hypotheses predicts stock market movements, as long as it is accurate. Hence, in this case, it is perfectly acceptable to use so called black box representations and techniques, such as neural nets. These methods often yield high predictive accuracy, but provide hypotheses which are difficult to understand (they are black boxes, and one cannot look inside). In scientific domains, training a neural network to perform prediction tasks may well be more effective in terms of predictive accuracy than using a another approach. However, the other approach may yield an answer which is more understandable and from which more science will flow.
For some methods, such as training neural networks, it's not all that useful to think of the method as searching, as it is really just performing calculations. In other techniques, such as Inductive Logic Programming, however, search certainly takes place, and we can think about the specifications of a search problem as we have done in game playing and automated reasoning. One important consideration in machine learning is whether the algorithm will search for more general or more specific solutions first. More general solutions might be advantageous because the user may be able to instantiate variables as they see fit, and general solutions may offer a range of possibilities for the task at hand. However, more specific solutions, which specify more precisely a property of the target concept, might also be advantageous.
Certain learning techniques learn a range of hypothesis as solutions to the problem at hand. These hypotheses usually range over two axes: their generality and their predictive accuracy over the set of examples supplied. We have mentioned that the application of machine learning algorithms is to predicting the categorisation of unseen examples, and this is also how learning techniques are evaluated and compared. Hence, machine learning algorithms must choose a single hypothesis, so that it can use this hypothesis to predict the category of an unseen example.
The overriding force in machine learning assessment is the predictive accuracy of learned hypotheses over unseen examples. The best bet for predictive accuracy over unseen examples is to choose the hypothesis which achieves the best accuracy over the seen examples, unless it overfits the data (as explained later). Hence, the set of hypotheses to choose from is usually narrowed down straight away to those which achieve the best accuracy when used to predict the categorisation of the examples given to the learning process. Within this set, there are various possibilities for choosing the candidate to use for the prediction.
Often, Occam's Razor is called into effect: the simplicity of the hypotheses is evaluated and the simplest one is chosen. However, it is worth noting that there are other reasons why a particular hypothesis may be more useful than another, as discussed further below.
To start our exploration of machine learning techniques, we shall look at a very simple method which searches through hypotheses from the most specific to the most general. This is called the FINDS (find specific) method by Tom Mitchell (the author of the standard machine learning text). To use this method, we need to choose a structure for the solutions to the machine learning task. Once we have chosen a structure, then our learning agent first finds all the possible hypotheses which are as specific as possible. How we measure how specific our hypotheses are is dependent on the representation scheme used. In first order logic, for instance, a more general hypothesis will have more variables in the logic sentence describing it than less general hypotheses.
The FINDS method takes the first positive example P_{1} and finds all the most specific hypotheses which are true of that example. It then takes each hypothesis H and sees whether H is true of the second positive example P_{2}. If so, that is fine. If not, then more general hypotheses, H' are searched for so that each H' is true of both P_{1} and P_{2}. Each H' is then added to the list of possible hypotheses. To construct more general hypotheses, each old hypothesis is taken and the least general generalisations are constructed. These are such that there is no more specific hypothesis which is also true of the two positives. Note that, if we are using a logic representation of the hypotheses, then all that is required to find the least general generalisation is to keep changing ground terms into variables until we arrive at a hypothesis which is true of P_{2}. Because the more specific hypothesis that we used to generalise from was true of P_{1}, then the generalised hypothesis must also be true of P_{1}.
Once this process has been exhausted for the second positive, the FINDS method takes the enlarged set of hypotheses and does the same generalisation routine using the third positive example. This continues until all the positives have been used. Of course, it is then necessary to start the whole process with a different first positive.
Only once it has found all the possible hypotheses, ranging from the most specific to the most general, does the FINDS method check how good the hypotheses are at the particular learning task. For each hypothesis, it checks how many examples are correctly categorised as positive or negative, and the hypotheses learned by this method are those which achieve highest predictive accuracy on the examples given to it. Note that because this method looks for the least general generalisations, it is guaranteed to find the most specific solutions to the problem.
This method is best demonstrated by example. Suppose we have a bioinformatics application in the area of predictive toxicology, as described in the box below.
Drug companies make their money out of developing drugs which can cure, vaccinate against and eleviate the suffering from certain illnesses. For each disease, they have many leaders: drugs which may turn out to be useful against the disease. Unfortunately, after some development, it becomes obvious  sometimes as late as human trials  that some of the leaders are toxic to humans. Of course, this usually means that the drug will be abandoned and the money spent on developing the drug will have been wasted. Therefore, it is highly advantageous for drug companies to determine, at any stage of development, whether a given new drug will turn out to be toxic: they want to predict toxicology. Because they have examples of drugs in a family similar to one which is under investigation, they can use the drugs which turned out to be toxic as positives, and those which didn't as negatives, and this problem can be stated as a machine learning problem. This is one of the places where Artificial Intelligence overlaps with biology/chemistry, and forms part of the rapidly growing area of bioinformatics. 
Suppose further that we have been given 7 drugs, 4 of which are known to be toxic and 3 of which are known to be nontoxic, as drawn below:
Positives  Negatives 
The chemists think that the toxicity might be caused by a substructure of the molecules consisting of three atoms joined by two bonds, for example: cch or ccn. Structure, rather than actual chemicals, sometimes plays a more important part in the activity of drugs, so the chemists also suggest that we look for generalisations, i.e., substructures where some of the chemicals are not known, for example: c?n or c?c.
Hence, to solve this problem, we can use a FINDS
method where the solutions are simply triples of letters
< A,
B, C > where A, B and C are taken from the set of chemical letters
{c, h, n, o, ?}. We include the ? so that we can find more general
solutions. For instance, the solution < c, ?, n > means that
the agent has learned that the toxic chemicals have a substructure
consisting of a carbon atom bonded to something which is in turn
bonded to a nitrogen atom, and that the nontoxic chemicals do
not. This isn't, of course, a good solution, because it is true of
only 2 out of 4 positives and is also true of 1 of the 3 negatives.
To design our search strategy, we start with the simple fact that any concept learned will be true of at least one positive (toxic) drug. If we look at P1, then there are only two triples of atoms in the molecule (if we do not allow a triple to be written backwards):
< h, c, n > and < c, n, o >
We now see whether these substructures are also
found in P2, and generalise them if not. Firstly, the structure
< h, c, n > is not a substructure of P2. So, we will need to
generalise it, and to do this, we should introduce one variable only,
in such a way that the generalised structure is found in P2. By
generalising only one variable, we will find only the least general
generalisations. In this case, only the following generalised
substructure is true of P2:
< h, c, ? >
If we now look at < c, n, o > then it is also not found in P2 but it can be generalised to:
< c, ?, o >
which is true of P2. Our set of candidate hypotheses
now contains these four:
< h, c, n >, < c, n, o >,
< h, c, ? > and < c, ?, o >. We now turn to P3 to
generalise these further, which gives us these nine possible
hypotheses:
< h, c, n >,
< ?, c, n >,
< h, c, ? >,
< h, ?, ? >,
< ?, c, ? >,
< c, n, o >,
< c, ?, o >,
< c, ?, ? > and
< ?, ?, o >
Using P4 to generalise these, we do not get any more possible hypotheses. We now need to check the accuracy of these hypotheses and choose the best. The table below scores the hypotheses in terms of their predictive accuracy over the given examples:
Hypothesis  Solution  Positives true for  Negatives true for  Accuracy 
1.  < h, c, n >  P1  N2  3/7 = 43% 
2.  < c, n, o >  P1  4/7 = 57%  
3.  < h, c, ? >  P1,P2,P3  N1,N2  4/7 = 57% 
4.  < c, ?, o >  P1,P2,P3  6/7 = 86%  
5.  < ?, c, n >  P1,P3,P4  N1,N2  4/7 = 57% 
6.  < h, ?, ? >  P1,P2,P3  N1,N2,N3  3/7 = 43% 
7.  < ?, c, ? >  P1,P2,P3,P4  N1,N2,N3  4/7 = 57% 
8.  < c, ?, ? >  P1,P2,P3,P4  N1,N2,N3  4/7 = 57% 
9.  < ?, ?, o >  P1,P2,P3  N1,N3  4/7 = 57% 
Hence, the best hypothesis learned by this method is number 4. This hypothesises states that the toxic substances have a submolecule consisting of a carbon atom joined to some other atom, which is in turn joined to an oxygen atom. This correctly predicts the toxicity of 6 out of 7 of the given examples, so it scores 86% for predictive accuracy over the given examples.
Note that, to finish the FINDS method, the whole procedure would have to be repeated using P2 as the first positive, and generalising using P1, P3 and P4. Once the possible solutions had been collected using this, then the procedure would be repeated again using P3 as the first positive, and so on.
Disclaimer: Please note that this is an entirely fabricated example. The chemists among you will have no doubt noticed that some of the drugs are not even valid chemicals and the existence of the learned substructure has nothing to do with toxicity. For a real example of machine learning being used in predictive toxicology, see the results from Inductive Logic Programming on mutagenesis data HERE.
Many machine learning problems have binary categorisations, where the question is to learn a way of sorting unseen examples into one of only two categories, known as the positive and negative categories. [Note that this is not to be confused with supplying positive and negative examples]. Suppose an agent has learned a method to perform a binary categorisation. Suppose further that it is given an example which it categorises as positive using it's learned method. In this case, if the answer should have been categorised as negative, then we say this is a false positive: the learned method has falsely categorised the example as positive. Similarly, if the method categorises an example as negative, but this is incorrect, this is a false negative.
In some cases, having a false positive may not be as disastrous as having a false negative or vice versa. For instance, machine learning techniques are used to diagnose whether patients have a particular illness, given their symptoms as background information. Here, it may be the case that the doctors don't mind false positives as much as false negatives, because a false negative means that someone with the disease has been incorrectly diagnosed as not having the disease, which is perhaps more worrying than a false positive. Of course, if the medicine used to treat patients had severe sideeffects (or was very expensive), then it is possible that the doctors may prefer false negatives to false positives.
To calculate the predictive accuracy of a particular hypothesis over the set of examples supplied, we simply have to calculate the percentage of examples which are correctly classified as either positive or negative. Suppose we are given 100 positives and 110 negatives and our learning agent learns a hypothesis which correctly categorises 95 positives and 98 negatives. We can therefore calculate that, given any of the examples, positive or negative, the hypothesis has a 92% chance of correctly categorising it. This is because:
(95 + 98)/(100 + 110) = 0.919 (3 d.p).
It is very important to remember, however, that this gives us only a weak indication of how likely the hypothesis is of correctly categorising an example it has not seen before. To see this point, think how easy it would be to program an agent to find a hypothesis which correctly classifies all the examples for a particular learning problem. Suppose it was given positives P1, P2, ..., Pn then a "good" hypothesis it could learn would be something like:
A is positive if A is P1 or A is P2 or ... or A is Pn
A is negative otherwise.
This would score 100% in terms of predictive accuracy over the set of examples given. Imagine, however, how badly this hypothesis would perform when used to predict whether a new example was positive or negative: it would always predict that the new example is negative.
A standard machine learning technique is to separate the set of examples into a training set and a test set. The training set is used in order to produce hypotheses, and the test set  which is never seen during the hypothesis forming stage  is used to test the accuracy of the hypothesis in predicting the categorisation of unseen examples. In this way, we can have more confidence that the learned hypothesis will be of use to us when we have a genuinely new example for which we do not actually know the categorisation.
There are various ways in which to separate the data into training and test sets, and established ways by which to use the two sets to assess the effectiveness of a machine learning technique. In particular, we use nfold cross validation to test the predictive accuracy of machine learning methods over unseen examples. To do this, we partition the set of examples into n equalsized sets randomly. A partition of a set is a collection of subsets for which the intersection of any pair of sets is empty. That is, no element of one subset is an element of another subset in a partition.
For each set in the partition, we hold back that set as the test set, and use the examples in the other n1 subsets to train our learning agent. Once the learning agent has learned a hypothesis to explain the categorisation into positives and negatives over the training set, we determine the percentage of examples in the test set which are correctly categorised by the hypothesis. Each set is held back in turn, and the predictive accuracy over the test set of the learned hypothesis is tested. To produce a final calculation of the nfold cross validation predictive accuracy, an average over all the percentages is taken.
For learning methods which produce multiple competing hypotheses for a learning task, in order to perform the cross validation, the method must be forced to choose a single hypothesis after each learning session. As mentioned above, this could be in terms of generality, or in terms of Occam's razor  if two hypotheses have the same predictive accuracy over the training set, then Occam suggests we choose the least complicated one. In cases where everything is equal between hypotheses, a learning method may have to resort to randomly choosing one of a set.
Note that we are assessing the method for learning hypotheses with nfold cross validation, rather than particular hypotheses learned by the method (because with each test set held back, the method may learn a different hypothesis). The crossvalidation measure therefore gives us an estimate of the likelihood that, given all possible data to train on, and given a genuinely unknown example, the method will learn a hypothesis which will correctly categorise the new example.
nfold cross validation is a useful method when data is limited to, say, a few hundred examples. Often, 10fold cross validation is used. When the data is even more limited (to fewer than around 30 examples), 1fold cross validation is used. This is called the leaveoneout method of assessing a learning method, and for each test, a single example is left out and the learned hypothesis is tested to see whether it correctly categorises that example. For larger datasets, cross validation may be unnecessary, and a holdback method may be employed. In this case, a certain number of examples are held back as the test set, and the learned hypothesis is tested on them. The hold back set is usually chosen randomly.
One very simple learning method is to look at the training examples, and see which class is larger, positives or negatives, and to construct the hypothesis that an example is always categorised as being a member of the larger class. This trivial method is called majority class categorisation and is a yardstick against which we can test machine learning results: if a method cannot produce hypotheses better than the default categorisation, then it really isn't helping much. We say that a machine learning method is overfitting a particular problem if it produces a hypothesis H, and there is another hypothesis which scores worse than H on the training data, but better than H on the test data. Overfitting is clearly not desirable, and is a problem with all machine learning techniques. Overfitting is often referred to as the problem of memorising answers rather than generalising concepts from them.
Machine learning and statistics have much overlap. In particular, some AI techniques (in particular neural networks) can be seen as statistical learning techniques. Also, machine learning draws heavily on statistics in the evaluation of techniques given notions about the data being used. Moreover, certain machine learning techniques, for instance ILP, draw from the statistical theory of probability distributions. And there are many statistical methods which perform prediction tasks such as those undertaken by learning algorithms. 
The remaining lectures on machine learning can be described in terms of the representation over which the learning will take place.
Lecture 11: Decision trees.
Lecture 12: Neural Networks.
Lecture 13: Logic Programs.
Some other representations which are very popular in AI are Bayes nets, Hidden Markov Models and Support Vector Machines.Unfortunately,we don't have time to cover them in this course.