Imperial College Computational Bioinformatics Laboratory (CBL)
This page contains:-
Scientific Knowledge Discovery using ILP
The pharmaceutical industry is increasingly overwhelmed by large-volume-data. This is generated both internally as a side-effect of screening tests and combinatorial chemistry, as well as externally from sources such as the human genome project. On the other hand the industry is predominantly knowledge-driven. For instance, knowledge is required within computational chemistry for pharmacophore identification, as well as for determining biological function using sequence analysis.
From a computer science point of view, the knowledge requirements within the industry give higher emphasis to ``knowing that'' (declarative or descriptive knowledge) rather than ``knowing how'' (procedural or prescriptive knowledge). Mathematical logic has always been the preferred representation for declarative knowledge and thus knowledge discovery techniques are required which generate logical formulae from data. Inductive Logic Programming (ILP) [8,1] provides such an approach.
ILP algorithms take examples E of a concept (such as a protein family) together with background knowledge B (such as a definition of molecular dynamics) and construct a hypothesis H which explains E in terms of B. For example, in protein fold domains, E might consist of descriptions of molecules separated into positive and negative examples of a particular fold (overall protein shape). This is exemplified in the diagram below for the fold ``4-helical-up-and-down-bundle''.
A possible hypothesis H describing this class of proteins is:
fold('Four-helical up-and-down bundle',P) :- helix(P,H1), length(H1,hi), position(P,H1,Pos), interval(1 =< Pos =< 3), adjacent(P,H1,H2), helix(P,H2).
The hypothesis is a definite clause consisting of a head (fold(..,..)) and a body (the conjunction length(..), .. helix(..)). In this case ``fold'' is the predicate involved in the examples and hypothesis, while ``length'', ``position'', etc. are defined by the background knowledge. A logic program is simply a set of such definite clauses. Each of E, B and H are logic programs.
In the context of knowledge discovery a distinct advantage of ILP over black box techniques, such as neural networks, is that a hypotheses such as that shown above can, in a straightforward manner, be automatically translated into the following piece of English text.
The protein P has fold class ``Four-helical up-and-down bundle'' if it contains a long helix H1 at a secondary structure position between 1 and 3, and H1 is followed by a second helix H2.Such explicit hypotheses can be used within the familiar human scientific discovery cycle of debate, criticism and refutation.
Discovery of biological functionBiological functions are regulated by the docking of small molecules (ligands) with sites on large molecules (proteins). Drugs, such as beta-blockers, mimic natural small molecules, such as adrenaline. Effectiveness of drugs depends on the correct shape and charge distribution of ligands. Thus beta-blockers block the binding of adrenaline, and so stop over-stimulation of heart muscle in patients prone to heart attacks.
Results on scientific discovery applications of ILP are separated below between those related to small molecules (such as ligands) and those related to proteins.
Structure-activity predictionThe majority of pharmaceutical R&D is based on finding slightly improved variants of patented active drugs. This involves laboratories of chemists synthesising and testing hundreds of compounds almost at random. The average cost of developing a single new drug is around $300 million. In  it was shown that ILP system Golem  was capable of constructing rules which accurately predict the activity of untried drugs. Rules were constructed from examples of drugs with known medicinal activity. The accuracy of the rules was found to be slightly higher than traditional statistical methods. More importantly the easily understandable rules provided insights which were directly comparable to the relevant literature concerning the binding site of dihydrofolate reductase.
MutagenesisIn , ILP system Progol  was used to predict the mutagenicity of chemical compounds taken from a previous study in which linear regression had been applied. Progol's predictive accuracy was equivalent to regression on the main set of 188 compounds and significantly higher (85.7% as opposed to 66.7%) on 44 compounds which had been discarded by the previous authors as unpredictable using regression. Progol's single clause solution for the 44 compounds was judged by the domain experts to be a new structural alert for mutagenesis.
PharmacophoresIn a series of ``blind tests'' in collaboration with the pharmaceutical company Pfizer UK, Progol was shown  capable of re-discovering a 3D description of the binding sites (or pharmacophores) of ACE inhibitors (a hypertension drug) and an HIV-protease inhibitor (an anti-AIDS drug).
Last year Progol was entered into a world-wide carcinogenicity prediction competition run by the National Toxicology Program (NTP) in the USA. Progol was trained on around 300 available compounds, and made use of its earlier rules relating to mutagenicity. In the first round of the competition Progol produced the highest predictive accuracy of any automatic system entered .
Protein secondary structure prediction.In  Golem was applied to one of the hardest open problems in molecular biology. The problem is as follows: given a sequence of amino acid residues, predict the placement of the main three dimensional sub-structures of the protein. The problem is of great interest to pharmaceutical companies involved with drug design. For this reason, over the last 20 years many attempts have been made to apply methods ranging from statistical regression to decision tree and neural net learning to this problem. Published accuracy results for the general prediction problem have ranged between 50 and 60%, very close to majority-class prediction rates. In our investigation we found the ability to make use of background knowledge from molecular biology, together with the ability to describe structural relations boosted the predictivity for a restricted sub-problem to around 80% on an independently chosen test set.
Discovery of fold descriptionsProtein shape is usually described at various levels of abstraction. At the lower levels each family of proteins contains members with high sequence similarity. At the most abstract level folds describe proteins which have similar overall shape but are very different at the sequence level. The lack of understanding of shape determination has made protein fold prediction particularly hard. However, it is intriguing that although there are around 300 known folds, around half of all known proteins are members of the 20 most populated folds. In  Progol was applied to discover rules governing these 20 most populated protein folds. Average in class cross-validated prediction was around 70% and many of the rules were judged to be good characterisations of the fold classes by Michael Sternberg, a world-class protein prediction expert at the Imperial Cancer Research Fund in London.
In his statement of the importance of this line of research to the Royal Society  Sternberg emphasised the aspect of joint human-computer collaboration in scientific discoveries. Science is an activity of human societies. It is our belief that computer-based scientific discovery must support strong integration into existing the social environment of human scientific communities. The discovered knowledge must add to and build on existing science. We believe that the ability to incorporate background knowledge and re-use learned knowledge together with the comprehensibility of the hypotheses, have marked out ILP as a particularly effective approach for scientific knowledge discovery.