Future Search Engines.

Natural Language Querying...Naturally!

-By Edwin Chang


@ Introduction :

Computers never get jokes. In fact, computers powerful as they may be, can cope with very few of the functions and subtleties associated with languages that come naturally to us in our daily lives. The first practical application of natural language systems was developed nearly two decades ago and was used for translation of one language into another. Since then, numerous other applications have surfaced and today, interest is mainly on natural language modules that sit on top of other applications which offer a more user-friendly interface between the user and the machine. The application most likely to benefit from a natural language 'front-end' is the database. Databases typically hold huge amounts of data and needs to be stored in complex ways to ensure fast and accurate information retrieval from these monoliths.

@ Current State of Affairs :

The idea of natural language access to databases is neither a new idea nor a fundamentally different approach to information retrieval. With over a decades research and development gone into natural language processing, it still remains a formidable task to enable access of information through simple dialogues. For example, it wouldn't seem too far fetched for a user to ask simple questions such as "Who was the 1st President of the United States?", or "What elements does water consist of?" and get a simple, direct answer. These questions though not very complex in structure nor terribly ambiguous in nature, cannot be handled effectively by the current search techniques of the World-Wide Web. More intelligent search engines will break these queries into boolean keyword searches to give 'President' and 'United States' or 'Elements' and 'Water' but others would require the user to formulate the queries into a more Formal language. Even so, keyword searches are known to be highly inaccurate and low on hits. Additionally, the user must sieve through many answers, which are pointers to articles which they must then read rather than a direct answer to their questions.

Existing search techniques fall into two main categories: Manual hyper-link traversal and keyword searches of automatically generated databases. Manual searches provide more precise data retrieval but is limited to a small range of nodes since users tend to get lost in large search spaces. Keyword searches on the other hand involves a pre-compiled database which contains an entry for each word which in turn contains pointers to all documents containing that word. Users can enter a word or combination of words to locate relevant documents. The drawbacks however, are as I have mentioned before, low hit rates and imprecise information retrieval.

@ The Vision Unfolds :

Probably the most distinct difference between a conventional database and one for natural language access is it's indexing criteria or emphasis. Unlike a conventional database, the symbolic approach to natural language processing involves treating words as nodes in a semantic network. Emphasis is placed on semantics or meanings of words rather than it's absolute text. An obvious place to start when constructing a natural language processor (NLP) is to construct a large dictionary. Much of the recent advances in NLP systems is attributed to advances in design of large and concise dictionaries. The most common approach to representing word meanings is by how they are used. This approach is known as Case-Based reasoning. We often hear a particular word being used in many different situations and contexts. Take the word 'on' for example; it can be used in "place on top of", "to switch on ", "on legal grounds", "on his way", etc. Each occurrence of the word is treated as an independent case. Similarities between cases can be analysed and later grouped together into categories. Such reasoning therefore involves large databases and in the example of 'on' there could be up to 1000 cases. The strength of case-based reasoning is in it's flexibility. Since databases are based on semantics, imprecise querying may still retrieve information which is close to the answer but does not exactly match the query.

Another approach to natural language compatibility is to make incremental improvements to keyword searching. This can be achieved by adding a thesaurus to the database. Queries to this database are based on boolean search and relies on keywords drawn from the thesaurus. A thesaurus is a dictionary of synonyms and hence, allows retrieval of information that is implicitly or explicitly referenced by the query. Therefore, a query on 'Cars' would also return results from terms such as 'vehicles' and 'automobiles'. This entails a significant improvement in precision and recall rates over current search methods.

Further improvements can be procured by syntax analysis. In a natural language, relationships among words are established by it's grammar or syntax. By analysing the syntax of a given query, it's structured representation can then be compared with patterns stored in it's database giving a more thorough search result. Although thesaurus based systems recognises implicit references of keywords, they have no references what so ever between these keywords. For example, in the query "Does Tom like to drive?", a whole host of unrelated statements would result since no restriction on the relationship between the keywords 'Tom' and 'drive' is specified - such as "Tom hates driving", Tom went for a drive", "Tom's driving lessons", etc. With syntax analysis, relationships between keywords can be achieved and so the relation "having a chance to" will be implicitly referenced to concepts such as "am able to", "doing", etc. It is clear therefore, that syntactic analysis is crucial to achieving a higher level of precision and recall in database accesses through natural language.

@Shape Of Things To Come :

Despite their appeal and functionality, Natural Language Systems have yet to be the norm not because it is unpopular, but because of the enormous complexities and complications involved in constructing such a system. Current systems are so far based on partially assisted searches in which the user is required to do some manual browsing even if it's only within a local area.For the moment, Artificial Intelligence groups are heavily researching into more compact ways of data storage, more concise representation of word meanings and also various alternative approaches to NLP. However, much work still needs to be done before natural language dialogues between man and machine becomes a reality. Until the day C3PO's and Data's invade our work places, our modest databases will just have to be accessed by more traditional means.

@Further Reading :

  1. A Word is Worth 1000 Pictures: Natural Language Access to Digital Libraries.
    Howard W. Beck, Amir M. Mobini and Viswanath Kadambari. University of Florida.
  2. Mark Wallace.(1984)
    Communicating with Databases in Natural Language.
    Publishers: Ellis Horwood Limited. Chapters 1,2,3 and 5
  3. CIIR Information Retrieval Publications.
    IR-47:(1994) Jing, Y and Croft, W.B.
    "An Association Thesaurus for Information Retrieval"
    UMass Technical Report 94-17.
  4. CIIR Information Retrieval Publications.
    IR-74:(1995) Croft, W.B.
    "What Do People Want From Information Retrieval?"
    in d-lib magazine, November 1995.
  5. CIIR Information Retrieval Publications.
    IR-75:(1996) Ponte, J. and Croft, W.B.
    "Useg: A Retargetable Word Segmentation Procedure for Information Retrieval."
    Symposium on Document Analysis and Information Retrieval '96 (SDAIR)
    UMass Technical Report TR96-2