Text Summarization

With the coming of the information revolution, electronic documents are becoming a principle media of business and academic information. Thousands and thousands of electronic documents are produced and made available on the internet each day. In order to fully utilizing these on-line documents effectively, it is crucial to be able to extract the giz of these documents. Having a Text Summarization system would thus be immensely useful in serving this need.

In order to generate a summary, we have to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details, and assemble them into a compact coherent report. This however, is easier said than done as it involves some of the main problems of natural language processing. To produce a domain-independent system would require work in natural language understanding, semantic representation, discourse models, world knowledge, and natural language generation. Successes in domain-independent systems are few and limited to identifying key passages and sentences of the document. More successful systems have been produced for limited domain applications such as report generations for weather, financial and medical databases.

Domain Specific Approaches

Information Extraction

Information extraction systems analyze unrestricted text in order to extract information specific to a particular domain. It does not attempt to understand all of the text in all input documents, but analyzes portions of documents that contain relevant information. The relevancy of the information is determined by pre-defined domain guidelines which must specify, as accurately as possible, exactly what types of information the system is expected to find. One method to think of information extraction in terms of database construction. Here unstructured text documents is converted into classfied database entries which are then used to fill a "template". A summary report can then be generated using pieces of canned text from the "template".

Examples of Existing Information Extraction Systems

A number of Information Extraction Summarization Systems have been developed in specific fields. Even though most of these systems are not currently used in the internet, the potential is great and implementation of such systems in the internet is relatively simple.

Monitoring Technical/Scientific Literature
Systems have been designed to monitor technical articles in the field of microelectronic chips fabrication. These systems could be easily modified to target some other fields by simply modifying the domain guidelines and extended to use on the internet where many articles could be found.

Extracting from Medical databases
Systems have been designed to analyse and summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results and therapeutic treatments. These systems could be used to help health care providers with quality assurance studies. Central databases accessible from different medical centres could be setting up to facilitate transfer of patients as well as for emergencies when the medical records are required immediately. These systems could again be extended to analyse some other databases, eg. financial statement databases.

Case-Based Approach

Here the input document is matched against a corpus of relevant and irrelevant texts. Instead of having an explicit set of domain guidelines from a user, the system simply exploits a "training corpus" of representative texts that a user or domain expert has manually classified as either relevant or irrelevant. These predefined representative texts are matched with the document, using statistical techniques to determine the texts that are relevant to the domain of the document. Basically texts that contains only general information are unlikely to be highly correlated with the domain because similar cases will be found in irrelevant as well as relevant texts in the training corpus.Texts that are too specific are also unlikely because there will be very few matching cases. Thus using this statistical technique, only representative texts that contain ley domain-specific information will be extracted. These matched relevant texts could then be used to generate a summary of the text.

Basically, the Case-based approach can be think of as an extension of the basic information extraction system. The problem with the information extraction system is that it retains virtually all the information that is relevant to the domain without any discrimination between important information and details and general information. By including a statistical test in the Case-based approach, we are able to get an idea of the importance and relevancy of the informations being extracted for the summary.

Domain Independent Approaches

As being pointed out previously, domain-independent text summarization is much more difficult than a domain-specific task. A good summary should include the most relevant information, but omit details and irrelevant information. However, different piece of information will be relevant to different people, depending on their individual interests and needs, ie the domain. Thus not many successful, fully operational systems have been developed. In the following, we will look at some of the most common approaches used as well as some experimental systems produced.

Document Abridgement

Here a summary is produced by deleting irrelevant texts from the document, retaining only the key passages and sentences of the document. Basically, a typical system consist of two sections, the Reader and the Extractor. The Reader basically reads in the input text and converts it into internal representations, taking into account the word occurrences and calculating the word weights. The Extractor then determines the particular sentences to be included in the summary by analyzing the word weightings and sentence weightings and then generating the summary from the internal representations.

An example of an experimental system using these methods is the Automatic News Extraction System (ANES) developed by Lisa Rau. This system aims to produce summaries of news from many different sources, had achieved relatively good results in spite of the fact that it is limited by the constraint that it is publication-independent. If developed, this function would prove to be extremely useful for categorising and locating informations on the internet by providing summaries to all wide varieties of documents available on the net.

Another example of a domain-independent text summarization available on the net is the NetSumm web page summarization tool which is able to highlight the key points in articles, as well as abridge documents.