When examining documents that are stored on a computer system, automatic summarization is a name given to the process of reducing a text document to a shorter text document, while retaining the most important points of the original document. Automatic summarization can be a useful tool for providing information about the contents of a document at a glance, without requiring a user to review the document in depth.
Various techniques exist for performing automatic summarization, including linguistic and non-linguistic techniques. In recent years, text processing algorithms have advanced to the point where a computer system can be used to parse natural-language sentences and determine the structure of the sentence. Techniques that incorporate these technologies are typically called linguistic techniques. Linguistic techniques commonly involve understanding different parts of speech that appear in a document, such as nouns, verbs, and adjectives. Linguistic techniques also can use a priori information about the relative frequency of words in a given language. By using such techniques, it is possible to provide, for example, a list of words that are unusual in a document. However, linguistic document summarization techniques have a number of downsides. For example, when identifying common words, many words in a document are highly-common “stop words,” such as the word “the” in English, that do not add meaning. Removing stop words requires maintaining a cumbersome blacklist.
Linguistic document summarization can be further subdivided into supervised and unsupervised techniques. Supervised techniques involve the use of test documents to train or teach rules to the engine prior to first use. Unsupervised techniques are those where no pre-learning is required. Complex algorithms can be used in both cases to collect significant sentences and then weight the sentences accordingly. This complexity can make supervised or unsupervised linguistic document summarization inappropriate in situations where computing power is limited.
It is possible to provide a document summary without linguistic analysis. For example, several common email clients, including Google GMail and Microsoft Outlook, provide a short “snippet” when displaying an email. The snippet is typically comprised of the first few characters or sentences of the email. This approach provides information to the user without requiring extensive computation. However, this approach typically fails to provide information about the entirety of the contents of the document or email, and is limited to giving information about the first few sentences.
There is, therefore, a need for a document summarization system that, for example, overcomes the drawbacks above.