The volume of information content in our modern world is exploding at a staggering pace. In many situations, it is impossible to read, process, summarize, and extract useful meaning fast enough. This is true for text, images, sound, and video.
Information content is stored for two main reasons. First, content is stored for human consumption. For example, web pages, document, music, and email are stored mainly to be redisplayed for their respective human audiences. Second, content is stored for retrieval and processing by machine. Retrieval and processing by machine requires structured information. Structured information has its data items tagged or encoded with values sufficiently uniform for machine handling. For example, XML (extensible markup language) is a kind of structured information format that is amenable for machine processing and transfer. However, most information is unstructured, and cannot be processed by machines. This presents a problem.
For example, consider one form of information that touches many people—email. A denizen of our increasingly interconnected world may send and receive a considerable volume of email. A tiny fraction of this email is structured with tags such as the sender, recipient(s), subject, and receipt time. The structured part of email makes it possible to sort incoming email into categories using search criteria based on the send, recipients, subject line, etc. However, little is done with the email body. The body and the attachment part of the email remain largely unstructured. As plain text, the body is searchable by wildcards, or by regular expressions, while an attachment may be scanned for known viruses. While useful, much more could potentially be done to benefit the user. For example, sorting email by the contents of the body; identifying emails that contain information of future reference value; identifying emails that have only transient value; identifying emails that require a follow-up action; and identifying emails that convey information that should be extracted, logged, and followed up. Most email users may benefit if the body could be meaningfully tagged so that a machine could perform such identifications.
Just as the speed and volume of email impose a daunting challenge to individuals, similar problems afflict modern organizations around the handling of unstructured content. Companies forge agreements, establish commitments, and fulfill obligations. Legal relationships with customers, suppliers, and partners are represented in contracts. The contracts specify business terms, such as effective date, ending date, prices, quantities, and payment terms. Some contain limitation of liability clauses, while others may not. By being able to automatically scan contracts for the presence of specific types of clauses, companies may gain greater control and visibility of supplier obligations and partner commitments. However, the vast majority of contracts are stored as unstructured content. In many cases companies continue to store contracts in paper form. However, with the ability to elicit meaningful interpretations from a contract, it may become a competitive advantage to exploit machine readable contracts as a source of business virtuosity.
Another example is a full-text repository of issued patents which contains tags, for example for the inventor name, assignee name, and patent number. These are portions of the content that is searchable because it is structured information. The claims and description of the patent application are available in electronic form, however they are typically searchable only by Boolean search criteria. The bulk of a patent description is available as unstructured text. Someone searching must form Boolean expressions related to the presence or absence of chosen words. What is lacking is an effective approach to find explanations and claims that are “similar” to other descriptions. This presents a problem.
One technique for coping with unstructured textual content is text data mining. One approach uses latent semantic analysis. This involves filtering the content to remove a list of “stop” words, such as “of,” “the,” etc. Then the remaining words are deemed to be important keywords, and their relative frequencies of appearance within a document are tallied. The numerical values map the document into a vector space, and a dimensionality reduction technique (for example, singular value decomposition (SVD)) identifies the documents that are most similar. Another approach is based on n-grams of characters and words for computing dissimilarity, equivalently, or similarity of a target text against a reference text.
In another approach used in the realm of plagiarism-detection, student program code is analyzed to detect plagiarism. Similarly, student term papers are analyzed with respect to subject matter, word choice, and sentence structure to identify the possibility that a given paper has partially or substantially plagiarized its content from a number of known online sources of pre-written term papers.
Other algorithms are used for matching images of faces. Others use an algorithm for computing similarity of an email to a spam profile.
The field of stylometry assigns numerical measures to attributes of one or more source texts, and performs analyses in order to gain a better understanding of such texts. For example, one application is to determine authorship of disputed works. Researchers have used principal components analysis (PCA) to analyze word frequencies of disputed texts to shed light on authorship.
Frequency-based approaches may not be able to analyze content at a level of detail to extract term values from textual content. This presents a problem.