Many business and legal practitioners have a requirement to review paper copies of documents to find important information. Many of those paper-based documents have an electric correlate, but some do not. In many cases, tens or hundreds of similar paper documents must be reviewed. One common type of document set contains sections of redundant text from one document to the next, with important information buried in this boilerplate text. People have difficulty recognizing both boilerplate and important text: the process is tedious, time-consuming, and error-prone. Reviewers also often need to gain an understanding of the types of issues mentioned in each document. Technology to support the full range of required functionality does not currently exist.
There are two major categories of current approaches to attempt to address these problems, both making simplifying assumptions. First, to process paper documents into readable text, Optical Character Recognition (OCR) software is typically used. However, the accuracy of existing OCR software suffers on the types of documents typical for the application environments we have studied. In these cases, documents have been faxed, copied, mutilated, or written on. On these documents, the word-level accuracy of state-of-the-art OCR software can be 20% or worse. This low accuracy level makes the document unreadable when displayed as recognized text words.
The second approach to address these problems is to use text processing, change tracking, document management, search, indexing, and summarization tools. There are several deficiencies in these tools. Some of them work only with electronically produced documents, while the example embodiments described herein address both paper and electronic documents. Others support only a single file format. Text analysis tools cannot read images, and even applying them to the result of OCR would reduce their accuracy and usefulness dramatically. Finally, tools that find differences between text segments in documents usually limit the extent of their search (e.g., they do not search in pages far away from the current page) when looking for matching segments of text. They also do not typically support the recognition of repeated text as needed, or the comparison of tabular and multi-dimensional information.
There has been much related work in computational linguistics and related fields applying statistical and machine learning techniques to natural language processing tasks. Some of this work is reported in Manning, C. et al., “Foundations of Statistical Natural Language Processing,” The MIT Press (1999), the disclosure of which is hereby incorporated herein by reference in its entirety. Many approaches from machine learning involve building or training some sort of classifier to help make decisions about documents and the words or sentences they contain. Classifiers are statistical or symbolic models for dividing items (also called examples) into classes (also called labels), and are standard tools in artificial intelligence and machine learning.
To address the deficiencies discussed above, it would be desirable to provide a system and method for comparing and viewing electronic and paper-based text documents that is both accurate and efficient, that supports multiple file formats including scanned paper documents, that searches for similar text liberally within two documents, and that aids the user in analyzing each respective text document.
It should be noted that the figures are not drawn to scale and that elements of similar structures or functions are generally represented by like reference numerals for illustrative purposes throughout the figures. It also should be noted that the figures are only intended to facilitate the description of the preferred embodiments of the present disclosure. The figures do not illustrate every aspect of the disclosed embodiments and do not limit the scope of the disclosure.