Information and data processing is required in virtually all areas of human endeavor. The shear volume of documents that an organization must contend with has become increasingly problematic. The ability to rapidly obtain relevant documents from the huge store of available documents has become a key to an organization's success. Electronic document handling is the foundation for the future of information processing, with millions upon millions of papers being converted into electronic images every day.
Electronic documents such as images, emails, reports, web pages, etc. are generated at a tremendous rate. Indexing and classifying these electronic documents into manageable databases has become a mandatory task, without which, retrieval and searching for information and data among such documents is impossible in terms of efficiency and accuracy. As is well known, the cost of classifying documents manually is extremely high. As the number of documents being digitally captured and distributed in electronic format increases, there is a growing need for techniques and systems to quickly classify digitally captured documents.
At one time document classification was done manually. An operator would visually scan and sort documents by document type. This process was tedious, time consuming, and expensive. As computers have become more ubiquitous, the quantity of new documents including on-line publications has increased greatly and the number of electronic document databases has grown almost as quickly. As the number of documents being digitally captured and distributed in electronic format increases, the old, manual methods of classifying documents are simply no longer practical. Similarly, the conversion of information in paper documents is an inefficient process that often involves data entry operators transcribing directly from original documents to create keyed data.
A great deal of effort in the area of document handling and analysis has been done in the areas of document management systems and document recognition. Specifically, the areas of page decomposition and optical character recognition (OCR) are well developed in the art. Page decomposition involves automatically recognizing the organization of an electronic document. This usually includes determining the size, location, and organization of distinct portions of an electronic document. For example, a particular page of an electronic document may include data of various types including paragraphs of text, graphics, and spreadsheet data. The page decomposition would typically be able to automatically determine the size and location of each particular portion, as well as the type of data found in each portion. Certain page decomposition software go further than merely determining the type of data found in each portion, and will also determine format information within each portion. For example, the font, font size, and justification may be determined for a block containing text.
As may be appreciated, OCR involves converting a digital image of textual information into a form that can be processed as textual information. Since electronically captured documents are often simply optically scanned digital images of paper documents, page decomposition and OCR are often used together to gather information about the digital image and sometimes to create an electronic document that is easy to edit and manipulate with commonly available word processing and document publishing software. In addition, the textual information collected from the image through OCR is often used to allow documents to be searched based on their textual content.
In today's information society, individuals often require information and data acquired by others relating to them, making the freedom and ability to obtain such information a necessity. Organizations, both commercial and governmental, are required to provide such information upon request. However, documents so provided cannot contain confidential and/or otherwise secret information and data. As such, redaction is required before certain documents can be sent out. As may be appreciated, the redaction process is extremely costly, both in time and money, if performed manually. Finally, data capturing (data entry and coding) has become very important. In many situations, data must be captured and populated into databases for data mining, searching, and processing. When performed manually, these tasks are extremely costly.
There have also been a number of systems proposed that deal with classifying and extracting data from multiple document types. There are also systems available for automatically recognizing a candidate form as an instance of a specific form contained within a forms database based on the structure of lines on the form. These systems rely, however, on the fixed structure and scale of the documents involved.
Additionally, expert systems have been proposed using machine learning techniques to classify and extract data from diverse electronic documents. One such expert system proposed is described in U.S. Pat. No. 6,044,375, entitled “Automatic Extraction of Metadata Using a Neural Network.” Since machine learning techniques generally require a training phase that demands a large amount of computational power, such classification systems operate more efficiently if the document type of a new document is known.
From the foregoing it will be apparent that there is still a need for an improved system and process for document recognition that is capable of understanding the contents of electronic documents.