The preferred embodiment invention relates to a method and a system for acquiring data from machine-readable documents, the data being assigned to a database, in which individual data are extracted from the document as automatically as possible and are entered into corresponding database fields, the method and system according to the present invention relating to the acquisition of data in the case in which data cannot be extracted with the necessary degree of reliability for one or more particular database fields of a document.
Methods and systems for acquiring data from machine-readable documents are known. In the standard situation, the systems have a scanner with which documents are optically scanned. The data files produced in this way are machine-readable documents, and as a rule contain text elements. The text elements are converted into coded text with the aid of an OCR device. As a rule, predetermined forms or templates are assigned to the data files, so that on the basis of the forms, data files containing particular items of information from the text can be determined in a targeted manner. These items of information are stored for example in a database.
Methods and systems of this sort are used for example in large firms in order to read invoices. The data extracted in this way can be communicated automatically to an accounting software program.
Such a system is described in U.S. Pat. No. 4,933,979. This system has a scanner for the optical scanning of forms. In this system, a large number of types of forms can be defined, each type of form or template being defined by a plurality of parameters, in particular geometrically defined areas in which texts or images are to be contained. The form types can also be defined by additional characteristics, such as for example the type of script contained in the texts (letters, numbers, symbols, katakana, kanji, handwriting). After a form has been scanned, a template is assigned to the scanned form using a form type distinguishing device. Correspondingly, the data contained in the text field are read and extracted using an OCR device. If no suitable template exists, it is necessary to create one.
From WO 98/47098, another system is known for the automatic acquisition of data from machine-readable documents. Here, a scanner is used to optically scan forms. Subsequently, a line map of the form is created automatically. Here, on the one hand all lines are acquired, and all graphic elements are converted into a line structure. Other elements, such as for example text sections, are filtered out. All vertical lines form the basis for creating a vertical key, and all horizontal lines form the basis for creating a horizontal key. Subsequently, it is determined whether a template already exists having a corresponding vertical and horizontal key. If this is the case, the data are read out using a corresponding template. If this is not the case, then on the basis of the scanned-in form a template is created and stored using a self-learning mode.
In the book Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto, Addison-Wesley Press, ISBN 0-201-39829-X, the basic principles of databases and information stored for rapid finding in databases are explained. Thus, in Chapter 8.2, a method using inverted data files, also designated an inverted index, is described. In this method, from a text that is to be examined first a dictionary is created having all the words contained in the text. Each word in the dictionary is assigned one or more numbers that indicate the location at which the word occurs in the text. Such inverted data files enable a more rapid automatic analysis of a text that is to be searched. In Chapter 8.6.1, a string matching method is described in which two strings are compared and a cost measure is calculated that is indirectly proportional to the similarity of the strings. If the two strings are identical, the magnitude of the cost measure is zero. The more the strings differ, the greater is the magnitude of the cost measure. The cost measure is thus an expression of the similarity of the two strings. This and similar methods are also known under the names approximate string matching, Levenshtein method, elastic matching, and Viterbi algorithm. These methods are part of the field of dynamic programming.
In the not-yet-published patent application DE 103 42 594.2, a method and a system for acquiring data from a plurality of machine-readable documents are described in which, from a document that is to be processed—the read document—data are extracted by reading them out at positions in the read document that are determined by fields entered in a master document.
If an error occurs during the reading out of the read documents, the read document is displayed on a display screen and the data can be read out only by marking corresponding fields in the read document. Here, if it is required, additional master documents are automatically produced on the basis of the marked read documents, or existing master documents are correspondingly corrected. This system is easy enough to use that no special computer or software knowledge is necessary.
A method that supports an operator in the generation of electronic templates for a form recognition system arises from U.S. Pat. No. 5,317,646. For this, a form not provided with data (what is known as a master form) is shown on a screen, and the user can identify the data fields with a pointer device. The coordinates that bound the corresponding region are automatically detected after which a single point within this region has been selected by the operator. Templates for the automatic form recognition can be created simply and quickly with this method.
In Casey R. G. et al., “Intelligent Forms Processing”, IBM Systems Journal volume. 29 (1990) Nr. 3, pages 435 through 450, a form recognition method is described in which a scanned-in form is analyzed by means of image processing techniques and is compared with other stored template forms. In the event that no correlation with a template form is found, a new template form must be generated via input on a computer. In the generation of a template, the scanned form is shown on the screen and the boundary lines of the input fields are marked with a pointer device.
A two-stage method in which form templates can be initially input and documents can be automatically read out using the input form templates arises from US 2002/141660 A1. Form templates to be input are scanned, and the operator indicates input fields with a cursor. The position and size of the input fields is stored. The operator can also determine the data type associated with each data field. Given automatic reading of forms, these are scanned in and automatically read out using the data fields contained in the stored form documents. In the event that an error occurs in the readout, the operator can correct the errors via the keyboard.
U.S. Pat. No. 6,028,970 concerns a method and a system for automatic text recognition (OCR). The system comprises an error correction module (“error correction logic module”). This error correction module is applied to clearly detectable data errors in order to correct these. These corrections are executed automatically. Not only errors of individual letters are hereby detected, but rather errors in context are analyzed and correspondingly corrected. An error that cannot be automatically corrected can be communicated to the operator by means of an error message. The operator can then assess and, if applicable, correct the text generated by means of the text recognition.
It is an object to create a method and a system for acquiring data from machine-readable documents in which the inputting of the data is significantly simplified in comparison with the known methods in cases in which data cannot be automatically extracted.
In a method for acquiring data from a machine-readable document for assignment to fields of a database, individual data are extracted substantially automatically from the document and entered into the corresponding database fields. If data cannot be extracted from the document with a desired degree of reliability for one or more particular database fields, then the steps are executed of displaying the document onto the display screen, displaying on the display screen the at least one or more database fields for which the data cannot be extracted with the desired degree of reliability, and executing a proposal routine with which string sections in the vicinity of a pointer movable by a user on the display screen are selected, marked, and proposed for extraction.