The preferred embodiment concerns a method and system for collection of data from a plurality of machine-readable documents.
Such methods and systems are known. The systems typically comprise a scanner with which originals are optically scanned. The files hereby generated are machine-readable documents and as a rule contain text elements. The text elements are converted into encoded text with the aid of an OCR device. As a rule, predetermined forms or, respectively, templates are associated with the files such that targeted, specific information from the files containing the text can be determined using the forms. This information is, for example, stored in a databank.
Such methods and systems are, for example, used in large companies in order to read bills. The data so extracted can automatically be transferred to a business management software.
Such a system is, for example, described in U.S. Pat. No. 4,933,979. This system comprises a scanner for optical scanning of forms. A plurality of form types can be defined in this system, whereby each form type or template is established via a plurality of parameters, in particular geometrically defined regions in which text or images should be contained. The form types or templates can also be defined by further properties such as, for example, the writing that is contained in the texts (alphabet, numbers, symbols, katakana, kanji, handwriting). After the scanning of a form, a template is associated with the scanned form by means of a form type differentiation device. The data contained in the text field are correspondingly read and extracted by means of an OCR device. In the event that no suitable template is present, one must be created. This is complicated. Personnel are required for this who must be specially trained for this system and must have at least basic knowledge of the computer and software technology.
A further system for automatic collection of data from machine-readable documents arises from WO 98/47098. Forms are here optically scanned by means of a scanner. A line chart of the form is subsequently automatically generated. On the one hand, all lines and all graphical elements are converted into a line structure. Other elements such as, for example, text segments are filtered out. All vertical lines form the foundation for creation of a vertical key, and all horizontal lines form the foundation for creation of a horizontal key. It is subsequently determined whether a template with a corresponding vertical and horizontal key is already present. In the event that this is the case, the data are read out with a corresponding template. If this is not the case, a template is created and stored using the scanned form by means of a self-learning mode. The user can manually support the creation of the template. Here as well, the user should possess good knowledge of this system in the template creation, in particular its software structure, so that suitable templates are created for the operation.
The foundations of databanks and for fast retrieval of information stored in databanks is explained in the book Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto, Eddison-Wessley Publishing, ISBN 0-201-39829-X. A method with inverted files (which is also designated as an inverted index) is thus described in chapter 8.2. In this method, a dictionary with all words contained in the text is initially created from a text to be examined. One or more numbers that specify at which point the word occurs in the text are associated with all words of the dictionary. Such inverted files allow a faster, automatic analysis of a text to be searched. A string matching method is described in chapter 8.6.1 with which two strings are compared and a cost measure indirectly proportional to the similarity of the strings is calculated. When the two strings are identical, the value of the cost measure is zero. The cost measure is thus an expression for the similarity of the two strings. This and similar methods are also known under the designations approximate string matching, Levenshtein method, elastic matching and Viterbi algorithm. These methods belong to the field of dynamic programming.
A method for extraction of data fields from scanned documents arises from CASEY R. G. et al.: “Intelligent Forms Processing”, IBM Systems Journal, IBM Corp., USA, Volume 29, Nr. 3, January 1990, pages 435 to 450, XP000265375, ISSN: 0018-8670. This method is characterized in that background lines and the like can be extracted. Before forms can be processed with this method, models must be generated for each form type. Such a model of a form type is comprised of form patterns and a description for each field that is contained in the form. A form pattern is a set of features that are used for differentiation of one form type from another form type. The field descriptions comprise the location of the field in the form. Different methods of how the forms can be detected are disclosed here. In the event that a form is detected, information is also generated that specifies to what extent the position coincides between the form model and the detected form, whereby corresponding deviations can thereby be corrected.
A system for detection of forms arises from Patent Abstract of Japan Volume 1997, Nr. 07, 31st Jul. 1997 (JP 9 062758 A), in which system forms that are not completely detected are directly stored in an image file. These forms that are stored as an image file and not detectable can then be manually processed “en bloc”.
U.S. Pat. No. 5,140,650 A discloses a method and a system for optical recognition of letters (OCR device) in which an empty blank master form is scanned first and the corresponding digital image is stored. This scanned image is used in order to generate a template so that later corresponding forms can be automatically read and extracted.
A device for automatic reading of data from forms arises from U.S. Pat. No. 4,933,979 that comprises a scanning device for optical scanning of the forms for output of image data as well as a storage device for storage of information. A reader is also provided with which the regions of the forms are read out from the image data dependent on form information of a model form. The information of the model form are generated via scanning of a model form, whereby the digital image is shown on a screen on which a user can establish the read conditions for each read region. This registration process of the form information is executed for each form type that should then be read later.
A template recognition system that supports the operator in the creation of electronic templates arises from U.S. Pat. No. 5,317,646. The method enables the operator to consider what is known as a master form or blank form on a screen that comprises framed or semi-framed regions that show fields. The operator can then select via selection of an individual point within this framed or semi-framed region by means of a pointer device, and the coordinates representing the framed region are automatically determined by means of the single point selected by the operator.