The current invention is generally related to a method of and a system for extracting elements from document, and more particularly related to adaptively modifying model data for extracting the elements.
In prior attempts of extracting elements from documents, a number of models has been considered. For example, Japanese Patent Laid Publication Hei 8-287189 disclosed that a user prepares a set of extraction rules for extracting certain format information as well as a set of document formats. Based upon the extraction rules, predetermined certain elements containing the desired information are extracted from an input document image. However, since the extraction rules are precise for a particular document type, for each new document type, a new set of extraction rules must be implemented. Similarly, for any modification to an existing document type, the existing extraction rules must be also modified.
Another exemplary attempt includes Japanese Patent Laid Publication Hei 5-159101, which discloses a graphical representation scheme. An input document is broken down to elements, and the relative positional relationships among these elements are represented by a graphically linked model. Because of reliance on the relative positional relationships among the elements, when there is a mismatch with a single element, the remaining elements are also likely mismatched.
In the above-described matching prior attempts, it is assumed that for a given input document type, there is not a significant amount of variation in each element. For example, even if the same document type is used, font may be different for certain input documents. One reason may be that the input documents have been printed by more than one printer. Such variations may be a source for causing an error in extracting specified elements from a given input type. In this regard, Japanese Patent Application Hei 10-145781 has disclosed a method of and a system for updating a characteristic value of an element in an extraction model when a variable amount between the extracted characteristic value and the model characteristic value is beyond a predetermined threshold value. However, this simplistic approach turns out to be too susceptible to noise in the input documents. In order to substantially reduce errors in extracting elements, the criteria or extraction rules need to be intelligently adaptive enough to accommodate variations as well as noises.
In order to solve the above and other problems, according to a first aspect of the current invention, a method of extracting one or more elements from a document using model data, the model data including at least a template, includes: a) determining the template for a predetermined document type, the template having a set of predetermined characteristics for each of the elements; b) inputting at least one input document; c) extracting the elements having the predetermined characteristics from the input document according to the model data; d) storing the extracted characteristics of the elements in the model data; e) determining a distance value between the stored characteristics and a corresponding one in the model data; f) determining a variable amount based upon the distance value for each of the element; and g) modifying the model data based upon the variable amount.
According to a second aspect of the current invention, a system for extracting one or more elements from a document using model data, including: a model generation unit for generating model data for a predetermined document type, the model data including at least a template, the template having a set of predetermined characteristics for each of the elements; a document input unit for inputting at least one input document; an element extraction unit connected to the model generator and the document input unit for extracting the elements having the predetermined characteristics from the input document according to the model data; a characteristics storage unit connected to the element extraction unit for storing the extracted characteristics of the elements in the model data; a learning process unit connected to the characteristics storage unit for determining a distance value between the stored characteristics and a corresponding one in the model data and for determining a variable amount based upon the distance value; and a model updating unit connected to the learning process unit and the model generation unit for modifying the model data based upon the variable amount.
These and various other advantages and features of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and forming a part hereof. However, for a better understanding of the invention, its advantages, and the objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to the accompanying descriptive matter, in which there is illustrated and described a preferred embodiment of the invention.