The current invention is generally related to a user-defined search template for extracting information from documents, and more particularly related to a method and a system for generating a document template based upon a user selection of divided document image areas.
It is generally difficult to extract textual information from non-formatted documents primarily containing alphanumeric characters. Since the documents do not have the exactly identical format, to extract certain information such as a tile from similarly formatted documents, at least a relative location in each document has to be determined. In order to efficiently extract the textual information of the same type from sufficiently similar documents, text image must be first divided into areas and a relevant area then must be identified. In other words, sufficiently similar documents each have the corresponding text image areas where desired information is to be found. To specify the text image areas, predetermined information is generally stored as search templates or extraction criteria.
Prior art attempts in specifying the search templates or the extraction criteria include structural data and extraction rules, The structural data generally accounts for absolute or relative positional relations of textual image areas while the extraction rules determine a priority when more than one textual image areas are found for a search key. For example, Japanese Patent Publication Hei 5-159101 discloses an elaborate scheme for constructing and maintaining an extraction criteria database for extracting textual elements or textual image areas. The database is organized to maintain logical as well as positional relationships among the textual elements and textual image areas in a specific type of a document.
Other prior art attempts include Japanese Patent Publication Hei 8-287189 which discloses an extraction rule for prioritizing candidates to determine a desired textual image area. For example, candidates for a text image area containing a title are prioritized by finding whether any of the candidate areas is centered with respect to the width of the page. The above described extraction rule is predetermined and stored in an extraction dictionary.
The above exemplary prior art disclosures indicate a predetermined dictionary concept for storing the search templates or extraction rules in a system. These information is generally document-specific. In other words, for a new type of document, the specific information must be generated and inputted. Since the above prior art systems are not designed to allow an end-user to generate and input new search templates or extraction rules, more efficient system and process are desired for adding the search templates or extraction rules for a new document.
In order to solve the above and other problems, according to a first aspect of the current invention, a method of generating a search template for retrieving information from documents, including: inputting a first document; dividing the first document into areas, the areas including a text area containing text and an image area containing an image; displaying the areas to an end user; selecting at least one of the areas based upon an user-defined input, the user-defined input including a label for the selected area; automatically determining a predetermined set of characteristics of the selected area; and storing the user-defined input and the characteristics as a part of the search template.
According to a second aspect of the current invention, a system for generating a search template for retrieving information from documents, including: an input unit for inputting a first document; an area dividing unit connected to the input unit for dividing the first document into areas, the areas including a text area containing text and an image area containing an image; a user selection unit connected to the area dividing unit for displaying the the areas to an end user and for selecting at least one of the areas based upon an user-defined input; and a characteristic extraction unit connected to the user selection unit for automatically extracting a predetermined set of characteristics for the selected areas; and a storage unit connected to the user selection unit and the characteristic extraction unit for storing the predetermined characteristics and the user-defined input as a part of the search template.
According to a third aspect of the current invention, a recording medium containing a computer program for generating a search template for retrieving information from documents, the computer program including the steps of: inputting a first document; dividing the first document into areas, the areas including a text area containing text and an image area containing an image; displaying the areas to an end user; selecting at least one of the areas based upon an user-defined input; providing a predetermined set of user-defined input for the selected areas, the user-defined input including a label for the selected areas; storing the predetermined set of the user-defined input and the characteristics as a part of the search template; inputting a second document which is sufficiently similar to the first document after the storing step; and retrieving information from the second document based upon the specified user-defined input and the characteristics.
According to a fourth aspect of the current invention, A recording medium containing a computer program for generating a search template for retrieving information from documents, the computer program including the steps of: inputting a first document; dividing the first document into areas, the areas including a text area containing text and an image area containing an image; displaying the areas to an end user; selecting at least one of the areas based upon an user-defined input; providing a predetermined set of user-defined input for the selected areas, the user-defined input including a label for the selected areas; storing the predetermined set of the user-defined input and the characteristics as a part of the search template; inputting a second document which is sufficiently similar to the first document after the storing step; and retrieving information from the second document based upon the specified user-defined input and the characteristics.
These and various other advantages and features of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and forming a part hereof. However, for a better understanding of the invention, its advantages, and the objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to the accompanying descriptive matter, in which there is illustrated and described a preferred embodiment of the invention.