1. Field of the Invention
The present invention relates generally to data archiving systems and more particularly to a method of automatically extracting metadata from documents for use in the data archiving systems.
2. Description of the Related Art
Metadata is data about data. In the case of documents, metadata includes pieces of information about each document such as "author," "title," "date of publication," and "type of document." As document databases become larger it becomes necessary to extract and organize metadata so that the desired documents can be quickly and easily found within the database. There has been a continuing need for a way to automatically, quickly and accurately extract metadata from documents as they are entered into data archiving systems. This need has been particularly acute when either the metadata or the document types, or both, are user-defined.
At one time metadata extraction was done manually. An operator would visually scan and mentally process the document to obtain the metadata. The metadata would then be manually entered into a database, such as a card catalogue in a library. This process was tedious, time consuming, and expensive. As computers have become more commonplace, the quantity of new documents including on-line publications has increased greatly and number of electronic document databases has grown almost as quickly. The old, manual methods of metadata extraction are simply no longer practical.
Computerized "keyword" searching has replaced much of the old manual metadata entry. In "keyword" searching, the entire textual portion of every document in a database is converted into computer-readable text using optical character recognition (OCR) techniques that are known in the art. Every word in every document is then catalogued in a keyword database that indicates what words appear in a particular document and how many times those words appear in the particular document. This allows users to select certain "keywords" that they believe will appear in the documents they are looking for. The keyword database allows a computer to quickly identify all documents containing the keyword and to sort the identified documents by the number of times the keyword appears in each document. Variations of the "keyword" search include automatically searching for plurals of keywords, and searching for boolean combinations of keywords.
"Natural language" searching followed "keyword" searching. "Natural language" searching allows users to enter a search query as a normal question. For example, a child trying to learn to pitch a baseball might search for references that would help by entering the query, "How do you throw a curveball?" The computer would then automatically delete terms known to be common leaving search terms. In this case the search terms would be "throw" and "curveball". The computer would then automatically broaden the set of search terms with plurals and synonyms of the original search terms. In the above example, the word "pitch" might be added to the list of search terms.
As in "keyword" searching, a keyword database is then searched. Relevant documents are picked and sorted based on factors such as how many of the search terms appear in a particular document, how often the search terms appear in a particular document, and how close together the search terms may be to one another within the document.
While "keyword" and "natural language" searches have helped users find the documents they are looking for, they are not particularly helpful when a user is attempting to glean a particular type of metadata, for example "authors whose last names begin with the letter Z", from all, or a particular subset, of the documents within a database. Thus it is still desirable to be able to classify metadata by type.
Because manual entry of the information is often not practical, as discussed above, several schemes have been used to automate the process. First, the manual burden has been shifted to those submitting the data for the database rather than those receiving the data. Those submitting may be required to fill in on-line or paper forms listing the requested metadata. The metadata listed on the on-line forms can be entered directly into the metadata database. The metadata listed on paper forms can be scanned and an OCR operation can be performed on the textual portions. Since each item of metadata is presumed to be in a defined location on the form, the metadata can be automatically gathered and entered into the appropriate locations in the database.
In the case of classes of documents having a standardized format, such as patents, pre-set locations on the documents are known to contain certain types of metadata. For example, on a United States patent, the patent number and date are both found in the upper right hand corner of the first page. In the case of documents having standardized formats, automatic entry of the metadata into a database is accomplished by performing an OCR operation on the particular portions of the document known to contain the desired metadata. Until the advent of the automatic document classifier the usefulness of this system was limited by the need to manually classify each document.
Automatic document classifiers are now known in the art. For example, a document classifier is disclosed in U.S. Pat. No. 5,675,710, entitled, "Method and Apparatus for Training a Text Classifier." Automatic entry of metadata from assorted types of standardized documents can now be achieved fairly reliably and inexpensively.
Also known in the art are entire document database systems that utilize many of the aforementioned techniques in combination. One such system is described in U.S. Pat. No. 5,628,003 entitled, "Document Storage and Retrieval System for Storing and Retrieving Document Image and Full Text Data."
From the foregoing it will be apparent that there still a need for a method to automatically extract metadata from non-standard documents. There is also a need to automatically extract metadata where the location of the metadata sought is not well defined within the document. Further, there is a need to automatically extract user-defined metadata from user-defined classes of documents.