1. Field of the Invention
The present invention relates generally to a system for managing and searching a large corpus of documents, and more particularly, to a system for sorting sets of documents with user-specified layout components of the documents recorded in the large corpus of documents.
2. Description of Related Art
Searching for a document in a large heterogeneous corpus of documents stored in an electronic database is often difficult because of the sheer size of the corpus (e.g., 750,000 documents). Many of the documents that make up the corpus are documents that cannot be identified by simply performing text based searches. In some instances, some documents in the corpus may, for example, be scanned images of hardcopy documents, or images derived using PDF (Portable Documents Formats), or PostScript(copyright). In other instances, simply searching the text of documents may not narrow a search sufficiently to locate a particular document in the corpus.
Techniques for searching the text of a document in a large corpus of documents exist. U.S. Pat. No. 5,442,778 discloses a scatter-gather browsing method which is a cluster-based method for browsing a large corpus of documents. This system addresses the extreme case in which there is no specific query, but rather a need to get an idea of what exists in a large corpus of documents. Scatter-gather relies on document clustering to present to a user descriptions of large document groups. Document clustering is based on the general assumption that mutually similar documents tend to be relevant to the same queries. Based on the descriptions of the documents groups, the user selects one or more of the document groups for further study. These selected groups are gathered together to form a sub-collection. This process repeats and bottoms out when individual documents are viewed.
Also, techniques exist that analyze the machine readable text of a document for identifying the genre of documents. The genre of text relates to a type of text or type of document. An example of a method for identifying the genre of machine readable text is disclosed in European Patent Application EP889417A2, entitled xe2x80x9cText Genre Identificationxe2x80x9d. Initially, machine readable text is analyzed to formulate a cue vector. The cue vector represents occurrences in the text of a set of non-structural, surface cues, which are easily computable. A genre of the text is then determined by weighing the elements making up the cue vector.
Besides text found in a document, often the layout of a particular document contains a significant amount of information that can be used to identify a document stored in a large corpus of documents. Using the layout structure of documents to search a large corpus of documents is particularly advantageous when documents in the corpus have not been tagged with a high level definition. Hardcopy documents which are scanned are recorded as bitmap images that have no structural definition that is immediately perceivable by a computer. A bitmap image generally consists of a sequence of image data or pixels. To become searchable, the structure of a bitmap image is analyzed to identify its layout structure.
By examining different work practices, it has been found that a work process (i.e., manner of working) can be supported with a system that is capable of searching and retrieving documents in a corpus by their type or genre (i.e., functional category). Where some genres of documents are general in the sense that they recur across different organizations and work processes, other genre of documents are idiosyncratic to a particular organization, task, or even user. For example, a business letter and a memo are examples of a general genre. A set of documents with an individual""s private stamp in the upper right comer of each document is an example of a genre that is idiosyncratic to a particular user. It has also been found that many different genre of documents have a predefined form or a standard set of components that depict a unique spatial arrangement. For example, business letters are divided into a main body, author and recipient addresses, and signature. Unlike specific text based identifiers, which are used to identify the genre of a document, the layout structure of documents can apply across different classes of documents.
A number of different techniques have been developed for analyzing the layout structure of a bitmap image. Generally, page layout analysis has been divided into two broad categories: geometric layout analysis and logical structure analysis. Geometric layout analysis extracts whatever structure can be inferred without reference to models of particular kinds of pagesxe2x80x94e.g., letter, memo, title page, table, etc. Logical structure analysis classifies a given page within a repertoire of known layouts, and assigns functional interpretations to components of the page based on this classification. Geometric analysis is generally preliminary to logical structure analysis. (For further background on image layout analysis see U.S. Pat. No. 6,009,196, entitled xe2x80x9cMethod For Classifying Non-Running Text In An Imagexe2x80x9d and its references).
The present invention concerns a method and apparatus for defining user-specified layout structures of documents (i.e., the visual appearance) to facilitate the search and retrieval of a document stored in a multi-genre database of documents. This method of searching documents focuses a search according to the manner in which the layout structure of a document is defined. Unlike many techniques for searching the text within a document, searching documents according to their layout structure is based on the appearance and not the textual content found in a document. The general premise for searching documents based on their layout structure is that the layout structure of text documents often reflect its genre. For example, business letters are in many ways more visually similar to one another than they are to magazine articles. Thus, a user searching for a particular document while knowing the class of documents is able to more effectively narrow the group of documents being searched.
One problem addressed by this invention is how to best manage a large corpus of scanned documents. Many document search and retrieval systems rely entirely on the results of applying OCR (Optical Character Recognition) to every scanned document image. Generally, OCR techniques involve segmenting an image into individual characters which are then decoded and matched to characters in a library. Typically, such OCR techniques require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing. In operation, OCR techniques distinguish each bitmap of a character from its neighbor, analyze its appearance, and distinguish it from other characters in a predetermined set of characters.
A disadvantage of OCR techniques is that they are often an insufficient means for capturing information in scanned documents because the quality of OCR results may be unacceptably poor. For example, the OCR results for a scanned document may be poor in quality because the original document was a heavily used original, a facsimile of an original, or a copy of an original. In each of these examples, the scanned results of an original document may provide insufficient information for an OCR program to accurately identify the text within the scanned image. In some instances, some scanned documents may be handwritten in whole or in part, thereby making those portions of the original document unintelligible to an OCR program.
Another disadvantage of OCR techniques is that the layout or formatting of the document is typically not preserved by an OCR program. As recognized by Blomberg et al. in xe2x80x9cReflections on a Work-Oriented Design Projectxe2x80x9d (published in PDC""94: Proceedings of the Participatory Design Conference, p. 99-109, on Oct. 27-28, 1994), users searching for a particular document in a large corpus of documents tend to rely on clues about the form and structure of the documents. Such clues, which could be gained from either the original bitmap image or reduced scale images (i.e., thumbnails), tend to be lost in ASCII text renderings of images. Thus, the layout or formatting of a document, which is usually not captured or preserved when a scanned image is reduced to text using an OCR program, is crucial information that can be used for identifying that document in a large corpus of documents. Improved OCR programs such as TextBridge(copyright), which is produced by Xerox ScanSoft, Inc., are capable of converting scanned images into formatted documents (e.g. HTML (hypertext markup language)) with tables and pictures as opposed to a simple ASCII text document (more information can be found on the Internet at http://www.xerox.com/xis/textbridge/).
An alternative technique for identifying information contained in electronic documents without having to decode a document using OCR techniques is disclosed in U.S. Pat. No. 5,491,760 and its references. This alternative technique segments an undecoded document image into word image units without decoding the document image or referencing decoded image data. Once segmented, word image units are evaluated in accordance with morphological image properties of the word image units, such as word shape. (These morphological image properties do not take into account the structure of a document. That is, the word image units do not take into account where the shape appeared in a document.) Those word image units which are identified as semantically significant are used to create an ancillary document image of content which is reflective of the subject matter in the original document. Besides image summarization, segmenting a document into word image units has many other applications which are disclosed in related U.S. Pat. Nos. 5,539,841; 5,321,770; 5,325,444; 5,390,259; 5,384,863; and 5,369,714. For instance, U.S. Pat. No. 5,539,841 discloses a method for identifying when similar tokens (e.g., character, symbol, glyph, string of components) are present in an image section; U.S. Pat. No. 5,324,444 discloses a method for determining the frequency of words in a document, and U.S. Pat. No. 5,369,714 discloses a method for determining the frequency of phrases found in a document.
Another alternative to performing OCR analysis on bitmap images are systems that perform content-based searches on bitmap images. An example of such a system is IBM""s Query by Image Content (QBIC) system. The QBIC system is disclosed in articles by Niblack et al., entitled xe2x80x9cThe QBIC project: querying images by content using color, texture and shape,xe2x80x9d in SPIE Proc. Storage and Retrieval for Image and Video Databases, 1993, and by Ashley et al., entitled xe2x80x9cAutomatic and semiautomatic methods for image annotation and retrieval in QBIC,xe2x80x9d in SPIE Proc. Storage and Retrieval for Image and Video Databases, pages 24-35, 1995. A demo of a QBIC search engine is available on the internet at xe2x80x9chttp://wwwqbic.almaden.ibm.com/-qbic/qbic.htmlxe2x80x9d. Using the QBIC(trademark) system, bitmap images in a large database of images can be queried by image properties such as color percentages, color layouts, and textures. The image-based queries offered by the QBIC system are combined with text or keyword for more focused searching.
Another system for performing content-based queries is being developed as part of the UC Berkeley Digital Library Project. Unlike the QBIC system which relies on low-level image properties to perform searches, the Berkeley system groups properties and relationships of low level regions to define high-level objects. The premise of the Berkeley system is that high-level objects can be defined by meaningful arrangements of color and texture. Aspects of the Berkeley system are disclosed in the following articles and their references: Chad Carson et al., xe2x80x9cRegion-Based Image Querying,xe2x80x9d CVPR ""97 Workshop on Content-Based Access of Image and Video Libraries; Serge Belongie et al., xe2x80x9cRecognition of Images in Large Databases Using a Learning Framework,xe2x80x9d UC Berkeley CS Tech Report 97-939; and Chad Carson et al., xe2x80x9cStorage and Retrieval of Feature Data for a Very Large Online Image Collection,xe2x80x9d IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Dec. 1996, Vol. 19 No. 4.
In addition to using OCR programs or the like to decipher the content of scanned documents, it is also common to record document metadata (i.e., document information) at the time a hardcopy document is scanned. This document metadata, which is searchable as text, may include the subject of the document, the author of the document, keywords found in the document, the title of the document, and the genre or type of document. A disadvantage of using document metadata to identify documents is that the genre specified for a particular corpus of documents is not static. Instead, the number of different genre of documents in a corpus can vary as the corpus grows. A further disadvantage of document metadata is that it is time consuming for a user to input into a system. As a result, a system for managing and searching scanned documents should be robust enough to provide a mechanism for defining categories and sub-categories of document formats as new documents are added to the corpus.
Another method for locating documents in a large corpus of documents is by searching and reviewing human-supplied summaries. In the absence of human-supplied summaries, systems can be used that automatically generate documents summaries. One advantage for using summaries in document search and retrieval systems is that they reduce the amount of visual information that a user must examine in the course of searching for a particular document. By being presented on a display or the like with summaries of documents instead of the entire document, a user is better able to evaluate a larger number of documents in a given amount of time.
Most systems that automatically summarize the contents of documents create summaries by analyzing the ASCII text that makes up the documents. One approach locates a subset of sentences that are indicative of document content. For example, U.S. Pat. No. 5,778,397, assigned to the same assignee as the present invention, discloses a method for generating feature probabilities that allow later generation of document extracts. Alternatively, U.S. Pat. No. 5,491,760 discloses a method for summarizing a document without decoding the textual contents of a bitmap image. The summarization technique disclosed in the ""760 Patent uses automatic or interactive morphological image recognition techniques to produce documents summaries.
Accordingly, it would be desirable to provide a system for managing and searching a large corpus of scanned documents in which not only are text identified using an OCR program and inputted document metadata searchable but also the visual representations of scanned documents can be identified. Such a system would advantageously search, summarize, sort, and transmit documents using information that defines the structure and format of a document. It would also be desirable in such a system to provide an interface for a user to flexibly specify the genre of document by the particular layout format of documents. One reason this is desirable is that genre of documents tend to change and emerge over the course of using and adding document to a corpus. Consequently, an ideal system would give users the flexibility to specify either a new genre or a specific class of genre that is of interest to a single user or group of users.
In accordance with the invention there is provided a system, and method and article of manufacture therefor, for sorting document images stored in a memory. The document images are sorted by segmenting each document image recorded in the memory into a set of layout objects. Each layout object in the set of layout objects of each document is one of a plurality of layout object types, and each of the plurality of layout object types identify a structural element of a document image. A feature of a document is selected from a set of features, where each of the features in the set of features identify a selected group of layout objects in certain of the sets of layout objects recorded in the memory. A set of image segments is assembled in the memory. Each image segment in the set of image segments identifies those layout objects of a document image stored in the memory that form the selected feature. The assembled image segments are sorted into clusters in the memory, where each cluster defines a grouping of image segments that have similar layout objects forming the selected feature.