Various systems are used for the mass storage and retrieval of the contents of documents including systems such as those disclosed in my earlier U.S. Pat. Nos. 4,273,440; 4,553,261; and 4,276,065. While these systems are indeed quite usable and effective, they generally require considerable human intervention. Other systems involve storage techniques which do not use the available technology to its best advantage and which have serious disadvantages as to speed of operation and efficiency. In this context, the term "mass storage" is currently used to mean storage of very large quantities of data in the order of, e.g., multiple megabytes gigabytes or terabytes.
Storage media such as optical disks are suitable for such storage although light (holographic), magnetic and other media can be used.
Generally speaking, prior large-quantity storage systems employ one of the following approaches:
A. The content of each document is scanned by some form of optical device involving character recognition (generically, OCR) so that all or major parts of each document are converted into code (ASCII or the like) which code is then stored. Systems of this type allow full-text code searches to be conducted for words which appear in the documents. An advantage of this type of system is that indexing is not absolutely required because the full text of each document can be searched, allowing a document dealing with a specific topic or naming a specific person to be located without having to be concerned with whether the topic or person was named in the index. Such a system has the disadvantages that input tends to be rather slow because of the conversion time required and input also requires human supervision and editing, usually by a person who is trained at least enough to understand the content of the documents for error-checking purposes. Searching has also been slow if no index is established and, for that reason, indexing is often done. Also, the question of how to deal with non-word images (graphs, drawings, pictorial representations) must be dealt with in some way which differs from the techniques for handling text in many OCR conversion systems. Furthermore, such systems have no provision for offering for display to the user a list of relevant search words, should the user have need for such assistance.
B. The content of each document is scanned for the purpose of processing the images of the document content into a form which can be stored as images, i.e., without any attempt to recognize or convert the content into ASCII or other code. This type of system has the obvious advantage that graphical images and text are handled together in the same way. Also, the content can be displayed in the same form as the original document, allowing one to display and refer to a reasonably faithful reproduction of the original at any time. In addition, rather rapid processing of documents and storage of the contents is possible because no OCR conversion is needed and it is not necessary for a person to check to see that conversion was proper. The disadvantages of such a system are that some indexing technique must be used. While it would be theoretically possible to conduct a pattern search to locate a specific word "match" in the stored images of a large number of documents, success is not likely unless the "searched for" word is presented in a font or typeface very similar to that used in the original document. Since such systems have had no way of identifying which font might have been used in the original document, a pattern search has a low probability of success and could not be relied upon. Creating an index has traditionally been a rather time consuming, labor-intensive task. Also, image storage systems (i.e., storing by using bit-mapping or line art or using Bezier models) typically require much more memory than storing the equivalent text in code, perhaps 25 times as much.
Various image data banks have come into existence but acceptance at this time is very slow mainly due to input and retrieval problems. Because of the above difficulties, mass storage systems mainly have been restricted to archive or library uses wherein retrieval speed is of relatively little significance or wherein the necessary human involvement for extensive indexing can be cost justified. There are, however, other contexts in which mass storage could be employed as a component of a larger and different document handling system if the above disadvantages could be overcome.
In my copending applications Ser. No. 07/536,769 filed Jun. 12, 1990 and Ser. No. 07/547,190 filed Jul. 3, 1990, I have described techniques by which the input processing is accelerated and improved and in which the selection of search words is automated, i.e., performed with little human intervention. These systems allow the input of documents at a very high average rate of speed, requiring only about two seconds per document. Also, these systems provide techniques by which retrieval is facilitated because of the choice of search words usable to locate and identify stored documents at the time of retrieval. In part, the advantages of the systems are due to the use of font tables which allow searching for documents using pattern matching with search words constructed in the fonts or typefaces which are the same as those used in the original documents.
The above systems can employ some conversion of the words in the documents into code, such as ASCII, but an important aspect of them is that the documents are stored in image form regardless of how much is converted into code. In most cases, the amount to be converted is at least partly a matter of choice of the organization using the system. Nevertheless, whatever conversion is done has previously determined the amount of human assistance or intervention required because the major part of the human intervention is for the purpose of editing the converted text, i.e., making sure the conversion is correct or filling in characters which the conversion system (software or hardware) cannot recognize. Since human intervention must be done in most cases by someone who is reasonably well trained and able to understand the context and supply proper added information, the use of a person (or persons) is expensive, adding to the total cost of the system operation and to the individual cost of each document entered into image storage, a major cost item considering the millions of documents handled per company.