Various systems are used for the mass storage and retrieval of the contents of documents including systems such as those disclosed in my earlier U.S. Pat. Nos. 4,273,440; 4,553,261; and 4,276,065. While these systems are indeed quite usable and effective, they generally require considerable human intervention. Other systems involve storage techniques which do not use the available technology to its best advantage and which have serious disadvantages as to speed of operation and efficiency. In this context, the term "mass storage" is used to mean storage of very large quantities of data in the order of, e.g., multiple megabytes gigabytes or terabytes. Storage media such as optical disks are suitable for such storage although other media can be used.
Generally speaking, prior large-quantity storage systems employ one of the following approaches:
A. The content of each document is scanned by some form of optical device involving character recognition (generically, OCR) so that all or major parts of each document are converted into code (ASCII or the like) which code is then stored. Systems of this type allow full-text code searches to be conducted for words which appear in the documents. An advantage of this type of system is that indexing is not absolutely required because the full text of each document can be searched, allowing a document dealing with a specific topic or naming a specific person to be located without having to be concerned with whether the topic or person was named in the index. Such a system has the disadvantages that input tends to be rather slow because of the conversion time required and input also requires human supervision and editing, usually by a person who is trained at least enough to understand the content of the documents for error-checking purposes. Searching has also been slow if no index is established and, for that reason, indexing is often done. Also, the question of how to deal with non-word images (graphs, drawings, pictorial representations) must be dealt with in some way which differs from the techniques for handling text in many OCR conversion systems. Furthermore, such systems have no provision for offering for display to the user a list of relevant search words, should the user have need for such assistance.
B. The content of each document is scanned for the purpose of reducing the images of the document content to a form which can be stored as images, i.e., without any attempt to recognize or convert the content into ASCII or other code. This type of system has the obvious advantage that graphical images and text are handled together in the same way. Also, the content can be displayed in the same form as the original document, allowing one to display and refer to a reasonably faithful reproduction of the original at any time. In addition, rather rapid processing of documents and storage of the contents is possible because no OCR conversion is needed and it is not necessary for a person to check to see that conversion was proper. The disadvantages of such a system are that some indexing technique must be used. While it would be theoretically possible to conduct a pattern search to locate a specific word "match" in the stored images of a large number of documents, success is not likely unless the "searched for" word is presented in a font or typeface very similar to that used in the original document. Since such systems have had no way of identifying which font might have been used in the original document, a pattern search has a low probability of success and could not be relied upon. Creating an index has traditionally been a rather time consuming, labor-intensive task. Also, image storage systems (i.e., storing by using bit-mapping or line art or using Bezier models) typically require much more memory than storing the equivalent text in code, perhaps 25 times as much.
Various image data banks have come into existence but acceptance at this time is very slow mainly due to input and retrieval problems. Because of the above difficulties, mass storage systems mainly have been restricted to archive or library uses wherein retrieval speed is of relatively little significance or wherein the necessary human involvement for extensive indexing can be cost justified. There are, however, other contexts in which mass storage could be employed as a component of a larger and different document handling system if the above disadvantages could be overcome.