Over the past several decades, a large percentage of documents have been created and stored in digital formats. However, during this same time period and earlier, massive volumes of information have been recorded and stored only on physical documents. Such physical documents may include items that were produced using a computer or word processor and were printed, with the associated electronic files no longer available. Such physical documents may also include documents produced using a typewriter, with no associated electronic file ever created. Still further, massive amounts of handwritten records, spanning centuries, may exist.
While many of these documents are decades, if not centuries, old, each may contain information that would be beneficial to be available in an electronic and searchable format. One possible example may include previous population, birth, and death records. Such information may be particularly useful for a genealogist attempting to reconstruct a family tree with members throughout the country or world. In so doing, it may be useful to have access to immigration records, census records, birth certificates, death certificates, and/or any other document that may accurately provide information relating to family structures. Assisting genealogic studies is just one of the near limitless examples of the benefits of digitizing physical documents into an electronic, searchable format.
While digitizing documents previously unavailable in an electronic format may have distinct advantages, several obstacles exist. For example, consider FIG. 1. FIG. 1 illustrates a population schedule 100 from the 1930 Census of the United States. As illustrated, this is one page representing partial population information for Allegheny County of Pennsylvania. Considering that the 1930 Census was the 15th census of the United States, and each census has been charged with documenting every person in the country, the volumes of data existing in censuses in the United States alone are enormous.
While computer software and hardware arrangements capable of scanning and digitizing some text (often referred to as optical character recognition (“OCR”)) appearing on physical documents exist, they may have several drawbacks. In many instances they may not be able to produce with sufficient accuracy digitized text representing the text on the physical document. This may be due to one or more different problems. For example, the typing or penmanship may be fully or partially illegible, such as a name 110 in FIG. 1. Corrections or cross-outs, such as correction 120 may exist. Further, scanning errors or document imperfections may exist, such as anomaly 130. Such problems affect the ability of a machine to accurately decipher printed or handwritten text (which may not be decipherable by OCR at all), may prevent the automatic digitization of records, thereby requiring a person to manually review, decipher, and input the correct characters associated with the problem text. Considering the volumes of data, the possibility of frequent problem text appearing on documents, and the resources (especially in terms of a human workforce) required to produce accurate digitized data may be enormous, costly, and time-consuming.
The following invention serves to remedy these and other problems.