With ever expanding use of the Internet, digital content services are becoming increasingly popular. Companies digitize books, official records, and other printed documents, and make them available to subscribing customers. Digitized records are often easier than traditional physical documents to review, search and analyze for various purposes, such as research. Thus, it has become desirable to digitize many historical records to facilitate research.
The most efficient method of digitizing printed records is to electronically scan them and use optical character recognition (OCR) to convert the scanned text to computer readable text. However, historical records are often difficult to use when scanned because of unique formatting of the original document, and also because of graphics and other material not relevant to the likely purpose of use/research of the digitized document. Often some judgment needs to be exercised as to how formatting should be accommodated and as to what data is relevant, leading to a person having to manually review each record page (either before or after using OCR), or alternatively, manually entering data from the record at a keyboard (rather than using OCR).
One example of the difficulties in digitizing records are illustrated by historical “city directories.” These directories were published by many different publishers across the United States from the late 1800's to the mid-1900's, and include listings by name of every resident (or nearly every resident/head of household) in a given city. Such directories thus provide a historical snapshot of people and their respective addresses in that city at the time of publication and thus, collectively, are a valuable tool for tracking people across the United States during time periods covered by those city directories. However, city directories often include other, less useful information (unrelated to the names of residents) that make it difficult to use standard OCR methodologies to efficiently capture and use information. If a city directly is simply digitized (using OCR methodologies), the useful information (e.g., names) may be intermingled with less useful information and the format of data in the digitized directory may make the resulting data difficult to access and search by a user.
To illustrate the foregoing, reference is made to FIG. 1, which illustrates one page from a city directory for Los Angeles, Calif., published in 1891. As seen, the directory page 100 includes a listing 110 of names (ordered alphabetically by last name), each name appearing on a line, with some lines wrapping or continuing to the next line (without a person's name appearing on the wrapping line). The listing 110 includes information associated with each name, such as occupation and address (which associated information may also be useful to a researcher attempting to locate individuals by name).
The page 100 also includes information that would normally not be useful to a researcher or user (i.e., a user looking for individuals by name), such as advertising text 112 at the top of the page, adverting text 114 along the side, advertising text 116 at the bottom of the page and a header portion 120 with page number and directory identification. While not shown in FIG. 1, a typical city directory might include other information that would also not be useful, such as indexes, listings ordered by street or address (rather than names), pictorial or graphical advertising, and informational text concerning the city.
As discussed above, for purposes of digitizing the information on the directory page 100, it would be desirable to exclude the information that would not be useful to a user or researcher.
In addition, it would be useful (for purposes of access and retrieval) to have information on each person in the directory arranged as a single line or entry of computer readable text, ordered alphabetically by the last name of a person, and with each such line having any other useful information associated with the person. For example, as seen in FIG. 1, certain lines are indented (such as those designated by example as 130, 132 and 134), and thus are each a continuation or wrapping line of an immediately preceding line. It would be desirable for any such wrapping line to be combined or merged with its preceding line into a single line or entry of computer readable text.
Aforementioned U.S. application Ser. No. 13/242,736, discloses the processing of scanned data from a printed directory, with irrelevant information removed and with wrapping lines (and lines with dittoed last names) reconstructed so that the resulting file or digitized document has a listing for each person in the directory as a single line or entry of computer readable text and has last names appropriately inserted (e.g., when missing due to dittos). For example, referring to FIG. 2, there are illustrated listings from the same page of the directory that is seen in FIG. 1, but with the scanned data now arranged so that the data for each person appears in a single line, using the systems and methods described in U.S. application Ser. No. 13/242,736.
While such an arrangement makes the digitized document easier to use, it would be desirable to extract certain data from the lines and put them into searchable data fields to make that data more easily searched and accessed, for example by the use of standard database queries.
With such extraction, the data from many thousands of city directories could be combined, e.g., into a single database, so that a user trying to locate or track a person could enter one or more search terms and retrieve information, such as name, address and other information associated with the person found in any of the directories. However, because the listings are taken from many directories with many different formats (and different ordering of information on each line), the ability to extract certain information (for example, a name or an address) is difficult to do without human review and analysis.