The Internet has evolved into a communication medium capable of delivering virtually any type of media in electronic form. One particular media that is becoming increasingly digitized is the written word. Books, magazines, articles, and other publications are currently being stored as digital files that can easily be downloaded and viewed on electronic devices. No longer must consumers haul around paper copies of their favorite books. Now, they can peruse online libraries containing a vast quantity of digital publications.
Often, it is difficult to locate a digital form of a publication. Scanning publications using modern scanning devices is one method of creating an electronic version of a printed publication. During scanning, an image of one or more printed pages is extracted from the document and stored in a data file. Optical Character Recognition (OCR) software can then be applied to the data file to discern textual data from the scanned image(s). Traditional OCR software takes a digital image (such as a scanned book page) and determines whether the image contains any image regions, table regions, or text regions. For text regions, OCR software determines the bounding boxes of each line in the text region, the bounding boxes of each word in the line, and the most probable underlying meaning of each word. All of this information is stored in a file format proprietary to the OCR software. For example, OCR software developed by Abbyy® may store the above information as a unique extensible markup language (XML) file, named Abbyyxml. Another OCR-software company, Nuance, may store the information as a Nuance Xdoc file.