Virtual machines isolate an operating system from the computer platform that is used to execute the operating system. Operating systems running inside virtual machines can be executed by different computer platforms.
Mass digitization demands the creation of a new digitization paradigm by mobilizing the general public to help with large-scale digitization efforts. One of these projects is known as Project Gutenberg (http://www.pgdp.net/c/). While the bulk of the data is digitized automatically by automated tools (such as servers, computers, scanners and the like) by applying Optical Character Recognition (OCR) techniques, the output of the OCR is not error free. Thus, the main task in this effort is OCR validation and correction. The goal is to make this process productive and attractive to volunteer participation.
The so-called “carpet” OCR verification method includes generating a “carpet” of character images that were classified by the OCR as associated with the same character. Assuming that most OCR classifications are correct, an erroneous character image will be easily noticeable in the “carpet”. For example, if the OCR erroneously classifies a “P” as an “A,” the operator will see an image of a P in a “carpet” full of A's. This type of discrepancy is very easy for the human operator to spot and mark on the screen. The image of the field that was read erroneously by the OCR is then displayed so that the operator (or another operator) can type in the correct character.
Reading a “Carpet” is not very interesting, thus users of the general public will not be attracted to perform substantial verification efforts. Accordingly, it is less appealing for the massive volunteer efforts needed in library digitization.
Another OCR verification technique that involves validating texts within their original context is also not appealing. Not only does it require custom applications, but understanding the text within its original context is a difficult task in itself: (i) the actual content of different texts may interest only a select group of experts while large scale OCR verification needs to be done by laymen; (ii) the vocabulary can include words which are unfamiliar to the person who performs the OCR verification (e.g. a verification of the OCR results of a Shakespearean play, by a third grader).
When dealing with archaic texts even more problems arise: (a) language evolves through the years—words and meanings change; (b) spelling, even of familiar words, changes over the years.
This process is both intrusive and hard, thus significantly lowering productivity and participation.
There is a growing need to provide an efficient OCR verification method, system and computer program product.