1. Field of the Invention
This invention pertains in general to generating metrics which assess the quality of automatically generated digital text.
2. Description of the Related Art
Techniques which automatically generate digital or computer-readable text are widely used to organize and standardize large bodies of information. Optical character recognition (OCR) is a technology used to transform document images into computer readable text. Large scale projects such as the generation of libraries of digital text aim to scan images of books and generate digital text from the books using optical character recognition. Machine translation is another technology commonly used to generate large amounts of digital text by converting digital text in one language to another language. Machine translation and optical character recognition are often used in combination to standardize information into a common computer readable format.
As machine translation and optical character recognition are automated techniques, digital text generated using these techniques may be of variable quality due to errors. In some instances, these errors are due to the conversion of data such as characters or images that are not associated with the language of the document text. For instance, machine learning algorithms may attempt to translate an acronym or a piece of program code from English to French. Likewise, an optical character recognition program may attempt to recognize characters in a picture embedded in a page of text. The conversion of this data generates poor quality or “garbage” text.
Conventional approaches to recognizing garbage text include the use of a dictionary to determine whether automatically generated digital text is valid within a language. However, the use of dictionaries is limited because it is impossible to create a dictionary that includes all possible words in a language. For instance, dictionary creation is complicated by the large number of possible words in agglutinative languages such as German and Finnish in which individual words are compounded to create other words. Further, new words are adopted and used in different language systems on a continuous basis. The inclusion of technical terminology and colloquial language such as internet slang in text further complicates the use of dictionaries to determine whether automatically generated digital text corresponds to a language.
Due to these complications, alternate approaches to the use of dictionaries have been developed which use sequential models such as Markov models or n-gram models to compensate for unknown or unrecognized language. These models calculate a conditional probability of a word occurring based on the set of words which precede the word in text, therefore considering the context of the word based on the preceding words. However, these models suffer from the same shortcomings as the generic use of dictionaries as the conditional probability value for a word the model has not previously encountered is unknown or zero.
Other methods of assessing the quality of digital text generated using optical character recognition are based on criteria such as character morphology, pixel intensity, etc. As these models do not assess the quality of the text based on a language, these methods do not accurately identify the presence of garbage text. Accordingly, better methods of assessing the quality of digitally generated text are needed.