Script identification is a useful preprocessing step in automatic document recognition. Most optical character recognition (OCR) devices are trained to recognize a limited set of scripts. If an OCR device was presented with a document that includes text printed in a script for which the OCR device was not trained to recognize then the OCR device would not be able to process the document correctly. So, there is a need for a method of identifying each script in which a document is printed so that an OCR device that was trained in all of the scripts can be identified and used to process the document.
A. L. Spitz, in an article entitled “Determination of the Script and Language Content of Document Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 3, 1997, discloses a method of identifying the script of a document of unknown script by using the topographical features (e.g., concavity) of components to determine whether or not the document is of an Asian script or a European script. A further set of features are then used to determine the language represented by the script of the document.
U.S. Pat. No. 5,062,143, entitled “TRIGRAM-BASED METHOD OF LANGUAGE IDENTIFICATION,” discloses a method of identifying language by comparing the trigrams of a sample of an unknown language against a sample of trigrams from various languages. The method of the present invention is more involved than the trigram method of U.S. Pat. No. 5,062,143. U.S. Pat. No. 5,062,143 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,442,715, entitled “METHOD AND APPARATUS FOR CURSIVE SCRIPT RECOGNITION,” discloses a method of identifying cursive script by segmenting a cursive word and using a scanning window to recognize individual characters of the word. The method of the present invention does not use a scanning window to identify individual characters of a word. U.S. Pat. No. 5,442,715 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,444,797, entitled “METHOD AND APPARATUS FOR AUTOMATIC CHARACTER SCRIPT DETERMINATION,” discloses a method of identifying script by generating connected components from the pixels of the image in question, placing a bounding box around each identified connected component, determining the centroid of each bounding box, determining the feature within each bounding box, and comparing the features to features known script types to determine the script type of the image in question. The method of the present invention does not generate connected components, employ bounding boxes, determine centroids, or determine feature as does the method of U.S. Pat. No. 5,444,797. U.S. Pat. No. 5,444,797 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,844,991, entitled “SCRIPT IDENTIFICATION FROM IMAGES USING CLUSTER-BASED TEMPLATES,” discloses a method of identifying the script of a document of unknown script by comparing the components of the document against templates of various scripts. The templates for a particular script are derived from a whole page of training text in that script. Each template is a cluster of similar connected components found on the page of training text. Each template is assigned a reliability score. Components in a document of unknown script are then compared to the templates of each script, and the script is given points for each successful comparison. The script that gets the most points is declared the script of the document in question. The present invention does not use a template comprised of a cluster of similar connected components. U.S. Pat. No. 5,844,991 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,933,525, entitled “LANGUAGE-INDEPENDENT AND SEGMENTATION-FREE OPTICAL CHARACTER RECOGNITION SYSTEM AND METHOD,” discloses the utility of considering an image of a document as a collection of lines of text, considering a line of text as a sequence of vectors, and using a Hidden Markov Model to analyze the document. The present invention does not use a Hidden Markov Model. U.S. Pat. No. 5,933,525 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,991,714, entitled “METHOD OF IDENTIFYING DATA TYPE AND LOCATING IN A FILE,” discloses a method of identifying the data type of a file of unknown data type by performing an n-gram analysis on each unit of text in the file of unknown data type, where the units of text must be known. The method of the present invention does not require that the units of text are known beforehand and does not perform an n-gram analysis at the text unit level. U.S. Pat. No. 5,991,714 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,005,986, entitled “METHOD OF IDENTIFYING THE SCRIPT OF A DOCUMENT IRRESPECTIVE OF ORIENTATION,” discloses a method of identifying the script of a document of unknown script by identifying connected pixels, and computing mean positions, extents, and moment values of the pixels. The method of the present invention does not identifying connected pixels and does not compute mean positions, extents, or moment values of pixels. U.S. Pat. No. 6,005,986 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,157,905, entitled “IDENTIFYING LANGUAGE AND CHARACTER SET OF DATA REPRESENTING TEXT,” discloses a method of identifying the language of a document of unknown language by using sets of full characters that are tailored to emphasis full character values that tend to be more distinguishable from one language to another. The method of the present invention is not performed at the full character level. U.S. Pat. No. 6,157,905 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,246,976, entitled “APPARATUS, METHOD AND STORAGE MEDIUM FOR IDENTIFYING A COMBINATION OF A LANGUAGE AND ITS CHARACTER CODE SYSTEM,” discloses a method of identifying the language of a document of unknown language by using probabilities of occurrence of full character codes. The method of the present invention is not performed at the full character code level. U.S. Pat. No. 6,246,976 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,272,456, entitled “SYSTEM AND METHOD FOR IDENTIFYING THE LANGUAGE OF WRITTEN TEXT HAVING A PLURALITY OF DIFFERENT LENGTH N-GRAM PROFILES,” discloses a method of identifying the language of a document of unknown language by performing an n-gram analysis at the letter. The method of the present invention is not performed at the letter level. U.S. Pat. No. 6,272,456 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,327,386, entitled “KEY CHARACTER EXTRACTION AND LEXICON REDUCTION FOR CURSIVE TEXT RECOGNITION,” discloses a method of determining whether or not a document in question includes cursive text by identified key characters and using a neural network to estimate the upper and lower bound of the number of key characters in a line of cursive text. The method of the present invention does not identify key characters and does not use a neural network. U.S. Pat. No. 6,327,386 is hereby incorporated by reference into the specification of the present invention.