Currently, numerous pictures such as the pictures posted on the taobao.com web site include a large amount of prohibited texts. In order to recognize prohibited texts, optical character recognition (OCR) for natural scene images are utilized to detect and locate texts, the result of which is to filter out non-textual items and to present candidate textual items to an apparatus for recognition to enhance the accuracy of character recognition.
The natural scene OCR technology has been a hot topic in both industrial and academic research. Targeting different languages, features and algorithm structures utilized by OCR technology vary. Currently, the international OCR technology mainly targets English language. However, compared to English characters, Chinese characters are more complex and there are more types of Chinese characters. Together with the Chinese characters' component radicals rendering a single Chinese character a discontinuous region, it has been found more difficult to recognize Chinese characters.
Currently, there are three major types of OCR techniques recognizing text regions of Chinese characters in natural scenes. The first type utilizes experience based thresholds to classify. The second type extracts Chinese text line experience features from a large amount of samples marked-up in different application scenes, and utilizes a support vector machine (SVM) or the like to classify. The third type relies on a larger amount of marked-up positive samples and negative samples, and utilizes convolutional neural network (CNN) trained classifiers to classify.
With existing OCR techniques for recognizing Chinese text regions, the experience threshold based classification approach is the simplest, where the determination features mostly come from character features obtained from single character detection and extraction. But, the accuracy and robustness of such algorithms are relatively low, easily causing the effect of over-fitting. The second classification approach is presently the mainstream scheme; while the practice of the third approach is not often done due to the fact that the CNN approach tends to consume excessive amount of computational resources, affecting the overall efficiency of the algorithm. However, for either the second approach or the third approach, a large number of samples need to be marked up, consuming lots of effort and cost. Further, given that the classification results depend on the feature extraction and sample selection, for different application requirements, new batches of business dependent data need to be marked up, e.g., new samples need to be created. In other words, present marked-up samples have low applicability. In addition, Chinese characters have many fonts and styles, including the traditional, simplified and handwritten, etc., forms. Consequently, Chinese text lines have an extremely rich variety, which undoubtedly increases the difficulty in terms of recognizing Chinese text regions.
Therefore, there exists a need to provide a method of Chinese OCR text region recognition with high degrees of applicability, simplicity and effectiveness.