Digital images are desired to be recognized in various contexts. For example, recognition of digital images may need to be employed by web crawlers. Web crawlers (herein sometimes referred to as just “crawlers”) collect data to facilitate search engines in performing their searches. A web crawler typically downloads web pages (using a computer) and then extracts and/or recognizes data that is included or hidden within the web pages. The web crawler can arrange the extracted/recognized data into a format that can be accepted/accessed by the search engine to use when building an index. As such, the web crawler's task of extracting and/or recognizing data from a web page is important.
For example, in the context of a vertical search engine (i.e., a search engine that focuses on a particular segment of content, as opposed to all types of content) for businesses, the associated web crawler can extract and/or recognize such information as: “store name,” “address,” “phone number,” “map coordinates,” and “reviews.” For example, phone numbers (e.g., including company, business, or personal phone numbers) are very important pieces of information in the vertical search because inclusion of accurate phone numbers can greatly contribute to the information quality and effectiveness of a particular business. In this example, even if the extraction and/or recognition of information is complete for the business other than the phone information, the data collected for the business is still deficient because of the lack of accurate phone number information. Another example of a series of numbers that is desired to be recognized can be an ID number that is present in a scanned image (e.g., a scanned image of a ID card).
Continuing with the phone information example, a phone number is generally a string of numerical characters. Sometimes, at a web page, a phone number is presented in the form of an image. This is so that the phone number does not interfere with a user's reading. Also, since the phone number itself is relatively short and takes up only a very small area on a web page, when the phone number is transmitted as a picture, it does not unduly increase network overhead. If information a web crawler needs to obtain, such as phone numbers, is stored as a digital image, then it is useful for the web crawler to include an optical character recognition (OCR) functionality to recognize such (e.g., numerical) information. OCR is a form of mechanical and/or electronic translation of scanned images of text into machine-intelligible text.
OCR is one form of computer pattern recognition and the recognition of only digits (any numerical characters between “0” to “9”) is a particular branch of OCR. Typically, the available technologies employ a digit recognition technique that is a differentiation technique. Such a differentiation technique usually implements the following steps: receive image models for individual digits “0”-“9” and separately differentiate an image to be recognized from the individual model images, find the number of different pixels between the image to be recognized and the model image, and the digit that corresponds to the model image with the smallest number of pixels different from the image to be recognized is determined to be the numerical character in the image to be recognized. This method has a good level of accuracy of recognition for images that are not geometrically distorted. However, if the image noise is severe (e.g., the image is noisy even after applying a noise removal technique), the quality of the digit recognition that uses the differentiation technique may be degraded. Also, the differentiation technique may not be as helpful in processing images of digits with geometrical deformations (e.g., images that are warped or are zoomed in).
Another typical method of image recognition includes the use of neural networks. An image recognition technique that includes the use of neural networks typically includes the following general steps: feature extraction is performed on the image to be recognized and/or also a description is made of the features to be recognized, some human-recognized samples are selected to serve as objects for machine learning, and the machine learning technique can output the image mode (e.g., recognition rules or patterns). By applying this machine learned mode with regard to the images to be recognized, the numerical characters can be obtained in the images to be recognized. Utilizing the neural network technique in digital image recognition also involves certain issues. However, the recognition rate of individual numerical characters in the neural network technique is higher than in the differentiation technique (e.g., it can be as high as 96%˜98% for individual character recognition using the neural network technique). For example, in practice, China-based landline phone numbers usually consist of at least of 8 digits, and China-based cell phone numbers are generally made up of even more digits. Assuming there is a 96%˜98% accuracy for individual characters, the accuracy of image recognition for an 8-digit phone number (such as for a China-based stationary phone) would be approximately 72.1˜85.1%, the accuracy of image recognition for an 11-digit phone number (such as for a China-based cell phone), the accuracy of image recognition would be approximately 63.8˜80.1%, and the accuracy of image recognition for a 12-digit phone number (such as including the area code with the phone number of a stationary phone) the accuracy would be approximately 61.3˜78.5%. In practice, when image recognition accuracy is not high, the recognition results provided by a crawler are very likely to be of poor quality (e.g., includes many incorrectly recognized characters). To achieve better accuracy of image recognition, the neural network model(s) of the recognition program can be repeatedly tweaked and improved. However, repeatedly tweaking the image recognition model can be inefficient.