The present invention relates generally to text identification. More particularly, the invention relates to improved methods and apparatus for locating and recognizing text labels appearing in digital images.
Many retailers employ catalogs to provide information about their products. A typical catalog, such as one used by an auto parts retailer, comprises numerous pages, each page having a schematic or diagram with images of several products. Each image of a product is adjacent to a label which serves as an index pointing the shopper to a detailed description of the product including whatever information the retailer wishes to include, such as part number, description, function, shelf location or price. The customer""s attention is drawn to the desired item by the image, and the label allows the customer to easily locate additional useful information.
As electronic data storage and processing has become more widely used in retailing, more retailers are storing catalogs electronically so that customers can obtain access to catalogs via the World Wide Web or via kiosks located in a retail store. However, a digitized image of a page from a paper catalog does not provide the customer with an immediate way to retrieve the indexed information. The customer must type the label or use some other means of data entry, and this is not as convenient as simply pointing to the label in order to retrieve the index. It would be advantageous if the catalog labels could be implemented as hot buttons or hypertext links so that the customer could simply click on a hot button or link in order to retrieve information associated with the hot button or link.
In many cases it is difficult to rewrite catalog pages to include hotkeys or links because of the magnitude of the task. Many retailers have thousands of already existing catalog pages. Auto parts retailers in particular have large numbers of pages which do not need to be changed because the pages refer to replacement auto parts used in older model cars, and each of these cars is able to use the same replacement parts so long as the cars exist and the parts continue to be available. If a 1965 Ford Mustang, for example, requires a new radiator hose, that hose will need to have the same specifications whether it is purchased in 1995, 2000 or 2005. The description of such a hose appearing on a catalog page will therefore not need to change. Auto parts retailers, therefore, along with many other retailers, have a large base of catalog pages which do not need to be updated in the ordinary course of business. It would therefore represent a significant extra expense to review these thousands of pages to add hotkeys or links, if this had to be done manually. It is possible to use optical character recognition (OCR) on labels in order to convert them to text, but typical catalog pages contain a mixture of pictures and text, so that simply attempting to perform OCR on an entire page would waste processing capacity due to the attempt to perform OCR on non text components. Moreover, not all text on a catalog page is necessarily a label. Performing OCR on text which is not a label wastes processing time, and assuming that any text on a page is a label results in improper designation of text as labels, requiring that the improper designation be found and corrected.
There exists, therefore, a need in the art for a system for analyzing graphic images of catalog pages to identify labels for designation as hotkeys or links, which can distinguish between text and non text components and which can further distinguish between text components which are labels and text components which are not labels.
A system according to the present invention receives graphic images produced by scanning of catalog pages. The system analyzes each image using connected component analysis in order to identify each component which should be considered as a unit, such as drawings, lines of text and the like. Each component is a collection of connected foreground pixels. Foreground pixels are typically black, or another color darker than the background. Once all the components in the image are identified, each component is analyzed to determine if it is a text or a non-text component. Text components are identified by their size, aspect ratio, density and other features. Once text components are identified, each text component is examined in order to determine if it is the right length for a label. Labels tend to be relatively short, typically consisting of one, two or three digits. Text components which are significantly longer than this length are unlikely to be labels and are removed from consideration. After all non-label text components are removed from consideration and all labels identified, the location of each label is determined and optical character recognition is performed on the labels.
A more complete understanding of the present invention, as well as further features and advantages of the invention, will be apparent from the following Detailed Description and the accompanying drawings.