Text detection and localization in natural scene images serves as a crucial component for content-based information retrieval, as textual information often provides important clues for understanding the high-level semantics of multimedia content. Despite the tremendous effort devoted to solving this problem, text localization remains challenging. The difficulties mainly lie in the diversity of text patterns and the complexity of scenes in natural images. For instance, texts in images often vary dramatically in font, size, and shape, and can be distorted easily by illumination or occlusion. Furthermore, text-like background objects, such as bricks, windows and leaves, often lead to many false alarms in text detection.
Commonly used text detection methods include texture-based methods and component-based methods. In a texture-based method, an image is scanned at different scales using different windows shapes. Text and non-text regions are then classified based on the extracted window descriptors. However, text-lines in images have a much larger layout variation (e.g., rotation, perspective distortion, aspect ratio) that cannot be well captured by generic descriptors.
In contrast to text-based methods, in a component-based method, the majority of background pixels are discarded using low-level filters. Component candidates are then contrasted from the remaining pixels using a set of properties such as consistency of stroke width and color homogeneity. However, low-level filtering is sensitive to image noise and distortions can lead to incorrect component grouping.