Detecting and recognizing text lines in natural images is an important component in content-based retrieval, and has many real-world applications such as mobile search, text translation, and image tagging. For example, a user may capture an image of a product using his or her smartphone. A search application executing on the smartphone may use the image to identify a text line across a label of the product, recognize text from the text line, use the text to conduct an online search for information about the product, and display search results to the user.
Compared to extracting text in well-captured document images, spotting text elements in natural scenes is much more challenging due to the huge amount of diversity in both text appearance and surrounding backgrounds. For example, text lines in natural images could be in any orientation and could vary dramatically in font, size, and color across images. Images captured by hand-held devices also suffer from non-uniform illumination, occlusion, and blur. Moreover, text-like background objects such as windows, bricks, and fences may confuse a text detector.
Previous text detection techniques roughly follow the same bottom-up paradigm of two main steps. In a first step, pixel-level features are extracted to group pixels into text components. In a second step, sufficiently similar nearby text components are grouped into text lines. These techniques heavily focus on the first step by exploring various low-level image features and both rule-based and learning-based component forming and filtering techniques.
The second step of text line extraction has been a less explored territory. For this step, previous techniques often use heuristic, greedy methods by concatenating neighboring text components that are similar to form chain structures to represent text line candidates. Most commonly, a text component is set as a seed and compared to a neighboring text component. If sufficiently similar, the two text components are chained together to start a text line. This process is repeated for the next neighboring text components, until a large enough dissimilarity is encountered. At that point, the chain is broken to mark the end of the text line. However, such methods can be inaccurate and can fail due to, for example, local component detection errors. In theory, accurate component detection in the first step may result in good text line extraction in the second step. In practice, however, component detection typically includes errors. Further, simple text line construction methods as in the previous techniques do not handle errors in the input data well, resulting in inaccurate output.
In another approach for text line extraction, previous techniques have also used a graph-based text line approach. In this approach, a graph based on the detected components is generated. Graph segmentation methods are used to cut the graph to produce text lines. However, these approaches are also sensitive to the errors in the component detection step, since the errors directly affect the vertices and edges of the constructed graph.