Example embodiments described herein relate to text detection; in particular, to methods and apparatus for detecting text in an image and generating word-level text bounding boxes in a natural image. Reading text “in the wild” has attracted increasing attention in computer vision community. It has numerous potential applications in image retrieval, industrial automation, robot navigation, and scene understanding, among other areas. It remains a challenging problem. The main difficulty of such text interpretation processes lies in the vast diversity in text scale, orientation, illumination, and font present in real-world environments, which often come with highly complicated backgrounds.
Previous methods for text detection in such environments have been dominated by bottom-up approaches, which often contain multiple sequential steps, including character or text component detection, followed by character classification or filtering, text line construction and word splitting. Character detection and filtering steps play a key role in such bottom-up approaches. Previous methods typically identify character or text component candidates using connected component based approaches (e.g., stroke width or extremal region), or sliding window methods. However, both groups of methods commonly suffer from two main limitations which significantly reduce their efficiencies and performance. First, such text detection methods are built on identification of individual characters or components, making it difficult to explore regional context information. This often results in a low recall where ambiguous characters are easily discarded. It also leads to a reduction in precision, by generating a large number of false detections. Second, multiple sequential steps make the system highly complicated, and errors are easily accumulated in the later steps.