1. Field of the Invention
The present invention relates to a business document processing apparatus, and for example, to a technique for removing an underline touching a character string in a business document.
2. Background Art
In recent years, there is a move to apply character recognition by scanning and OCR to business documents with an enormous amount of paper stored in an organization and manage document data in a document management system to attain improved search performance, secure storage of paper documents, and sharing of knowledge.
Although the recognition accuracy of character strings of a document without noise is high in the current OCR, there is a problem that when an underline is drawn touching a character string, characters of that part cannot be correctly recognized in many cases. In OCR, characters are cut out one by one to recognize the characters, and a process of determining what characters are the cut out characters is executed. However, when an underline is attached to characters, cutting out of the characters often fails or a wrong determination is made by recognizing the underline as part of the characters. The wrong recognition causes a failure in acquiring the character information of that part and becomes an obstacle in searching because meaningless character information remains as noise. Character strings with underlines among the character strings in a business document are often essential information for identifying the document, such as the title of the document, business partner name, and various management numbers. Therefore, the document cannot be narrowed down in the search if the information cannot be correctly recognized. In that case, a burdensome process of checking all registered document data is necessary. Therefore, when OCR is applied, characters of character strings need to be recognized with high accuracy even if underlines touch the character strings.
A method of extracting and removing an underline from a character string with underline in a document is proposed as a section which improves the recognition accuracy of OCR when there is an underline on a character string. For example, Yoshihiro Shima and three others, “One Method of Underline Extraction from Business Form Image”, FIT 2002 (Forum on Information Technology), I-85, pp. 169-170, 2002.09 proposes a technique for removing an underline on a character string existing on a business form image. Zhen-Long Bai, Qiang Huo, “Underline Detection and Removal in a Document Image Using Multiple Strategies”, icpr, pp. 578-581, 17th International Conference on Pattern Recognition (ICPR '04)—Volume 2, 2004 proposes a technique in which an underline touching a character string is also removed.
However, the technique of Yoshihiro Shima and three others, “One Method of Underline Extraction from Business Form Image”, FIT 2002 (Forum on Information Technology), I-85, pp. 169-170, 2002.09 is a technique designed to handle a case that the underline and the character string do not touch. Therefore, the underline cannot be removed when the character string and the underline touch. The technique of Zhen-Long Bai, Qiang Huo, “Underline Detection and Removal in a Document Image Using Multiple Strategies”, icpr, pp. 578-581, 17th International Conference on Pattern Recognition (ICPR '04)—Volume 2, 2004 is designed to handle a case that the document includes only characters and underlines. Therefore, when the technique is applied to a document that often includes charts, such as a business document, there may be an adverse effect that ruled lines constituting the charts are removed.
The present invention has been made in view of the foregoing circumstances and provides a technique that can remove an underline even if a business document includes a chart or an underline touches a character string.