1. Field of the Invention
This disclosure relates to a method for extracting text information and, more particularly, to search documents and drawings for text characters mixed with graphics.
2. Description of the Related Art
Effective and automated creation of hypermedia have recently received immense focus due to the high demand for hypermedia applications generated as a result of the huge and growing popularity of the World Wide Web (WWW). Unfortunately, the creation of hypermedia to date continues to be a laborious, manually intensive job and in particular the task of referencing content in drawing images to other media. In a majority of cases, the hypermedia authors have to locate anchored information units (AIU's) or hotspots (areas or keywords of particular significance) which are then appropriately hyperlinked to relevant information. In an electronic document the user can retrieve associated detailed information by mouse clicking on these hotspots as the system interprets the associated hyperlinks and fetches the corresponding information.
Extraction of AIUs is of enormous importance for the generation of hypermedia documents. However, achieving this goal is nontrivial from raster images. This is particularly true in the case of scanned in images of engineering documents which primarily consist of line drawings of mechanical parts with small runs of text indicating the part or group number. Scanned-in mechanical drawings have machine parts that are labeled by text strings. These text strings point to the relevant machine parts. One way to create an index for the machine parts would be to point on the associated text. Obviously, the areas of interest to the hypermedia author and to the end user are those text strings that identify the part numbers or other related document information. This is also important within the scope of making drawings more content referable in electronic documents.
What makes this problem challenging is the indistinguishability of text from polylines which constitute the underlying line drawings. This also partially explains the paucity of reliable products that can undertake the above-mentioned task. While developing a general fit-for-all algorithm that would work for all kinds of line-drawing images is almost impossible, solutions can be achieved by making use of underlying structures of the concerned documents.
Currently, most available methods cannot be used reliably for drawing images. They can primarily be categorized as follows: (a) raster-to-vector converters and (b) traditional OCR methods mainly used for text documents. Due to the similarity between text and polylines, extraction of text from line drawings is a very difficult task. While the raster to vector converters treat the whole image as consisting of line drawings only, the OCR software packages presume the whole image is text. In the first case text is converted to line drawings, in the second case line drawings are attempted to be interpreted as text. While the first category of products is clearly irrelevant within the present context, the second category leaves the task of culling out the relevant material from all the "junk" that it produces as a result of misreading line drawings as text.
Several prior art software packages fall within this category. While they can both accomplish some preprocessing like image despeckeling and enhancement, proper interpretation of text within the context of line drawing images needs the user to manually outline the text regions, which is tedious and time consuming.
Therefore, a need exists for an automated method to locate keywords in engineering drawings and documents to create proper AIUs for cross-referencing in hypermedia documents. Most of the suggested prior art methods do not optimally use the underlying geometry and domain-specific knowledge to achieve the task of text separation. It is desirable for the above-mentioned method to make use of the geometry and length of the text strings that are to be identified in order to localize them. These localized regions are then analyzed using an OCR software to extract the exact text content. Further, the method must be amenable to user manipulation and input to adapt to the variability of the class of documents under consideration. The friendly user interface should also allow corrections at different stages of the procedure.