Increasing emphasis is being placed on the realization that computer based systems are now capable of providing automated analysis and interpretation of paper-based documents. The move from paper-based documentation towards computerized storage and retrieval systems has been prompted by the many advantages to be gained from the electronic document environment. A clear advantage is the efficiency of storage, transmission and retrieval, because paper-based information of almost any kind may be more efficiently processed in a computerized form. Document update and revision capability provided by a computerized form may be the most significant gain over the paper-based medium. For a system to enable image revision and update, it is necessary to automate the data capture process rather than re-create the data in digital from. Accordingly, it is necessary to generate a description of graphical elements in the document (rather than a bit-map) in order to allow easy editing and to decrease the storage and processing time.
Automatic word processing units are increasingly being employed in office use for producing, modifying, and storing written documents in an economic and time-saving manner. Such units have the capability of undertaking error corrections, insertion of new text passages, combining two or more texts having different origins, and random reproduction and electronic storage of data corresponding to the text passages. The advantages of such automatic word processing units in comparison to conventional typewriters are the flexibility and time-saving in the production of written documents which can be generated by such units, and the higher efficiency resulting therefrom. A particularly time consuming step associated with the use of automatic word processing units is the transfer of information already existing on paper into the automatic word processing unit for storage and/or further processing.
Manual transfer by keyboard of large amounts of text is extremely time consuming and accordingly, various method and devices have been developed for automatically transferring the information contained in texts into the word processing unit. One such device is a digitizer or the like for reading a drawing or the like to input data to a computer, in which the reading indicator for the coordinates is manually moved to feature points such as end points and infliction or bending points of lines so that the read coordinates are stored in the computer. In this case, the identification of the lines is also manually inputted into the computer by another means. The discrimination of lines in accordance with this method is performed by the human pattern recognition ability and the font definitely prevents the digitized operations from being fully automated.
A problem in the automatic transfer of existing information contained in text passages into a word processing unit is that the master on which such text passages occur may also contain graphics and/or image areas. It is a problem in the art to automatically identify, classify and store these different types of information areas on a master in order to achieve an optimum coding of the data representing these different master areas as well as to permit separate manipulation of the data representing those areas within the word processor.
In the past, to remove or cut regions of text or sentences in a digitized document, required the aid of an interactive software package along with a "mouse" to allow an operator to specify window locations (rectangular area) around the selected text that is to be cut. It is, however, more natural and easy for an operator to select text regions by hand drawing boundaries on a paper document and let the computer automatically extract the marked regions in the digital domain after the document has been scanned. This form of text extraction requires an algorithm to identify the hand drawn components and locate their spatial coordinates in a digital document. The boundary coordinates of the identified hand drawn curves are then used to separate the integral text material from the external text outside the hand drawn boundaries.
Prior art practices were unable to identify the hand drawn curves when the document has characters of different sizes and styles. An area (or size) threshold fails to extract small hand drawn components when their sizes are smaller than some of the characters. Because hand drawn curves are unconstrained symbols which can be any shape, it is impossible to use pattern recognition techniques (statistical or structural analysis) for the recognition of a hand drawn symbol without prior information concerning the hand drawn symbol.
The present invention overcomes the problems mentioned above using a threshold free technique to identify any size or shape of hand drawn closed curve in a document. The boundary coordinates of the identified curves are then used to locate text inside the curves. By blanking out the external text material, the output image will contain only the desired text being selected.