This invention relates to automated language identification, and, more particularly, to automated language identification from images of printed documents. This invention was made with government support under Contract No. W-7405-ENG-36 awarded by the U.S. Department of Energy. The government has certain rights in the invention.
There is an increasing need for the automated retrieval of information from electronic documents. Of the vast quantity of electronic documents that constitute the world resource to be tapped by such tools, a significant number have been scanned directly from hard copy of textual information into electronic images. For example, in order to save space, many libraries are reducing the physical volume of their holdings by scanning paper copies of books and journals into electronic form. Although the main future use, within the library itself, of these scanned documents may by a person sitting at a display terminal and reading the documents, it is important to recognize that these images also represent an important data resource for applications of automated information retrieval technology. These images are not character-oriented data bases, which form the usual input to automated document information retrievals, but simply patterns of light and dark (i.e., images) stored in electronic format. As such, they present unique problems to be solved in automated information retrieval applications.
In an international environment, a crucial first step in this type of image processing is to apply pattern recognition techniques to identify, from the image alone, the language used to produce the original document. Once the language is known, the document can be routed to an appropriate human translator, or to a computer for further processing. Such further processing might include the application of conventional optical character recognition techniques (which require prior knowledge of the source alphabet) to extract the individual characters that make up the document.
One important problem in this area is the analysis of writing systems and the identification of a language from the writing system. But some writing systems employ connected alphabets, e.g., Hindi and Arabic. It would be desirable to perform writing analysis without separating a writing into individual characters or words.
Accordingly, it is an object of the present invention to perform an analysis of a document stored in an image format.
It is another object of the present invention to enable the analysis of connected writing systems.
Yet another object of the present invention is to obtain image features that enable a writing to be analyzed for the language comprising the writing.
Additional objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.