1. Field of the Invention
The present invention relates to a translation method and translation apparatus for reading a document described in a certain language by means of an optical character reader (OCR), and translating it into other language.
The term "morpheme" used in this specification refers to a certain unit composed of a series of plural characters or sounds in one language, forming at least one meaningful word.
2. Description of the Prior Art
The optical character reader (OCR) for reading optically an original document printed in types or the like, recognizing characters from the read image of original, and delivering information such as character codes has been conventionally employed. The output of the OCR is, for example, fed into a computer, and is subjected to various processing, and displayed and delivered to display devices. At the same time, an OCR may be also used for inputting data into a translation apparatus for translating a certain language into other languages.
FIG. 1 shows a character recoginition processing procedure in an OCR, and FIG. 2 shows a translation processing procedure in a translation apparatus using a OCR as the input means. In FIG. 1, the OCR first, at step a1, stores the image data of the original image into a memory called an internal image buffer. At next step a2, the image data stored in the internal image buffer is segmented into character units during the character segmentation processing.
At step a3, features are extracted from the segmented characters. The feature data includes, for example, the number of black pixels for composing the segmented character.
The original image at step a1 is read by using a means for reading including a charge-coupled device (CCD). The original image is then binary-coded corresponding to the white pixels and black pixels in each of the plural pixels of the reading means. Binary data is stored into the internal image buffer as the image data. Therefore, the number of black pixels mentioned above refers to the number of black pixels, for example, when a rectangular region on the original plane allotted for one character is divided into predetermined unit divisions.
The feature data extracted at step a3 is compared with reference information stored in a memory device called a standard character pattern dictionary provided in the OCR for storing feature data of plural characters. Character recognition processing called matching is executed at step a4. Characters segmented in this way are recognized as the feature data of a reference character stored in the standard character pattern dictionary.
At step a5, each group of plural characters recognized by the character recognition processing step is compared with the memory contents of a lexical dictionary provided in the OCR for storing plural words or the likes. The characters considered to be misrecognized in the processing step a4 are corrected. This process is referred to as linguistic processing. In this way, the recognition rate by the OCR is enhanced.
Afterwards, at step a6, a document file obtained by converting the recognized characters individually into character codes is delivered.
Referring to FIG. 2, the document file delivered from the OCR in the preceding procedure is fed into a translation apparatus at step b1.
The translation apparatus, at step b2, recognizes each group of plural continuous character codes as one word, and classifies them by the parts of speech, on the basis of the character codes composing the input document file.
At step b3, referring to the lexical dictionary provided in the translation apparatus, the meaning of the recognized word is extracted.
At step b4, syntactic analysis is effected by analyzing the modifying/modified relationship between composing a sentence. This process is conducted on the basis of the parts of speech of words obtained at step b2.
At step b5, from the plural meanings corresponding to each word obtained at step b3, an adequate meaning is selected corresponding to syntactic analysis. This process is called semantic analysis processing.
At step b6, translated words of one sentence are generated, and a translated document in characters or the like corresponding to the generated translated words is delivered at step b7.
In the OCR, in order to enhance the recognition rate, as mentioned above, the system includes a lexical dictionary, and a morphological analysis (linguistic processing) that is also performed in the translation apparatus. However, since the document file delivered from the OCR is a continuous string of character codes divided into character units, the translation apparatus can only complete translation processing after performing the same morphological analysis again. That is, the same morphological analysis is duplicated during character recognition and translation. Thus, the translation apparatus is wasting time by repeating the morphological analysis once executed in the OCR.