“E-book” or “electronic book” is a general term for the combination of an e-reader and the digital contents inside the e-reader. The current trend is towards replacing the conventional paper books by E-books. According to the Oxford Dictionary, “an e-book is the electronic edition of a printed book, which can be read from a personal computer or a hand-held device”. The hardware reading interface is generally called an “e-reader”. Personal computers and certain mobile phones may be used as e-readers.
An e-book can be read from different types of e-readers (interchangeably, called information loaders). Accordingly, producing a digital file that can be read from various e-readers is a critical challenge for digital publishing.
Most of the existing books are printed on paper with ink, which cannot be read electronically by e-readers. One general solution to the problem is to scan the paper books into output image files that can be loaded and displayed on an e-reader. However, the scanned image files cannot be loaded to automatically perform line-feeding according to the view dimension of the reading interface. Accordingly, the user cannot read the entire page when loading a scanned image file directly unless he/she frequently drags a vertical or/and horizontal scrollbar(s) or changes the view dimensions. That results in a time-consuming and sometimes frustrating reading experience.
The electronic translation technology using OCR (Optical Character Recognition) is a potential solution for aforementioned problem. OCR electronically transcribes an image file into an editable digital text file that can be loaded and line-fed according to the view dimensions of the reading interface.
A basic requirement of a workable OCR is the ability to correctly recognize the text embedded in an image file. As English language has a limited number of letters and punctuation marks, a good OCR system for English texts can recognize almost 100% of the scanned text files. However, the performance of OCR systems for ideographic languages or block languages such as traditional Chinese, simplified Chinese, Japanese, and Korean are much less satisfactory due to the large set of characters that an OCR system is required to handle. To achieve a readable level of correctness, the transcribed text resulting from OCR for Chinese texts must be proofread by human beings and corrected manually. The significant overhead required in the proofreading step made OCR an unacceptable solution for converting the enormous quantity of existing paper books of ideographic languages into electronically readable e-books. We further remark that if the text image file is only for reading but not for editing, as it is the case in most e-book applications, electronic recognition of characters may altogether be unnecessary.