1. Technical Field
The present application relates generally to a system and method for managing textual archives and, more particularly, to a system and method for indexing and searching textual archives using semantic units of words, such as syllables and morphemes.
2. Description of Related Art
There are vast library archives that store manuscripts, manuals, written and typed texts in various languages. Moreover, there are various techniques known to those skilled in the art are employed for indexing textual data stored in such archives, as well as searching for target text. For example, techniques similar to those that are used for automatic indexing of audio data can be applied to automatic indexing of handwriting data in general, wherein a word is used as the basic unit for indexing and searching. Typically, with these conventional audio methods, audio data is transcribed (via automatic speech recognition or manually), time stamped and indexed via words. Similar indexing methods are employed for textual manuscripts, wherein the textual manuscripts are processed by AHR (automatic handwriting recognition) or OCR (optical character recognition) systems to produce electronic text in some format (e.g. ASCII). This decoding process generates an index from electronic textual files into stored textual data that can be used for a search and data retrieval in textual archives.
In general, there are various problems associated with word-based indexing and searching systems. In word-based systems, before searching can be started, a vocabulary and a language model based on known words must be generated. Typically, there are always unknown words (that are not accounted for in the language model and vocabulary), which renders the searching mechanism inefficient as it can only work with known words with a good language model score. Furthermore, a searching mechanism that is based on indexing of letters would be very inaccurate since there is a high level of confusability of what letter the individual written symbols represent in unknown words—indeed, people often write inaccurately and some written symbols resemble different characters.
The disadvantages associated with word-based systems are even more apparent with languages in which the unit “word” in a text is ambiguous (such as Chinese language) or in languages that have very large number of word forms (such as Slavic languages). Indeed, for certain languages such as Chinese and Slavic, traditional word-based methods for indexing data are not readily applicable due to certain features in these languages. For example, for most Asian languages such as Chinese, Japanese, Korean, Thai and Vietnamese, the word boundaries of the character strings do not have “marks” that clearly indicate/define the ends of words such as a blank text (such as the English language and most European languages where the word boundaries are in the printed text or computer text file as “white spaces”). In addition, Slavic languages operate with vocabularies consisting of several million words, which makes it difficult to build a hash table for word-indexing purposes in connection with such enormous vocabularies.
Another disadvantage associated with word-based indexing and searching systems is that certain languages have different fonts and styles for writing characters. These fonts and styles are time dependent. For example, the Chinese characters (ideographs or pictographs, never hieroglyphs) have been developed for about 7000 years. The handwriting styles were fixed at about the 13th century. There are several versions, but for handwriting recognition, there is almost one single style, which is the standard script style (KaiShu). A similar style is the informal script (XingShu), which is more difficult to recognize. The fonts were fixed in the 10th century, after the invention of printing. Such fonts are still used today, and called “Song Style”, after the dynasty of Song (900–1200 A.D.).
Similarly, the Russian spelling of words was radically changed after Russian revolution in 1917. For example, one Russian character that occurs often at the end of words was eliminated from the vocabulary.
Another difficulty of indexing Chinese textual data, for example, is due to the fact that there are several methods that are being developed for inputting Chinese characters through a keyboard. A first method uses Pinylin, a system for writing characters phonetically with the Roman alphabet. A second method recognizes Chinese characters, so long as the strokes of each character are written in certain order. Currently, the Chinese IBM WorkPad™ supports simplified Chinese, which has fewer characters than traditional Chinese—7000 versus 10,000—and fewer strokes in some characters. Finally, a hybrid method allows the first character to be entered phonetically and the following one by strokes.
For all of the reasons described above, it is difficult to apply automatic handwriting or OCR recognition of textual data for indexing purposes that use word-based recognition of handwriting (or typed) input utterance. Accordingly, a need exists in the art for a system and method for indexing and searching textual archive which is independent of fonts and styles and which is operable with unknown words. Such a system would be particularly useful with Asian and Slavic languages, wherein word-based indexing and searching techniques are inefficient for managing textual archives.