A portion of the disclosure of this patent document contains material is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever, most especially as they relate to the disclosed computer programs, listings, or descriptions.
1. Field of the Invention
The present invention relates to the field of Computer Systems using techniques for text storage and retrieval. More particularly, it relates to the use of techniques for the storage and retrieval of non-English language texts for use with systems which are designed around the English alphabet.
2. Background
Document retrieval systems, or automated text processor systems are a major application in many computer systems today. These are systems where various kinds of text are stored in some form of the computer memo or storage space, and can be efficiently accessed by the user and rapidly retrieved from the memo or storage.
One type of such storage space which is being used by many computer systems today is the Compact Disc-Read Only Memo (CD-ROM). CD-ROMs are disk files which can contain millions of characters of data on a single disk.
One type of software system used to read such CD-ROMs and retrieve data from that type of data base is the system called AnswerBook .TM. developed by Sun Microsystems, Inc. (AnswerBook is a trademark of Sun Microsystems, Inc.). AnswerBook, like other such systems, provides the ability to do full-text searching (sometimes called "content-based retrieval") of over 16,000 pages of documentation. Such searching allows the user to enter a word or phrase or sentence (that is, a string of characters) and ask the text retrieval system to search the text stored for any instances of the word or character string entered, and to rapidly display those instances.
When documents are stored on CD-ROMs or other storage devices, they are typically stored as text characters which are encoded in the American Standard Code for Information Interchange (ASCII) 8-bit format. Since the English alphabet only contains 26 characters, and since that number plus the usual punctuation and special characters total less than 256 different characters, the binary representation of those characters will fit in one 8 bit byte of computer data (2.sup.8 =256).
Some non-English languages have many more than 256 characters in their language; for example, the Japanese language requires a character set of over 8,000 characters. Since this number cannot be accommodated within the range of an 8-bit (1 byte) number, multiple-byte characters must be used to describe the Japanese character set for computers. As a result, since most automated text processing systems cannot display or access a multi-byte text file, the use of existing automated text processing systems and available CD-ROM text data bases have not been available for Japanese or other languages with more than 256 characters in their alphabet.
While much progress is being made to standardize code sets for the languages of the world and to develop computer applications to use these universal code sets, it is generally necessary to rewrite the English language based applications programs themselves in order to use the programs with another language such as Japanese. The cost of such rewrites in many instances is prohibitive. The present invention makes use of an alternative scheme for using existing search and retrieval programs without the necessity of rewriting them by a novel technique of converting the non-English character code sets into ASCII format.
The International Standards Organization (ISO) has adopted various standard coding schemes to handle different languages and revises these from time to time to add more languages. As early as April 1984, Xerox Corporation published its own character code standard which included code assignments for Greek, Cyrillic, and Japanese characters in addition to the Latin character set defined by ISO 646. (For more information on the early Xerox standard see "Xerox Network Systems Architecture, General Information Manual" XNSG 068504, April 1984, pp. 57-61.). Subsequently, various Japanese Industrial Standard (JIS) Code sets were defined and similar standard codes were defined by AT&T called the "Extended Unix Code (EUC)" which conformed to ISO Standard 2022. More recently, the Open Systems Foundation, Unix International and Unix Systems Laboratories Pacific have agreed to support the Extended Unix Code (EUC) for Japanese language, enhancing prospects for portability and interoperability of computer applications. This common definition (EUC) includes support for Japanese standard code sets established in 1990, JIS X0212 Supplemental Kanji, JIS X0208 Kanji, and JIS X0201 One-byte Kana. These are described in the Standard publications titled the "Code of the Japanese Graphic Character Set for Information Exchange". (UNIX.RTM. is a registered trademark of UNIX System Laboratories, Inc.)
While these standard code definitions for Japanese characters, which define the characters in terms of multiple bytes of coded information, are a problem for applications designed to handle single byte coded input information, the written Japanese language poses many other complex issues for a text parser. Two significant problems which must be resolved in order to permit full-text searching of Japanese language text are 1) the problem of how to separate words in the text (there is no delimiter for words such as white spaces in English text); and 2) the problem of compound nouns that must be broken up for improved searching. The present invention also provides a way to handle these problems within the context of an ASCII based text processing system. No methods for solving such problems to permit the use of existing English based text retrieval applications with non-English complex languages are known in the prior art.