The present application is related to the following commonly-owned U.S. patent application(s), the disclosures of which are hereby incorporated by reference in their entirety, including any incorporations-by-reference, appendices, or attachments thereof, for all purposes:
Ser. No. 09/614,465, filed on  less than the same day as the present application greater than , and entitled SYSTEM AND METHODS FOR DETERMINING SEMANTIC SIMILARITY OF SENTENCES; 
Ser. No. 60/212484, filed on  less than the same day as the present application greater than , and entitled SYSTEM AND METHODS FOR ACCEPTING USER INPUT IN A DISTRIBUTED ENVIRONMENT IN A SCALABLE MANNER; and
Ser. No. 09/614,050, filed on  less than the same day as the present application greater than , and entitled SYSTEM AND METHODS FOR FACILITATING MANUAL ENTRY OF, AND USE OF, IDEOGRAPHIC TEXT IN A COMPUTER. 
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to information retrieval. More especially, the present invention relates to document retrieval using the Chinese language, or like languages. Even more especially, the present invention relates to document retrieval involving remote servers on a communication network, for example, World Wide Web sites on the Internet.
The World Wide Web (the Web) is a mine of information. Unfortunately, it is frequently not easy to find needed information from the Web. The problem is not that the Web does not have the needed information. Rather, the problem is that the Web has too much information that is not needed. Various online search engines attempt to help users find just the information that is most needed by the user, based on queries supplied by user. Most of these search engines require their users to learn and use particular query syntaxes, perhaps syntaxes that require keywords combined by boolean operators. Learning and mastering such syntaxes is inconvenient for the users. More recently, some search engines have begun to allow users to enter English-language queries in the form of natural-language sentences. Nevertheless, there is still much room for improvement.
In particular, although some search engines now allow users to enter queries in the form of natural-language sentences, there is still a need to improve such systems so that they process queries to produce only the most relevant documents. Further, there is a need for systems and methods that allow users to search for documents using natural language sentences that include words of the Chinese language, or similar languages. Still firther, such improved systems and methods should still be efficient and suitable for large-scale, real-time use on the Internet or on other communication networks. The present invention satisfies these and other needs.
A system and associated methods identify documents relevant to an inputted natural-language user query. According to one aspect of the invention, relevant documents are identified by: selecting a set of keywords from the user query; determining at least one word, not necessarily found in the user query, that is semantically similar to a keyword of the set of keywords; using the set of keywords and the at least one word to determining a subset of word sets from a database of pre-stored word sets, wherein the pre-stored word sets are each preassociated with at least one document; determining a plurality of word sets, from the subset of word sets, that is most semantically similar to the user query; and identifying documents that have been pre-associated with the plurality of word sets as being relevant to the natural-language user query.
According to another aspect of the invention, a system identifies relevant documents. The system includes means for selecting a set of keywords from the user query; means for determining at least one word, not necessarily found in the user query, that is semantically similar to a keyword of the set of keywords; means for using the set of keywords and the at least one word to determining a subset of word sets from a database of pre-stored word sets, wherein the pre-stored word sets are each pre-associated with at least one document; means for determining a plurality of word sets, from the subset of word sets, that is most semantically similar to the user query; and means for identifying documents that have been pre-associated with the plurality of word sets as being relevant to the natural-language user query.