A search engine is designed to search for information on the World Wide Web. It often collects information on the Internet through specific computer programs according to certain policies. A search engine also provides a retrieval service to users. That is, it organizes and processes information that it collects, and it displays the processed information to users.
Web search engines typically work by storing information about many web pages. These pages are retrieved by information capture systems referred to as Web crawlers (sometimes also known as spiders). A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. This process is called Web crawling or spidering. Most Web crawlers are used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. In general, a Web crawler starts with a list of URLs to visit, referred to as the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page (referred to as the crawl frontier) and adds them to the list of URLs to visit. URLs from the frontier are recursively visited according to a set of policies. Web pages are captured in this crawling process along with the hyperlinks. The web pages are called web page snapshots. Because hyperlinks are widely used on the Internet, theoretically, most of the web pages can be collected starting from certain web pages. When the captured web pages are processed, keywords are extracted and indexes are established in order to provide search services. Then, when a user enters a query into a search engine (typically by using key words), the search engine examines its index and provides a listing of best-matching Web page URLs according to its criteria, usually with a short summary containing the document title and sometimes part of the text. The index is built from the information stored with the data and the method by which the information is indexed. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be many pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the “best” results first. How a search engine decides which pages are the best matches and what order the results should be shown in, varies widely from one search engine to another.
In particular, for a search engine with the capacity to search Chinese characters, a Chinese character partitioning operation is needed during indexing and querying processes. The conventional Chinese partitioning method uses a monadic partition method in which each Chinese character in a sentence is taken as a single unit. For example, after a monadic partition of the phrase of “” (“China country stock market,” also translated as “Chinese stock market”), the result set contains four single characters: “,” (“Chinese,”) “,” (“country,”) “,” (“stock,”) and “” (“market,”) respectively. Here, the appearance probability for the character “” (“market”) in a single search engine server that indexes 6 million documents is as high as 93%. Therefore, the query of “” (“market”) will consume a large portion of the search engine server's resources during the “” (“Chinese stock market”) query if a monadic partition method is used. In order to avoid such a situation, a list of high-frequency characters is pre-stored in the search engine and high-frequency characters are filtered before conducting the query. Such high-frequency characters are referred to as “filter characters.” Using the same example, a query of “” (“Chinese stock market”) will be simplified as a query of “” (“Chinese stock”) in order to skip the high-frequency character “” (“market”) in a query.
However, since indexing and querying are carried out by omitting high-frequency characters in the conventional monadic partition method, the result set may not be accurate. Again, using “” (“Chinese stock market”) as an example, because “” (“market”) is omitted in the query, the query result set may contain a large number of “” (“Chinese stock investors”) and “” (“Chinese stock shares”), etc., which contain “” (“Chinese stock”), but do not accurately match the user's query. Therefore, more accurate and more efficient indexing and querying techniques are needed.