A search engine is a computer program that helps a user to locate information. To locate information on a particular topic, a user can submit to a search engine one or more search query terms related to the topic. In response, the search engine executes the search query and generates information about the results of the search. The information about the results of the search usually contains a list of the resources that satisfy the search query.
While search engines may be applied in a variety of contexts, search engines are especially useful for locating resources that are accessible through the Internet. Resources may include files whose content is composed in a page description language such as Hypertext Markup Language (HTML). Such files are typically called pages. Using a web browser, pages may be retrieved by selecting HTML links that contain the Universal Resource Locators (URLs) of the pages.
Pages may contain words from different languages. For example, one page might contain words from the German language, and another page might contain words from the Korean language. Some words might be compound words. A compound word is a word that contains two or more component words that are independent words in their own right. For example, one English compound word is “firehouse.” The separate words “fire” and “house” both have independent meanings when standing alone. The words “fire” and “house” are component words within the compound word “firehouse.”
In some languages, such as German and Korean, it is common to connect two or more words together to form a compound word, even though the compound word might not have any meaning other than the meaning of its constituent component words. For example, when used together, the German words “kind” and “buch” become the single compound word “kinderbücher.”
One searching for the English words “fire” or “house” is probably not interested in seeing pages that contain the compound word “firehouse.” Similarly, one searching for the English words “grass” or “hop” is probably not interested in seeing pages that contain the compound word “grasshopper.” In English, the meanings of the compound words “firehouse” and “grasshopper” are only loosely related to their component words.
In some languages, such as German and Korean, component words within a compound word are more likely to retain their individual meaning despite being within the compound word. Thus, someone in Germany searching for “buch” would likely be interested in seeing search results that contain the compound word “kinderbücher.” Because many German language pages relating to “buch” are likely to contain “buch” only as a component word within a compound word, ignoring pages that contain “buch” only as a component word within a compound word may cause many highly relevant pages to be missed.
Unfortunately, many search engines do miss highly relevant pages as a result of such ignorance. If such pages are not ignored, other complications arise. For example, it is desirable for a search engine to “highlight” instances of a search term found in a page or summary description of a page. Highlighting a word means visibly distinguishing that word from other words. For example, a highlighted word may be displayed in a bold, italicized, underlined, or differently colored font. By highlighting an instance of a search word, a searcher's attention is drawn to the search word so that the searcher can quickly ascertain the context of the search word within a page. However, unless a search is for an entire compound word, compound words typically are not highlighted at all.
The useful effects of highlighting would be significantly reduced if very long compound words were highlighted in their entirety even though only a component word thereof was relevant to a search. For example, highlighting would be less useful if the entire compound word “groβlangenfeldjahreswagen” was highlighted when a user searched only for the word “jahre.” In some languages, very long compound words are quite common.
To complicate matters further, in some languages, such as German, component words are not appended together in original form when forming a compound word. In German, some component words take an alternative form when connected together to form a compound word. An alternative form of a component word differs from the form that the component word takes when standing alone. For example, the compound word resulting from the connection of the component words “kind” and “buch” is “kinderbücher” rather than “kindbuch.” In this case, the alternative form of “kind” is “kinder,” and the alternative form of “buch” is “bücher.” Sometimes, the alternative form of a component word differs so much from the original form of the component word that the alternative form of the component word does not contain the original form of the component word. This alteration makes proper highlighting more difficult.
Based on the foregoing, it is clearly desirable to provide a technique for displaying a compound word in a way that implements useful highlighting when only a portion of the compound word is of interest to a searcher. It is further desirable that the technique provides a way of dealing with alternative forms of component words.