In many instances, a search engine is utilized to search for information. In general, a search engine is a special program (e.g., computer executable instructions) designed to help find files (e.g., web pages, images, text . . . ) stored on a computer, for example, a public server or on one's own personal computer. A typical search engine allows a user to invoke a query for files that satisfy particular criteria, for example, files that contain a given word or phrase in a title or body. Web search engines generally work by storing information about a large number of web pages retrieved from the World Wide Web (WWW) through a web crawler, or an automated web browser, which follows essentially every link it locates. The contents of each web page are then analyzed to determine how it should be indexed, for example, words can be extracted from the titles, headings, or special fields called meta-tags. Data about web pages is stored in an index database for use in later queries. Some search engines store (or cache) all or part of a source page as well as information about the web pages. When a user invokes a query through the web search engine by providing key words, the web search engine looks up the index and provides a listing of web pages that best-match the criteria, usually with a short summary containing the document's title and/or parts of the text.
In general, the usefulness of a search engine depends on the relevance of the results it presents to a user and the presentation of such results. While there can be numerous web pages that include a particular word or phrase, some web pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide a “best” result first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another.
There has been much focus on tier one markets such as the United States and France in terms of searching as a web service. While this is justifiable from an immediate business point of view, as companies try to enter new markets, they would have to offer competitive quality for a search engine in native languages. One major aspect in serving high quality query results is the ability to do effective ranking of web documents by surfacing up relevant documents from a user standpoint. For scalability and performance reasons, many current web document ranking approaches use machine learning techniques to learn the mapping between query-document pairs and the degree of relevance as judged by users. Yet, these data driven approaches require large amounts of training data for satisfactory performance. For popular or more widespread languages, there is typically enough resources and justification to collect and maintain high quality training data. However, less popular or used languages do not have sufficient amounts of training data to provide satisfactory performance to provide search results.