Web search based ad services and search engines have become important tools for providing information to users. One factor in attracting users and advertisers is providing relevant information and ads for a given search query. Search relevance may be determined by a ranking function that ranks resultant documents according to their similarities to the input query.
Information retrieval (IR) researchers have studied search relevance for various search engines and tools. Representative methods include Boolean, vector space, probabilistic, and language models. Earlier search engines and tools were mainly based on such IR algorithms. These search engines and tools incorporate in varying degrees the concept of the ranking function. Many factors may affect the ranking function for search relevance. These factors may include page content, title, anchor, URL, spam, and page freshness. It is extremely difficult to manually tune ranking function parameters to accommodate these factors for large-scale data sets, such as those that are common in many applications including World Wide Web (“Web”) applications and speech and image processing. For these large data sets, machine based learning algorithms have been applied to learn complex ranking functions from large-scale data sets.
Early algorithms for ranking function learning include Polynomial-based regression, Genetic Programming, RankSVM and classification-based SVM. However, these algorithms were only evaluated on a small-scale dataset due to the high computational cost. In fact, these traditional machine-learning algorithms operate slowly when searching large-scale data sets. Users often wait many hours, days, or even weeks to get results from these data sets. This slow computation time may be due, in part, to a typical personal computer (PC) being unable to exploit full parallelism in machine-learning algorithms efficiently.
Instruction level parallelism techniques somewhat improve the processing time. More particularly, distributed implementations with process level parallelism are faster than many of the PC central processing units (CPUs), which execute instructions in sequential manner. However, distributed implementations occupy many machines. Additionally, for some algorithms, distributed computing yields poor speed improvement per processor added due to communication cost. A Graphics Processing Unit (GPU)-based accelerator could only accelerate a limited spectrum of machine learning algorithms due to its special hardware structure optimized for graphics applications. Thus, memory access bandwidth, communication cost, flexibility and granularity of parallelism remain bottlenecks for these solutions.