The invention is related to the art of data search. It is described in reference to World Wide Web and Internet searching. However, those of ordinary skill in the art will understand that the described embodiments can readily be adapted to other database or data search tasks.
A great deal of work is being done to improve database and Web searching. For example, Ayse Goker and Daqing He, in Analyzing Web Search Logs to Determine Session Boundaries for Unoriented Learning, Proceedings of the Adaptive Hypermedia and Adaptive Web-Based Systems International Conference (Trento, Italy), pages 319–322, August 2000, incorporated herein by reference in its entirety, defines a search session to be a meaningful unit of activities, with the intention of using it as input for a learning technique. Sessions are determined by a length in time from the first search query. Goker reports that a session boundary of 11–15 minutes compares well with human judgment. This is a simple model, and does not allow for determining which events in the time window correspond to Web searching. Additionally Goker analyzed logs from search engines only.
Johan Bollen, in Group User Models for Personalized Hyperlink Recommendation, Proceedings of the Adaptive Hypermedia and Adaptive Web-Based Systems International Conference (Trento, Italy), pages 39–50, August 2000, incorporated herein by reference in its entirety, presents a method to reconstruct user searching using the Web server log entries of the Los Alamos Research Library corresponding to access to the digital library of journal articles. The resulting retrieval paths are a group user model. The group user model is used to construct relationships between journals using a V×V matrix, where V is the set of hypertext pages. In this library of journal articles, a journal article is represented by a URL (Universal Resource Locator). This approach will not scale well and would be overwhelmed when V is the set of publicly-accessed URLs.
Many techniques exist for automatically determining the category of a document based on its content (e.g., Yiming Yang and Xin Liu, in A Re-Examination of Text Categorization Methods, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, Calif.), pages 42–49, ACM, August 1999 and its references, all of which are incorporated herein by reference in their entirety) and the in- and out-links of the document. For example, Jeffrey Dean and Monika R. Henzinger in Finding Related Web Pages in the World Wide Web, Proceedings of the Eighth International World Wide Web Conference (WWW8) (Toronto, Canada), pages 389–401, Elsevier Science, May 1999, incorporated herein by reference in its entirety, Dharmendra S. Modha and W. Scott Spangler, in Clustering Hypertext with Applications to Web Searching, Proceedings of the ACM Hypertext 2000 Conference (San Antonio, Tex.), May 2000, incorporated herein by reference in its entirety, Attardi et al. Giuseppe Attardi, Antonio Gulli, and Fabrizio Sebastiani, in Theseus: Categorization by Context, Proceedings of the Eighth International World Wide Web Conference (WWW8) (Toronto, Canada), pages 389–401, Elsevier Science, May 1999, incorporated herein by reference in its entirety, the context surrounding a link in an HTML document to extract information for categorizing the document referred by the link. Oren Zamir and Oren Etzioni, in Web Document Clustering: A Feasibility Demonstration, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98) (Melbourne, Australia), pages 46–54, ACM, August 1998, incorporated herein by reference in its entirety, use the snippets of text returned by search engines to quickly group the results based on phrases shared between documents. Murata Tsuyoshi Murata, in Discovery of Web Communities Based on the Co-Occurrence of References, Proceedings of the Third International Conference on Discovery Science (DS'2000) (Kyoto, Japan), December 2000, incorporated herein by reference in its entirety, computes clusters of URLs returned by a search engine by entering the URLs themselves as secondary queries.
Clusters of similar Web pages can be developed using the approach presented by Dean and Henzinger, which finds pages similar to a specified one by using connectivity information on the Web. The Context Classification Engine catalogs documents with one or more categories from a controlled set. For example, see Classifying Content with Ultraseek Server CCE by Walter Underwood of Inktomi Search Software CCE, Foster City, Calif., incorporated herein by reference in its entirety. The categories can be arranged in either a hierarchical or enumerative classification scheme. Finally, DynaCat, by Wanda Pratt, Marti A. Hearst, and Lawrence M. Gagan in A Knowledge-Based Approach to Organizing Retrieved Documents, Proceedings of the 6th National Conference on Artificial Intelligence (AAAI-99); Proceedings of the 11th Conference on Innovative Applications of Artificial Intelligence (Orlando, Fla.), pages 80–85, AAAI/MIT Press, July 1999, incorporated herein by reference in its entirety, dynamically categorizes search results into a hierarchical organization using a model of the domain terminology.
Another approach to document categorization is “content ignorant.” For example, Doug Beeferman and Adam Berger in Agglomerative Clustering of a Search Engine Query Log, Proceedings of the 2000 Conference on Knowledge Discovery and Data Mining ( Boston, Mass.), pages 407–416, August 2000, incorporated herein by reference in its entirety, uses click-through data to discover disjoint sets of similar queries and disjoint sets of similar URLs. Their algorithm represents each query and URL as a node in a graph and creates edges representing the user action of selecting a specified URL in response to a given query. Nodes are then merged in an iterative fashion until some termination condition is reached. This algorithm forces a hard clustering of queries and URLs. This algorithm works on large sets of data in batch mode, and does not include prior labeled data from existing content hierarchies. By focusing on click-through statistics, these authors only see an abbreviated portion of a user's activities while searching. This paper also only advocates improving web search by proposing for users alternative queries taken from the disjoint sets of queries built by their algorithm.
Approaches to hierarchical classification such as that discussed by Ke Wang, Senqiang Zhou, and Shiang Chen Liew in Building Hierarchical Classifiers Using Class Proximity, Proceedings of the Twenty-fifth International Conference on Very Large Databases (Edinburgh, Scotland, UK), pages 363–374, September 1999, incorporated herein by reference in its entirety, when applied to our data, would only allow for one URL to be related with each query.
Most recent work in Web searching has been to improve the search engine ranking algorithms. For example, PageRank, by Sergey Brin and Lawrence Page, in The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference (WWW7) (Brisbane, Australia), Elsevier Science, April 1998, incorporated herein by reference in its entirety, The WISE System by Budi Yuwono and Dik Lun Lee, in WISE: A World Wide Web Resource Database System, IEEE Transactions on Knowledge and Data Engineering, 8(4):5:48–554, August 1996, incorporated herein by reference in its entirety, Budi Yuwono and Dik L. Lee, in Server Ranking for Distributed Text Retrieval Systems on the Internet, Proceedings of the 5th International Conference on Database Systems for Advanced Applications (DASFAA '97) (Melbourne, Australia), pages 41–49, April 1997, incorporated herein by reference in its entirety, and NECI's metasearch engine, by Steve Lawrence and C. Lee Giles, in Inquirus, the NECI Meta Search Engine, Proceedings of the Seventh International World Wide Web Conference (WWW7) (Brisbane, Australia), pages 95–105, Elsevier Science, April 1998, incorporated herein by reference in its entirety, are examples of such work. Direct Hit (www.directhit.com) claims to track which Web sites a searcher selects from the list provided by a search engine, how much time she spends on those sites, and takes into account the position of that site relative to other sites on the list provided. Thus, for future queries, the most popular and relevant sites are notated in the search engine results.
WebWatcher attempts to serve as a tour guide to Web neighborhoods, see Webwatcher: A Learning Apprentice for the World Wide Web by Robert Armstrong, Dayne Freitag, Thorsten Joachims, and Tom Mitchell in Proceedings of the 1995 AAAI Spring Symposium on Information Gathering From Heterogeneous, Distributed Environments (Palo Alto, Calif.), pages 6–12, March 1995, incorporated herein by reference in its entirety, and Webwatcher: A Tour Guide for the World Wide Web by Thorsten Joachims, Dayne Freitag, and Tom M. Mitchell in Proceedings of 15th International Joint Conference on Artificial Intelligence (IJCAI97) (Nagoya, Japan), pages 770–777, Morgan Kaufmann, August 1997, incorporated herein by reference in its entirety. Users invoke WebWatcher by following a link to the WebWatcher server, then continue browsing as WebWatcher accompanies them, providing advice along the way on which link to follow next based on a stated user goal. WebWatcher gains expertise by analyzing user actions, statements of interest, and the set of pages visited by users. Their studies suggested that WebWatcher could achieve close to the human level of performance on the problem of predicting which link a user will follow given a page and a statement of interest.
Marko Balabanovic and Yoav Shoham in Fab: Content-Based, Collaborative Recommendation, Communications of the ACM, 40(3):66–72, March 1997, incorporated herein by reference in its entirety, discusses Rab, a Web recommendation system; this system is not designed to assist in Web searching, and it requires users to rate Web pates. WebGlimpse described by Udi Manber, Mike Smith, and Burra Gopal in WebGlimpse: Combining Browsing and Searching, Proceedings of the 1997 USENIX Annual Technical Conference (Anaheim, Calif.), pages 195–206, January 1997, incorporated herein by reference in its entirety, restricts Web searches to a neighborhood of similar pages, perhaps searching with additional keywords in the neighborhood. It saves one from building site-specific search engines.
Clever, described by Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg in Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proceedings of the Seventh International World Wide Web Conference (WWW7) (Brisbane, Australia), Elsevier Science, April 1998, incorporated herein by reference in its entirety, and D. Gibson, J. Kleinberg, and P. Raghavan in Inferring Web Communities from Link Topologies, Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space—Structure in Hypermedia Systems (Pittsburgh, Pa.), pages 225–234, June 1998, incorporated herein by reference in its entirety, builds on the HITS (Hypertext-Induced Topic Search) algorithm, which seeks to find authoritative sources of information on the Web, together with sites (hubs) featuring good compilations of such authoritative sources. The original HITS algorithm first uses a standard text search engine to gather a “root set” of pages matching the query subject. Next, it adds to the pool all pages pointing to or pointed to by the root set. Thereafter, it uses only the links between these pages to distill the best authorities and hubs. The key insight is that these links capture the annotative power (and effort) of millions of individuals independently building Web pages. Clever additionally uses the content of the Web pages. SALSA described by R. Lempel and S. Moran in The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect, Proceedings of the Ninth International World Wide Web Conference (WWW9) (Amsterdam, Netherlands), May 2000, incorporated herein by reference in its entirety, presents another method to find hubs and authorities.
Paul P. Maglio and Rob Barrett, in How to Build Modeling Agents to Support Web Searchers, Proceedings of the Sixth International Conference on User Modeling (UM97) (Sardinia, Italy), Springer Wien, N.Y., June 1997, incorporated herein by reference in its entirety, studied how people search for information on the Web. They formalized the concept of waypoints, key nodes that lead users to their searching goal. To support the searching behavior they observed, Maglio and Barrett constructed a Web agent to help identify the waypoint based on a user's searching history. Unfortunately, it is not clear how to extend the waypoint URL so that other users can profit from it.
All of this work is motivated, at least in part, by a general need to improve database and Internet searching in general. However, a large part of the motivation to improve Web searching is brought about by the advent of mobile computing and communication devices and services. For example, cell phone and personal digital assistant (PDA) users are demanding Internet connectivity. One of the fundamental design challenges of today's mobile devices is the constraints of their small displays. For example, PDAs may have a display space of 160×160 pixels, while a cellular phone can be limited to only five lines of 14 characters each. Differences in display real estate and access to peripherals like keyboards and mice can alter the user experience with much of the content available on the Web. These display limitations as well as bandwidth limitations related to constraints of mobile communication are accommodated through special connectivity services.
Considering the interface constraints in the mobile environment, one can easily see how important proper selection of content becomes in mobile Web searching applications. Without the benefit of refining content selection, delivery, and distribution, a user may be inundated with search results, and may be unable to manipulate the content in a manner satisfactory to the task, context, or application at hand. As such, it would be desirable to have an improved search system for general Internet and database applications, but also for tailoring search results for display on a limited browser screen.
Of the available methods to improve search results, there are several techniques that are commonly used:
Improved ranking algorithms. Current search engines crawl the Web and build indexes on the keywords that they deem are important. The keywords are used to identify which URLs should be displayed. A great deal of work had been done to improve the ranking of the URLs. For example, see the work of Brin and Page mentioned above.
Meta-search engines. A meta-search engine queries a group of popular engines, hoping that the combined results will be more useful than the results from any one engine. For example, MetaCrawler collates results, eliminates duplication, and displays the results with aggregate scores (see The MetaCrawler Architecture for Resource Aggregation on the Web, IEEE Expert, 12(12):8–14, January/February 1997, by Erik Selberg and Oren Etzioni, incorporated herein by reference in its entirety).
Dedicated search engines. There exist a number of search engines specializing in particular topics.
Specialized directories. Yahoo, About, LookSmart, and DMOZ organize pages into topic directories. These special hierarchies are maintained by one or more editors, and hence their coverage is somewhat limited and their quality can vary. These directory structures are also referred to as resource lists or catalogs.
Bookmarks. Individuals often keep a set of bookmarks of frequently visited pages and share their bookmark files with others interested in the same topics, e.g. www.backflip.com.
With reference to the two last techniques, members of a community (office, work group, or social organization) often think about, and research, the same set of topics. When searching for information on the Web, if others from one's community have recently performed the same searches, it would be helpful to know what they found; search results could then feed into a shared pool of knowledge. To be practically useful, this pool needs to be maintained without requiring direct input from the members of the community.
However, gathering such a pool is only useful if queries are repeated. In examining 17 months of proxy server logs at Bell Labs, 20% of the queries sent to search engines had been done before. Based on this promising number, SearchLight, a system disclosed in U.S. patent application Ser. No. 09/428,031, filed Oct. 27, 1999, entitled Method for Improving Web Searching Performance by Using Community-Based Filtering by Shriver and Small, which is incorporated herein by reference in its entirety, was built, which transparently constructs a database of search engine queries and a subset of the URLs visited in response to those queries. Then, when a user views the results of a query from a search engine, SearchLight augments the results with URLs from the database. Experimental results indicate that among all the cases when a search involves a query contained in the SearchLight database, the desired URL is among those in the SearchLight display 64% of the time.
Unfortunately, if the SearchLight database is large, it will have many of the same problems experienced by other search engines—too many results to display with the order being the only technique to help the user.
There is a desire to provide a scalable method to improve or augment available data searching techniques.