For almost as long as computers have existed, their designers and users have sought improvements to the user interface. Especially as computing power has increased, a greater portion of the available processing capacity has been devoted to improved interface design. Recent examples have been Microsoft Windows variants and Internet web browsers. Graphic interfaces provide significant flexibility to present data using various paradigms, and modern examples support use of data objects and applets. Traditional human computer interfaces have emphasized uniformity and consistency; thus, experienced users had a shortened learning curve for use of software and systems, while novice users often required extensive instruction before profitable use of a system. More recently, intuitive, adaptable and adaptive software interfaces have been proposed, which potentially allow faster adoption of the system by new users but which requires continued attention by experienced users due to the possibility of interface transformation.
While many computer applications are used both on personal computers and networked systems, the field of information retrieval and database access for casual users has garnered considerable interest. The Internet presents a vast relatively unstructured repository for information, leading to a need for Internet search engines and access portals based on Internet navigation. At this time, the Internet is gaining popularity because of its “universal” access, low access and information distribution costs, and suitability for conducting commercial transactions. However, this popularity, in conjunction with the non-standardized methods of presenting data and fantastic growth rate, have made locating desired information and navigation through the vast space difficult. Thus, improvements in human consumer interfaces for relatively unstructured data sets are desirable, wherein subjective improvements and wholesale adoption of new paradigms may both be valuable, including improved methods for searching and navigating the Internet.
Generally speaking, search engines for the World Wide Web (WWW, or simply “Web”) aid users in locating resources among the estimated present one billion addressable sites on the Web. Search engines for the web generally employ a type of computer software called a “spider” to scan a proprietary database that is a subset of the resources available on the Web. Major known commercial search engines include such names as Yahoo, Excite, and Infoseek. Also known in the field are “metasearch engines,” such as Dogpile and Metasearch, which compile and summarize the results of other search engines without generally themselves controlling an underlying database or using their own spider. All the search engines and metasearch engines, which are servers, operate with the aid of a browser, which are clients, and deliver to the client a dynamically generated web page which includes a list of hyperlinked universal resource locators (URLs) for directly accessing the referenced documents themselves by the web browser.
A Uniform Resource Identifier (RFC 1630) is the name for the standard generic object in the World Wide Web. Internet space is inhabited by many points of content. A URI (Uniform Resource Identifier is the way you identify any of those points of content, whether it be a page of text, a video or sound clip, a still or animated image, or a program. The most common form of URI is the Web page address, which is a particular form or subset of URI called a Uniform Resource Locator (URL). A URI typically describes: the mechanism used to access the resource; the specific computer that the resource is housed in; and the specific name of the resource (a file name) on the computer. Another kind of URI is the Uniform Resource Name (URN). A URN is a form of URI that has “institutional persistence,” which means that its exact location may change from time to time, but some agency will be able to find it.
The structure of the World Wide Web includes multiple servers at distinct nodes of the Internet, each of which hosts a web server which transmits a web page in hypertext markup language (HTML) or extensible markup language (XML) (or a similar scheme) using the hypertext transport protocol (http). Each web page may include embedded hypertext linkages, which direct the client browser to other web pages, which may be hosted within any server on the network. A domain name server translates a top-level domain (TLD) name into an Internet protocol (IP) address, which identifies the appropriate server. Thus, Internet web resources, which are typically the aforementioned web pages, are thus typically referenced with a URL, which provides the TLD or IP address of the server, as well a hierarchal address for defining a resource of the server, e.g., a directory path on a server system.
A hypermedia collection may be represented by a directed graph having nodes that represent resources and arcs that represent embedded links between resources. Typically, a user interface, such as a browser, is utilized to access hyperlinked information resources. The user interface displays information “pages” or segments and provides a mechanism by which that user may follow the embedded hyperlinks. Many user interfaces allow selection of hyperlinked information via a pointing device, such as a mouse. Once selected, the system retrieves the information resource corresponding to the embedded hyperlink. As hyperlinked information networks become more ubiquitous, they continue to grow in complexity and magnitude, often containing hundreds of thousands of hyperlinked resources. Hyperlinked networks may be centralized, i.e. exist within a single computer or application, or distributed, existing over many computers separated by thousands of kilometers. These networks are typically dynamic and evolve over time in two dimensions. First, the information content of some resources may change over time, so that following the same link at different times may lead to a resource with slightly different, or entirely different information. Second, the very structure of the networked information resources may change over time, the typical change being the addition of documents and links. The dynamic nature of these networks has significant ramifications in the design and implementation of modern information retrieval systems.
One approach to assisting users in locating information of interest within a collection is to add structure to the collection. For example, information is often sorted and classified so that a large portion of the collection need not be searched. However, this type of structure often requires some familiarity with the classification system, to avoid elimination of relevant resources by improperly limiting the search to a particular classification or group of classifications.
Another approach used to locate information of interest to a user, is to couple resources through cross-referencing. Conventional cross-referencing of publications using citations provides the user enough information to retrieve a related publication, such as the author, title of publication, date of publication, and the like. However, the retrieval process is often time-consuming and cumbersome. A more convenient, automated method of cross-referencing related documents utilizes hypertext or hyperlinks. Hyperlink systems allow authors or editors to embed links within their resources to other portions of those resources or to related resources in one or more collections that may be locally accessed, or remotely accessed via a network. Users of hypermedia systems can then browse through the resources by following the various links embedded by the authors or editors. These systems greatly simplify the task of locating and retrieving the documents when compared to a traditional citation, since the hyperlink is usually transparent to the user. Once selected, the system utilizes the embedded hyperlink to retrieve the associated resource and present it to the user, typically in a matter of seconds. The retrieved resource may contain additional hyperlinks to other related information that can be retrieved in a similar manner.
It is well known to provide search engines for text records which are distributed over a number of record sets. For example, the Internet presently exists as literally millions of web servers and tens of millions or more of distinct web page uniform resource locators (URLs). A growing trend is to provide web servers as appliances or control devices, and thus without “content” of general interest. On the other hand, the traditional hypertext transport protocol (HTTP) servers, or “web servers”, include text records of interest to a variety of potential users. Also, by tradition, the web pages, and particularly those with human readable text, are indexed by Internet search engines, thereby making this vast library available to the public.
Recently, the number and variety of Internet web pages have continued to grow at a high rate, resulting in a potentially large number of records that meet any reasonably broad or important search criteria. Likewise, even with this large number of records available, it can be difficult to locate certain types of information for which records do in fact exist, because of the limits of natural language parsers or Boolean text searches, as well as the generic ranking algorithms employed.
The proliferation of resources on the Web presents a major challenge to search engines, all of which employ proprietary tools to sift the enormous document load to find materials and sites relevant to a user's needs. Generally speaking, the procedure followed in making a search is as follows. User enters a string of words onto a character-based “edit line” and then strikes the “enter” key on user's keyboard or selects a search button using a pointing device. The string of words may be fashioned by a user into a Boolean logical sentence, employing the words “AND,” “OR,” and “NOT,” but more typically the user enters a set of words in so-called “natural language” that lack logical connectors, and software called a “parser” takes user's natural language query and estimates which logical connections exist among the words. Such parsers have improved markedly in recent years through employment of techniques of artificial intelligence and semantic analysis. Having parsed the phrase, the search engine then uses its database, derived from a spider that has previously scanned the Web, for materials relevant to the query. This process entails a latency period while user waits for the search engine to return results. The search engine then returns, it is hoped, references to relevant web pages or documents, identified by their URLs or a hypertext linkage to title information as a set of hits, to the user, often parceled out at the rate of ten per request. If further hits are desired, there is a wait while a request for further hits is processed, and this typically entails another, fresh search and another latency period, wherein the search engine is instructed to return ten hits starting at the next, previously undisplayed, record. Often, each return hypertext markup language (HTML) page is accompanied by advertising information, which subsidized the cost of the search engine and search process. This advertising information is often called a “banner ad”, and may be targeted to the particular user based on an identification of the user by a login procedure, an Internet cookie, or based on a prior search strategy. Other times, the banner ads are static or simply cycle between a few options.
A well-recognized problem with existing search engines is the tendency to return hits for a query that are so incredibly numerous, sometimes in the hundreds, thousands, or even millions, that it is impractical for user to wade through them and find relevant results. Many users, probably the majority, would say that the existing technology returns far too much “garbage” in relation to pertinent results. This has lead to the desire among many users for an improved search engine, and in particular an improved Internet search engine.
In response the garbage problem, search engines have sought to develop unique proprietary approaches to gauging the relevance of results in relation to a user's query. Such technologies employ algorithms for either limiting the records returned in the selection process (the search) and/or by sorting selected results from the database according to a rank or weighting, which may be predetermined or computed on the fly. The known techniques include counting the frequency or proximity of keywords, measuring the frequency of user visits to a site or the persistence of users on that site, using human librarians to estimate the value of a site and to quantify or rank it, measuring the extent to which the site is linked to other sites through ties called “hyperlinks” (see, Google_com and Clever_com), measuring how much economic investment is going into a site (Thunderstone_com), taking polls of users, or even ranking relevance in certain cases according to advertiser's willingness to bid the highest price for good position within ranked lists. As a result of relevance testing procedures, many search engines return hits in presumed rank order or relevance, and some place a percentage next to each hit which is said to represent the probability that the hit is relevant to the query, with the hits arranged in descending percentage order.
However, despite the apparent sophistication of many of the relevance testing techniques employed, the results typically fall short of the promise. Thus, there remains a need for a search engine for uncontrolled databases that provides to the user results, which accurately correspond the desired information sought.
Advertisers are generally willing to pay more to deliver an impression (e.g., a banner ad or other type of advertisement) to users who are especially sensitive to advertisements for their products or are seeking to purchase products corresponding to those sold by the advertisers, and the economic model often provides greater compensation in the event of a “click through”, which is a positive action taken by the user to interact with the ad to receive further information.
This principle, of course, actually operates correspondingly in traditional media. For example, a bicycle manufacturer in generally is willing to pay more per subscriber to place advertisements in a magazine having content directed to bicycle buffs than in a general interest publication. However, this principle has not operated very extensively in the search engine marketplace, partly because there is little differentiation among the known characteristics of the users of particular search engines, and because, even after a search inquiry in submitted, there may be little basis on which to judge what users intention or interest really is, owing to the generality or ambiguity of user's request, so that even after a search request is processed, it may be impossible to estimate the salient economic, demographic, purchasing or interest characteristics of the user in the context of a particular search. In fact, some “cookie” based mechanisms provide long-term persistence of presumed characteristics even when these might be determined to be clearly erroneous. Thus, the existing techniques tend to exaggerate short term, ignorance based or antithetical interests of the user, since these represent the available data set. For example, if a child seeks to research the evils of cigar smoking for a school class project, a search engine might classify the user as a person interested in cigar smoking and cigar paraphernalia, which is clearly not the case. Further, the demographics of a cigar aficionado might tempt an advertiser of distilled liquors to solicit this person as a potential client. The presumed interest in cigars and liquor might then result in adult-oriented materials being presented. Clearly, the simple presumptions that are behind this parade of horribles may often result in erroneous conclusions.
Another inherent problem with the present technology of search engines is that the user, to make a request for information, must use words from natural language, and such words are inherently ambiguous. For example, suppose user enters the word “bat” as a search query to a search engine to search the database generated by its associated spider, and produce a set of ranked results according to the relevance algorithms. The word bat, however, has several possible meanings. The user could mean a “baseball bat”, or the mammalian bat, or maybe even a third or forth meaning. Because the technology of existing search engines cannot generally distinguish various users intentions, typically such engines will return results for all possible meanings, resulting in many irrelevant or even ludicrous or offensive results.
Yet another problem with existing search engine technologies relates to the problem of organizing results of a search for future use. Internet browsers, which are presently complex software applications that remain operative during the course of a search, can be used to store a particular URL for future use, but the lists of URLs created in this way tend to become very long and are difficult to organize. Therefore, if a user cannot remember a pertinent URL (many of which are long or obscure), the user may be forced to go search again for resources that user might wish were ready at hand for another use. On the other hand, in some instances, it may be more efficient to conduct a new search rather than recalling a saved search.
Although a few search engines for the mass market exist that charge a fee for use, this model has not been popular or successful. Instead, most search engines offer free access, subject to user tolerating background advertising or pitches for electronic commerce sales or paid links to sites that offer goods and services, including the aforementioned banner ads. These advertisements are typically paid for by sponsors on a per impression basis (each time a user opens the page on which the banner ad appears) or on a “click-through basis” (normally a higher charge, because user has decided to select the ad and “open it up” by activating an underlying hyper-link) In addition, most search engines seek “partners” with whom they mutually share hyperlinks to each other's sites. Finally, the search engines may seek to offer shopping services or merchandise opportunities, and the engines may offer these either globally to all users, or on a context sensitive basis responsive to a user's particular search.
Therefore, the art requires improved searching strategies and tools to provide increased efficiency in locating a user's desired content, while preventing dilution of the best records with those that are redundant, off-topic or irrelevant, or directed to a different audience.
The art also requires an improved user interface for accessing advanced search functionality from massive database engines.