1. Field of the Invention
This invention is in the broad field of information technology, and pertains more particularly to providing information to customers shopping in brick-and-mortar retail establishments, the information pertaining to products and services for sale.
2. Description of Related Art
As information proliferates at an ever-increasing pace, one of the greatest areas of need in information technology is in the area of ways to find needed information, as described briefly above, and this is an area served in one important aspect by search engines and associated systems that enable users to find information, such as in web pages in the Internet network. Search systems and search engines are a particular focus in embodiments of the present invention.
A goal of most search engines is to make it possible for users to easily find and/or access relevant data on the world wide web (WWW). Relevance is always of great importance, and is perhaps best judged by the person looking for the information.
A key subsystem of most known search engines is a system for crawling the Web and collecting information, known in the art as a Web crawler. Without regularly crawling the Web to update the information there available, a search engine will rapidly become outdated and irrelevant. Further the Web crawling subsystems are needed to be efficient and to operate on a relatively large scale. Ideally such search engines should operate without disrupting the Web itself or the sites (pages) that are crawled. Many innovations in this area are sought, including methods for checking pages for updates including soliciting involvement from content owners in notifying the search engine enterprises of relevant changes, methods for caching data and parallelizing the process of crawling, and more. Typically the result of the Web crawling is a database of Web content that may span more than 10 billion Web pages, all or part of the content of which may be collected and archived by the search engine.
Pages collected by a crawler subsystem are analyzed in a variety of ways well known in the art to create an index of page identifiers and links to the pages. Such a search index serves much the same purpose as the index of a book; for any term or terms entered as search criteria, a list of pages, with links to those pages, is returned. More broadly, a goal of the Web search index is to return a list of pages when a user enters a search query such as, for example, “dramatic innovations”. Typically pages returned are pages in which the terms are simply present, although it might be preferable to also return pages that may not contain the search terms, but may nevertheless be relevant to the needs of the person who enters the search query. For instance, in response to a search query stated as “dramatic innovations”, the search engine might return links to the history of the Wright Brothers' airplane innovation, even though the history may not comprise the specific term. Relevance is of great importance. A Web crawler is a means to an end in search. An index built from information garnered by a crawler is one of the core elements of a search system.
An index, however, is of little use unless users can use it to search the Web, so a user interface is needed. In such an interface, typically operated from an application known in the art as a browser, the user enters a search query and typically presses Enter. The query is sent, via the Internet network, to the enterprise hosting the search service, of which several major enterprises are well-known. The search engine then uses the present index (the index may change over time as Web crawling progresses) to make a list of Web pages that match the search query. Again, a key challenge is to provide that the most relevant results for this particular user are displayed at or near the top of the list.
The known need for relevance has been a very important motivator in developing a page ranking algorithm. A page ranking algorithm (or node ranking algorithm) is a ranking subsystem, which determines the order of display of the search results. The criticality of this function is that a person searching is going to look at the top-listed pages, rather than digging down to buried information, especially if it is clear that there is a ranking system meant to present more relevant pages nearer the top. Additionally, if the relevance determinations are considered authoritative by many users, the tendency to only look at highly-ranked search results becomes more pronounced, making the impact of the relevance scores very large.
One of the most effective page ranking algorithms in the art at the time of filing the present application is the PageRank algorithm of Google™, Incorporated. The effectiveness of the PageRank algorithm is related in the current art, at least in part, to a structural graph and a matrix computation. The structural graph is a representation of the structure of linkages between pages in the form of a “graph”, as is well known in the art of graph theory. It is well known that, although there are additions and variations, the PageRank system basically works by giving indexed pages a score that is calculated by adding up the number of links that point to the page to be ranked from other pages, and weighting this score based on similar scores calculated for the linking pages. That is, if there are five pages that link to a page to be ranked, but no other page links to the five pages, then the PageRank for that page will be much lower than for a page that has five in-links that each come from highly ranked linking pages (these in turn are highly ranked because many pages link to them, and so on). It is clear that the calculation for page ranking involves relatively complex mathematics, since the score of one page is determined by the scores of linking pages, whose scores are in turn determined by the scores of their linking pages, whose scores are determined by the scores of their linking pages, and so on at least to some pre-determined depth.
From this description it becomes clear why a graph is needed—in current art it is necessary to understand the structure of linkages that connect Web pages in order to perform the calculation, which is based on these links.
In a somewhat abstract sense one may visualize the WWW as a vast array of dots (points, or nodes), each of which represents a Web page connected in the Internet network. To represent nearly all of the existing pages at any one point in time would need perhaps 1010 points. Each of the pages is, of course, a collection of code, typically in HTML format (or one of its well-known extensions such as DHTML, Cascading Style Sheets, etc.), that defines page content, which may be presented by the page through a user's computer typically using a web browser, which may include text, graphics, audible music and voice, video, and more. Another component of almost any page in the Web is at least one link for initiating a transfer to a different page, or in some cases more recently, initiating a transfer of code and data to a user's computer for some purpose, without requiring transition to a different page.
FIG. 1 is a very simple illustration of the one-dot-for-a-page illustration or view of the WWW introduced above. Only five page-representative dots are shown, as sufficient for the purpose, these being pages 101 through 105. A link for the present purpose may be considered the well-known navigational element in the display of a web page for which the cursor typically turns into a hand with a mouseover, and for which clicking-on asserts an address (such as a Universal resource locator URL), which takes the user to another Web page. The link area in a display can be an icon, text, or even an animated figure.
In FIG. 1 the links are shown as arrows. Note that page 105 has links to all of pages 101 through 104, none of which link back to page 105. Links 101 through 104 each have one link to another one of the pages. It is helpful to consider that, although a link is a link, there is a difference in links from the view of the page itself. From the viewpoint of the page, a link may be an out-link (an outgoing link to another page) or an in-link to the instant page from another page. Consider, for example, page 103, which has two in-links, one each from pages 102 and 105, and one out-link to page 104. Consider also that not all links to or from these five pages may be shown, because a very limited subset of pages is illustrated. Page 105, for example, may have several in-links from pages not shown. For the purpose of a state-of-the-art page ranking system, it is the in-links that are typically most important.
In the current art, according to all of the information known to the inventor, the PageRank algorithm and all other search ranking systems are based on the static link structure of the World Wide Web, as briefly described above. The random page graph shown, with the links shown, however, is not a good mathematical model for the purpose. For better computation efficiency a better model (graph) is shown in FIG. 2. The inventor terms this graph a Structural Web Graph (SWG). It should be understood as well, at the outset, that a SWG may only ever show a subset of the WWW structure, and the size and structure of the WWW is in constant flux. In this SWG concept each Web page in the WWW (or a subset) is still a point, but the pages are not illustrated in random space, but in rows and columns. So in the SWG of FIG. 2 there are five rows, each identified by the page association, and also five columns, each also identified by the same page association. By using the same five pages as in FIG. 1, a six-by-six matrix results, considering the five pages and the necessity of having an origin to the matrix. If the matrix were defined for essentially all Web pages, it would be as big as 1010 rows and 1010 columns.
In FIG. 2 the rows and columns are shown with identifiers for the pages associated with each row and column. In a workable, mathematical definition to be machine-manipulated, the rows and columns would simply be identified in a data convention; the matrix might never be displayed.
The matrix as shown in FIG. 2 creates a row-column intersection for each page represented with every other page represented in the matrix. This is a basis of its utility. There is also an intersection for each page with itself, which has no utility for the present purpose, and these intersections have been marked in FIG. 2 by an X.
Now consider, as an example of the utility of the SWG, which is well-known in the art, the following illustration. The intersection of the row for page 104 with the column for page 102, which is labeled in FIG. 2 as element 201, presents an opportunity to represent a particular relationship between pages 104 and 102, which may be shown in a number of ways, one of which is simply a value placed at the intersection. In this case the value, by convention, is to represent whether there is an in-link from 102 to 104. Since there is not, the value is zero.
It should be recognized that at an intersection the convention of labeling the intersection with a value based on the existence of a link from the page represented by the column to the page represented by the row is arbitrary; one could as easily have chosen a convention of in which the element 201 would represent a link from page 104 to page 102, and would thus still be set to zero (since the path from 102 to page 104 is indirect; there is no link from 102 to 104 in FIG. 1). A primary function of the SWG utilized in most search engines in the art is to capture the plurality of link relationships between pages in a computationally useful way. In-links are the most useful, since they represent the choices of web page designers to link from the pages they are designing to other web pages. It will be appreciated that pages that are heavily linked to are likely to be more relevant, whereas pages with many out-links may or may not be relevant (the designers of these pages being free to add more out-links, since they control the content of their own pages, they would be able to easily inflate the relevance scores of their pages). A web crawler may garner this information by crawling each web page and noting the links from that page to other pages; in the case of element 201 of FIG. 2, the crawler when reaching page 104 would have noted no link to page 102 and thus marked a zero in element 201, as shown in FIG. 2.
Crawling FIG. 1 provides information that page 104 is linked (has in in-link) from page 103, but not from page 102. Therefore the value at 201 is zero, but the value at the intersection of the row for 104 and the column for page 103 is 1. By the same process, crawling FIG. 1 the values at all of the other intersections are determined, and have been indicated in FIG. 2.
In this particular example, the values are one or zero, which may be convenient for computer simulation and manipulation. Of course other values may be assigned, and in the real world values may be weighted by a number of other considerations, not just whether there is an in-link from the secondary to the primary page. For example, it is common in the art to normalize the values of the Structural Web Graph so that the sum of all of the values in the Structural Web Graph is equal to one, making each value equal to a probability that a random web surfer might make a particular transition from one page to the next (and, continuing this convention, the sum of the values of a column represent the probability that a random web surfer will, after a long session, find herself on the page represented by the column).
A page ranking algorithm, which may take many forms, might, in a primitive form, just consider the SWG once to rank a page. The value at each intersection may be one or zero, but there is a possibility of a 1 for a primary page at each intersection for another page. For page 104 the sum of values at intersections across the row is two. So page 104 may be given a rank value of two, since two pages (103 and 105) link into page 104. The rank value for page 105 would be the sum for the row for page 105, or zero, since no pages link in to page 105. In FIG. 2 the sum for every row but 105 is two, so the pages other than 105 may have equal rank, or there may be a tie-breaker in the algorithm. In a real-world case there are many, many more intersections to consider, and one page may be seen to be linked to from dozens or hundreds of other pages.
In a more sophisticated situation, the page ranking algorithm may first consider the row sum for a page, and then look at the in-links for each of the secondary pages at the positive intersections; that is, an answer to the question: How many pages link in to each page that links directly to the page being ranked, which may be extended to how many (and which ones) link to each page that links to the instant page. Now the value for ranking becomes more realistic and granular, but is still limited to the structural links designed into the pages of the Web. This approach is the basis of the well-known PageRank algorithm pioneered by Google™; the heuristic that drove this step was that links represented authorities, and the relative in-link density of a given authority provides a good indication of the importance of that authority. So at least a nominal relevancy was indicated.
In summary, a search engine in the present art comprises a few key elements, such as a Web crawler to discover and gather information about Web pages, an index of Web pages composed of information garnered by the crawler, a search function that determines which of the pages in the index to present to a viewer, based at least in part on the search query entered by the browsing person, a Structural Web Graph based also on the information retrieved by the crawler, and a PageRank algorithm that uses the Structural Web Graph and values assigned in the graph to give each page a unique PageRank score, for ordering the displayed return of the pages. U.S. Pat. No. 6,285,999 issued to Lawrence Page describes and claims such a PageRank system. U.S. Pat. No. 6,285,999 is incorporated by reference in the present application.
Bearing in mind many of the difficulties attendant to search technology, many of which are described above, it is clear that provision of correct and expedient search criteria by individuals seeking information from networked collections is a serious difficulty, and returning information ranked for relevancy is also a distinct challenge for conventional search systems, such as those provided by Mozilla™, Google™ and Yahoo™. Having considered all of these difficulties the inventor believes that what is clearly needed is an intermediary system and methods that will provide greatly enhanced search capability for individuals in dealing with more conventional search services.