1. Field of Invention
A method, device, and computer program product to aid in searching and navigating among files. The method allows for the building up of a web of links connecting the documents, and is suitable at least for cases where such a web is not pre-existing, for example, a single user, or small groups sharing documents.
2. Discussion of the Background Art
The current situation for any user of a personal computer is frustrating. A user is accustomed to finding, essentially instantly, almost anything of interest that exists on the Web, on any topic, plucked out from a set of Web documents that currently exceeds 8 billion in number, with the results ranked so well that usually the few highest-ranked hits give the user what is asked for. Also, it is easy for this user, having found a good hit, to follow hyperlinks from that hit and so discover related documents.
Now the same user has perhaps thousands or millions of files on his/her PC. This user needs also to search and navigate through these files. The reason is, of course, that the number of files makes it impossible to remember what they all are, where they are in the hierarchical file system, and what they contain. Hence the user needs help: a) in finding specific files, and b) in finding files related to a topic or theme. This is of course precisely the kind of help one gets—in the case of the Web—from current Web search engines. The frustrated user then asks, “Why is it so much harder to find things on my own PC?”
The present invention is aimed at meeting just this need. That is, this invention offers methods for searching and navigating among personal files. It is also suited to supporting the same functions for files that are shared by groups.
The current state of the art in technologies to aid search and navigation over personal files is rather limited. As noted above, at present there is a clear gap between the need of users to search through ever-growing amounts of personal content, and the capabilities of present technology to meet this need. Recently, many different firms have recognized this gap, and are working hard to fill it—since a large, unmet need represents a large business opportunity. Hence in discussing the current state of the art, we will include both the limited technological solutions that may be bought and used today, and also those that are announced or hinted at in the public media. The point is that the field is in a state of rapid growth and change.
The idea for desktop search—meaning a search appliance to run locally on a user's own PC—has existed for some time now. One of the first Internet search engines, AltaVista, gave away free software for personal PC search in 1998, called AltaVista Discovery. Here we see an early recognition of a fact which is now understood by many: the sheer number of digital documents that even a single user must relate to has grown so large that the old, hierarchical method of organizing and navigating among files is hopelessly inadequate.
Microsoft has been aware of the problem facing PC users searching for information in computer files for more than a decade. Microsoft's vision of a unified data store in its Windows operating system (Cairo, with OFS—Object File System; ideas date back to at least 1990) has been the source of many public announcements. These announcements have continued up to now, and are revised often. (After several postponements, the current announced launch date for the next version of Windows, code-named Longhorn, is 2006.) The solution offered by Microsoft is to replace the basic plumbing of its Windows operating system with technology borrowed from its SQL Server database software. Currently, documents, Web pages, e-mail files, spreadsheets and other information are stored in separate, mostly incompatible software. The new technology, code-named WinFS, promises to unify storage in a single database built into Windows that's more easily searchable, more reliable, and accessible across corporate networks and the Internet.
In October, 2004, Google released a beta version of its Google Desktop Search engine. In contrast to the Microsoft ‘total-overhaul’ approach, Google Desktop Search consists of a relatively small and easily downloaded set of software modules, which scan and index the contents of a user's PC. The index is then used to support fast searches. Documents which are indexed include text files, Word files, Powerpoint, excel, Outlook mail files, and browsed Web documents.
Subsequently (in December 2004), Microsoft released a beta version of its Microsoft Toolbar Suite, which includes both desktop search and Web search. Microsoft had previously purchased the Lookout desktop search technology; Lookout (as evidenced by its name) focused on searching through Outlook files.
Also in December 2004, Ask Jeeves introduced a beta version of a downloadable desktop search engine. This engine likely integrates technology acquired from the firm Tukaroo, which was bought by Ask Jeeves. In the same month, Yahoo announced that it would release a test version in early 2005. Yahoo has purchased a large number of earlier technologies, most notably Overture—which had itself purchased several engines, including AllTheWeb. Yahoo is developing its desktop search engine in cooperation with X1.
There are many other firms offering desktop search products. The brief summary above is certain to be rapidly outdated; hence we do not try here for completeness. An overview of desktop search firms and products may be found at goebelgroup's website.
An important question is, “What technology do these new players use?” Little information is disclosed in the publicly available announcements by these firms; and it is very hard to find any details about the actual search technology that are used. The vast majority of these firms seem clearly to offer keyword-based search, using indexing over various file types; and many offer both desktop and enterprise search. However, we have not found any firm which clearly bases its ranking of search results on link analysis. In fact, it is not clear whether any of the above firms use links at all—either for ranking or for navigation.
A technology that does apparently make some use of links is that of the Autonomy Corporation. Autonomy has recently launched IDOL Enterprise Desktop Search. Autonomy technology includes symmetric “similarity links” between documents. The similarity measure is sophisticated, using probabilistic measures of concept similarity. Also, the concept analysis is used in the searching process, replacing the reliance purely on keywords. However, there is no sign of the use of one-way hyperlinks such as proposed in the present invention, and no evidence of the use of link analysis. In fact, Autonomy explicitly rejects the use of any kind of page ranking technique. That is, as noted in a press release: “Instead of page ranking, an approach which has been proven to be ineffective in the link free enterprise, Automatic Query Guidance uses conceptual clustering . . . .”
Thus, as discovered by the present inventors, in order to be able to build good searching, ranking, and navigation tools for a wide variety of documents, it is preferable to have a proper link structure on the local file system that can be exploited in a link analysis. The kind of link structure that is present on the World Wide Web represents the way people relate to information far better than does the traditional hierarchical file system, with each document forced into a single place in a hierarchical tree. If such a link structure already had been present on today's PCs, a link-analysis based search-and-ranking device for local hard disks would probably already exist.
None of the solutions proposed to date build the necessary link infrastructure to enable link analysis-based ranking for search and navigation among files of a single user or a small group. The present invention remedies this by proposing a way for generating a local link structure.
As explained in more detail below, hyperlinks can provide two types of information: they can indicate a similarity between two files (symmetric), and/or they can imply a recommendation that a viewer starting at file A may find file B interesting (one-way or asymmetric). Also, links can be used for two purposes: they can help in searching (via ranking), and in navigation.
Current technologies for non-WWW document systems either lack hyperlinks entirely—thus missing both the ranking and the navigation benefits—or they use only similarity (e.g., Autonomy). In the latter case, the option of exploiting human judgment to provide recommendations about files, and about relationships between files, is lacking. Without such recommendations, both search (ranking) and navigation will suffer in quality.
Link analysis has played a crucial role in the enormous success of the Google Web search engine. Before Google, main approaches to ranking of hits from a search used one or more of: text relevance, “link popularity”, and human judgment (Yahoo). Text relevance is always important, but not sufficient in itself to give good ranking results. Link popularity is characterized by counting the links pointing to a page. Link popularity is the crudest form of link analysis, and is too easily fooled by fake links. Finally, human judgment, though always useful, is too slow and costly for distributed document systems with many documents and a high turnover rate.
Google was the first Web search engine known to the inventors to make use of nontrivial link analysis by way of the well-known PageRank algorithm. An advantage of PageRank—along with other forms of nontrivial link analysis, such as those cited in U.S. patent application Ser. No. 10/687,602 and U.S. patent application Ser. No. 10/918,713—is that PageRank makes use of a collective form of human judgment. That is, most of the huge number of links, connecting billions of Web pages, are laid down by millions of humans (Web page designers). Hence nontrivial link analysis is a clever way to harness the free labor of these millions of humans, extracting their collective judgment, in order to find the best Web pages.
For the most part, when a Web designer lays down a link from his own page A to another page B, it means that (in the Web designer's opinion) a reader interested in page A is likely also to be interested in page B. That is, such a link may be interpreted as implying some mixture of two things: (i) that page B is similar to page A; and/or (ii) that page B is likely to be interesting to someone interested in page A.
In short: link analysis is valuable because links convey two things: similarity and recommendation.
While these approaches have been applied to networked environments, consumers are faced with the dilemma of how to deal with thousands or millions of files located on their personal computer.
What is desired, as recognized by the present inventors, are tools to develop a Personal Web of links, enabling a user to rank hits from a keyword search, and to navigate through these files. The term “Personal Web” refers to the network of linkages between documents that are built up by the current invention. The Personal Web includes the combination of: (i) undirected, weighted links, based on similarity; (ii) directed, weighted links, which may or may not be anchored to text on the pointed or pointed-to document, and which represent recommendation; and (iii) weights (importance scores) assigned to the documents themselves—again representing recommendation.
Ranking and navigating will always be important functions in the world of large masses of information. The Personal Web supports both of these functions in a unique and effective way—by incorporating the two crucial aspects of similarity and recommendation—as discussed in some detail next.
First we address similarity. The present invention uses machine algorithms to evaluate similarity between documents or files. As noted above, at least one other approach (that of Autonomy) uses similarity analysis between documents to aid the user in finding and navigating between these documents. This measure of similarity is different from that of Autonomy. Another difference is the use weighted similarity links, which are generated by the previously described similarity analysis, as a component in the total link analysis approach—which in turn supports the ranking of hits from a search. Also, the similarity links play an important role in aiding navigation.
Next we come to recommendation. Recommendation is often best done by humans. However the case of a single user evaluating his/her own files is rather different from the case of evaluating files on the Web. On the Web, millions of users contribute to recommendations among billions of Web pages. In this situation, each user only makes recommendations for a relatively small number of other documents. In the one-user case, it is often not realistic or practical for a user to go through many thousands of pre-existing files, and attempt to lay down links pointing to other related and/or interesting files. That is, one cannot simply create “a Web on the desktop” by attempting to make a personal Web just like the World Wide Web—because the burden of labor on the single user is too great.
Another difference from the WWW is also relevant. That is, the single user is often in fact the only person who is qualified to evaluate the quality or interest of his/her own files—no one else can do this, and no machine can do this. The user has read—or at least has some knowledge of—all of these files. In contrast, on the WWW, there is no way that any one person can evaluate all pages on the Web.
Summing up these two differences: on the Web, many individuals do the job of reading; and many individuals do the job of recommending/evaluating, via hyperlinks. In the single-user case, one individual can be expected to do (albeit of course imperfectly) the job of reading the files; and yet this one individual is not expected to be willing to do the labor of laying down links from each file to others. This mismatch between the resources of the recommender(s) and the number of documents to be reviewed/recommended has so far prevented any systematic application of hyperlinks to document systems other than the World Wide Web.
To address this mismatch, the present invention includes a hybrid form of recommendation. This hybrid provides to the user the option of laying a hyperlink from any file to any other. This hybrid also however provides another mechanism for recommendation: each file will be given a “file quality score” or FQS. Each file will have a default value, which is rather low on the scale of possible FQSs. This value may be modified automatically, based on measures such as recentness and/or frequency of use of a document. Also, the user can increase (or decrease) this FQS at will, whenever it is convenient—for example, after opening/reading the file. The FQS is the least labor-intensive possible method for including recommendations into a system of documents. The present invention adds even greater flexibility by including also the possibility of user-chosen hyperlinks. It is in this sense that one embodiment of the recommendation system is hybrid: it includes both weights on the nodes of the graph (the documents, with their FQSs), and directed links between the nodes (thus recommending the pointed-to document from the pointing document).