Continued growth in the sheer volume of personal digital content, together with a shift to multi-device personal computing environments, is inevitably leading to the development of Personal Content Databases (herein referred to as “PCDBs”), which will make it easier for users to find, use, and replicate large, heterogeneous repositories of personal content. An email repository is an example of a PCDB in today's terms in that users receive messages and content in heterogeneous forms. For instance, the text included in the body of an email message may be formatted according to a variety of formats and styles, may include pictures, audio or video user interface (UI) controls, hyperlinks to other content, and importantly, just about any kind of content can be attached to an email message as a separate, but associated, object. To name a few, attachments may be images (such as .jpeg files, .gif files, etc.), video (mpeg files, RealPlayer format, QuickTime format, macromedia flash objects, etc.), audio (.mp3 files, .wmv files, etc.), contact cards (e.g., v-cards), calendar objects (Sch+ files), word processing documents (Word, WordPerfect, .pdf files), graphics files (Paint files, Visio files, etc.) and computer code (object files and source code). In essence, any object that can be created in a computing system can be shared via email, and thus, a user can appreciate that an email repository may serve as an example of the generalized notion of a PCDB.
In this regard, end users are facing at least two trends that are driving the development of these new types of “very large database”—the proliferation of data and the proliferation of devices. With respect to the proliferation of data, as mentioned above, end users are facing an explosion of email, office documents, IM transcripts, photos, video content, music, and so on, and thus people need to manage an ever increasing number of digital items. In many respects, while the number of bytes representing the content can be a separate issue, the problem identified here is that the number of items is exploding, creating overwhelming manageability overhead. Traditionally, hierarchically organized sets of folders have been the primary means of managing these items; however, folders do not scale well, and for increasing numbers of users, this problem is reaching crisis proportions. As the folder tree structure becomes massive, there are too branches to consider, and way too many leaves to uncover. In essence, folders merely save the problem for a different day in that folders by themselves add to overhead and, over time, the folders may no longer have particular relevance to the user in the manner in which they were originally organized. A folder only helps if the user remembers the folder and what is generally inside, and where to find it.
Compounding the problem is the proliferation of devices. Given multiple desktops (home, office, etc.), PDAs, smart phones, the Internet, and even in-dash car computers, the increasing volume of personal content described above is necessarily being distributed over multiple devices. Currently, movement of personal data among these devices is painful, if possible at all, and users face a hodge podge of software and services for storing the volumes of data that result. Email, for example, is sometimes stored in specialized, local files (e.g., in personal information store, or .pst, files), sometimes on servers, and sometimes replicated on both. Some office documents are stored in the local file system, but a surprisingly large number of them are stored as attachments in one's email repository. Photos are often stored in the file system, possibly indexed by specialized software running beside the file system, and also possibly replicated to a Web server. Contact information, like email, might be stored in a specialized, local file and also synchronized out to a PDA and a phone. These various storage schemes do not interoperate, are all folder based, and are difficult to manage. Currently, movement of personal data among these devices is painful, if possible at all. Over time, this needs to become seamless if users are going to be able to fully utilize their digital content, and accordingly, new ways for searching for and retrieving desired content from PCDBs efficiently and effectively are desired.
To the extent that this hodge podge of storage systems will be replaced by a single PCDB, all of the user's personal data can be encompassed: email, documents, photos, even Web pages visited by the user, from wherever generated or found or from whichever device it is retrieved. Associative retrieval, rather than folders, will be used as the primary means of organizing. The PCDB will transparently move content among a user's multiple devices, and the PCDBs of multiple users will share content with each other based on policies set by the user. PCDBs will initially be small by VLDB standards—say, tens to small hundreds of gigabytes—but current trends suggest that they will grow to terabytes.
As an illustration of PCDB principles, email is the largest, fastest-growing, and most dynamic collection of documents managed by most users, and as described above, an email store is a microcosmic representation of a PCDB. Also, email is becoming the primary gateway for bringing content into a personal environment, especially in a business setting. As an initial step in the building of robust, secure, and efficient PCDBs, therefore, it would be desirable to address current problems associated with the proliferation and retrieval of email. Searching and retrieving relevant content from a large scale email database becomes quite difficult and time consuming, and over time, as any high volume user of email recognizes, as more email is received and stored, the problem worsens. Accordingly, it would be desirable to provide a query execution model that addresses the need to search and retrieve the ever proliferating quantity of content that users receive via email.
In this regard, thanks to the success of Web search, users today can quickly understand applications that incorporate search as a user interface (UI) metaphor. If a service, such as a Web page, represents underlying content, for instance, the user quickly can appreciate that entering search terms in a UI control displayed on the Web site will retrieve content that is possibly relevant to those terms. However, with respect to email and the UI metaphor, the goals of Web and personal search tend to be quite different, and thus current UI controls and underlying algorithms for Web search are not suited to the problem of personal search. In this regard, scalable personal search is a difficult problem and for different reasons than Web.
For a brief explanation as to why, when considering only the search corpus, personal search seems much easier since the Web is vast, distributed and global whereas the desktop is local and finite. From a pure scale perspective, the Web is the harder problem, except personal search presents significant challenges in other ways that do not manifest with respect to Web search, including challenges with respect to: the activity associated with or goal(s) of the search, the computing environment, the interface and search dynamics.
First, it is easier to discover information than to recover an exact match based on incomplete information. The simple query “Aaron Burr,” for instance, will yield thousands of documents about him on the Web. For the most part, information on the Internet wants to be found; it is intentionally, proactively—even aggressively—optimized for search engines results given knowledge of the underlying search algorithms. But recovery of personal information requires higher precision. There is typically only one right answer, one message or document (or version of the document!) for which the user is looking, and typically, what little metadata exists and is captured at the time an email message enters the store is not optimized for search and retrieval. Making matters worse, people typically adopt a steep discount function on time. This means users will not invest the time to organize up front (e.g., adding good associative metadata to the content)—nor should they, with the tsunami of digital information they face—so they invest it on the back end, with the expectation of a quick recovery process. Further, users know they once had the information, and so the process of looking for things can quickly feel redundant, frustrating and interminably time consuming.
When considering the computing environment, Web search engines are built from thousands to tens of thousands of dedicated machines. These machines are assigned specific tasks—some crawl, some index, some respond to queries. All the resources of a machine are dedicated to its one task. With personal machines, on the other hand, resources such as computing cycles, RAM, and I/O transactions are expected to be dedicated primarily to the user's foreground activity. When this expectation is violated, users quickly get impatient. Thus, resources for indexing and disk structure maintenance must be borrowed from this primary use. In addition, Web search engines typically house their machines in dedicated host facilities with backup servers, restoration services, and redundant power supplies. Operating systems, memory configurations and hardware configurations are all finely tuned to be application-specific. The desktop is another world entirely—it's downright hostile. File scanners of various types can lock files for long periods of time, preventing even reads from occurring. Virus detectors and “garbage collectors” feel free to delete files they deem dangerous or redundant. And of course, there are users, who feel free to remove files and even entire directories they (mistakenly) deem to be unnecessary.
Additionally, the typical interface to Web search engines supports a single task: executing queries. PCDB interfaces, on the other hand, are embedded in applications that support multiple tasks. In email, for example, finding messages is one of many tasks; users also want to view messages (and, at times, avoid reading messages), create them, and even relate them to their on-going projects. Search can support many of these tasks, but only if the UI is redesigned around the search paradigm (rather than being relegated to a mere “fast find” dialog box).
With respect to dynamics, for the purposes of an individual query, content on the Web is static. Naturally, it changes over time, but the lifetime of a Web query is far shorter than the update cycle of the index. Personal content, on the other hand, is dynamic, in two directions. First, new information is constantly being added. Emails come in and go out at a dizzying pace. New documents are created and sent and received as attachments and moreover, all sorts of content can be downloaded from the Web. Second, the information itself is dynamic over time. Emails change state as they are read, annotated, altered, sent, and filed. Plus, capturing different versions of documents is essential to the flow of business. Business contracts, negotiations and agreements all have multiple versions and retrieving the correct version can have broad and deep financial implications. In a PCDB, the lifetime of queries far exceeds these changes. As a simple example in the context of email, when looking at the Inbox (an example of a view on a PCDB) in a search-based email client, one is looking at the output of a query: as new messages enter the system, this output needs to be updated accordingly. When keeping track of many views over the PCDB simultaneously, one can see that the problem compounds and becomes daunting.
It would thus be desirable to provide a query execution model that addresses the above-described characteristics of a personal search of a PCDB, such as an email store. It would be further desirable to provide a mechanism for returning query results from a PCDB to a user interface of a device, either as a count or a view of the results. It is further desirable to provide a mechanism that updates the query results (as displayed in the UI as a count or a view) efficiently and automatically as the underlying content reflected by the search changes, with the ability to scale to many simultaneous queries. It would be still further desirable to provide a simple and efficient mechanism for providing fast, updated message counts for saved searches.