Continued growth in the sheer volume of personal digital content, together with a shift to multi-device personal computing environments, is inevitably leading to the development of Personal Content Databases (herein referred to as “PCDBs”), which make it easier for users to find, use and replicate large, heterogeneous repositories of personal content. An email repository is an example of a PCDB in today's terms in that users receive large volumes of messages and content in heterogeneous forms. For instance, the text included in the body of an email message may be formatted according to a variety of formats and styles. An email may include pictures, audio or video user interface (UI) controls, hyperlinks to other content. Moreover, just about any kind of content can be attached to an email message as a separate, but associated, object. To name a few, attachments may be images (such as .jpeg files, .gif files, etc.), video (mpeg files, RealPlayer format, QuickTime format, macromedia flash objects, etc.), audio (.mp3 files, .wmv files, etc.), contact cards (e.g., v-cards), calendar objects (Sch+ objects), word processing documents (Word, WordPerfect, .pdf files), graphics files (Paint files, Visio files, etc.) and computer code (object files and source code). In essence, any object that can be created in a computing system can be shared via email, and thus, a user can appreciate that an email repository may serve as an example of the generalized notion of a PCDB. Defining the objectives and requirements for interacting with a PCDB is thus a useful step towards providing a system that, at a minimum, meet those objectives and requirements.
In this regard, end users are facing at least two main trends that are driving the development of these new types of “very large database(s)”—the proliferation of data and the proliferation of devices. With respect to the proliferation of data, as mentioned above, end users are facing an explosion of email, office documents, IM transcripts, photos, video content, music, and so on, and thus people need to manage an ever increasing number of digital items. In many respects, while the number of bytes representing the content can be a separate issue, the problem identified here is that the number of items is exploding, creating overwhelming manageability and organizational overhead. Traditionally, hierarchically organized sets of folders have been the primary means of managing these items; however, folders do not scale well, and for increasing numbers of users, this problem is reaching crisis proportions. To name just a few problems with folder structures, as the folder tree structure(s) becomes massive, there are too many branches to consider, and way too many leaves to uncover. In essence, folders merely save the problem for a different day because folders, by themselves, add to overhead and, over time, the folders may no longer have the same contextual relevance originally contemplated by the user. A folder only helps if the user remembers the folder and what is generally inside, and where to find it. Such folder memory is lost when the number of folders exceeds the average memory capabilities of the human mind.
Compounding the problem is the proliferation of devices. Given multiple desktops (home, office, etc.), PDAs, smart phones, the Internet, and even in-dash car computers, the increasing volume of personal content described above is necessarily being distributed over multiple devices. Currently, movement of personal data among these devices is painful, if possible at all, and users face a hodge podge of software and services for storing the volumes of data that result. Email, for example, is sometimes stored in specialized, local files (e.g., in personal information store, or .pst, files), sometimes on servers, and sometimes replicated on both. Some office documents are stored in the local file system, but a surprisingly large number of them are stored as attachments in one's email repository. Photos are often stored in the file system, possibly indexed by specialized software running beside the file system, and also possibly replicated to a Web server. Contact information, like email, might be stored in a specialized, local file and also synchronized out to a PDA and a phone. These various storage schemes do not interoperate, are all folder based, and are difficult to manage. Over time, interacting with content across device location(s) needs to become seamless if users are going to be able to fully utilize their digital content. Accordingly, new ways for searching for and retrieving desired content from PCDBs efficiently and in a scalable manner are desired.
To the extent that this hodge podge of storage systems will be replaced by a single PCDB, all of the user's personal data can be encompassed: email, documents, photos, even Web pages visited by the user, from wherever generated or found or from whichever device the data is retrieved. A hope is that associative retrieval, rather than folders, will be used as the primary means of organizing. Another hope is that the PCDB will transparently move content among a user's multiple devices, and the PCDBs of multiple users will share content with each other based on policies set by the user. While PCDBs will initially be small by VLDB standards—say, tens to small hundreds of gigabytes—current trends suggest that they will grow to terabytes, and thus another hope is that the computing systems and methods built around PCDBs will scale appropriately.
As an illustration of PCDB principles, email is the largest, fastest-growing, and most dynamic collection of documents managed by most users, and as described above, an email store is a microcosmic representation of a PCDB. Also, partly due to the difficulty of exchanging content among devices by comparison, email is becoming the primary gateway for bringing content into a personal environment, especially in a business setting. As an initial step in the building of robust, secure, and efficient PCDBs, therefore, it would be desirable to address current problems associated with the proliferation and retrieval of email. Searching and retrieving relevant content from a large scale email database becomes quite difficult and time consuming, and over time, as any high volume user of email recognizes, as more email is received and stored, the problem worsens. Accordingly, it would be desirable to provide a query execution model that addresses the need to search and retrieve the ever proliferating quantity of content that users receive via email.
In this regard, thanks to the success of Web search, users today can quickly understand applications that incorporate search as a user interface (UI) metaphor. If a service, such as a Web page, represents underlying content, for instance, the user quickly can appreciate that entering search terms in a UI control displayed on the Web site will retrieve content that is possibly relevant to those terms. However, with respect to email and the UI metaphor, the goals of Web and personal search tend to be quite different, and thus current UI controls and underlying algorithms for Web search are not suited to the problem of personal search. Scalable personal search is thus a difficult problem and for different reasons than those related to the Web.
For a brief explanation as to why, when considering only the search corpus, personal search seems much easier since the Web is vast, distributed and global whereas the desktop is local and finite. From a pure scale perspective, the Web is the harder problem, except personal search presents significant challenges in other ways that do not manifest with respect to Web search, including challenges with respect to: the activity associated with or goal(s) of the search, the computing environment, the interface and search dynamics.
First, it is easier to discover information than to recover an exact match based on incomplete information. The simple query “Aaron Burr,” for instance, will yield thousands of documents about him on the Web. For the most part, information on the Internet wants to be found; it is intentionally, proactively—even aggressively—optimized for search engines results given knowledge of the underlying search algorithms. But recovery of personal information requires higher precision. There is typically only one right answer, one message or document (or version of the document!) for which the user is looking, and typically, what little metadata exists and is captured at the time an email message enters the store is not optimized for search and retrieval. Making matters worse, people typically adopt a steep discount function on time. This means users will not invest the time to organize up front (e.g., adding good associative metadata to the content)—nor should they, with the tsunami of digital information they face—so they invest it on the back end, with the expectation of a quick recovery process. Further, users know they once had the information, and so the process of looking for things can quickly feel redundant, frustrating and interminably time consuming.
When considering the computing environment, Web search engines are built from thousands to tens of thousands of dedicated machines. These machines are assigned specific tasks—some crawl, some index, some respond to queries. All the resources of a machine are dedicated to its respective one task. With personal machines, on the other hand, resources such as computing cycles, RAM, and I/O transactions are expected to be dedicated primarily to the user's foreground activity. When this expectation is violated, users quickly become impatient. Thus, with PCDBs, resources for indexing and disk structure maintenance must be borrowed from this primary use. In addition, Web search engines typically house their machines in dedicated host facilities with backup servers, restoration services, and redundant power supplies, whereas with personal devices, operating systems, memory configurations and hardware configurations tend to be all finely tuned for a specific set of applications in different ways from one another. The desktop is another world entirely—it's downright hostile. File scanners of various types can lock files for long periods of time, preventing even reads from occurring. Virus detectors and “garbage collectors” feel free to delete or otherwise “quarantine” files that are deemed dangerous or redundant. And of course, there are end users who are free to remove files and even entire directories they (mistakenly) deem to be unnecessary.
Additionally, the typical interface to Web search engines supports a single task: executing queries. PCDB interfaces, on the other hand, are embedded in applications that support multiple tasks. In email, for example, finding messages is one of many tasks; users also want to view messages (and, at times, avoid reading messages), create them, and even relate them to their on-going projects. Search can support many of these tasks, but only if the UI is redesigned around the search paradigm (rather than being relegated to a mere “fast find” dialog box).
With respect to dynamics, for the purposes of an individual query, content on the Web is static. Naturally, it changes over time, but the lifetime of a Web query is far shorter than the update cycle of the index. Personal content, on the other hand, is dynamic, in two directions. First, new information is constantly being added. Emails come in and go out at a dizzying pace. New documents are created and sent and received as attachments and moreover, all sorts of content can be downloaded from the Web. Second, the information itself is dynamic over time. Emails change state as they are read, annotated, altered, sent, and filed. Plus, capturing different versions of documents is essential to the flow of business. Business contracts, negotiations and agreements all have multiple versions and retrieving the correct version can have broad and deep financial implications. In a PCDB, the lifetime of queries far exceeds these changes. As a simple example in the context of email, when looking at the Inbox (an example of a view on a PCDB) in a search-based email client, one is looking at the output of a query: as new messages enter the system, this output needs to be updated accordingly. When keeping track of many views over the PCDB simultaneously, one can see that the problem compounds and becomes daunting.
In sum, the notion of a PCDB and associated software will evolve as a way to interact with content on many computing “personal” devices, including desktop and laptop computers as well as handheld devices. Relative to server computers, personal devices have less RAM, fewer disks, and otherwise have fewer resources. More importantly, personal devices are a shared (vs. dedicated) environment: the PCDB and associated application logic will run aside word processors, Web browsers; media players and other applications. When these other applications are in the foreground (i.e., when they are being actively used), the user expects them to operate unencumbered by the PCDB's background activities. Thus, a PCDB must find idle cycles to perform its background activities, it must be able to defer its background activities until there are idle cycles, and it must be able to suspend or abort background activities if they are started in an idle period, but are not finished when the machine becomes busy again.
It would thus be desirable to provide a query processing and document indexing model that addresses the above-described characteristics of a personal search of a PCDB, such as an email store. It would be further desirable to retrieve content from a PCDB based on a query in a fast, scalable, robust and efficient manner. It would be further desirable to implement posting list and term expansion systems and methods that are suitable for implementation in connection with the above-described characteristics of personal devices.