The tremendous amounts of information now available even to casual computer users, particularly over large computer networks such as the Internet, have engendered numerous efforts to ease the burden of locating, filtering, and organizing such information. These include classification and prioritization systems for e-mail (see, e.g., Maes, Commun. of ACM37(7):30-40 (1994); Cohen, "Learning Rules that Classify E-mail," AAAI Spring Symposium on Machine Learning in Information Access, March 1996), systems for filtering news downloaded from the Internet (see, e.g., Lang, "NewsWeeder: Learning to Filter Netnews," Machine Learning: Proc. of 12th Int'l Conf. (1995)), and schemes for organizing user-specific information such as notes, files, diaries, and calendars (see, e.g., Jones, Int'l J. of Man-Machine Studies 25 at 191-228 (1986); Lamming et al., "Forget-me-not: Intimate Computing in Support of Human Memory," Proc. FRIEND21, '94 Int'l Symp. on Next Generation Human Interface (1994)).
Systems designed for information retrieval generally function in response to explicit user-provided queries. They do not, however, assist the user in formulating a query, nor can they assist users unable or unwilling to pose them. The Remembrance Agent ("RA"), described in Rhodes et al., Proc. of 1st Int'l Conf. on Practical Application of Intelligent Agents and Multi-Agent Technology at 487-495 (1996), is a computer progran that watches what a user is typing in a word processor (specifically the Emacs UNIX-based text editor) and continuously displays a list of documents that might be relevant to the document currently being written or read. For example, if a journalist is writing a newspaper article about a presidential campaign, the RA might suggest notes from a recent interview, an earlier article about the campaign, and a piece of e-mail from her editor suggesting revisions to a previous draft of the article.
The utility of the RA stems from the fact that currently available desktop computers are fast and powerful, so that most processing time is spent waiting for the user to hit the next keystroke, read the next page, or load the next packet off the network. The RA utilizes otherwise-wasted CPU cycles to perform continuous searches for information of possible interest to the user based on current context, providing a continuous, associative form of recall. Rather than distracting from the user's primary task, the RA serves to augment or enhance it.
The RA works in two stages. First, the user's collection of text documents is indexed into a database saved in a vector format. These form the reservoir of documents from which later suggestions of relevance are drawn; that is, stored documents will later be "suggested" as being relevant to a document currently being edited or read. The store documents can be any sort of text document (notes, Usenet entries, webpages, e-mail, etc.). This indexing is usually performed automatically every night, and the index files are stored in a database. After the database is created, the other stage of the RA is run from Emacs, periodically taking a sample of text from the working buffer. The RA finds documents "similar" to the current sample according to word similarities; that is, the more times a word in the current sample is duplicated in a candidate database document, the greater will be the assumed relevance of that database document. The RA displays one-line summaries of the best few documents at the bottom of the Emacs window. These summary lines contain a line number, a relevance ranking (from 0.0=not relevan to 1.0=extremely relevant), and header information to identify the document. The list updated at a rate selectable by the user (generally every few seconds), and the system is configured such that the entirety of a suggested document can be brought up by the user pressing the "Control-C" key combination and the line number to display.
Briefly, the concept behind the indexing scheme used in RA is that any given document may be represented by a multidimensional vector, each dimension or entry of which corresponds to a single word and is equal in magnitude to the number of times that word appears in the document. The number of dimensions is equal to the number of allowed or indexed words. The advantages gained by this representation are relatively speedy disk retrieval, and an easily computed quantity indicating similarity between two documents: the dot product of their (normalized) vectors.
The RA creates vectors in three steps:
1. Removal of common words (called stop words), identified in a list of stop words.
2. Stemming of words (changing "jumped" and "jumps" to "jump," for example This is preferably accomplished using the Porter stemming algorithm, a standard method in the text-retrieval field.
3. Vectorization of the remaining text into a "document vector" (or "docvec"). Conceptually, a docvec is a multidimensional vector each entry of which indicates the number of times each word appears in the document.
For example, suppose a document contains only the words: "These remembrance agents are good agents."
Step 1: Remove stop words
This converts the text to "Remembrance agents good agents"
Step 2: Stem words
This converts the text to "remembrance agent good agent"
Step 3: Make the document vector
This produces the vector:
000 . . . 121 . . . 000
Each position in the vector corresponds to an allowed word. The zeroes represent all allowed words not actually appearing in the text. The non-zero numerals indicate the number of times the corresponding word appears, e.g., a 1 for the words "good" and "remembr," and a 2 for the word "agent"; thus, the numbers indicate the document "weight" for the word in question.
Step 4: Normalize the vector
Document vectors are normalized (i.e., divided by the magnitude of the vector). The vector magnitude is given by the square root of the sum of the squared weights. (In fact, the normalization step takes place in the context of other computations, as described more fully below.) Normalization facilitates meaningful comparison between the words in a query and the words in a document in terms of their relative importance; for example, a word mentioned a few times in a short document carries greater significance than the same word mentioned a few more times in a very long document.
In a more recent implementation of the RA, a fifth step is added to improve the is quality of matching beyond that attainable based solely on term frequency. In this fifth step, vectors are weighted by the inverse of the document frequency of the term, based c the assumption that words occurring frequently in a document should carry more weight than words occurring frequently in the entire indexed corpus (which are less distinguishing). More rigorously, the similarity between two word vectors is found by multiplying the document term weight (DTW) for each term by the query term weight (QTW) for th term, and summing these products: ##EQU1##
The document term weight is computed on a document-by-document basis for each indexed word in the document vector. Because it does not change until new documents are added to the corpus, these computations may take place only when the corpus is indexed and re-indexed. The summation in the denominator covers all words in the document vector (i.e., all indexed words) that also appear in the current document for which DTW is computed (since a summation term is zero otherwise); this facilitates normalization. The term frequency tf refers to the number of times a particular term appears in the current document; N is the total number of documents in the corpus; and n is the number of documents in which the term appears. The summation is taken over each indexed word (the first through the ith) in the document. The DTW of a term within a document, then, reflects the number of times it appears within the document reduced in proportion to its frequency of appearance throughout all documents.
The QTW is computed for each word (the first through the ith) in the query vector. In this case, tf refers to the number of times the word appears in the query vector, and max tf refers to the largest term frequency for the query vector. If the document term weight is greater than the query term weight, then the former is lowered to match the query term weight (in order to prevent short documents from being favored).
The RA, running within Emacs, takes a sample of text every few seconds from current document being edited. This text sample is converted into a vector (called a "query vector") by the four-step process set forth above. After computing the query vector, the RA computes the dot product of the query vector with every indexed document This dot product represents the "relevance" of the indexed document to the current sample text, relevance being measured in terms of word matches. One-line summaries of the top few most relevant documents are listed in the suggestions list appearing at the bottom of the Emacs window (the exact number displayed is customizable by the user).
Documents to which sampled text is compared need not be entire files. Instead for example, files can be divided into several "virtual documents" as specified in a template file. Thus, an e-mail archive might be organized into multiple virtual documents each corresponding to a piece of e-mail in the archive. Alternatively, one can index a file into multiple "windows" each corresponding to a portion of the file, such that, for example, each virtual document is only 50 or so lines long, with each window overlapping its neighbors by 25 lines. (More specifically, In this representation, window one includes lines 0-50 of the original document, window two includes lines 25-75, etc.) This format makes it possible to suggest only sections of a long document, and to jump to that particular section when the entirety of the document is brought up for viewing.
Experience with the RA has shown that actually performing a dot product with each indexed document is prohibitively slow for large databases. In preferred implementations, therefore, document vectors are not stored; instead, word vectors are stored. The "wordvec" file contains each word appearing in the entire indexed corpus of documents followed by a list of each document that contains that particular word. The documents are represented by an integer value (generally 4 bytes) encoding both the document nun ber and the number of times that word appears in that particular document. The wordvec file format is as follows:
(int) (width*uns int) (int) (uns int) (uns int) (uns int) NUM_WORDS, WORDCODE-1, NUM_DOCS=N1, DOC-1, DOC-2, . . . , DOC-N1, WORDCODE-2, NUM_DOCS=N2, DOC-1, DOC-2, . . . , DOC-N2, etc.
The headings indicate the type of data each variable represents (integer, unsigned integer). The first entry in the wordvec file, NUM_WORDS, is the number of words ap pearing in the entire file. Each word in the wordvec is represented by a unique numerical code, the "width" indicating the number of integers in the code (the RA uses two integers per code). The NUM_DOCS field indicates the number of documents containing the word specified by the associated wordcode. The word-count variables DOC-1, DOC-2, . . . , DOC-N1 each correspond to a document containing the word, and reflect the numb of occurrences of the word divided by the total number of words in the the document.
A word offset file contains the file offsets for each word in the wordvec file, and is used to overcome the difficulties that would attend attempting to locate a particular wordcode in the wordvec file. Because each wordcode in the wordvec file can be associated with an arbitrary number of documents, locating a particular wordcode would require searching wordcode by wordcode, jumping between wordcodes separated by the arbitrary numbers of intervening word-count variables. To avoid this, a "wordvec offset" file is used to specify the location of each wordcode in the wordvec file.
 (width*uns int) (long) WORDCODE-1, OFFSET-1, WORDCODE-2, OFFSET-2, etc.
Since each entry has a fixed length, it is possible to perform rapid binary searches on the wordvec offset file to locate any desired wordcode.
Accordingly, for each word in the query vector, the RA first looks up the word in the word offset file, and from that the word's entry is looked up in the wordvec file. An array of document similarities is used to maintain a running tally of documents and their similarities, in terms of numbers of word matches, to the query vector. The array is sorted by similarity, with the most similar documents at the top of the list. Similarity is computed for each word in the query vector by taking the product of the query-vector entry and the weight of each document in the corresponding wordvec file. To normalize this product, it is then divided by the query-vector magnitude (computed in the same manner as the document magnitude) and also by the document magnitude. The final value is added to the current running-total similarity for that document, and the process repeated for the next word in the query. In summary, the query vector is analyzed wordcode by wordcode, with the similarities array indicating the relevance to the query of each document.
When computing the similarity of a query to an indexed document, it is preferred to employ a "chopping" approach that prevents an indexed word in a document from having a higher weight than the word has in the query vector. If the weight of the word in the indexed document is higher than its weight in the query vector, the document weight gets "chopped" back to the query's value. This approach avoids situations where, for example, a query containing the word "spam" as just a single unimportant word will not get overwhelmingly matched to one-word documents (which have the highest possible eight) or documents like "spam spam spam spam eggs bacon spam. . ." This method is slower on indexing and the index files take more space, but is much faster on retrieval because only documents containing words in the query are even examined.
The other files created on indexing are a location file (doc_locs) containing a mapping between document number and filename for that document, a titles file containing the information for the one-line summary (titles), offset files for doc_locs and titles (dl_offs and t_offs) to do quick lookups, and a window-offset file specifying where to jump in a file for a particular portion of a windowed document.
While the RA offers substantial capabilities for automated, "observational" retrieval, the cues it utilizes to identify possibly relevant documents are limited to word similarities. This is adequate for many computational tasks and clearly suits the traditional desktop environment of everyday computing: if the user is engaged in a word-related computational task, word-based cues represent a natural basis for relevance determinations. In other words, the current information reliably indicates the relevance of similar information. More broadly, however, human memory does not operate in a vacuum of query-response pairs. Instead, the context as well as the content of a remember episode or task frequently embodies information bearing on its relevance to later experience; the context may include, for example, the physical location of an event, who was there, what was happening at the same time, and what happened immediately before and after.
As computer components grow smaller and less expensive, so-called "wearable" computers that accompany the user at all times become more feasible. Users will perform an ever-increasing range of computational tasks away from the desktop and in the changing environmental context of everyday life. Consequently, that changing context) will become more relevant for recall purposes. Even now, inexpensive laptop computers allow users to monitor their physical locations via global-positioning systems ("GPSs") or infrared ("IR") beacons, and to access various kinds of environmental sensors or electronic identification badges. Since information is created in a particular context, the atributes of that context may prove as significant as the information itself in determining relevance to a future context.
Contextual "meta-information" is not limited to physical surroundings. Even in traditional desktop environments, where for practical purposes the physical context remains constant, meta-information such as the date, the time of day, the day of the week, or the general subject can provide cues bearing on the relevance of information (regardless of, or more typically in addition to, the content of the information itself). Word-based searching and retrieval systems such as the RA are incapable of capturing these meta-informational cues.