The present invention is related to techniques for organizing data, and more particularly to techniques for organizing a collection of documents into hierarchical groups of documents to facilitate review of the documents.
The widespread use of computers has led to an explosion in the amount of electronic textual data being created. Such electronic textual data comes in a variety of different forms: emails, office documents (including, but not limited to, word processing documents, presentations, spreadsheets, etc.), instant messaging logs, call center agent notes, newsgroup messages, Web pages, on-line newspapers, among others. The existence of such large corpora of textual data has accentuated the need for automated techniques for the rapid analysis, organization, review, and mining of these corpora.
Due to the large number of electronic documents that may be available, the review and analysis of these documents is a very daunting task. This is especially evident in the legal area. For example, traditional ways of conducting legal discovery have been overwhelmed by the sheer volume of discoverable information that is available in electronic form. Law firms and their clients are under increasing pressure to efficiently review documents to identify documents to be produced for large litigation cases as well as to analyze complex document sets that have been produced to identify and extract critical fact patterns pertinent to the litigation.
As an example, in a typical litigation, one party may serve a discovery request on some other party requesting the other party to produce any and all documents related to a particular subject matter. The party receiving the discovery request has the daunting challenge of reviewing a potentially large set of electronic documents (including emails, attachments, typical office documents, CAD drawings, patent filings, instant messaging logs, etc.) and paper documents (such as faxes, notebooks, and printed reports) to identify all relevant documents that need to be produced, while being sensitive to issues such as privilege and confidentiality. Once the discovery request has been fulfilled, the requesting party then has the equally daunting challenge of analyzing the resulting document corpus to help prepare for the litigation itself. The analysis task is further complicated by the fact that the result of discovery may, and often does, include large amounts of irrelevant documents (for example, junk email, non-confidential documents that are not about the subject matter of interest, and so forth).
Traditionally, the process of legal discovery and analysis has been a laborious manual process carried out with printed hard-copy versions of documents. Even with the availability of data in electronic form, electronic documents are manually organized by subject matter. Such a manual process may work for a relatively small collection of documents but is not feasible for a large collection of documents as it takes too long and is too expensive. Further, with manual organization, the consistency of the organization cannot be ensured. Accordingly, as the size of the document collection grows, automated techniques for organizing documents are desired.
In an effort to automatically organize documents, law firms have increasingly used search engines to automatically identify groups of related documents. The idea is to characterize the subject matter of interest with a search query. All documents that match the search query are then assumed to be about the same subject matter of interest. However, there are several problems with this approach. Most users find it difficult to craft an appropriate query—a query may be too specific or too general. If it is too specific, relevant documents may be missed by the search. On the other hand, if it is too general, several irrelevant documents may be identified by the search. As a result, a user is often left with the uncomfortable feeling that relevant documents are being missed by the search, while being forced to wade through irrelevant documents that happened to match the query. Additionally, it is often not known what query needs to be constructed since the range of subjects in a document collection may not be known a priori. Further, due to the nature of the English language, searches may return ambiguous results.
More recently, “fuzzy” searches have been proposed as an alternative to traditional search engines. Fuzzy searches perform searches using terms specified in a search query and also based upon other terms that are determined (using statistical analysis) to co-occur with the search query terms. While “fuzzification” of search queries addresses some of the limitations of traditional searches, it also has significant shortcomings. “Fuzzy” searches may not be appropriate in all situations, such as where exact matching of search terms is desired. “Fuzzy” searches also do not effectively deal with the issue that words often have different meanings or senses, and hence co-occur with different words in different documents. “Fuzzy” searches do not provide a mechanism that allows users to identify these different senses or to select the appropriate sense. Further, as with traditional searches, the subject matter to be searched for may not be known a priori.
The problem of reviewing and analyzing a large corpus of documents is thus found in a number of other areas (e.g., legal area) where there is a need to analyze a large amount of data. In addition to the legal area, examples of other areas include: intelligence analysts need to rapidly review and analyze news reports and agent reports from all around the world to rapidly identify threats; customer service managers need to analyze call center agent notes to rapidly identify emerging problems and trends; companies need to analyze news reports, analyst reports, and newsgroup messages to gauge the strength of, and threats to, their brand or the effectiveness of a marketing campaign; etc.