Enterprises such as corporations, institutions, agencies, and other entities have massive amounts of data that they need to manage. While some of an enterprise's critical data are normalized, structured, and stored in relational databases, most enterprises' data (generally thought to be around 80% of enterprise data) is unstructured. With conventional computing systems, effective management and efficient access to such unstructured data is problematic.
Indexing is a well-known technique that is used to increase the efficiency by which data can be searched. An index is a list of terms and pointers associated with a collection of data. An example of such an index 100 is shown in FIG. 1. Index 100 comprises a plurality of index entries 102, with each index entry 102 comprising a term 104 (see the “term” column in the table) and one or more pointers 106 (see the “pointer(s)” column in the table). The terms 104 in an index can be words, phrases, or other information associated with the data. In many situations, these terms are user-specified. Each pointer 106 in an index corresponds to the term 104 for that entry 102 and identifies where that term can be found in the data. With unstructured data, the data collection often comprises a plurality of documents. Examples of documents include items such a word processing files, spreadsheet files, emails, images, Adobe Acrobat files, web pages, books, pages of books, etc.
However, the inventors note their belief that conventional indexing techniques require a tremendous amount of time to generate an effective index. Even relatively small data sets can take days to effectively index with conventional indexing techniques deployed in software on central processors such as GPPs because of indexing's computationally-intensive nature. Because of the sheer volume of data that enterprises encounter on a daily basis, it is simply not practical for enterprises to index all of the data in its possession (and to which it has access) using these conventional indexing techniques. Instead, enterprises are forced to make a priori decisions as to which data will be subjected to indexing; this is particularly true for unstructured data which comprises the bulk of most enterprises' data. In doing so, enterprises are left without an effective means for efficiently managing and searching much of its data.