1. Technical Field
The present invention relates to a method for massively parallel multi-core text indexing.
2. Description of the Related Art
Text based search remains an important technique to retrieve data including images, movies, sounds recordings, and so forth. Current distributed information retrieval (IR) systems are expected to maximize search throughput while having low acceptable response times. Current distributed IR systems typically generate a document partitioned index where each index slice is generated by a single node (which could be a single or multi-core processor). Distributed search is performed and search results are then merged to generate the final top-X (50,100) documents for a query. Since disk based accesses are involved in indexing, the indexing speed is limited by memory size and disk access times. Optimization is primarily focused at disk-based storage and distributed access of index and text. However, recent trends including the need for real-time indexing and searching of massive amounts of data, along with the advent of massively parallel (multi-core) architectures and storage class memories, motivate exploration of performance optimizations for in-memory text indexing and search.
Given the size and continuing growth of the Web, the time needed to build an index is a significant factor in providing an effective search output. Also, the fact that the contents available in the Web change extremely rapidly necessitates either efficient incremental index updates or a rebuilding approach. Typically, for efficiency and simplicity, the index rebuilding approach is taken which makes it necessary to have low indexing time over huge volumes of data.
Rapid advances in computer architecture and system software in recent years have produced massively parallel systems. In the near future, one can expect to see massively parallel multi-core systems (e.g., 1K-2K cores) with storage class memories. For such systems, one can store the complete index and text in memory of about couple of hours of data. Thus the index data structures need to be re-designed to attain high indexing rates. One also needs to re-design indexing and search algorithms to execute efficiently on these systems.
Indexing a large number of real-time streams with a high data rate in the order of 1-2 Gigabytes/second (GB/s) is a challenging problem. Such streams are encountered in backbone network routers, sensor networks, and other domains. Such indexing would involve sustained aggregate indexing rates of around 50-100 GB/s or more.
Current multi-core architectures cannot sustain these high aggregate indexing rates. The current text indexing & search based implementations are optimized for disk-based index and text storage. Hence, such implementations will not perform well with high capacity storage-class memories that can store both index and text.
Current conventional text indexing approaches are all somewhat similar. Hence, an overview of conventional text indexing approaches is provided as follows. A conventional text indexing approach such as, for example, Lucene index, covers a set of documents. A document is a sequence of fields. A field is a sequence of terms. A term is a text string. A Lucene index includes one or more segments. Each segment covers a set of documents. Each segment is a fully independent index.
Hence, the Lucene index is organized in segments. Each segment includes information about terms and their frequencies, documents in which they occur and positions in which they occur in the documents. The positions and frequencies are stored in sequential fashion and accessed by adding base position with offset. The terms are also kept sorted for fast binary search during query evaluation. The whole organization is designed to minimize the number of disk accesses as the disk seek time is orders of magnitude larger compared to compute or memory access time. The index also stores skip-lists for documents to enable fast access of documents (in O($\ log (n)$) time), where n: denotes the number of documents.
During the merging of segment indexes, a merge-sort is performed to keep the terms sorted in the merged index and the document IDs are updated to reflect the new set of documents. The control structure for the final merged index is updated at each step to reflect the new merged index. With respect to the Lucene index, there are two key steps involved in merging of segments into a next level segment. The first step involves a k-way merge sort of the sorted-term-lists in “k” segments to be merged. The second step involves reading the document and postings data per unique term and then copying the posting data to the destination segment. As the number of merges increase, this data is again and again read and copied over to successive merged segments. This process along with sorting of terms makes the merge process inefficient.
Hence, the current indexing algorithm in Lucene, as well as similar prior art indexing algorithms, is not scalable. That is, if we double the number of processors for constructing one merged index (for the same data size), indexing may not necessarily get a speedup close to two. This is because the index-merge process quickly becomes the bottleneck.
With respect to conventional text indexing approaches such as the Lucene index, index updates are serialized, multiple index searches can proceed concurrently, and simultaneous index update and search are supported.
However, conventional text indexing approaches are not designed to be strongly scalable (while maintaining the same search performance) on massively parallel architectures including future generation multi-core architectures. The high cost of merging index segments prevents the current conventional text indexing approaches from scaling on larger number of processors.
Document-based text search (e.g., Lucene) is typically designed for low memory workstations with large disks. Disadvantageously, such document-based text search involves a lot of overhead in order to merge indices residing on disks. Further disadvantageous, parameters for document-based text search need to be tuned for multi-core architectures (e.g., repetitive merging of the same index). Also disadvantageous, data structures designed for low memory architectures are inefficient for large memory multi-core architectures.
Hence, in view of the preceding, there is a need to re-design the existing indexing data structures and enable an indexing algorithm to be strongly scalable on massively parallel architectures while maintaining the same search performance as compared to indexing performed on non-massively parallel architectures.