1. Technical Field
The present invention relates generally to data stream applications and, more particularly, a system and method for indexing data streams.
2. Description of the Related Art
Data stream applications are becoming increasingly popular. Many data stream applications use various linear optimization queries to retrieve the top-K tuples that maximize or minimize the linearly weighted sums of certain attribute values.
For example, in environmental epidemiological applications, linear models that incorporate, e.g., remotely sensed images, weather information, and demographic information are used to predict the outbreak of certain environmental epidemic diseases such as, e.g., Hantavirus Pulmonary Syndrome. In oil/gas exploration applications, linear models that incorporate, e.g., drill sensor measurements and seismic information are used to guide the drilling direction. In financial applications, linear models that incorporate, e.g., personal credit history, income level, and employment history are used to evaluate individual credit risks for loan approvals.
In all the above applications, data continuously streams in (e.g., from satellites and sensors) at a rapid rate. Users frequently pose linear optimization queries and want answers back as soon as possible. Moreover, different individuals may pose queries that have divergent weights and K's. This is because, e.g., the “optimal” weights may vary from one location to another (in oil/gas exploration), the weights may be adjusted as the model is continually trained with historical data collected more recently (in environmental epidemiology and finance), and different users may have differing preferences.
Chang et al., in “The Onion Technique: Indexing for Linear Optimization Queries”, SIGMOD Conf. 2000, pp. 391-402 (hereinafter the “Onion Technique Article”), the disclosure of which is incorporated by reference herein, proposed using an onion index to speed up the evaluation of linear optimization queries against a large database relation. An onion index organizes all tuples in the database relation into one or more convex layers, where each convex layer is a convex hull. For each i≧1, the (i+1)th convex layer is included within the ith convex layer. For any linear optimization query, to find the top-K tuples, typically no more than all the vertices of the first K outer convex layers in the onion index are searched.
However, due to the extremely high cost of computing precise convex hulls, both the creation and the maintenance of the onion index are rather expensive. Moreover, an onion index keeps track of all tuples in a relation and, thus, requires a lot of storage space. In a data streaming environment, tuples keep arriving rapidly while available memory is limited. As a result, it is very difficult to maintain a precise onion index for a data stream, let alone using the precise onion index to provide exact answers to linear optimization queries against the stream.
A description will now be given of the traditional onion index, as disclosed in the above-referenced “Onion Technique Article”, for linear optimization queries against a large database relation.
Suppose each tuple includes n≧1 numerical feature attributes and m≧0 other non-feature attributes. A top-K linear optimization query asks for the top-K tuples that maximize the following linear equation:
            max              top        ⁢                                  ⁢        K              ⁢          {                        ∑                      i            =            1                    n                ⁢                              w            i                    ⁢                      a            i            j                              }        ,where (a1j, a2j, . . . , anj) is the feature attribute vector of the jth tuple and (w1, w2, . . . , wn) is the weighting vector of the query. Some wi's may be zero. Here,
      v    j    =            ∑              i        =        1            n        ⁢                  w        i            ⁢              a        i        j            is called the linear combination value of the jth tuple. It is to be noted that a linear optimization query may alternatively ask for the K minimal linear combination values. In this case, we can turn such a query into a maximization query by switching the signs of the weights. For purposes of brevity and illustration, maximization queries are primarily described herein after.
A set of tuples S can be mapped to a set of points in an n-dimensional space according to their feature attribute vectors. For a top-K linear optimization query, the top-K tuples are those K tuples with the largest projection values along the query direction.
Linear programming theory has the following theorem, designated herein as Theorem 1.
Theorem 1: Given a linear maximization criterion and a set of tuples S, the maximum linear combination value is achieved at one or more vertices of the convex hull of S.
Utilizing this property, the onion index in the above-referenced “Onion Technique Article” organizes all tuples into one or more convex layers. The first convex layer L1 is the convex hull of all tuples in S. The vertices of L1 form a set S1⊂S. For each i>1, the ith convex layer Li is the convex hull of all tuples in
  S  -            ⋃              j        =        1                    i        -        1              ⁢                  S        j            .      The vertices of Li form a set
      S    i    ⊆      S    -                  ⋃                  j          =          1                          i          -          1                    ⁢                        S          j                .            It is easy to see that for each i≧1, Li+1 is contained within Li. FIG. 1 illustrates an exemplary onion index 100 in two-dimensional space, in accordance with the prior art. The exemplary onion index 100 shown in FIG. 1 includes a first convex layer 110, a second convex layer 120, and a third convex layer 130.
From Theorem 1, we know that the maximum linear combination value at each Li (i≧1) is larger than all linear combination values from Li's inner layers. Also, there may be multiple tuples on Li whose linear combination values are larger than the maximum linear combination value of Li+1. As a result, we have the following property, designated herein as Property 1.
Property 1: For any linear optimization query, suppose all tuples are sorted in descending order of their linear combination values (vj). The tuple that is ranked kth in the sorted list is called the kth largest tuple. Then the largest tuple is on L1. The second largest tuple is on either L1 or L2. In general, for any i≧1, the ith largest tuple is on one of the first i outer convex layers.
Given a top-K linear optimization query, the search procedure of the onion index starts from L1 and searches the convex layers one by one. On each convex layer, all its vertices are checked. Based on Property 1, the search procedure can find the top-K tuples by searching no more than the first K outer convex layers.
During a tuple insertion or deletion, one or more convex layers may need to be reconstructed in order to maintain the onion index. The detailed onion index maintenance procedure is disclosed in the above-referenced “Onion Technique Article”. Both the creation and the maintenance of the onion index require computing convex hulls. This is fairly expensive, as given N points in an n-dimensional space, the worst-case computational complexity of constructing the convex hull is O(N 1n N+N└n/2┘).
It is to be noted that in some data stream applications, the linear optimization queries are known in advance and the entire history of the stream is considered. In this case, for each linear optimization query, an in-memory materialized view can be maintained to continuously keep track of the top-K tuples. However, if there are many such linear optimization queries, it may not be feasible and/or otherwise possible to keep all these materialized views in memory and/or to maintain them in real time.
It is to be further noted that in a data streaming environment, tuples may continuously arrive rapidly and the available memory is typically limited. To meet the real-time requirement of data streams, everything is preferably done in memory. Moreover, it should not incur a lot of computation or storage overhead. However, the original onion index keeps track of all tuples and, thus, requires a lot of storage space. Also, as noted above, maintaining the original onion index is computationally expensive, making it difficult to meet the real-time requirement of data streams. Therefore, the original onion index, as introduced in the above-referenced “Onion Technique Article” does not work for data streams.