Increasingly, an abundance of business intelligence data is gathered from the Internet and other information sources. Much of this data takes the form of information describing an action or occurrence (i.e., an event) that is typically generated by a user or a computer. Event data, including but not limited to data that may be associated with or derived from events, is often stored for later access, identification, manipulation, or use. In many cases, event data is stored in the form of records within one or more datastores, data sets or database files (e.g., in the form of tables). Data sets storing event data typically require significant amounts of storage space that may be spread across a plurality of networked storage devices.
Event data that is gathered from one or more information sources may be related or share common properties despite being stored in different data sets or residing at different network or storage locations. In order to access, identify, manipulate, or use commercially useful information, businesses typically build queries or provide instructions for extracting event data based upon the related or shared common properties of the event data. Commonly referred to as “data mining,” this process typically, involves searching through numerous data sets that include one or more fields (i.e., primary key fields) that uniquely identify event data sharing common or related properties. Event data matching a certain query may then be extracted from the numerous databases.
Data mining is typically a processor-intensive activity. Even in distributed processing systems, where multiple computers may be linked in a network to perform the same work, processing of queries that span large and/or numerous data sets often require a significant number of CPU machine cycles. Particularly where queries request event data from a plurality of databases, the processing overhead may be enormous for merging and analyzing event data records across the plurality of databases.
In many circumstances, query results may be required in a timely manner (e.g., microseconds) or query results may be required so as to reduce utilization of one or more processors. In response to these and other requirements, many queries may make use of data that is pre-sorted. Pre-sorting data set information typically makes searching more efficient by organizing a collection of data into a sequenced order that may permit faster extraction of the data on the basis of the sequenced order. Despite some efficiency that may be gained by pre-sorting a data set, queries requesting event data from a plurality of data sets do not necessarily exhibit the same efficiency if the query directs a search of more than one sorted, yet un-merged, data set. Such queries may exhibit a high number of input/output operations or in-memory tree/scan operations that may degrade the performance of the query operations. Thus, there exists a need for methods and systems to efficiently merge event data that may comprise a plurality of data sets.