This invention relates to the field of computer systems. More particularly, a system, method, and apparatus are provided for organizing large quantities of multi-dimensional data in support of count-distinct queries executed against the data.
A count-distinct query executed against a set of multi-dimensional data returns a count of the number of unique values for one or more specified dimensions. For example, an illustrative collection of data might encompass all ten-digit telephone numbers in use across the United States, and include dimensions such as area code, prefix (i.e., the three digits that follow the area code), a geographic area (if any) in which the number is situated, etc. Illustrative count-distinct queries might therefore be executed against this data to find the number of distinct area codes in the U.S., the number of unique prefixes within one or more area codes, etc.
Count-distinct queries can become time-intensive and resource-intensive when the data grows very large. For example, consider a collection of data encompassing all electronic mail messages dispatched within a day, a week or some other time period. An illustrative count-distinct query may attempt to identify how many unique subject lines were found within e-mail messages sent to or from a particular domain, or among messages of a particular size, etc. This query would have to not only identify all relevant data records or elements, such as all messages to or from the target domain, but also eliminate duplicates so that after a unique subject is identified, all other relevant messages having the same subject line are ignored.
In today's computing environment, computing systems hosting messaging services, retailers, news sources, social networking sites, and/or other services process incredible amounts of data. Count-distinct queries within these systems may normally take significant amounts of time (e.g., many minutes, hours), depending on the amount of data being queried. In any system in which these types of queries must be executed on regular or frequent bases, the time it takes to receive a query's results may negatively affect system operations.