The following relates to techniques for gathering data from a variety of sources and using the gathered data to populate relational database tables of a data warehouse.
A data warehouse is a database designed to support decision making in an organization. It is normally batch updated on a periodic basis and can contain enormous amounts of data. For example, large retail organizations can have 100 GB or more of transaction history in a data warehouse.
The data in a data warehouse is typically historical and static and may also contain numerous summaries. It is structured to support a variety of analyses, including elaborate queries on large amounts of data that can require extensive searching.
The data warehouse often represents data as a xe2x80x9ccubexe2x80x9d of three, four, or more dimensions. For example, a business might be modeled as a cube having three dimensions, corresponding real-world business distinctions such as, for example, Product, Time, and Market. Any point within the cube is at the intersection of the coordinates defined by the edges of the cube, and can be viewed as corresponding to a metric or measurement that is valid for the combination of dimension values that define the point. For example, such metrics might include xe2x80x9cunits sold,xe2x80x9d xe2x80x9cprice,xe2x80x9d, etc. Each point might indicate the price and units sold of a particular product, at a particular time or time period, in a given market.
It is common to implement this data model from within a relational database. A relational database consists of many interrelating tables. FIG. 1 shows an example of such tables. Some of these tables are dimension tables 21. A dimension table is a collection of information describing a business construct. For example, in the FIG. 1 model designed to represent web usage, there is a xe2x80x9cDomainxe2x80x9d dimension table. The Domain dimension table includes information in the form of strings that describe each target domain, such as the site the domain belongs to and the country code for the domain. Other dimension tables contain information describing concepts such as xe2x80x9cTime,xe2x80x9d xe2x80x9cReferring Domain,xe2x80x9d and many others. Note that dimensions are usually parameters relating to the organization of measured data, and do not indicate the measured data itself.
The tables of FIG. 1 also include one or more fact tables 22. A fact table includes the actual numeric metrics, such as a count of page views, that a user might be interested in viewing. In addition, there are defined relationships between the dimension and fact tables. Specifically, the fact table has a plurality of keys, referred to as primary key IDs, that correspond to similar primary key IDs in the dimension tables. This allows the individual records of the fact table to be indexed or matched up to specific dimensional valuesxe2x80x94given a set of dimensional values, corresponding metrics can be located.
In the example above, a user wishes to view data from the page views fact table. The Domain dimension table allows the user to choose a single domain, and then see only the data from the page views fact table that corresponds to that target domain. Similarly, the time dimension allows us to choose a single day, and view only the data from the page views fact table that corresponds to the chosen target domain and the chosen date. Choosing the dimensions across which a user wants data to be summarized is sometimes referred to as slicing the data. A definition of the relationship between tables in a data warehouse is called a schema.
Most metrics are aggregates, as they summarize data across criteria provided by one or more dimension tables in the data warehouse. In the example above, the count of page views is aggregated across a specific target domain (from the Domain table) and a specific day (from the Time table). This particular metric provides a count of a given value; other metrics might provide a sum, average, or other summary.
Still other metrics are calculated, rather than aggregated. For example, a data warehouse might provide metrics such as Peak Hour Page Views, which provides the hour during which the most page views are received. This metric is not derived by summarizing a value across dimensions; instead it is calculated by comparing a value across dimensions, and selecting the top value. Other calculated metrics might provide the bottom value, the top or bottom N values, the top or bottom percentage, etc.
Data modeling such as this is described in Kimball, Ralph, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (Wiley 1996).
Once the tables of a data warehouse are populated with actual data, the warehouse becomes very useful. However, the process of populating the data warehouse can become quite difficult because of the enormous amounts of data invovled.
Consider, as an example, the task of populating a web usage data warehouse in a company such as Microsoft Corporation that currently maintains nearly thirty web sites. Many of these sites are administered by different divisions within the company, in different parts of the world. Furthermore, each site might have a number of individual servers. For example, company such as Microsoft might maintain more than 500 servers, which might use different types of server software. Together, this number of servers might generate over 1.5 billion log records, each representing a page hit. For data warehousing purposes, it is desired to combine data logged by each of these servers and use it to populate a data warehouse.
This is normally accomplished using xe2x80x9cExtract, Transform, and Loadxe2x80x9d (ETL) methodology. Extraction refers to actually obtaining the data from individual data sources such as servers. This process in itself can be difficult because of the enormous size of the data.
Transformation indicates processing the data to put it into a more useful form or format: removing redundancies, adding business logic, filtering irrelevant data, integrating additional information, etc.
Loading refers to the process of loading the data into the tables of a relational database.
Although data warehousing techniques have been used in a number of different environments, it is not believed that data warehousing has been successfully implemented in a system such as that mentioned above, in which such a huge amount of data needs to be extracted and transformed from such a large number of different servers, and then used to populate a single data warehouse. The invention described herein addresses this deficiency of the prior art.
Described below is a data collection system and a method for populating a data warehouse. The system uses a pipeline approach, in which individual log files are retrieved from individual servers and processed in sequence through various operations. These operations include a pre-parsing operation that parses each log file to create a file representing dimensional data, and another file representing metric data. Other transformation logic is also applied during this stage.
These files are then subject to a subsequent parsing operation that parses the pre-processed files into tables corresponding to tables of the data warehouse. These tables are eventually merged and loaded directly into the relational database of the data warehouse, after removing duplicates, sorting, and counting aggregates. Using these techniques, the majority of processing is performed by programs other than relational database programs, greatly increasing the speed of processing.
Significant efficiency is gained by pushing a pre-processing component to each server on a daily basis. Using this component, each server filters its own data before sending it to the data collection system.
In addition, the application of transformation logic results in dimension strings being replaced by keys. Rather than referring to a database for such keys when applying the transformation logic, the system calculates each key as a hash of the string. This eliminates the performance bottleneck that otherwise results from reverse lookups in a database.