A data warehouse is a database designed to support decision-making in an organization. A typical data warehouse is batch updated on a periodic basis and contains an enormous amount of data. For example, large retail organizations may store one hundred gigabytes or more of transaction history in a data warehouse. The data in a data warehouse is typically historical and static and may also contain numerous summaries. It is structured to support a variety of analyses, including elaborate queries on large amounts of data that can require extensive searching.
The data warehouse often represents data as a “cube” of three, four, or more dimensions. For example, a business may be modeled as a cube having three dimensions, corresponding to real-world business distinctions such as Product, Time, and Market. Any point within the cube is at the intersection of the coordinates defined by the edges of the cube, and is viewed as corresponding to a metric or measurement that is valid for the combination of dimension values that define the point. For example, such metrics might include “units sold,” “price,” etc. Each point may indicate the price and units sold of a particular product, at a particular time or time period, in a given market.
Some systems implement this data model from within a relational database. A relational database has many interrelating tables. As known in the art, each table has a two dimensional structure of values with records and fields. A table can have a combination of one or more fields called the primary key. This means that for each record, the values in the fields of the primary key serve to identify the record. These values in fields of the primary key are known as primary key identifier (PKID). A given PKID should be unique in a table; that is, no two records should have the same PKID.
Tables in a relational database are related by means of foreign keys. A foreign key is a combination of one or more fields. Each foreign key relates to a primary key of another table. A record in a table with a foreign key relates to a record in a table with a primary key if the fields in the foreign key have the same values as the fields in the primary key.
Those skilled in the art are also familiar with dimension tables. A dimension table is a collection of information describing a business construct. For example, in a model designed to represent web usage, there is a “Domain” dimension table including information in the form of strings that describe each target domain, such as the site the domain belongs to and the country code for the domain. Other dimension tables contain information describing concepts such as “Time,” “Referring Domain,” and many others. Note that dimensions are usually parameters relating to the organization of measured data, and do not indicate the measured data itself.
Other tables include fact tables which contain the actual numeric metrics, such as a count of page views, that a user might be interested in viewing. In addition, there are defined relationships between the dimension and fact tables. Specifically, the fact table has a plurality of foreign keys which relate to primary keys in the dimension tables. This allows the individual records of the fact table to be indexed or matched up to specific dimensional values. That is, given a set of dimensional values, corresponding metrics can be located. In the example above, a user wishes to view data from the page views fact table. The Domain dimension table allows the user to choose a single domain, and then see only the data from the page views fact table that corresponds to that target domain. Similarly, the time dimension allows the user to choose a single day and view only the data from the page views fact table that corresponds to the chosen target domain and the chosen date. Choosing the dimensions across which a user wants data to be summarized is sometimes referred to as slicing the data. A definition of the relationship between tables in a data warehouse is called a schema.
Most metrics are aggregates that summarize data across criteria provided by one or more dimension tables in the data warehouse. In the example above, the count of page views is aggregated across a specific target domain (from the Domain table) and a specific day (from the Time table). This particular metric provides a count of a given value. Other metrics might provide a sum, average, or other summary. Still other metrics are calculated, rather than aggregated. For example, a data warehouse might provide metrics such as Peak Hour Page Views, which provides the hour during which the most page views are received. This metric is not derived by summarizing a value across dimensions; instead, it is calculated by comparing a value across dimensions and selecting the top value. Other calculated metrics might provide the bottom value, the top or bottom N values, the top or bottom percentage, etc.
Those skilled in the art are familiar with data modeling such as this (see Kimball, Ralph, The Data Warehouse Lifecycle Toolkit, Wiley 1998).
After the tables of a data warehouse have been populated with actual data, the warehouse becomes very useful. However, the process of populating the data warehouse can become quite difficult because of the enormous amounts of data involved. Consider, as an example, the task of populating a web usage data warehouse in a company that maintains numerous web sites administered by different divisions within the company in different parts of the world. Furthermore, each site may have a number of individual servers. For example, the company may maintain more than five hundred servers, which might use different types of server software. Together, the servers may generate over 1.5 billion log records, each representing a page hit. For data warehousing purposes, it is desired to combine data logged by each of these servers and use it to populate a data warehouse.
Some prior art systems use “Extract, Transform, and Load” (ETL) methodology. Extraction refers to actually obtaining the data from individual data sources such as servers. Unfortunately, this process in itself can be particularly difficult when dealing with the enormous size of the data in a web usage data warehouse or other large database. Transformation indicates processing the data to put it into a more useful form or format. Loading refers to the process of loading the data into the tables of a relational database. These existing systems provide summaries of user information. However, there is a need for retaining user level detail data in addition to the summaries. For example, there is a need to provide monthly views of data that is collected daily. Such collection results in very large amounts of data (e.g., seventy-five terabytes per month). Because the existing systems load all the data in one or more databases across many computing devices, servicing a user query for data requires scanning all sets of data in all the databases. Such systems typically employ massively parallel or symmetric parallel systems with hardware at a cost of several million dollars. There is a need for a system using a single database in which the data is correlated prior to loading into the database.
To effectively analyze and data mine detailed user information for hundreds of millions of users (e.g., tens of terabytes of data), the user information must be kept up-to-date and reduced in volume to something that an online analytical processing (OLAP) server can handle. The high cardinality user detail data may be too large to load into the database directly. There is a need for extracting a huge amount of data from a large number of different servers and transforming the extracted data to populate a single data warehouse. Further, there is a need for cross-referencing (e.g., per user) all the different types of data (e.g., newsletters, member directories, web logs).
For these reasons, a system for collecting and maintaining detailed user information is desired to address one or more of these and other disadvantages.