1. Field of the Invention
The invention disclosed herein relates generally to database applications, and more particularly to a system and method for constructing generic analytical database applications through the automated creation of metadata to establish an application structure controlling the availability and operability of individual applications.
2. Description of the Background
Market analysis is a key tool in customer acquisition and retention. Performing a detailed market analysis of both potential product or service purchasers and existing purchasers, and from such analysis obtaining an understanding of customer behavior, product success, marketing success, and other business factors enables a supplier to make critical decisions regarding how they conduct business in order to maximize both new customer acquisition and the retention of existing customers. However, performing such a detailed analysis in order to obtain an accurate depiction of customer behavior can be a daunting task.
Many product and service providers maintain records of particular customer data, including demographic information, purchasing characteristics, etc., in addition to maintaining detailed records of products, pricing, and other business information. The collections of such data in electronic form can become enormous for large entities and attempting to digest such information into a logical structure from which one might be able to deduce customer trends and characteristics can be a highly technologically challenging task.
In order to analyze such large volumes of information to in turn make reasoned deductions about customer characteristics, product sales, and other business-related information, automated data mining tools have been implemented which attempt to identify patterns in data which might not be self-evident in a review of the data by a human analyst. Such data mining tools traditionally seek to extract patterns that exist in large volumes of data in order to forecast future expectations, whether of customer actions, future profit, inventory needs, scheduling requirements, or any other trend or characteristic that a planner may wish to forecast.
Unfortunately, however, the implementation of a successful data mining strategy often requires persons having highly technical expertise to develop accurate data mining profiles that might model such future expectations. Before a user may know what questions may be asked, the user must know what data is available to answer his questions. Likewise, in order to ask a particular question for which the data might be able to provide an answer, the user must know how to structure the question, i.e., have knowledge of the query language required to ask the question. The fact that database fields, customers, and analytical database applications all have distinct labels for the same conceptual piece of data further exemplifies the need for expert-level knowledge of the data structure. Given such labeling discrepancy, while a particular non-expert user may know in plain English the question he wishes to have answered, he lacks the expert knowledge of how that data is arranged or how the applications might extract and manipulate that data, and thus whether and how the data might provide that answer. While the creation of metadata describing how the data is arranged may to a limited extent alleviate some of the problems arising from such labeling discrepancy, the creation of metadata itself can also be a daunting task.
The development of a successful data mining strategy traditionally involves six distinct phases, as follows: (i) obtaining an understanding of the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives; (ii) initial data collection and activities to become familiar with the data, identify data quality problems, discover first insights into the data, and detect interesting subsets to form hypotheses for hidden information; (iii) constructing the final dataset to be used by the predictive modeling tools from the initial raw data, including categorizing continuous variables, eliminating outlying data values, joining related data sources, and extracting a sample of the result; (iv) selecting and applying various modeling techniques, and calibrating their parameters to optimal values, which often involves returning to the data preparation phase; (v) evaluating the model to be certain it properly achieves the business objectives and meets accuracy requirements; and (vi) deployment of the model, in which analysts consider the application of the model to new inputs, consider the application that will eventually receive the scored prospects, and consider the operational issues surrounding the ongoing execution of the model. Such a process often involves up to six months of expert time to develop a single model, with the business understanding, data understanding, and data preparation phases consuming the largest portion of this effort.
Obviously, it would be highly advantageous to provide a means by which the business understanding, data understanding, and data preparation phases could be implemented in less time so as to provide timely analysis, and without the need for such expert technical intervention.
Moreover, in order to provide adaptive applications that provide the greatest possible utility to users, it is necessary that the applications themselves be able to adapt to the data that is available, such that all application functions that are supported by the available data are readily available to the user, while application functions that are not supported are not available to the user.
It is, therefore, an object of the present invention to provide a method and system for constructing generic analytical database applications which avoids the disadvantages of the prior art.
In accordance with the above objects, a method and system for constructing generic analytical database applications is provided which allows users to access application functions which are enabled depending upon the data stored in databases, as reflected by metadata.
The metadata connects user terminology, application terminology, and database terminology in a common framework. The system of the instant invention allows the construction of analytical database applications having broad adaptability in terms of the database layout and business functions supported. The applications operate against an abstract business model, with all components (i.e., xe2x80x9crolesxe2x80x9d) optional. Application functions presented to the user depend on the configuration of roles which are present in the metadata. Databases are described in the metadata at an entity relationship level, and are annotated with semantic business roles, as well as with a wide variety of user preferences for naming and presentation. The end-user sees neither database terminology (SQL names) nor application terminology (roles). Rather, they see things in relevant business terms, and they only see applications and functions which are viable given the metadata. If there is a database in place, much of the metadata can be guessed automatically (e.g., table associations or roles). This means introduction of this technology provides a fast start, followed by incremental refinement which surfaces new application functions and more familiar user terminology.
The application environment of the instant invention seeks to provide a suite of applications which may be implemented across as many database scenarios as possible, and thus creates generic applications which use rich metadata to adapt to an existing data environment. The application environment of the instant invention provides a variety of information and services to the individual database applications, such as information on the user (e.g., user preferences, user privileges, etc.), and information on the data visible to that user (i.e., metadata describing the information both in human terms and machine terms). The applications, in turn, provide a variety of information and services to the application environment, including application name, description of the currently-specified task, and an indication of whether the user and data characteristics are sufficient to run the application.
By interacting with data on metadata terms instead of database terms, applications can present data using end-user terms, rather than being forced to resort to database names. Common metadata means applications use common terminology to refer to business elements. As the metadata changes, the terminology changes for all applications simultaneously.
The application environment of the instant invention adapts to business data as it stands. If an application is irrelevant to a particular business (the underlying information to support it is not present), it simply doesn""t appear as an option in the user interface. If, in fact, an application is of interest, the application environment of the instant invention can xe2x80x9cask itxe2x80x9d why it""s not available, and it can articulate the missing pieces it needs. Once these pieces are provided, the application appears.
Each application in the suite of applications made available to any particular user has a unique name. Moreover, within each application is a set of application-specific user pages, each of which in turn has a unique name within that application. The individual application pages are defined in XML. More particularly, the HTML content for each application page is specified, and application-specific XML elements describing substitutions and conditional HTML generation may likewise be embedded in each application page. Such XML instructions are of a form that can be detected in an application-independent way, and presented to the application itself for resolution. Thus, the application may adapt the user interface, presenting select application pages to users depending upon particular criteria which is set forth in the XML code embedded in each page. For example, some XML instructions controlling the generation of particular HTML on a page presented to the user may include: (a) xe2x80x9csubstitute the user""s name herexe2x80x9d; (b) xe2x80x9ccreate an input field here, initialized to the name of the current tablexe2x80x9d; (c) xe2x80x9cif the user had administrative privilege, generate the following HTML; otherwise, omit itxe2x80x9d; (d) xe2x80x9ccreate a hyperlink here, exposing the name of the current table to the user, and linking to the table summary application if clickedxe2x80x9d; (e) xe2x80x9cgenerate a column summary table here, with the column headings I supply.xe2x80x9d
Thus, application pages may be externalized in metadata. This construction provides the user interface independence from the application code itself, enabling for example the templates for a given application page to be modified for particular users without modifying the application itself. This means that it is possible, for example, to have multiple users interacting with the same databases in different languages, each appearing to have theirs as the xe2x80x9cnative tonguexe2x80x9d. (Note that presentation down to the data value level is controlled through metadata, so data values can be presented and described in the user""s native tongue).
Another aspect of the system and method of the instant invention is the process of generating such metadata that is used by the application environment, i.e., discovering data structures in databases which structures may be recorded, indexed, and accessed to provide a semantic framework of the data. Business data is typically represented in relational form. In the absence of any further information, an experienced database analyst can generally make use of a database by looking at names of fields, guessing relationships between tables, surmising business roles of various attributes, and testing those ideas by doing SQL queries against the data. The system and method of the instant invention automates this discovery process, yielding a more systematic and comprehensive set of observations, data distributions, and xe2x80x9cinformed guessesxe2x80x9d about the semantics of the database. These can be presented in a form that is meaningful to someone who would not have been expert enough to generate the information manually. The automated data structure discovery process of the instant invention preferably includes the generation of data distributions for all the columns, row counts and column layouts of all tables, relationships (joins, associations) between the tables, an entity-relationship diagram of the data, and roles of the data items in the business (e.g. customer, sales, product).
Once the informed guesses are manually refined (by an interaction with a DBA which can now be very specific), this semantic framework can be updated automatically as the database changes.
The automated data structure discovery process of the instant invention may be described with reference to four key methods and the systems for their implementation, namely, (1) data categorization; (2) generation of abstract business semantic roles associated with database objects (e.g., tables, columns, or column categories); (3) identification of links between data tables (including join paths and associations); and (4) generation of an easy-to-use graphical and textual description of the metadata which may inform persons lacking technical database expertise of key aspects of the semantics of the data, such that non-technical users may readily adapt the metadata as necessary without having intimate knowledge of individual database structures.
1. Data Categorization
The first above-mentioned method, data categorization, involves apportioning data into organized subcomponents in order to ease the application of an analytical task. Often times, categorized data is simpler to use and more meaningful than raw data in implementing an analytical task. For example, presenting the distribution of gross profits as a function of sales price (thousands of sales prices, meaning thousands of result values . . . with gross profits for $9.95 and $9.96 reported separately) is almost meaningless, but presenting the same information by sales price category (a dozen carefully-chosen price ranges, with gross profits for each) could reveal useful insights about which sorts of sales are generating the most profits. Unfortunately, manual value categorization is slow, expensive, inconsistent, unresponsive to changing data, tedious, and subject to human error. This often means it""s not done , and that therefore end users must interact directly with raw data.
It is useful, a s a reference for database users, to display counts by category for the whole database. This provides a simple visual crosscheck that can surface unexpected behavior in general, and can help spot data quality problems in particular. Columns which are rarely populated, or which generally contain some default value, are easily spotted. Unexpected spikes in the distribution might identify something needing further investigation. Multimodal distributions may reveal cases where one data value is representing different information in different source systems.
Categorized data also facilitates data mining. Some data mining algorithms or data mining tools require categorized data; for these, the data mining analyst must do the data categorization up front. This can be a time-consuming and error-prone process. Some estimate this to be 80% of the data mining task.
Categorization provides another benefit for data mining. Continuous variables in databases often have outliers, i.e., extreme values which are often errors or other anomalies which may be present in a very small number of cases, but by virtue of their values have significant influence on the algorithms, xe2x80x9cdistractingxe2x80x9d them from seeing the patterns in the broader databases. Using categorized data instead of raw data, these problems are minimized.
The method and system of the instant invention automatically establishes a categorization for each column in the subject databases, mapping the full range of column values into some number of category values. This categorization is then preserved as metadata.
This categorization uses the existing data to find an encoding which is suitable for use by a human analyst as well as by an automated data mining algorithm. This is done by starting with a general categorization scheme which would be meaningful to an end-user, based on the underlying data type and actual data. For example, if the data is numeric, the method of the instant invention may consider uniform value ranges, with boundaries which are round numbers. Then, among all such possible categorizations, the method chooses the one which conveys the maximum information about the underlying data, using information entropy. Since the method of the instant invention automates data categorization, applications can rely on categorization being available for all database columns xe2x80x9con day onexe2x80x9d. The categories created by the automated categorization of the instant invention, therefore, provide applications and tools with a good starting point for presenting statistics relating to a database attribute.
2. Semantic Role Generation
The second above-mentioned method, semantic role generation, involves the association of particular business roles with database objects, such as tables, columns, or column categories, such roles representing an abstract, generic use of that object within the business. Once generated, these roles are captured in metadata and then used by applications, which thereby acquire a rudimentary xe2x80x9cunderstandingxe2x80x9d of the business data.
By way of example, semantic roles relating to a database table may include xe2x80x9ccustomerxe2x80x9d (a person or organization which buys products), xe2x80x9cproductxe2x80x9d (an item or service sold to a customer), xe2x80x9csalexe2x80x9d (an event associating a customer, product, and price on a date), xe2x80x9cbasketxe2x80x9d (collection of related sales to a customer), xe2x80x9cmarketing messagexe2x80x9d (brochure, web page, telemarketing campaign, etc.), xe2x80x9cmarketing contactxe2x80x9d (an event associating a message with a customer on a date), etc.
Roles are generic rather than industry specific. They are handles for applications, not for end-users. Thus, for example, an application for product affinity analysis might use semantic role metadata to find a database table playing the xe2x80x9cbasketxe2x80x9d role, then use other metadata to find the associated xe2x80x9csalexe2x80x9d and xe2x80x9cproductxe2x80x9d data. Although the terminology suggests a retail sales database, it may be that the xe2x80x9cbasketxe2x80x9d is an airplane ticket (a collection of flights bought by one customer), or a set of optional features for a car, or some other collection. In each case the application would identify xe2x80x9cproductsxe2x80x9d which tend to be bought together, i.e., information which may be useful for packaging, pricing, or promotion.
Roles can also be defined for database columns, such as sale price, sale date, product cost, contact date, etc.
Roles can likewise be defined for column values (or categories), such as bad credit risk, customer voluntarily terminated service, etc.
One database object may play several roles. For example, for a dating service, the xe2x80x9ccustomerxe2x80x9d and xe2x80x9cproductxe2x80x9d role may be played by the same table. Likewise, one role may occur multiple times. For example, an auto manufacturer may represent cars and optional features for cars in separate tables, even though both are products.
All roles are also optional. For example, a retailer operating on a cash basis may not have customer data.
Here are some further examples that might be helpful in understanding semantic roles, and how they are used by applications. A profitability application may look at the metadata, and find one column with the xe2x80x9csales pricexe2x80x9d role; it may also discover a xe2x80x9cproduct costxe2x80x9d column, perhaps in the sales transaction or the product table . . . no matter; it has what it needs to ascertain profitability of each sale. The application, seeing that sale date was present (possibly in the sale table, or maybe in the basket table) would now be able to offer the user the ability to view gross profit and margin trends over time.
Likewise, a cross-selling application might look for the Customer role in the metadata, and find that there are two (a table identifying individual buyers, and a table identifying households). In this case the application must resolve the ambiguity, perhaps by simply presenting the decision to the end user . . . xe2x80x9cDo you want to cross-sell to Households or Individuals?xe2x80x9d
Consider a marketing analysis application, which looks at sales data to identify and report significant product sales increases associated with specific marketing programs. It might require tables with the xe2x80x9cmarketing contactxe2x80x9d and xe2x80x9csalexe2x80x9d roles to be available before it can operate. One company might use this application to analyze web page effectiveness (each web page is a marketing message, each page hit is a contact). One might use it to gauge effectiveness of brochures (each brochure is a message, each time one is sent is a contact). A third may use the application for both, comparing web-based and direct mail promotions.
In sum, semantic roles can be used by applications to provide a level of data independence. These applications can determine if they have enough information to run: for example, a simple revenue reporting application might look for the sale price and sale date roles to determine if its revenue reporting services are available. If not, the application can be omitted from user interfaces. Applications which cannot run can clearly articulate what they need to run.
Applications can use roles to adapt their queries to fit the data model, instead of forcing the user to adapt his data model to fit the application. For example, a profitability analysis application might generate a simple query if price and cost reside in the sales transaction data; but if not (say, if the cost exists in the product table), the function can still be supported by using join path metadata to bring the cost and price information together. This enables the business to decide the logical place to capture cost information.
3. Identification of Links Between Data Tables
The third above-mentioned method involves the identification of links between multiple tables, which links may optionally also specify a join path and/or an association. Query tools are notorious for presenting end-users with a bewildering array of query specification capabilities, many of which lead to queries which are well-formed from an SQL perspective, but which are semantically incorrect. The method of the instant invention uses metadata to guide the user towards semantically correct queries, and suggests ways of presenting user options in a more logical fashion. By automatically detecting join paths, the method of the instant invention removes a tedious step from metadata creation.
The ability to perform data mining against multi-table data is important. Most real-world data mining exercises involve finding customers with certain in behavior, where that behavior appears elsewhere in the database. By creating a broad framework for data mining that goes beyond the flat-file model, we can start to think about considerably more flexible data mining tools, wielded by people who no longer need to deal with databases at an SQL level.
A join path defines how two tables may be relationally joined. There are some number of column pairs (one column in each table) that can be used to link the tables in a relational join operation. Thus we might have a join path between the xe2x80x9ccustxe2x80x9d and xe2x80x9csalexe2x80x9d table, with the xe2x80x9cidxe2x80x9d column in xe2x80x9ccustxe2x80x9d being joined to the xe2x80x9ccustidxe2x80x9d column in xe2x80x9csalexe2x80x9d. Typically, join paths involve a single pair of equijoin column, but many may be present (composite keys).
The metadata also characterizes the association multiplicity (in UML or ER terms). Thus, one sale might link to one and only one customer; each customer may link to any number of sales (possibly zero). The multiplicities are associated with each end of the link. This is a xe2x80x9cmany to onexe2x80x9d association.
Both the join paths and the multiplicities are automatically discovered and captured in the metadata using the method of the instant invention, and may subsequently be reviewed and refined by a database expert. This metadata (join paths, multiplicities) provides a type of semantic information that is useful for both humans and applications. For example, if xe2x80x9crevenuexe2x80x9d is defined against a sales table as the sum of the extended price, then a conventional SQL-based database tool might offer an ability to break down revenue by other attributes in the sales table, for example by date, product, or store. This would be a simple single-table SQL operation. However, each of these attributes is typically too detailed, comprising hundreds or thousands of values, to be of direct use as a breakdown criterion.
Given join paths and multiplicities, a xe2x80x9csmartxe2x80x9d application can go down many-to-one join paths to offer a broader set of breakdown criteria for an end-user to choose from. For example, revenue can be broken down by customer attributes, since there is a many-to-one relationship between sales and customers; similarly, revenue can be broken down by product department since there""s a many-to-one relationship between sales and product.
However, you generally can""t follow associations the other way: if the customer table had a column telling how much he spent with the company last year, then breaking down total customer spend by customer attributes would make sense; but breaking it down by sales attributes would not. xe2x80x9cTotal spend by sale datexe2x80x9d would result in valid SQL and would return an answer, but not at all what it sounds like: if customer 107326 bought 7 things on a particular sale date, then the reported xe2x80x9ctotal spendxe2x80x9d for that sale date would be incremented by 7 times 107326""s spend last year.
Since the metadata contains both join paths and multiplicities, it contains sufficient information to identify semantically-correct questions, as well as sufficient information to generate the corresponding SQL for the queries.
Similar rules make it possible to broaden the applicability of data mining to include multi-table data. To the best of the inventor""s knowledge, all commercial data mining tools operate against single tables: if an analyst wants to predict which kind of customers spend more on clothing, then both the output (e.g., clothing revenue) and the inputs (e.g., geographic region, age, occupation, sex, etc.) must all exist in one table or flat file. It is apparent, however, that if customers have a postal region, and the analyst has a postal region table (containing, e.g., a neighborhood classification: rural, suburban, etc.; or distance to the nearest store), then it is sensible to include these characteristics as inputs, even though they are in another table, since each customer has one neighborhood classification. Columns of any table which stands in a one-to-many relationship with Customer (directly, or through a chain of 1:M relationships) are semantically valid inputs for a Customer model.
Similarly, we need not limit ourselves to the Customer table when considering outputs to a propensity model. Using the same sort of logic we did for queries, we can consider aggregate statistics for any table standing in a many-to-one relationship with Customer.
4. Generation of Metadata Description
The fourth above-mentioned method concerns the production of an HTML document mapping the metadata such that it may by easily digested and utilized by a wide variety of end users, including database professionals, data analysts, and business end users, through a standard web browser.
Metadata is designed to be navigated by applications, not humans, and is thus ordinarily expressed in XML such that any analysis or manipulation of such metadata is ordinarily capable of being carried out only by persons having technical expertise in the particular data and database structures being analyzed. Thus, a major problem preventing the effective use of metadata is the fact that the xe2x80x9cresultxe2x80x9d is often inaccessible to the people who need it. The metadata is conventionally viewed through tools that presuppose the user knows the basic idea of how the metadata is organized. Since the metadata is very much organized around database structures, the tools tend to assume that this is the right setting for presenting the metadata, which is fine for persons having the required database technical expertise, but for the other 99.9% of the universe, this presents a problem.
This means that metadata, once created, is difficult to get right (get reviewed by the right people), since they don""t have the tools and they are not technical metadata experts. Further, if a user needs to find out about, for example, the kind of geographical information the company has about customers, they don""t go to the metadata for the same sorts of reasons.
By casting the metadata into HTML, the method of the instant invention makes it accessible to a wide audience, and eliminates the need for exotic tools (and exotic tool usage skills). Also, given that the metadata generated by the system and method of the instant invention goes beyond conventional database metadata (being more biased towards applications and end-users than towards databases), the document generated is, in fact, palatable to that broader audience.
The richness and interplay of the various types of metadata (going beyond tables and columns) allows the method of the instant invention to present information which is both useful for navigating the metadata and useful in understanding the underlying business information.
In addition to the above-described methods and systems for automating the discovery of a data structure and its recordation in the form of metadata, it is also important to note that once generated, the metadata must be maintained in order to reflect changes and updates in the data. For example, new tables and columns are likely to appear in subject databases over time. As underlying data changes, data value distributions change as well. With use, descriptive metadata (end-user labels, pop-up short descriptions, longer descriptive paragraphs) get out of date as new business targets and initiatives replace old ones.
It is quite normal for an organization to make a significant investment in metadata, only to find a year or two later that xe2x80x9cnatural forcesxe2x80x9d have eroded the relevance of the metadata to a point where it is an impediment rather than a valuable asset. However, many of the above changes can be automatically reflected in the metadata, and in other cases the changes can be reflected through an automatically-assisted process. In order for the metadata to reflect such changes, the system and method of the instant invention provides the following functionalities.
First, when a user is initially creating metadata, he specifies which data objects are of interest. For example, he might say xe2x80x9cI am interested in all tables in this database.xe2x80x9d In addition to capturing the actual data objects (in this case, the tables), the method and system of the instant invention also captures the way in which they were specified (xe2x80x9call tables in this databasexe2x80x9d). Later, if new data objects have appeared, the method and system of the instant invention can automatically include them in the metadata.
Second, the method and system of the instant invention captures, value distribution in a manner which allows data value metadata to be recreated from current data, and automatically recognizes when previous assumptions (e.g. data type) are invalid, requiring human intervention.
Third, whenever descriptive metadata is presented in the normal course of application execution, the method and system of the instant invention presents authorized users with the ability to change the description on the spot. That is, if the description does not fully convey what is needed (in the context where it is needed), it can be corrected on the spot.
Finally, the metadata evolves as it""s used. It can be mapped into HTML for review at any time, in which case all the current knowledge (database, description, roles, relationships, etc.) are placed in one, current, document which is suitable for use by end-users.