A large number of human endeavours use quantitative data methods to track and control activities. As a result, a large number of computer programs are available today for analyzing data. However, these programs are designed to work with clean (e.g., well-structured data).
Generally, clean data can be defined as data that is fully consistent with any implied relationships within the data. An implied relationship arises due to real-world relationships between objects represented by the data. For example, if one table contains course enrollment records containing a student number, and another table contains student records containing a student number and student name, then there is an implied join relationship between the tables, because the student numbers in both tables represent the same real-world objects. If the data is clean, then every student number found in the enrollment table will also appear in the student table, and each student number will only appear once in the student table. If these criteria are not met, then the data is considered to be “dirty”; however the implied relationship nevertheless still exists, because it is based on a real-world relationship that still exists. The difficulties with dirty data are first to identify the implied relationships when the data is not clean, and then to correctly address the inconsistencies in the data.
The existing methods only allow analysis of specific data from some specific source systems; others allow analysis of ad-hoc data sources. However, the existing programs need the data to be clean and well structured before it can be used or analyzed. This is disadvantageous since inconsistent “dirty” data tends to be the majority of cases, particularly with ad-hoc data sources. Some analysis programs allow the user to make limited corrections to the data, but these capabilities are often limited in scope, and in general it is very difficult to keep track of such corrections and ensure that they remain in place as the user works with the data. Therefore manually cleaning and restructuring the data is still often needed prior to performing any analysis or manipulation on the data.
The most common tools used for performing this manual cleaning and structuring are spreadsheet programs, and sometimes relational database programs for more sophisticated users. Although these are not data analysis tools per se, they are often used to prepare the data for analysis by allowing manual restructuring and manipulation of the data by a user. In many cases, the actual data analysis is then performed in spreadsheet programs also using its manual manipulation and charting capabilities.
As will be understood by a person skilled in the art, manual cleaning and restructuring of data is disadvantageous as it requires users to be knowledgeable of the desired result and clean data. As well, when handling large amounts of dirty data, this process is tedious and unreliable.
Further, the difficulty with these existing methods is that spreadsheet programs and database programs are extremely low-level tools designed for flexibility, not simplicity. They are designed to give the users control of all details and aspects. In a spreadsheet program, the user is thus responsible for ensuring every value is correct in every cell. Relationships between different parts of a spreadsheet are established on a cell-by-cell basis. For example, in order to establish a high level relationship between two data sources (such as, relating a customer number to the corresponding customer name), the user must perform operations on multiple individual cells using formulas in order to establish the relationship. This process is time consuming, non-intuitive and error-prone.
Further, although database programs allow certain operations from one or more tables to other tables, these are static operations. That is, static operations refer to the fact that the operation inputs one fixed set of data and outputs another fixed set. If a change is made in the original input data, this change is not reflected in the output data. If the input data contains certain errors, a database program will generally disallow certain operations until the errors are corrected. Users need to manually prepare the data by executing operations sequentially, starting from the initial input data and using the output of each operation as input to one or more other operations. As well the operations are performed by the user manually defining the links between each of the cells in the tables and this process needs to be redone once the order of the cells and/or location is changed.
As a result, data manipulation in spreadsheet and database programs involves significant numbers of sequential manual steps to be performed. Often users must go back and repeat steps if they need to correct errors made along the way. To redo the manipulations on a new or revised set of data, all steps must be performed again. The skill level required to perform this data manipulation and correction successfully is quite high.
The current state of the art in data manipulation software consists of four broad categories: high end statistical analysis, data mining and data visualization programs; business intelligence systems, generally integrated with a company's operational and marketing systems; specialty data manipulation software designed for the information needs of a specific vertical industry, often integrated with operational systems for the same industry; software used to analyze “ad-hoc” data with an arbitrary structure and content provided that the user performs manual manipulation on the data prior to any analysis to obtain clean data. The first category of software is generally designed for a small subset of users in specialist occupations, who have specific data analysis requirements that go beyond the needs of most users. The second and third categories of software generally require a company's operational data to be pre-structured and integrated with the data manipulation software and data warehouses. In general, the manipulation capabilities provided by these systems must be largely anticipated in advance. These systems also generally require significant involvement of technical personnel to set up and maintain. The fourth category of software allows users to perform analysis with data, and where the analysis needs cannot be anticipated in advance. However, the data often needs to be transformed, corrected or otherwise manipulated by the user before it can be usefully analyzed in these software packages.
This manual manipulation generally requires significant expertise both in data modeling and with the spreadsheet and/or database programs in use.
An increasingly large amount of corporate data resides in ad-hoc sources such as spreadsheets and desktop databases. Significant technical skills are generally required to extract useful information from this type of data. Spreadsheet programs and desktop database programs are “low-level” tools. They are designed to give the users control of all details and aspects. For example, the user is responsible for ensuring every value is correct in every cell in a spreadsheet, and relationships between different parts of the spreadsheet are established on a cell-by-cell basis. Thus, in order to establish a high level relationship between two data sources (for example, relating a customer number to the corresponding customer name), the user must perform operations on multiple individual cells using formulas in order to establish the relationship. This process is time consuming, non-intuitive and error-prone. Additionally, as mentioned earlier, the process needs to be repeated if the data including its order in the table changes.
Thus, users generally spend the majority of their time performing repetitive operations. This arises due to a number of factors. First, as previously noted, the tools involved are very low-level, so users often must perform multiple low-level operations to effectively perform a single high-level operation. Secondly, the tools involved are not dynamic, meaning that the tools often cannot be set up to dynamically perform all the necessary processing from beginning to end; instead the user generally performs a series of steps, with the output of each step being then used as input to the next step. If any errors are made at any step and discovered later, the user must repeat all the steps subsequent to the error; often, the intermediate results are no longer available at this point and the user must start again at the very beginning. Such errors happen frequently even with experienced users, due to the complexity of the process. Finally, the same analysis must be performed again on a regular basis using new data. In these situations the user must re-execute each step with the new data.
For example, a spreadsheet can join data only by manual manipulation, that is only by entering into a cell a specific formula and copying the formula to all other rows, and then repeating the process with a slightly different formula for each column to be joined. On the other hand, a desktop database is not dynamic (such operations must be performed sequentially, and if the input changes, the results of the join do not fully reflect the changes); additionally, a desktop database handles errors in the data very poorly, generally disallowing certain operations until the user has manually found and corrected each error.
Thus, spreadsheets and desktop databases used to analyze and report on data, require performing the same steps repeatedly in a very time-consuming and error-prone manner. Further, users typically spend much more of their time manipulating data, as opposed to actually analyzing the information.
A solution to one or more of these issues is therefore desirable.