Since the first relational database model has appeared, data mining and knowledge management were considered to be the biggest challenges in data processing. Data mining may be addressed by data drilling (data retrieval in a sophisticated way), which refers to any of various operations and transformations on tabular, relational, and multidimensional data. The term is commonly (according to WIKIPEDIA) associated with specialized software designed specifically for data analysis. Efficiency of data drilling becomes a crucial problem nowadays, when database systems are so complex that it may be almost impossible to be aware of a full logical database structure.
A typical database may contain hundreds and more commonly even thousands of data structures (e.g. tables). Such data structures usually relate to other structures in multiple ways by means of so-called relations. If users wish to execute database queries on such complex databases, they need to design a special, tailored query, which may be usually complex and require a long time to process by a computing machine. If the obtained result of such query does not reflect exactly the aim of the intention of the query designer, the queries may have to be performed again, which may be very costly in terms of time and required trial and error cycles.
Inherently, data drilling may be tightly bound to using complex JOIN type database queries.
JOIN type queries are commonly (according to WIKIPEDIA) defined in the prior art as follows. An SQL join clause combines records from two or more tables in a database. It creates a set that can be saved as a table or used as it is. The JOIN clause may be a means for combining fields from two tables by using values common to each. ANSI standard SQL specifies four types of JOIN: INNER, OUTER, LEFT, and RIGHT. As a special case, a table (base table, view, or joined table) can JOIN to itself in a self-join.
A programmer writes a JOIN predicate to identify the records for joining. If the evaluated predicate may be true, the combined record may be then produced in the expected format, a record set or a temporary tables mechanism that provides a way to organize data derived from queries into new tables, which can themselves be queried repeatedly.
Present relational databases are often normalized in order to eliminate duplication of information when objects may have one-to-many, one-to-one or many-to-many relationships. The aims of normalization are: to store only the minimal amount of information, to remove redundancies in the data, to remove anomalies, and to restructure data in order to provide more efficient storage. For example, in a database storing human resources data, a single Department may be associated with many different Employees. Joining two tables effectively using many-to-many connection one must create another table which combines information from both tables. This may be at some expense in terms of the time it takes to compute the join.
Much of the work in database-systems has been aimed at efficient implementation of joins, because relational systems commonly call for joins, yet face difficulties in optimizing their efficient execution.
Three fundamental algorithms for performing a JOIN type operations exist: nested loop join that may be a simple algorithm that joins two sets by using two nested loops; sort-merge join that may be an algorithm to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time; and hash join that may be an algorithm performed by hashing one data set into memory based on join columns and reading the other one and probing the hash table for matches.
The problem with the known databases being split into numerous tables forming a logical hierarchy, may be that in order to obtain, as a result, logically related data, a query must traverse different tables in order to obtain the requested information.
Additionally, executing database queries to a normalized database typically requires time consuming retrieval of stored data from multiple normalized tables. The normalized database therefore needs to locate and retrieve the requested tables and then join the information from the tables in order to respond to the initial data request.
Further, the drawbacks of the known query optimization in normalized databases include for example increased memory and CPU requirements and difficulties in formulating complex queries.
The JOIN type queries reduce database performance by slowing down the processing and by placing stress on computer hardware. Normalized databases need more CPU, memory, and I/O to process transactions and queries than unnormalized and denormalized databases. In existing databases, the JOIN type queries incur significant overhead.
Therefore it may be advantageous to improve data drilling processes and make such processes independent of a hierarchy of a database, because up to now such data drilling processes have been hardcoded in end-user software, such as Business Intelligence (BI) database software, Enterprise Resource Planning (ERP) database software and Customer Relationship Management (CRM) database software, and any new request of a client that desires some new way of data drilling, presently requires end-user's software update and extensive work on the side of database software provider.
Creating a Business Intelligence or Data Discovering processes requires analysis and accurate determination of the beginning, the middle and the end of the analysis process (within limited dimensions). This applies to query building and its constraints.
In the known systems, building a query requires knowledge of the structures of each relational database system parts. Further, it requires Long time processing of large data sets with heterogeneous size of in-between data structures (JOIN operations). Lastly, any new request results in a modification of already developed systems and the need to create new filters requires the knowledge of the query language.
A pre-grant publication number US 20040139102 A1, entitled “Parameterized database drill-through,” ('102) discloses a database interface for generating drill-through paths for various data cubes, derived from different report generating applications may be. The interface comprises: (a) means for accepting a request from a user for data; (b) means for translating the request into a drill-through path selected from a plurality of possible drill-through paths between a source and a target; (c) means for applying one or more parameters to the selected drill-rough path to produce a valid drill-through path and to transfer the requested data over the valid drill-through path to an application; and (d) display means for displaying the requested data to the user.
The '102 publication relates to data drilling but requires upfront definition of paths used by the respective queries and execution of such queries in full for a given path comprising numerous data structures along the start to end path. Implementation of the solution of '102 in another database requires a fundamental update of software and redefinition of the model as well as paths of respective queries and in the end generation of new queries and new options within the software operated by end users.
In the '102 solution, in case there may be some error at any place of the path, the end result will be erroneous and the user will have no knowledge of what caused the error and where the error has happened. Typically, complex queries will have to be re-executed in order to debug the query or the database. The error in a complex cross-querying situation will only be apparent, in prior art solutions, when the full processing path has been executed.
Taking into account the foregoing prior art, there may be a need to design and implement an efficient database hierarchy-independent data drilling process. In particular, such optimization shall be aimed at increasing data retrieval performance and shall alleviate the need of an end-to-end design of a data drilling JOIN type query.
Up to now, such problems have been addressed with a use of more powerful hardware, such as database servers having higher processing performance and more memory, instead of solutions related to design of the databases.