‘Big data’ is a broad term for data sets so large or complex that specially adapted data processing approaches may be necessary in order to return a processing reside within an acceptable timeframe. For example, databases comprising millions of transactional data records may needs to be processed for identifying correlations and other patterns within the data, for dynamically retrieving specific subsets of the data records and other tasks.
One important aspect of processing large amounts of data, e.g. transactional data, is the assignment of scores. For example, a score can relate to a technical parameter such as total time required by a machine to perform a particular task, the total amount of resources, e.g. energy, chemicals or any other kind of materials consumed for manufacturing a particular good, the price for items ordered by a plurality of customers and the like. Often; said assignment is highly complex, because a large plurality of conditions may have to be checked for dynamically determining which kind of score is to be assigned to a particular data record. For example, the score assignment may be performed for finally calculating an aggregate score from the totality of assigned score values, whereby the question how many and what scores are assigned to a particular data records depends on many different criteria.
For example, the predicted total time a manufacturing line requires for manufacturing a particular good may depend on the type of material and components used for producing said good, may depend on known delivery times of the various suppliers, may depend on the workload of individual machines of the production line which may be used also for producing other types of goods, from a configurable mode of operation of various machines in the production line and the like.
According to another example, the final price assigned to a particular good may depend on customs and taxes, on the customer having ordered the good (there may exist granted discounts), on the chosen way of transportation (by air mail or ship, express or standard delivery time, domestic or international transportation), on the material the good is made of, the size of the good and many other factors.
Management of the plurality of data records representing, for example, machines, production lines, laboratory devices or goods and services on the one hand and scoring data on the other hand has often been difficult in that a direct assignment of said two types of data, e.g. within a single table, is not possible, because the assignments depends on a plurality of criteria (the type of machine, the type of good, the query time, the chosen transport means, etc), which may be provided dynamically and may vary in an unforeseeable manner.
Some approaches for dynamically assigning score values to data records taking account of the plurality of complex conditions are based on retrieving the data records and the scores from a database and then let an application program perform a complex data processing workflow in which the conditions are evaluated and the scores are finally assigned to the data records. However, the retrieval and processing of large amounts of data by an application program often results in a tremendous data traffic between the database server and the computer hosting the application program. Moreover, data processing and application programs is often slow as the processing routines implemented in higher programming languages are not as speed optimized as are the routines of a database management system.
Other approaches for dynamically assigning score values to data records are purely implemented in a database. However, due to the limited set of operations supported by a DBMS, it is often not possible to implement complex assignment strategies which depend on a plurality of different conditions within a DBMS. At least, it is often not possible to provide an efficient implementation of such an assignment as the complexity of the assignment process often exceeds the capabilities of the query planner of the DBMS.
Hybrid approaches for dynamically assigning score values to data records try to provide a compromise by delegating some assignment tasks to the DBMS and others to the application program in order to reduce the complexity of score assignment that still has to be performed in the DBMS. However, said approaches often cause a significant data traffic between the database server and the computer hosting the application program as multiple, often iterative database queries have to be submitted to the DBMS and respective result sets have to be received and processed by the application programs. Moreover, such systems are hard to maintain as the score assignment logic is scattered among the DBMS and the application program.
Hongjun Lu et al.: “Decision Tables: Scalable Classification Exploring DBMS Capabilities”, proceedings of the 26th international conference on very large data bases, 10 Sep. 2000, pages 373-384, XP055280742, Cairo, Egypt, ISBN: 978-1-55860-715-6 describes an approach in building efficient scalable classifiers in the form of decision tables by exploring capabilities of modern relational database management systems. In chapter 3.2.4, the pruning of the decision table is mentioned.