These days much data is generated and stored in digital form. Since the 1980s the world's capacity to digitally store information has increased by over twenty percent per year. In 2012 every day 2.5 exabytes (2.5×1018) of data were created every day. Some parts of this data is publicly available, other parts are in-company data.
The term ‘big data’ is often used in this connection for a collection of data so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
This data is often heterogeneous with many interconnections and dependencies (relations) or and/or correlations. Large collections of data relations contain valuable information, but these relations need to be ordered and structured before the actual patterns present in the data can be easily disclosed. It is desirable to leverage the valuable and often unknown information contained in this data. For example this allows data analysis where none currently takes place. Nevertheless, this requires assessing millions of data points within an acceptable period of time.
Much of this data is stored in large databases, sometimes referred to as data warehouses. Such databases can store thousands of columns of data entries. The total number of data entries in such database can be millions or even billions.
The database can for instance store in-company data, such as client data. Such client data can be distributed over columns storing all kinds of information. Some groups of columns can relate to personal data such as first names, last names, social security numbers, phone numbers, email addresses, IP-addresses, street addresses, postal codes, city names, state names, country names, etc. Other groups of columns can relate to financial information such as bank account numbers, credit card numbers, etc. Yet other groups of columns can relate to products offered by a company, such as financial products such as mortgage types, savings account types, loan types, credit types, clients making use of such products, etc. Yet other groups of columns can relate to insurance products such as car insurance types, health insurance types, life insurance types, home insurance types, liability insurance types, clients making use of such products, etc. Also, the database can contain additional columns. For instance in relation to car insurance types the database can also include columns relating to car makes, car types, gasoline consumption, CO2 emissions, car weight, etc.
The vastness of the amount of data stored in the database can make assessing interrelations between separate columns of data virtually impossible or at least very complex and time consuming. This may cause that a first department within a company, e.g. a financial department, is unaware of data stored by a second department, e.g. an insurance department. From a business perspective it would be highly desirable that separate departments can benefit from data stored by other departments.
From a marketing perspective it can also be desirable to be able to combine and/or compare databases of different companies, e.g. of a bank or insurance company and a telecom provider.
In view of the above a data analysis system is desirable which assists in assessing relations between columns of data in a database.