Data profiling is an important tool in, for example, systems which may require a user to inspect or interact with extremely large data sets that must be analyzed computationally to reveal, among other things, patterns, trends and associations. It is often the case that these extremely large data sets are contained in a number of computer readable files that may be collocated, or may be located in numerous locations.
One way to simplify human interaction with such extremely large data sets is to generate a profile of the data contained within all the objects of a data set. In some cases, the data to be profiled may be contained in an extremely large number of small files containing small amounts of data represented as one or more data objects, or the data may be contained in a smaller number of extremely large files containing an extremely large number of data objects. The individual computer readable files themselves may be in any format, such as text files, stored in a file system within a computer operating system, or they may be stored in a distributed file system (DFS) such as the Apache Hadoop Distributed File System (HDFS) or the Google File System (GSF).
Extremely large data sets such as these are often referred to under the generalized term, “big data.” Big data describes data sets that are so large or complex that traditional data processing applications are inadequate for handling, accessing and manipulating the data. Big data sets suffer from challenges relating to the analysis, capture, curation, searching, sharing, storage, transfer, visualization, querying, and updating of the data contained therein. Compounding these issues is that big data sets are typically growing rapidly, often because they are generated and gathered by cheap and numerous information collection devices, for example mobile devices, aerial remote sensing devices, software logs, cameras, microphones, radio frequency identification (RFID) readers, wireless sensors, and consumer devices that contain or more of the above. Other sources of “big data” are social media exchanges and web-based transaction facilities. The emerging concept of Internet of Things (IOT) will only add to the number of devices that collect data and the distributed nature of data storage. To derive meaning from such a “big data” system one typically needs considerable processing power, as well as appropriate analytical tools.
Beyond the size of modern big data sets, challenges are compounded by the fact that very little real world data is conveniently available as structured data represented in a canonical relational format or model. First, while a computer can more readily handle structured data in an efficient manner; humans typically deal with non-structured data, for example text strings, web pages, email, or ad-hoc spreadsheets, because humans rarely interact with information in a strict relational model in day-to-day life. Second, much of the data created today is stored in disparate files or locations by various systems designed to operate in an efficient manner when executing a particular task, without concern for conforming to canonical or accepted standard data formats (which may require additional programming or cause inefficiencies in the relevant system, for example in SCADA systems). Also, data describing similar things may be sensed by two separate systems which then store that data in two entirely different formats in two entirely different file systems.
Generally, data may fall into the categories of structured data, semi-structured data, and unstructured data. Structured data is data with a high level of organization, and such information can be seamlessly included in a relational database or other forms of tables. The canonical structured data is data stored in one or more relational tables, which are defined according to a well-defined schema. Structured data is also readily searchable by simple search engine algorithms or other search operations. Semi-structured data, and unstructured data, on the other hand is not easily collected or searched.
Known systems for profiling large data sets require first that semi-structured and unstructured data first be collected and stored in a relational database before it can then be analyzed or profiled. There are a number of known methods for mapping semi-structured data stored in a particular configuration into relational databases. Typically these fall into one of two categories: using a fixed mapping method to store semi-structured data in relational databases, or requiring a user to supply a mapping schema. Therefore, the lack of structure in such data makes compilation of the data into a time and energy-consuming tasks of storing all the data in a persistent table format which can then be subjected to analysis, or the generation of a mapping schema or a combination of these two tasks.
First, a schema must be created and supplied to a computer system, which relies on the schema to map the unstructured data to a structured format, and secondly all of the data must first be converted to a structured format irrespective of what data a user desires to profile. Also, these mapping schemas do not adapt to changes in the data, which may occur over time as the software and hardware generating the data changes or is upgraded; or if there is simply an error in the data giving rise to outlier data elements. Each of these hurdles becomes increasingly time-consuming as the data sets to be analyzed get larger and larger. While what is considered “big data” differs from user to user based on the computational abilities available or a user's needs, it will be appreciated that relational database management systems and available data statistics and visualization packages are insufficient for handling modern big data sets which may have extremely large numbers of data objects.