Service providers (e.g., wireless, cellular, etc.) and device manufacturers are continually challenged to deliver value and convenience to consumers by, for example, providing compelling network services. These services are leading to vast amounts of data (structured and binary) which need to be managed, stored, searched, analyzed, etc. Over the last decade, the internet services have accumulated data in the range of exabytes (1016 bytes). Although most of this data is not structured in nature, however, it must be stored, searched and analyzed appropriately before any real time information can be drawn from it for providing services to the users.
In order to apply analytics (e.g., statistical analysis) on the data and gain insight into the data, the data need to be put into an analytics engine through various ingestion schemes. The data is typically received in an unstructured format at the time it is ingested. It then needs to be cleansed, structured and validated into a format that is conducive for analysis. In order to cleanse the data and make it available for analytics, the data goes through a pipeline of disparate systems. Considerably high amounts of time and resources are spent on providing a pipeline through disparate systems for each data source that is brought into the system. This is the most time consuming and labor intensive work in order to get the data ready for analysis. Typically, developers write various custom map reduce programs to cleanse the data. However if the data could be reflected in terms of some standard data models and cleansing processes, it would be possible to create a standard pipeline and greatly streamline the Extraction, Transformation, Load (ETL) process which is easily the biggest obstacle and time consuming area of analytics.