1. Field of the Invention
The embodiments of the invention generally relate to obtaining unstructured domain specific modality data and transforming it into a structured form that enables further analysis in a failure resistant manner that compensates for multiple error scenarios.
2. Description of the Related Art
The vast amount of continually growing content on the Internet has fostered many approaches to harness the information contained therein. Advanced data mining and text analytics techniques have been developed to perform knowledge gathering and information discovery using Web data.
Data analysis is the process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes.
Text analytics describes a set of linguistic, lexical, pattern recognition, extraction, tagging/structuring, visualization, and predictive techniques. The term also describes processes that apply these techniques, whether independently or in conjunction with query and analysis of fielded, numerical and categorical data, to solve business problems. These techniques and processes discover and present knowledge—facts, business rules, and relationships—that is otherwise locked in textual form, impenetrable to automated processing. Typical applications scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted. Current approaches to text analytics use natural language processing techniques that focus on specialized domains.
Data gathered from free and public sources on the Web is frequently integrated with enterprise and proprietary data to perform sophisticated analytics. This phenomenon in turn has lead to data analytics technology being in high demand as people try to extract as much value as possible from their most valuable resource—the information around them, whether in their organizations or freely and publicly available. Financial institutions are recognizing that they need to leverage public data and internal information in order to differentiate themselves from their competitors and provide value to their customers and employees. The retail industry is leveraging external consumer data to better enhance their distribution networks and hone their marketing efforts. The focus of all analytics efforts is to extract interesting and often hidden “nuggets” from within the data. In order to do so, however, all of these efforts have to spend a vast amount of time, effort and resources on data acquisition, ingestion, and integration.
Thus, the focus of these efforts tends to tilt away from data analysis and towards data ingestion. Another phenomenon worth observing is that the number of online sources with valuable data that use “broken English” is a lot larger than the number of sources using proper English. As analytics approaches attempt to bring structure to unstructured and semi-structured content, they first have to process broken English. In other words, analytics projects now also have to figure out how to parse and understand broken English, as well as how to consistently and reliably extract useful information from these data sources.
Traditional data analytics projects usually focus on the information question they would like to answer and often fail when confronted with inconsistent data sources, networking problems and machine failures. Companies like Nielsen and even IBM have a multitude of data analytics efforts, which leverage some sort of mechanism for ingest. The ingest mechanisms used in their efforts work well in typical enterprise environments where failure (of both data and system) is an exception rather than the rule, as well as in instances where the ingested content has rich structure (schema) around it. However, as analytics projects move in the direction of unstructured and semi-structured content and away from the (relatively) regulated enterprise environments, more rigorous approaches are required. What is needed is an underlying notion of embracing failure in data ingestion when a system is confronted with inconsistent data sources, networking problems and machine failures.