This disclosure relates generally to analytic processing in a data processing system and more specifically to assisted predictive analytic processing using examples in the data processing system.
Current predictive analytic tools and offerings are typically too complex to be readily consumable by general users who are not data and predictive analytics experts. A high skill requirement to use the predictive analytic tools and offerings creates a major barrier to general adoption of predictive analytics technologies across major industry domains, despite a critical need for this category of tool.
Typically to create a predictive analytics query against a particular data stream, a user is required to obtain data from multiple, distributed data sources and receive the obtained data into a predictive analytic platform. The user then is required to identify entities in the data as input to participate in the analysis and to identify a target of the analysis. The user is further required to identify the most fitting predictive analytic models, for example, a classification model, a segmentation model or other model that best fits a study the user intends to perform.
Typically these are not tasks general users are capable of performing. Without in-depth knowledge in data schema, analytic models and data types (for example, data types of continuous, ordinal or nominal), predictive analytic tooling is typically out of reach for general users such as doctors or journalists or stock brokers, who need this type of technology to assist them in making decisions in their everyday jobs.
In another example, when provided with typical user-defined queries using natural languages, conventional predictive analytics query systems return unpredictable results or results which are perceived as irrelevant. The erroneous results are due to ambiguous query input when ambiguous or ill-formed queries are presented to the system. In attempting to accommodate users, the user-friendly input typically is not useful from a system perspective.
True natural language systems to query a variety of structured information, leveraging semantics and ontology, include examples from research and industry comprising systems to query a number of source data formats, including program source code, biological information, and databases. Natural language based query interfaces to particular databases exist, however typically with limited commercial success.
A common challenge to all of these methods and systems, however, is the difficulty of accurately parsing and understanding true natural language as provided by an unskilled user. Semantically and syntactically understanding arbitrary natural language remains an open research problem. As a result, many natural language based query systems typically suffer from precision challenges, for example generating queries that do not match an intent of the user. Report or model authors often revert to writing structured query language (SQL) queries or building models by manually using lower level computer languages, for example, SQL or more sophisticated user interfaces.
Enabling end users to express queries in a form of natural language typically hides the users from technical details for constructing queries. A user expresses queries in a free form style. However, a technical restriction in using this type of free form natural language queries is a lack of precision and inherent ambiguity in expressing the intent of the user, which typically renders the system impractical and accordingly unusable.
There are tools enabling users to run predictive models by exposing statistical model details and database schema structure. While the tools are typically very flexible in enabling users to select from a number of predictive analytics models using the database schema and enabling selection of a nature of an element to predict, the tools typically cannot be utilized by people not having detailed knowledge of analytics models and databases. Therefore, use of current tools presents a high barrier to adoption.
In another example, a method for controlling data mining operation by specifying the goal of data mining in natural language is used. Specifically, the method finds correlations between words in a query and database column names/column description by using link-analysis techniques such as Bayes network. Using a probability assigned by a link-analysis algorithm, a user is presented with a list of candidate columns most likely to be the dependent variable. The user then reviews the candidates and makes refinements. The list of candidates, combined with user refinements, is used to construct a list of independent variables. Once the dependent and independent variables are identified, a data mining problem definition is created which can be executed by a data mining application.
However, the data mining example has some severe restrictions because using probabilistic link-analysis techniques (for example, Bayes network) to identify dependent and independent variables means incorrect variables can be identified which require further user intervention. The proposed technique relies on a set of subject-specific vocabularies (SSVs) that are derived from a data source. Metadata may not always be available from a data source and is typically not originally intended for use for this purpose by a database administrator. A further limitation exists in a lack of a mechanism to select an appropriate type of predictive model (for example, an association, a classification or a segmentation model) most relevant to the intent of the user.