The present invention relates to natural language analytics, and more specifically, to the classification of data in a dataset.
It can be difficult to accurately convert tabular data into a useful query model. Typically, expert modeling of the data is required, and the analytic tools needed to do the conversion typically require training and expertise that is not common among business users.
Of course there are many challenges with creating such tools. If the automatic modeling is not reflective of the data or the knowledge it represents, then the queries that it can produce are not very useful in answering the user's questions. If the user's question cannot be parsed and understood by the system, then the system cannot accurately produce a query to answer their question. Accurate natural language parsing has been a branch of computer science for over 50 years, and is still considered in its infancy.
In traditional analytics systems, there is a modeling phase where an experienced modeler would have specifically exposed the rows (if they have meaning in the data) as an element in the model, often by adding a derived attribute. But this is done by a person who understands the data they are modeling, and takes time. Systems like Watson Analytics specifically do away with the modeling step, or at least make it optional, to improve time to value for the user.
Other systems have taken a more pragmatic approach to both the modeling and natural language challenges by specifically doing away with the modeling step, or at least making it optional, in order to improve time to value for the user. Under some such systems, the natural language parsing consists of matching words to elements in the model, or analysis types, and ignoring the other words in the sentence. The modeling is also very lightweight, producing a single table (usually within a columnar database) which matches the user's original data, but with extra metadata describing what the system thinks the various columns represent. This is done for each column since the columns contain a label that is easy to look up in a classification system, as well as a set of data values that are usually representative of the concept of the column. As such the columns themselves become the query elements that can be matched to the user's questions in order to produce answers.