The present invention relates generally to natural language understanding, and more particularly to training data used in Natural Language Classifiers (NLCs).
NLCs find utility in various fields, by providing software applications the capability to semantically and contextually understand and interpret natural language, enabling performance of various tasks by the applications using the understanding and interpretation. NLCs use machine learning (ML) algorithms in processing received texts, including words or characters of a natural language, to determine and return matching classes or categories to which the received texts may most belong. NLCs learn from “example data” during training, to correctly return information in response to “new data” during use.
NLCs can be used in providing customer support. For example, an NLC can be used in predictively routing received questions from customers or users, to appropriate customer support persons or departments for answers. By incorporating Speech to Text functionality into software applications that use NLCs, voiced questions can also be predictively routed. Further, NLCs can be used in matching questions to answers or topics, in categorizing issues by severity, and so on. Various NLCs have been developed for use in a wide variety of software applications, services, and products, such as in Watson™ by IBM®, in Alexa® by Amazon®, and in Cortana® by Microsoft®.
The process of establishing an NLC for use typically includes: preparing training data, which may require identifying class labels, collecting representative texts, and matching classes to texts; training the NLC, which may require uploading the prepared training data to the NLC by way of an Application Programming Interface (API) for processing by ML algorithms of the NLC; querying or testing the trained NLC, which may require sending texts to the trained NLC by way of the API, and in return, receiving results including matching classes or categories to which the sent texts may most belong; evaluating the results; updating the initially prepared training data based on the evaluated results; and retraining the NLC using the updated training data, as necessary.
A method of effectively applying an understanding or interpretation of an expressed instance of natural language, such as in the form of texts, to perform a task includes making a determination as to semantics and intention of the expressed instance, and then classifying the expressed instance into one or more classes based on the determination. The performed task can include, for example, automatic text summarization, sentiment analysis, topic extraction, relationship extraction, and the like.
During use, an NLC can receive texts to determine to which of one or more classes the received texts most belong. The texts can be representative of a question or query, and the classes can be representative of groups or types of corresponding answers. In an example, a class can be formed of a group or type of answers corresponding to a group or type of questions. In the example, the NLC can determine to which of one or more groups of answers may most likely include a relevant answer with respect to a received question, based on characteristics of the received question. The NLC can operate according to a model developed and generated based on prepared training data uploaded to the NLC during training. The training data can be formed of a corpus, such as a text corpus or the like. The corpus can be formed of texts, feature vectors, sets of numbers, or the like. In the example, the texts of the corpus can include groups of related answers, as well as individual questions that each include one or more designations attempting to specify to which group of related answers each of the individual questions may most belong.
By appropriately training an NLC for use in a target business area it is possible to provide, for example, an automated system forming a virtual customer service agent configured to perform tasks in the target business area, such as by answering questions to provide customer support, or the like. The quality of the provided customer support, or the like, may depend on the quality and interpretation precision of the training data used in training of the NLC.
The process of preparing training data to establish an NLC for use in a target business area may include identifying suitable class labels and collecting sample texts, with respect to the target business area. In preparing the training data, a subject matter expert of the target business area may consider or conceive various sample texts to be classified with respect to various classes. The various sample texts and classes may include, for example, those relating to expected end-users, a target audience, or the like.
U.S. Pat. Nos.: 9,342,588, 9,390,378, and 8,234,179, each describe various methods of developing and refining training data used in training NLCs, and are incorporated herein by reference. Non-patent literature “Automatic Training Data Cleaning for Text Classification,” by Hassan H. Malik et al. (ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, Pgs. 442-449, Dec. 11, 2011), describes another training data development and refinement method, and is also incorporated herein by reference. [ADD TO IDS]