Machine learning tools (MLT) can be used to identify or predict patterns. For example, an MLT can learn to predict that a particular word in written text is a person's name or a location name. As another example, an MLT can learn to predict the location of a particular record in a given set of data. More particularly, the MLT might learn to predict the location of a company name record in a job listing. Machine learning tools can learn to predict many other types of patterns.
Training data is often used to provide patterns from which the MLT learns to predict the existence of patterns in other data (“input data”). The patterns in the training data may comprise “inputs” that are mapped to “designated results.” An input may be any element in the training data. A designated result may be a label associated with the input. Typically, a human provides the designated results. For example, a human labels words (“inputs”) in the training data to indicate that a particular word is a “named entity” such as a person's name, location name, or same other named entity. Based on the inputs and designated results, the MLT develops a model that can be applied to predict results for input data that has no designated results. As a particular example, the MLT learns to extract named entities from input data. As another example, the MLT learns to determine or predict where a particular type of record, such as a company name field, is located in the input data.
Because the training data provides patterns to teach the MLT, the accuracy of the model generated by the MLT is affected by the nature of the training data. If the training data includes more patterns or better patterns, the MLT is able to generate a more accurate model. Because the training data is typically manually generated, creating training data can be costly. Moreover, there is a often a problem obtaining enough training data for the MLT to generate an accurate model. In particular, for many languages there is a lack of adeqaute training data. As a specific example, there is a lack of adequate training data for the Chinese langauge. However, the problem of providing a sufficent amount and quality of training data for the MLT applies to all languages.
Thus, there is a need for generating accurate models using an MLT based on a limited amount of training data.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.