This patent application relates to the problems of classifying text. Classifying text may include, for example, deciding if the example text is describing a product for sale, deciding if text is describing a television monitor, deciding if text is spam or non-spam or placing text into a taxonomy—i.e., a classification system of members based on similarity of structure or origin.
Systems and/or methods that can classify free text are often called classifiers. While very valuable, these systems and/or methods are typically expensive to construct and maintain. Such systems and methods may include constructions of large-rule bases (sets of detailed rules for structuring data), human expert systems and machine-learning systems. The expense of these systems often comes from deigning the rules and/or supplying sufficient data for the machine-learning system.
Machine-learning is the field that studies methods of producing classifiers directly from marked example data. That is, for a given classification task (such as deciding what type of product this text is and describing or extracting the weight of the product from the product description) machine-learning can produce from a series of marked examples a working program that reliably performs the task. Marked examples may include pairs of information where the first half of the pair is the information that is available for the classification task (typically features such as the free-text, price and other easily extracted data) and the second half of the pair is the desired outcome (assignment of the example to a category or the desired extracted data-field). Typically production and maintenance (either making software edits by hand or editing with software tools) is the most expensive step in machine learning.
It would be preferable to take advantage of the effort used to produce and/or maintain such a classifier in one language (for example English) to produce a similar system in another language (for example French).