This disclosure relates to machine learning techniques as applied to text samples, and more particularly, to a text classification model that may be used in order to automatically classify an unknown text sample into one of a plurality of different categories.
In various contexts, it may be desirable to classify text samples into different categories. Such classification may typically be performed by human beings, who may read a text sample, evaluate its content, and then determine if there is a particular category that should be associated with the text sample. On a large scale, human text review may be costly. For example, if a human reviewer is paid $10 an hour and is capable of reviewing 100 text samples in an hour, the average cost for reviewing a single text sample would be $0.10. In a situation with thousands, hundreds of thousands, or even millions of text samples that require classification, human review may be an expensive proposition.
In contrast, automated machine classification may attempt to classify text samples without human review. In some cases, however, such techniques may suffer from inaccuracies (e.g., false positives or negatives), or an inability to definitively classify certain content. Accordingly, even if using automated classification, there may be text samples that require human review in order to be classified (and other text samples may be misclassified). Such as classification process may be inefficient, particularly when dealing with large numbers of text samples.
Text sample classification may also be used in the review of user-generated content (UGC). For example, an owner of a website that publishes UGC might or might not want certain UGC to be published, depending on various criteria. Thus, using existing text classification techniques, classifying UGC may be inefficient and/or costly.
This specification includes references to “one embodiment” or “an embodiment.” The appearances of the phrases “in one embodiment” or “in an embodiment” do not necessarily refer to the same embodiment. Particular features, structures, or characteristics may be combined in any suitable manner consistent with this disclosure.
Various units, circuits, or other components may be described or claimed herein as “configured to” perform a task or tasks. In such contexts, “configured to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs those task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task even when the specified unit/circuit/component is not currently operational (e.g., is not powered on). The units/circuits/components used with the “configured to” language include hardware—for example, circuits, memory storing program instructions executable to implement the operation(s), etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. §112(f) for that unit/circuit/component. Additionally, “configured to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue.
As used herein, terms such as “first,” “second,” etc. are used as labels for nouns that they precede, and, unless otherwise noted, do not imply any type of ordering (e.g., spatial, temporal, logical, etc.) for those nouns. For example, a “first” text sample and a “second” text sample can be used to refer to any two text samples; the first text sample is not necessarily generated prior to the second text sample.
Still further, the terms “based on” and “based upon” are used herein to describe one or more factors that affect a determination. These terms do not foreclose additional factors that may affect a determination. That is, a determination may be solely based on the factor(s) stated or may be based on one or more factors in addition to the factor(s) stated. Consider the phrase “determining A based on B.” While B may be a factor that affects the determination of A, this phrase does not foreclose the determination of A from also being based on C. In other instances, however, A may be determined based solely on B.