Document classification can be generally described as classifying, or categorizing, documents into multiple classes, or categories. Example document classification can include aspect-based sentiment analysis, in which each document can reflect one or more aspects, and can be categorized to a sentiment (e.g., negative, positive). For example, a restaurant review can be provided as a document (e.g., text provided in one or more sentences), and can reflect one or more aspects of a restaurant (e.g., food, staff, ambience), and each aspect can be categorized with a sentiment (e.g., food→positive, staff→negative, ambience→positive).
Document classification can be performed using a machine-learning process, in which documents form a corpus of text that are used to train a machine-learning model. To perform such document classification, each document is processed to provide a respective document representation. An example approach for providing document representations includes the bag-of-words (BOW) model. Using the BOW model, each document (e.g., sentence) is represented as a vector, where each word is a feature of the vector. In some examples, weighting (e.g., binary, term frequency and inverse document frequency (TF-IDF)) can be applied to the respective features of the vector.
Such traditional approaches, however, have certain disadvantages. For example, new words that are not included in the underlying training data cannot be efficiently accounted for. As another example, resulting vectors can be relatively large. Consequently, a significant amount of computing resources (e.g., processors, memory) are required to determine and store the vectors.