A variety of natural language processing algorithms or other machine learning classifiers can be improved by incorporating one or more encoders representing word vectors. Word vectors are vectors that individually correspond to respective words in a corpus of words (e.g., the set of words present in a particular literary work, or a set of literary works) and that are embedded in a semantically-encoded multidimensional vector space. Words with similar meanings or other semantic content or associations (e.g., “strong” and “forceful,” or “slick” and “slippery”) have corresponding word vectors that are located near each other in the vector space. On the other hand, words with unrelated meanings or other semantic content or associations (e.g., “France” and “cone,” or “gerbil” and “hypotenuse”) have corresponding word vectors that are located farther apart within the semantically encoded vector space than pairs of words that are more similar to each other. An encoder can produce a plurality of word vectors corresponding to respective different words that are present in text of interest.
These word vectors can then be used to determine whether strings of text are similar to each other or to perform some other classification or processing related to the strings of text (e.g., combining the word vectors associated with the words present in the strings of text and determining whether the combinations are similar). The word vectors being of the same size permits words of varying size, and text strings of varying size and/or number of words, to be compared more easily and/or to be applied to the input of a classifier (e.g., an artificial neural network). The concept of word vectors can be extended into paragraph vectors, which represent, in the same semantic space as the word vectors, the context and/or overall semantic content of phrases, sentences, paragraphs, or other multi-word samples of text.