This invention relates generally to document embeddings, and more particularly to generating document embeddings using a convolutional neural network architecture.
Online systems, such as content provider systems or recommendation systems, frequently have access to a significant number of documents. For example, a video publishing website may have access to reviews for videos written by users of the website. Online systems analyze information contained in the documents, such that the online system can perform various tasks based on this information. For example, the video publishing website may classify reviews into positive or negative reviews, and recommend videos associated with positive reviews to the users of the website. As another example, the video publishing web site may retrieve reviews that are similar to a given review with respect to the content of the review.
Often times, it is advantageous for the online system to determine document embeddings that represent documents as numerical vectors in latent space. These representations (i.e., the documents embedding) may be used to characterize the document for various purposes. However, existing document embedding models suffer from difficulties related to computational efficiency and accuracy. For example, some existing models generate document embeddings based on very small subsets of words, and do not incorporate long-range semantic relationships in the document and may fail to accurately characterize the document as a whole. As another example, some existing models require a significant amount of computational power during both the training and inference process due to the structure of the models. As yet another example, some existing models require an iterative optimization process to generate document embeddings for new documents.