Technical Field
The present invention generally relates to word embedding, and, more particularly, to training word embedding of domain-specific words and phrases.
Description of the Related Art
Word embedding is a collective term used to describe techniques involving language modeling and feature learning in natural language processing, wherein identified words and phrases are mapped to vectors of real numbers. Word embedding using training methods incorporating unsupervised learning, such as, e.g., the use of skip n-grams, has been found to be successful.
Word embedding of similar words or phrases, wherein the word embedding is trained using unsupervised data, such as a general domain corpus (such as, e.g., Wikipedia corpus), has been found to have large cosine similarity. For example, using unsupervised word embedding, user inquiries such as
“Tell me recommended cars of Mercedes-Benz®” and “Tell me recommended cars of BMW®” would likely result in a large cosine similarity of the vector representations of the inquiries. This is due to the two individual inquiries having similar inquiries concerning car recommendations. However, although both inquiries involve requesting car recommendations, accurate results for the query “Tell me recommended cars of Mercedes-Benz®” would be entirely different from accurate results for the query “Tell me recommended cars of BMW®.” This is due to the fact that Mercedes-Benz® and BMW® produce different automobiles.
This large cosine similarity can result in confusing, inadequate, and inaccurate results returned to a user, especially since, for a specific domain, the words or phrases can appear similar in scope but can include completely different information. Thus, there is a need for an improved approach for performing training for word embedding.