Recently, semantic analysis (SA) has been gaining an enormous amount of attention in the computational linguistics (CL) and Natural Language Processing (NLP) communities. In particular, researchers have been focusing on techniques for evaluating or measuring lexical and semantic similarity and/or relatedness between words, terms, and/or other natural language objects. Generally speaking, measuring the similarity between natural language terms is typically directed to measuring the resemblance between the meanings of the terms, and as such focuses on synonymous relationships or synonymy (e.g., “smart,” “intelligent”). On the other hand, measuring the relatedness between natural language terms typically is broader than measuring similarity, as it focuses on additional relationships between the terms, such as antonymy (e.g., “old,” “new”), hypernymy (e.g., “rooster,” “bird”), and numerous other functional associations (e.g., “money,” “bank”).
Evaluating lexical and semantic similarity and/or relatedness is a knowledge intensive task. Typically, known evaluation techniques are corpus-based, and leverage the occurrences and associations between words and/or other linguistic terms occurring therein by utilizing a Distributional Semantics (DS) model, for example, by representing each linguistic term as a vector. Relatedness between linguistic terms is then calculated using vector similarity measures (e.g., cosine similarity, or other suitable technique).
In some known semantic analysis techniques, vectors are constructed from direct or explicit mentions of linguistic terms and explicit mentions of their associations with other linguistic terms within a large corpus of text or knowledge base. Examples of such direct or explicit techniques include Explicit Semantic Analysis (ESA), Salient Semantic Analysis (SSA), NASARI (a Novel Approach to a Semantically-Aware Representation of Items), to name a few. Direct or explicit semantic techniques use the direct or explicit mentions of linguistic terms and their associations with other linguistic terms occurring in text corpora and/or in dictionary corpora to generate explicit DS models for use in determining semantic relatedness. Generally, with direct or explicit semantic analysis techniques, semantic relatedness between linguistic terms is deterministically calculated.
Other commonly known, corpus-based semantic relatedness techniques construct vectors based on indirect or implicit associations between linguistic terms or concepts represented in a corpus. Accordingly, in these indirect or implicit techniques, vectors that represent linguistic terms/objects are indirectly derived from the textual information within a corpus or knowledge base (as contrasted with the explicit derivation techniques described above). For example, the vectors representing the linguistic terms may be estimated by statistical modeling and/or through neural embeddings. Examples of such indirect or implicit techniques include probabilistic/count models such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), and neural network models such as CW vectors, Word2Vec, and GloVe. Accordingly, with indirect or implicit semantic and analysis techniques, semantic relatedness is probabilistically and/or statistically determined.