The present invention relates to text analysis, and more particularly, to a text representation method and apparatus.
Text analysis has wide application in fields such as information retrieval, data mining, and machine translation. Text analysis refers to extracting representation Of text and its feature items, and converting unstructured original text into structured information which can be identified and processed by a computer, i.e., performing scientific abstraction on text and establishing its mathematical model to describe and replace the text, such that the computer can realize text identification by computing and operating such a model.
Latent semantic analysis (LSA), also known as latent semantic index (LSI), is a known index and retrieval method. This method, like traditional vector space model, uses vectors to represent terms and documents, and determines relationship between terms and documents through relationship between vectors (e.g., angles); the difference lies in that, LSA maps terms and documents to a latent semantic space, thus removing some “noises” in original vector space and improving accuracy in information retrieval. However, LSA still does not solve the problem of polysemy, and only solves the problem of synonym. Because LSA represents each term as a point in latent semantic space, the plurality of meanings of one term correspond to one point in the space and are not distinguished.
The intention of ESA (Explicit Semantic Analysis) is to for a given document segment, ESA will generate a semantic interpreter, which can project this segment to some related wiki concepts and perform sorting according to degree of relevancy. The method of ESA determines a set of concepts by only considering similarity between context of the concepts and the text, and does not consider coherence among the concepts.