The present invention relates to a method of analysing textual phrases and in particular to a method of analysing the semantic content of textual phrases and determining semantic similarities between those phrases.
Information that is stored and organized within databases is conventionally referred to as structured information, whereas information that is described in natural language text is often referred to as unstructured, since many concepts can be expressed in multiple, and often imprecise ways, using natural language. There are a number of useful applications that are dependent upon the capability to process and analyse unstructured, natural language content, for example information retrieval (IR), document analysis and understanding, text summarization, machine translation, and question answering.
Natural Language Processing (NLP) tools are known, but methods and systems for Natural Language Understanding (NLU), question answering (QA) and advanced information retrieval (aIR) remain open topics in Computer Science. Given a passage of text, NLP tools can identify sentences, tag the parts of speech of words in that sentence, parse the sentence structure to identify grammatical components like noun and verb phrases, extract keywords, provide stemmed root words, resolve co-references, provide synonyms, perform named entity recognition (NER), and perform basic information retrieval like keyword-based search engines. However, advanced IR requires processing at the semantic level, which takes into consideration the meaning of phrases and sentences.
According to a first aspect of the present invention there is provided a method of determining a degree of semantic similarity between a first text phrase and a second text phrase, the method comprising the steps of: analysing the grammatical structure of the first text phrase and the second text phrase; processing the first text phrase to generate a first keyword set; processing the second text phrase to generate a second keyword set; and determining the semantic similarity between the first and second text phrases based on (i) the similarities between the grammatical structure of the first text phrase and the second text phrase; and (ii) the similarities between the first and second keyword sets.
If any idiomatic expressions are detected in step a), then one or more alternative expressions with a similar meaning to the or each idiomatic expressions may be identified. Step a) may comprise the parsing of the grammatical structure of each text phrase and inserting part of speech tags in accordance with the results of the parsing.
In steps b) and c) said first and second keyword sets may be generated by removing one or more stopwords from the respective text phrases. Steps b) and c) may also generate said first and second keyword sets by stemming the words comprising the respective text phrases.
The semantic similarity between the first and second text phrases may be used to determine the similarity of a first document to a second document. Furthermore, the decision to retrieve one or more documents may be made in accordance with the semantic similarity between the first and second text phrases.
Thus, if a requested phrase can be determined to be sufficiently similar to a key phrase which is associated with one or more documents, the or each document associated with the key phrase may be returned to the requester.
In an alternative application of the present invention, the semantic similarity between the first and second text phrases may be used to provide one or more answers in response to a received question.
Thus, in a support system, if a received question can be determined to be semantically similar to one or more questions that have been received previously then it is possible to respond automatically to the received question by returning answers that have previously been determined to be relevant to the received question.
According to a second aspect of the present invention there is provided an apparatus comprising a central processing unit, volatile data storage means and non volatile data storage means, the apparatus being configured to perform a method as described above.
According to a third aspect of the present invention there is provided a data carrier device comprising computer executable code for performing a method as described above.