Information retrieval is an important aspect of increasingly digital world. Several techniques exist to access and retrieve information from digital data sources. Typically, the process of information retrieval of unstructured data is triggered by a natural language query submitted by a user and involves processing the submitted query, and retrieving results from one or more databases. The retrieved results are then displayed to the user. For example, in any ticketing solution such as helpline solution when the user logs a ticket he may be presented closest responses or solutions for his query. However, the text entered by the user towards his query in natural language may not exist anywhere in the pre-existing knowledge store. The existing techniques therefore need to understand the meaning of the query and provide appropriate solutions. Further, in many cases, the quality of the results provided by the existing techniques is poor and irrelevant with respect to the input query, which further confuses the users.
Typically, when a user inputs a query to find relevant solutions based on the entered query, different similarity measures are applied. For example, in helpline solutions or question-answer (QA) systems, responses to a query rely on different similarity measures to find the relevance of the query with information on helpline database or QA database. Most of the similarity measures use different word level matching techniques to identify the most relevant response. Existing word level matching techniques include bag of words based matching, N-gram matching, and similarity measures like cosine similarity, Jacardian similarity, etc. These techniques generally tend to give the most matching response from the database. Word based matching techniques tend to fail since relevance cannot be decided based on the number of similar words. N-gram based methods are computationally very expensive and fail when order of the words change.
Further, the information retrieval systems based on these techniques often lack mechanisms which enable the systems to prune the result set further and pick the most relevant response if it exists, or to take a decision not to respond if none of the responses in the database are relevant. Hence, these systems end up giving irrelevant response to input queries, which may be very annoying. This condition primarily arises as the above discussed similarity measures generally tend to weigh all words equally during the similarity matching stage. In other words, these systems do not have a way to understand the meaning of the entered query and work only at word levels. Additionally, a scoring mechanism to identify the relevance of an input query to different responses in the database is very complex since this mechanism decides whether or not to respond to the query. Current mechanisms therefore tend to cripple the systems performance, as very strict matching mechanisms may tend to make the system only return an exact match while lenient matching mechanisms may lead to the system to generalize the matching process and return irrelevant results.