The invention relates generally to fulfilling keyword based search over a semantic repository (RDF Triple store). In particular, the invention relates to a method and system for translating user keywords into semantic queries based on a machine readable domain vocabulary.
Information Retrieval Techniques that facilitate machine based searching and finding information are crucial in assisting commerce and work related tasks in an enterprise. Many business application solutions like Knowledge Management Systems in product support and maintenance centers, Content Analytics and digital publishing platforms, learning and diagnostic support systems in Healthcare rely on Information Retrieval technology to find relevant information from a massive set of documents or work related information products.
The goal of Information Retrieval (IR) is to obtain a ranked set of information resources or documents that are relevant to a user input or query typically expressed as keywords. This is done by devising a suitable system model to represent the information contained within the documents and then matching the user input or the user query with respect to the system model and ranking the documents in the order of relevance to the query.
The quality of the matching depends on the richness of the system model chosen to represent the information content. IR system in use today embodied by web search engines like Google, Yahoo, or even enterprise search like Lucene or SOLR are based on a simplistic full-text bag-of-words model wherein a document is treated as a collection of words or terms. In this model the words in a document are indexed after stripping certain stop-words (like ‘a’ ‘the’ ‘of’ ‘on’ etc.) and the word or term frequency is computed. The distribution of a term across the entire document collection is thus obtained and drives the interpretation to find relevant document for a set of user keywords. The information model in use in IR systems only examines the occurrence of words or phrases and not its meaning Current system cannot understand how words and phrases are related to semantic concepts or real-world things and how words or phrases are related to each other. Therefore current IR systems cannot make distinguish between varying interpretations of the same word based on its meaning and context. This absence of meaning and context in the system model in existing IR systems and technology can be addressed by relying on a model that relies on meaning based metadata to derive the proper sense of words and phrases.
The current models and interpretation in IR systems provide a list of documents as the results of a search task, for example in the current state if the user input is ‘Diabetes drugs for blood pressure patients’. The user information need is best understood as ‘Show me drugs that can be used to treat Diabetes when the patient also has high blood pressure’. However, the results from existing IR systems and technology are a set of documents or web pages deemed relevant by the system. Users have to read through the document contents based on highlighted occurrences of the keywords in those documents and extract and assemble useful answers for his/her information need. Users are expected to avoid duplicate information, resolve references and assemble answers they are looking for. This assemble of answers from documents in the result set is a manual task and is time consuming and error prone affecting the productivity of users in current IR systems. In order to address this gap between the granularity of the IR system results which is in the form of documents and the user information requirements which is at a finer granularity level based on the actual content in the documents we believe it is vital to have a more rich and fine-grained model.
Semantic technology represents a new technology stack that provides the technology primitives to annotate data with a precise meaning and context. It also enables rich information models that are closer to user information needs than documents and thus helps break free from the containers that were the limiting factor in the model used in existing IR technology. The semantic technology refers to a suite of models, languages and associated runtime components that include RDF as a basic data model and data representation format knowledge representation languages like OWL or simple knowledge organization system (SKOS), Inference engines, and SPARQL query engines. Semantic technology helps to improve the findability of information relevant to some user need.
This information model at the logical level is composed of two parts a) a disambiguated list of entities each associated to a set of entity types and uniquely identified by an identifier that may be in the form of a URI and/or a long integer value. And b) a set of named relations between these entities expressed using the unique identifiers. These Relations could be used to describe attributes like name or address or date-of-birth and also relations to other defined entities like ‘friend-of’ or ‘is-part-of’. At the physical level this data is typically managed by triple stores or semantic repositories based on the RDF language.
A populated triple store along with the machine readable domain knowledge or domain vocabulary in the form of taxonomy or a richer form of ontology is called a knowledge base. Information may be retrieved from such knowledge bases by providing structured queries in a specific language known as SPARQL. This scenario is similar to the use of Relational Database systems in enterprise applications wherein data is retrieved using structured queries in SQL language.
There is a gap encountered when users with specific information goals attempt to interact with a knowledge base or a semantic system. This gap results from a mismatch between the system query, which has to be expressed in structured SPARQL queries and the user query that is expressed using simple natural language keywords. For most practical purpose and real world use-cases it is not feasible to expect users to specify their information requirement in a structured query language. Users are entrenched into the keyword based search paradigm exemplified by search products like Google.com or Bing.com. This gap is an opportunity for our invention to automatically translate user's natural language keywords into structured semantic queries based on the domain knowledge and the kind of entities and relations present in the semantic repository.
Accordingly, there is a need for a method and system for translating user keywords into semantic queries based on domain vocabulary. Further, there is a need to bridge a gap between user keywords and logic based structured queries by semantic representation of information.