1. Field of the Invention
The present invention generally relates to a method and an apparatus for processing natural language questions. More particularly, the present invention relates to a method and an apparatus capable of answering natural language questions using open linked structured information.
2. Description of Related Art
Question Answering (QA) has been a classical and difficult problem in the area of Artificial Intelligence over the past decades. Given a natural language question, e.g., “Justin Henry's first film role as Dustin Hoffman and Meryl Streep's son in this film earned him an Oscar nomination”, a computer system would try to return a correct answer in natural language, e.g., “Kramer vs. Kramer”, just like what a human being would do.
To meet the need for computer systems to process natural language questions, Natural Language Processing (NLP) techniques have been widely proposed to solve most of QA problems by using unstructured data. Undoubtedly, it is reasonable to develop NLP techniques because over 80% data of the world is unstructured.
FIG. 1 illustrates a general architecture of existing QA systems. As shown in FIG. 1, a general QA system includes a question processing module 101, a document/passage retrieval module 103, and an answer processing module 105. For a natural language question raised by a user, question parsing and focus detecting are performed in the question processing module 101, which selects keywords for the question. Then the document/passage retrieval module 103 performs keywords search in a database, and performs document filtering and passage post-filtering in a document containing the keywords, so as to generate candidate answers. Afterwards, the answer processing module 105 performs candidate identification and answer ranking on the candidate answers generated by the document/passage retrieval module 103, and finally formulates an answer to the raised natural language question, so as to output a brief answer to the user in natural language.
Moreover, QA evaluation systems are developed for QA systems to evaluate performance of QA systems. As an evaluation platform for QA, TREC OA track is the best known evaluation platform for QA in the world, where various dataset and question set are provided to evaluate accuracy and performance of different QA systems. However, with the advance of database and semantic Web, structured data are increasingly growing and becoming more important due to their non-ambiguous characteristics compared with the NLP over unstructured data. Furthermore, most of large commercial firms process structured data in their business and store them into database without transferring them into unstructured data.
To support QA with the structured data inside the corporations, new techniques have to be developed, e.g., NLDB (natural language database), which combines NLP with database technologies by providing a natural language interface over the database to ease users to issue questions. The NLDB techniques in general depend on syntax of the database schema, where natural language questions are translated into a few executable SQLs in the database. Therefore, it restricts users to ask questions with specific natural language grammar and returns answers within the scope of the database.
Besides the database, there have been a lot of new structured data with the progress of realizing semantic Web vision, e.g., RDF (Resource Description Framework) data, a form of linked data. Over RDF data, semantic query languages, e.g., SPARQL, have been proposed to query data based on semantics without depending on syntax. However, so far there is no well developed technique to process natural language questions over open linked data without the limitation of natural language grammar.