Some phrases appearing in questions submitted to free-form natural language question-answering systems are necessary for finding relevant answers, while other parts are less important. For example, certain phrases in the questions are very likely to occur in their answers. Identifying automatically the important parts of a question is often difficult; yet, it is needed for building a successful system.
Current solutions rely on bag-of-words models and corpus statistics, such as inverse-document-frequency (IDF), to assign weights to terms in questions. For instance, in most question answering (QA) systems and search engines term-weights are assigned in a context independent fashion using simple Term Frequency-Inverse Document Frequency (TF-IDF)-like models. Even the more recent advances in information retrieval techniques for query term weighting typically rely on bag-of-words models and corpus statistics, such as inverse-document-frequency (IDF), to assign weights to terms in questions.
Consider, for example, the query “How does one apply for a New York day care license?” A bag-of-words model would likely assign a high score to “New licenses for day care centers in York county, PA” because of high word overlap, but it does not answer the question, and also the region (State) is wrong.
Important phrases also are not necessarily contiguous. For example, in the question “how does one change his or her name?” the important part is the predicate-argument structure “change name.” A system relying on contiguous n-grams (groups of n contiguous words) and IDFs will return many irrelevant results because “change” and “name” are high-frequency words.