The present invention relates to the field of information extraction, and more particularly to the field of identifying and extracting relevant information from independent sources of information.
The present age is witnessing the generation of large amounts of information. The sources of information, such as the Internet, store information in different forms. There is no common syntax or form for representing the information. Therefore, there is a need of information search techniques that can help in extracting relevant information from volumes of unstructured information available at different sources of information.
There are various conventional techniques that can be used to conduct search and extract the information available at various sources. One of the commonly used techniques is ‘keyword search’. In this technique, a search is conducted based on some keywords that relate to a particular knowledge domain. For example, in the knowledge domain of online purchase of concert tickets, the keywords can pertain to the name of the artist, price, date, etc. The search is conducted based on keywords provided by a user. However, this technique has a few limitations. This technique generates a significant number of irrelevant results. This is primarily due to the reason that this technique does not recognize the context in which the keyword is being used. For example, if a user inputs the name of the artist and is looking for the artist's upcoming concerts, the technique may also generate results that may be related to the personal life of the artist. This type of information will be irrelevant for a person who is looking for tickets to the artist's show.
Further, the conventional techniques fail to incorporate the synonyms and connotations of the keywords that are rife in natural language content. For example, one of the keywords for an upcoming concert's tickets is concert. The conventional techniques do not incorporate the synonyms, such as show, program, performance etc
Another commonly used technique for information extraction is ‘wrapper induction’. It is a procedure designed to extract information from the information sources using pre-defined templates. Instead of reading the text at the sentence level, wrapper induction systems identify relevant content based on the textual qualities that surround the desired data. For example, a job application form may contain pre-defined templates for various fields such as name, age, qualification, etc. The wrappers, therefore, can easily extract information pertaining to these fields without reading the text on the sentence level.
However, different sources of information are not represented in a uniform format, there is a lack of common structural features across various sources of information. Hence, wrapper induction technique does not work efficiently.
Therefore, there exists a need for an extraction technique, which can identify the context in which the keywords are being used. The technique should be able to identify the information, which is relevant to the context. The technique should also identify and filter out the information, which is not relevant to the context, in order to yield efficient search results.