The present invention relates to the field of information processing systems. More specifically, the present invention relates to the field of extracting information from natural language data and re-arranging it in a structural form.
The present age is witnessing the generation of large amounts of information. The sources of information such as the Internet store information in different forms. There is no common syntax or form of representing the information. Therefore, there is a need of information search techniques that can help in extracting relevant information from volumes of unstructured information available at different sources of information.
Different information search techniques are known in the art. One such technique is keyword search. In keyword search, keywords that relate to a particular information domain are used to conduct search in the information sources.
Another methodology is wrapper induction search. It is a procedure designed to extract information from the information sources using pre-defined templates. Instead of reading the text at the sentence level, wrapper induction systems identify relevant content based on the textual qualities that surround the desired data. For example, a job application form may contain pre-defined templates for various fields such as name, age, qualification, etc. The wrappers, therefore, can easily extract information pertaining to these fields without reading the text on the sentence level.
Yet another methodology for extracting information is an information index system that creates a database by extracting attributes from a plurality of structurally similar texts.
However, the above-mentioned techniques suffer from one or more of the following limitations. The keyword search techniques generally produce inadequate search results. These techniques do not recognize the context in which a particular searched keyword has appeared. For example, if a user inputs the name of the artist and is looking for the artist's upcoming concerts, the technique may also generate results that may be related to the personal life of the artist. This type of information will be irrelevant for a person who is looking for tickets to the artist's show. Therefore, many non-relevant data sets also get displayed in the search results.
Further, the conventional techniques fail to incorporate the synonyms and connotations of the keywords that are rife in natural language content. For example, one of the keywords that can be used for an upcoming concert's tickets is ‘concert’. The conventional techniques do not incorporate the synonyms, such as show, program, performance etc.
Wrapper induction technique faces limitations because of the lack of common structural features across varied information sources. Information index system techniques find specific use in extracting information from texts that have a pre-defined structural form. The techniques discussed above do not re-structure the information in any way to highlight the context, and circumvent the nuances and complexities of language.
In light of the above limitations, there exists a need for an information extraction methodology that identifies relevant content by identifying the presence of associated attributes that relate to an information domain. Further, there is a need for a methodology that extracts relevant information from a data set and restructures it in a common structural form.