The invention relates to the field of comparison of data objects. More specifically, the invention relates to the conversion of unstructured data objects into structured representations, and subsequent comparison of the structured representations.
The present age is witnessing the generation of large amounts of information. The sources of information such as the internet store information in different forms. There is no common syntax or form of representing the information. Therefore, there is a need of information search techniques that can help in extracting relevant information from volumes of unstructured information available at different sources of information.
Several information search techniques are known in the art. One such technique is keyword search. In keyword search, keywords that relate to a particular information domain are used to search in the information sources.
Another methodology is wrapper induction search. It is a procedure designed to extract information from the information sources using pre-defined templates. Instead of reading the text at the sentence level, wrapper induction systems identify relevant content based on the textual qualities that surround the desired data. For example, a job application form may contain pre-defined templates for various fields such as name, age, qualification, etc. The wrappers, therefore, can easily extract information pertaining to these fields without reading the text on the sentence level.
Yet another methodology for extracting information is an information index system that creates a database by extracting attributes from a plurality of structurally similar texts.
However, the above-mentioned methodologies suffer from one or more of the following limitations. The keyword search techniques generally produce inadequate search results. These techniques do not recognize the context in which a particular searched keyword has appeared. For example, if a user inputs the name of the artist and is looking for the artist's upcoming concerts, the technique may also generate results that may be related to the personal life of the artist. This type of information will be irrelevant for a person who is looking for tickets to the artist's show. Therefore, many non-relevant data sets also get displayed in the search results.
Further, the conventional methodologies fail to incorporate the synonyms and connotations of the keywords that are rife in natural language content. For example, one of the keyword for an upcoming concert's tickets is ‘concert’. The conventional techniques might not incorporate the synonyms, such as show, program, performance etc. Wrapper induction methodology proves inefficient in cases where there is a lack of common structural features in the varied information sources.
The methodologies discussed above find specific use in extracting information from texts that have a pre-defined structural form. Further, these methodologies do not re-structure the information in any way to highlight the context and circumvent the nuances and complexities of natural language. Furthermore, the above-mentioned methodologies do not provide related results, which contain keywords related to the ones provided in the search string. For example, if a user wants to search for concert tickets for Madonna's show, the websites selling tickets for Britney Spear's show may also be relevant for the user. These related results are not displayed through the existing search methodologies, since the existing techniques do not pass on the weights associated with the relevant search results to other related search results, which relate to the same context as the relevant search results. In other words, the techniques do not provide context-based search for related results.
In light of the above limitations, it is apparent that there is a need for a scalable methodology for comparison of data objects that identifies relevant content within the data objects, and compares the data objects based on the identified content. The method should be able to identify the presence of certain attributes within the data objects that relate to an information domain or context of interest to the user. The search methodology should also assign certain weights to related search result that may be relevant to a user. Further, there is a need for a methodology that converts data objects into structured representations in order to compare the data objects. Furthermore, there is a need for a methodology that compares the context in which keywords are used in data objects.