Over the years, content available on websites has increased. With increased content there is a need for efficient content extraction techniques. One way of extracting contents includes computing a similarity score for an attribute “A” between attribute values of a data record stored in a database and an input webpage, and then deciding to extract content from the webpage as being relevant or deciding not to extract the content from the webpage as being irrelevant based on the similarity score. However, existing methods of computing similarity score can be error prone.
One existing method of computing similarity score is explained in conjunction with FIG. 1. Consider a data record 105. The data record 105 includes two attributes, for example NAME and ADDRESS, of restaurants. The data record 105 includes a record, for example R1. An exemplary webpage 110 can be available over a network. The webpage 110 has name and address of restaurant. The name and address of restaurant in the webpage 110 and record R1 belong to same real-world entity, which is Beijing Bites restaurant. Jaccard similarity technique can be used to compute the similarity score for an attribute “A” between attribute values of the data record 105 and the webpage 110. Jaccard similarity can be computed for two sets S1 and S2 as
      JC    ⁡          (                        S          ⁢                                          ⁢          1                ,                  S          ⁢                                          ⁢          2                    )        =                                    S          ⁢                                          ⁢          1                ⋂                  S          ⁢                                          ⁢          2                                                          S          ⁢                                          ⁢          1                ⋃                  S          ⁢                                          ⁢          2                          
The similarity score ( 6/13) between value (115) of ADDRESS attribute in the record R1 and value (120) of ADDRESS attribute in the webpage 110 belonging to the same real-word entity is low due to additional line “(between 28th and 29th St)” in the ADDRESS attribute in the webpage 110 and due to presence of acronym “Ave” in the webpage 110. Similarly, value (125) of the NAME attribute in the record R1 and value (130) of the NAME attribute in the webpage 110 belonging to the same real-word entity has low similarity score of ⅓ due to wrong spelling of Beijing as Bejing in the webpage 110. The low similarity score for the same real-world entity can lead to ignoring of the webpage 110 as being non-relevant and hence can cause errors in extraction of relevant content.