The identification of organizations that possess a strong knowledge of a specific area of research or expertise in a specific technique is of interest to a wide variety of private and public sector entities. For example, the identification of organizations most experienced in research on a disease area of interest can facilitate collaborations and communication between these organizations and also between these organizations and governmental agencies. Moreover, private sector entities, such as pharmaceutical companies, spend a percentage of their total marketing budgets on identifying the Key Opinion Leaders and organizational Centers of Excellence. Most entities, however, continue to use conventional tools like literature searches, surveys, observation methods, self-identification methods, informant methods, and socio-metric methods. These conventional methods are not entirely accuracy and in at least some instances, they are not cost effective.
Methods for extracting, normalizing and/or organizing data from text have been described previously.
For example, the problem of extraction and normalization of organization names has been studied in open domains like Wikipedia and news articles (see, e.g., Khalid et al. in “The Impact of Named Entity Normalization on Information Retrieval for Question Answering,” Lecture Notes in Computer Science 4956, pp. 705-710, 2008), however, those systems had an accuracy of less than 80%.
U.S. Pat. No. 7,716,162 describes a method for normalizing geographic locations rather than organization names. Free text is used as the data source, rather than organization-related text such as a PubMed affiliation sentence. The normalization of geographic locations is based upon the generation and combination of histograms.
U.S. Published Patent Application 2010/0023515 A1 teaches a method for clustering and organizing records in a database. The clustering algorithm involves comparing deterministic cluster definitions of records against each data record under consideration to match records. These deterministic cluster definitions can employ edit distance related metrics. However, U.S. 2010/0023515 A1 does not teach modified Levenshtein distance for matching phrases or the use of centroids.
U.S. Published Patent Application 2009/0313463 A1 teaches edit distances to match database records from different database custodians, but does not teach derivation of centroids or clustering centroids or the use of regular expressions to extract phrases.
U.S. Published Patent Application 2007/0067285 A1 teaches normalization of persons from free text and citation databases as well as approximate string matching. This reference also teaches the use of clusters and centroids, but these centroids are based upon multiple weighted variables where weights are determined by statistical regression analyses. Clustering of centroids based upon geopolitical entity and the use of modified Levenshtein distance are not mentioned. Also, phrase extraction and assignment of geographic and other information using regular expressions are not taught in this reference.
U.S. Published Patent Application 2009/0182755 A1 teaches clustering of entities to determine business locations. Data extraction is taught; however, the clustering method is based upon the well-known Expectation-Maximization algorithm.
WIPO publication WO2009/158492 A1 describes a social networking based process for matching entities extracted from PubMed and includes the use of data obtained from affiliation sentences in PubMed. However, this reference does not teach the use of edit distances or word similarity metrics such as Smith-Waterman to define centroids and normalize names.