Set expansion refers to the practical problem of expanding a small set of “seed” entities, into a more complete set by discovering other entities that also belong to the same “concept set”. Here a “concept set” can be any collection of entities that conceptually form a set that people have in mind, and “seeds” are the instances of entities in the set. As an example, a person wanting to discover all camera brand names may give a small number of well-known brand names like “Canon” and “Nikon” as seeds; the set expansion techniques would leverage the given data sources to discover other camera brands, such as “Leica”, “Pentax” and “Olympus” that are also camera brands.
Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools.
There is a substantial amount of data on the web, but present set expansion techniques work poorly with noisy web data. Two readily available forms of general web data sources are Hypertext Markup Language (HTML) lists extracted from web pages by web crawls (henceforth referred to as web lists) and web search query logs (query logs). Such general-purpose web data can be highly useful for set expansion tasks: they are very diverse in nature, with rich information that covers most domains of interest. In addition, since these general data are not domain/seed specific, they can be pre-processed and optimized for efficiency purposes. However, these general web data can be inherently noisy. Random walk or other similarity measures alone may not be sufficient to distinguish true results from the noises, especially when the number of seeds is limited. Random walk based ranking techniques used in previous work perform poorly on general-purpose web lists and query logs and produce results with low precision/recall. Partly because of that, previous approaches use seed-specific and page-specific wrappers to reduce the candidate set to a smaller and much cleaner subset over which the random walk based ranking techniques work reasonably well. However, this additional data extraction process is at the cost of overall architectural complexity and system responsiveness.
One set expansion system for using web data to expand a set of seed entities is presented in U.S. patent application Ser. No. 13/163,736 entitled “ITERATIVE SET EXPANSION USING SAMPLES,” and filed on Jun. 20, 2011, which is hereby incorporated by reference and referred to herein as SEISA. SEISA solves several of the above problems. SEISA uses web-lists as one data source. A web-list is the hypertext markup language (HTML) fragments between <ul> or <ol> and its corresponding closing </ul> or </ol> tag. The text between each <li> and its closing </li> tag is considered as a named entity. All named entities that belong to the same web-list are considered to be from the same concept set. The similarity between any two named entities are measured by how many web-lists they share versus how many web-lists they belong to using popular scoring functions such as Jaccard or Cosine. For example, if Boston belongs to List_1 and List_2 and Chicago belongs to List_1 and List_3, and using Jaccard as the similarity function, then Similarity (Boston, Chicago)=1/(2+2−1)=0.33.
Experiments show SEISA works well for concepts of relatively small cardinality such as countries and colors. However, in practice there are also uses for expanding a large concept that includes many entities such as all the cities in the United States. Such expanded sets can be used for data cleaning or as features for name entity recognition in document understanding. One typical behavior of set expansion algorithms is as the expanded set becomes larger, the expansion precision (that is, fraction of the expanded set that belongs to the concept set) drops. So one particularly interesting application setting is to find as many entities as possible in a large concept while keeping the precision of the expanded set above a relatively high threshold such as 0.9. There are a few drawbacks when applying SEISA in the above setting. First, SEISA treats each web-list as equal so that introducing less popular entities in a large concept is likely to reduce the quality score. Second, SEISA does not use negative seeds so that giving feedback that New Jersey is not a city is not possible.