Set expansion refers to the practical problem of expanding a small set of “seed” entities, into a more complete set by discovering other entities that also belong to the same “concept set”. Here a “concept set” can be any collection of entities that conceptually form a set that people have in mind, and “seeds” are the instances of entities in the set. As an example, a person wanting to discover all camera brand names may give a small number of well-known brand names like “Canon” and “Nikon” as seeds; the set expansion techniques would leverage the given data sources to discover other camera brands, such as “Leica”, “Pentax” and “Olympus” that are also camera brands.
Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools.
Many efforts have been made over the years to develop high-quality set expansion systems. The most relevant efforts include Google Sets, which employs proprietary algorithms to do set expansions. However, due to its proprietary nature, algorithms and data sources behind Google Sets are not publicly available for future research endeavors. Another prominent line of work is the Set Expander for Any Language (SEAL) system, which adopts a two-phase strategy that first builds customized text wrappers based on the input seeds in order to extract candidate entities from web pages in a precise manner. The SEAL system then uses a graph-based random walk to rank candidates entities based on their closeness to the seeds on the graph. While this customized data extraction/ranking process can produce results with high quality, the necessary online data extraction can be costly and time-consuming.
There is a substantial amount of data on the web, but present set expansion techniques work poorly with noisy web data. Two readily available forms of general web data sources are Hypertext Markup Language (HTML) lists extracted from web pages by web crawls (henceforth referred to as web lists) and web search query logs (query logs). Such general-purpose web data can be highly useful for set expansion tasks: they are very diverse in nature, with rich information that covers most domains of interest. In addition, since these general data are not domain/seed specific, they can be pre-processed and optimized for efficiency purposes.
However, these general web data can be inherently noisy. Random walk or other similarity measures alone may not be sufficient to distinguish true results from the noises, especially when the number of seeds is limited. Random walk based ranking techniques used in previous work perform poorly on general-purpose web lists and query logs and produce results with low precision/recall. Partly because of that, previous approaches use seed-specific and page-specific wrappers to reduce the candidate set to a smaller and much cleaner subset over which the random walk based ranking techniques work reasonably well. However, this additional data extraction process is at the cost of overall architectural complexity and system responsiveness.