The following relates to the informational arts, computer arts, clustering and classification arts, and related arts. Some illustrative applications of the following include named entity recognition and annotation, content clustering, and so forth.
Named entity recognition systems are configured to identify and annotate named entities, such as names of persons, locations, or organizations. The system may label such named entities by entity type (for example, person, location, or organization) based on the context in which the named entity appears. Problematically, the same named entity may have different usages. For example, “Oxford” may refer to a city, a university, or a football team, among other usages, and as a city may refer to any of numerous different cities named “Oxford” that exist in England, in Ohio, and elsewhere.
Named entity recognition is a specific instance of the more general problem of soft clustering, in which items are to be assigned to non-exclusive groups based on features of the items. In soft clustering, a given item may be assigned to more than one group; in contrast, hard clustering requires that each item be assigned exclusively to a single group. Named entity recognition is a soft clustering problem since, for example, Oxford may be assigned to each of the groups “Cities”, “Universities”, and “Football teams”, among others.
Named entity recognition is also an example of a soft clustering problem in which differing levels of specificity in the groupings may be desirable. For example, “Oxford” may be annotated more generally as a city, or more specifically as a city in England. Using numerous small groups provides high specificity in the annotation; on the other hand, an unacceptably large number of (typically small) groups can lead to high computational complexity and difficulty in manually, automatically, or semi-automatically assigning annotations or labels to the large number of groups. Existing soft clustering techniques generally require an a priori selection of the number of groups, which limits flexibility and can lead to forced grouping of unrelated items or forced separation of related items.