Databases will often include sets of character strings, i.e., one or more words, that refer to and/or describe a same subject matter, e.g., a name of an entity. Accordingly, in various circumstances, e.g., deduplication, it is desirable that these sets of character strings, are identified as referring to the same subject matter. A system that is trained based on known examples is usable to predict that two character strings might refer to the same subject matter.