Any application dealing with large amounts of structured data (such as databases and spreadsheets) that benefits from knowing an exact semantic category for the various fields of data can utilize the method to be described hereinafter.
Given a list of words of “arbitrary origin” that are assumed to have one or more categories in common, the problem becomes finding the category that best fits (based on human judgment or other methods) the list of words with enough accuracy to be used in automated data processing applications. In one illustrative example a list of numbers can include (2, 3, 5, 7, 11, 13). The question becomes what is the best category that all of, or most of, the aforementioned numbers belong. At least three categories can be listed: prime numbers; odd numbers; and, positive integers. It appears that the category of prime numbers would be the best category for these numbers. Another illustrative example includes (dog, cat, hamster, gold fish). Two categories that this list belongs to are: pets and animals. It would appear that the aforementioned list is best categorized as pets.
It is to be appreciated that categorizing a list of words of “arbitrary origin” requires a large comprehensive, multifaceted categorization system that reflects how a wide range of people would naturally categorize the words in the list. For example, the category of “Presidents of the United States” or “Assassinated Presidents of the United States” would depend on the particular list of Presidents. One present categorization system involves Wikipedia's crowd sourced “folksonomy” which is used to organize Wikipedia articles using a sense of classification and organization that is agreeable to the authors and other individuals that categorize articles. Wikipedia's classification system is effective for classifying a list of words when there is strong semantic similarity between the words in the list. For example, the classification system is good for categories like: countries, ‘Presidents of the USA’, and birds. Wikipedia's classification system is not as good for: ‘objects found in the sky’ or ‘people born in January’. There are many applications where knowing what lists of words represent is useful. One such application is the conversion of tabular data into resource description framework (RDF) triples (linked data). An RDF file can parse down to a list of triples. A triple comprises a subject, a predicate, and an object.
Another application is combining two distinct databases into one database where both of the original databases contain records with the same type of information but have different syntactic schemas. The problem being resolved in the present disclosure is the automatic categorization of arbitrary fields in a spread sheet or database to make easier a class of problems like the two described above. Current approaches to this problem typically require a large amount of contextual information about the specific problem in order to create a one off solution that does not generalize well to the entire class of problems in this area.