There are many applications for learning from unstructured natural language. However, the rapid increase in data production creates difficulties in understanding such a large amount of data. Previous attempts have been made to classify the number of subject matters that exist on the internet, with one such attempt classifying over 2000 different subjects, including general categories like history, cars, travel, and hotels, to much more specific subjects like renaissance dance or bed bugs.
It would be useful to have the ability to calculate semantic distances between these subjects in order to determine whether subjects are “close” to any given subject, and provide a quantitative measurement of that closeness. Such a measurement can be used to provide further information during a query about a particular subject by collecting information from the related subjects, or to find subject clusters within a pre-determined list of all subjects.