The present disclosure relates to methods and systems for semantically classifying open class nouns. In some embodiments, the present disclosure relates to methods and systems for semantically classifying a data set including a large number of open class nouns.
Many websites and interactive software programs include or create various data sets including one or more pieces of data. Typically, semantically classifying the data included in these data sets requires comparing each piece of data in the set to an electronic knowledge-base (EKB) to determine the classification for the individual datum. Various classifications may include first names, last names, peoples' whole names, street names, business names, product names, song titles, book titles, etc. However, occasionally a data set will include various pieces of data that overlap two or more classifications, thereby making a single comparison to one or more EKBs unreliable to classify a data set.
In typical digital marketplaces, a user can use a keyword or semantically-based search to identify individual content such as text phrases and custom graphics and images. However, content and other digital media previously organized into data sets, such as a data set including pieces of data that overlap two or more classifications, is difficult to search as the search results may include a large number of returned pieces of data, many of which may be unrelated to the user's search. This can result in lost revenue for the digital marketplace because the user may be unable to sort through the results to find what they are looking for, and thus, the user may not purchase the content.
Additionally, variable-data marketing campaigns rely on one or more data sources to produce dynamic documents consisting of variable content. The dynamic documents are targeted toward particular recipients (e.g., customers, prospects, event invitees, etc.). The cleanliness of the data source content is crucial toward the success of a variable-data marketing campaign. Therefore, accurate semantic identification of data source content is highly valued.