Data mining technologies are indispensable for processing large quantities of unstructured data. Data mining relies on identifying and extracting names, places, or words that match a particular search criteria or specification and typically involves the initial steps of discovering sources and mining those sources for relevant information.
A number of technologies focus on automating aspects of the data mining process, but the complete automated solution is still yet to be realized. For example, there are a large number of open source mining and source discovery tools available today such as from Kapow Software. There are also solutions that organize information that has been extracted from sources such as MarkLogic® and LexisNexis™ as well as software such as Esri's ArcGIS™ for the creation and visualization of geospatial data.
Source discovery is often a manual process because it takes an analyst time to evaluate various datasets for usefulness, accuracy, and validity to a specific application. There are tools in use that help with the process of traversing the World Wide Web (WWW) to discover online sources (e.g., Ficstar, Web Grabber, Fetch, Mozenda), but these need to be directed and bounded by specific search parameters to scope that search, all of which require input from a person. In addition, sources that are not online or digital cannot be discovered in an automated fashion.
Once discovered, mining those sources for specific nuggets of information is possible, especially if the search is done in English. However, exploiting these searches in foreign languages, especially when dealing with non-Roman character sets can be a challenge. While many tools support UTF-8 encoding, which supports character matching even in non-Roman character sets, there are often challenges in dealing with misspellings and characterizing the words within the languages (e.g., identifying that a word is actually a name or a person). Given the variety of Romanization systems, there are often dozens of ways of spelling one name. For example, the name Muhammed has one spelling in Arabic but has numerous spellings in Romanized characters. This poses challenges to data mining systems in matching words with multiple spellings, especially when mining online media.
Human Geography (HG) is becoming increasingly important given the recent uprisings in the Middle East and North Africa, as well as threats from cartels in Mexico and South America. While there are potentially many definitions of HG, the term can be described as tying human information to geospatial locations. Many solutions focus on technology and on automating the process of collecting human geography information such as with data mining and language technologies. Automated approaches are valuable because these approaches offer the benefit of quickly drilling through large quantities of data to discover specific pieces of information and identifying patterns. However, there are still limits to the ability of automated mining technologies to find, assimilate, and geospatially locate information.
Existing data mining engines frequently struggle to place the mined data in context, leading to misidentifications of relationships or patterns. For example, many data mining engines would connect the financial institution “Berkshire Hathaway” with the actress “Anne Hathaway.” Although the names may match exactly, appropriate context would show that there is no relationship between these two entities.
The importance of contextual analysis is even more pronounced in the area of HG data mining. For example, a data mining engine may identify the name “As Sadlan” in an unstructured document. Without any contextual information, there would be no way of knowing any social, cultural, or geographic affiliations of this person and no way to reach new conclusions based on the mined data.
Improved systems for contextualizing data discovered through data mining are desired.