In our information-based society, there are many sources of data and information. In general, data can be found in all forms, sizes, and contexts. For example, data can be found in news media, Internet, databases, data warehouses, published reports, scientific journals, industry publications, government statistics, court papers, recorded phone conversations, and the like. When the need arises to research a topic or find a solution to a problem, the common approach is to search known data sources and then manually scan the available facts and figures for any useful information.
Some data may be stored in a structured format, e.g., in data warehouses or relational databases. Structured data is typically pre-sorted and organized into a useful information format which is relatively easy to search and digest. In fact, assuming the potential questions are known, the data may be properly organized into customized data marts that the user can readily access to retrieve the needed information with minimal effort.
There also exist vast amounts of unstructured data that are not so easy to access. Unstructured data may be found in newspaper articles, scientific journals, Internet, emails, letters, and countless other sources that are relatively difficult to organize, search, and retrieve any useful information. The unstructured data is typically just words in a document that have little meaning beyond their immediate context to those in possession of the document. It is most difficult to assess or learn anything from unstructured data, particularly when questions from unrelated areas are posed or when the right questions are not even known. The unstructured data may be just as important as the structured type, sometimes even more so, but its elusiveness often leaves a significant gap in the thoroughness of any search and analysis.
The process of searching for relevant and useful information and getting meaningful results is important in many different contexts and applications. The user may be interested in marketing information, medical research, environment problem solving, business analysis, criminal investigation, or anti-terrorist work, just to name a few. In a typical approach, the user creates a list of key words or topics and uses a search engine to electronically interrogate available data sources, e.g., the Internet or various public and private databases. The user will get one or more hits from the search and must then manually review and analyze each reference of interest. The process takes considerable time and effort and, with present research tools, will often overlook key elements of relevant data.
Consider the example of a search of potential terrorist threats and targets. Authorities have access to vast amounts of structured information in government databases to use as intelligence gathering tools in the war on terror. The numerous government computer systems are generally not linked together. Data from one agency is not necessarily available to another agency. Moreover, the unstructured data which exists in other places is hard to access and even harder to interpret. There is no central depository of all information.
Some key piece of intelligence may exist which, if known to the proper authorities, could avert an attack. The data may come from a newspaper article, email, recorded phone call, or police report. Such information is usually in some innocent or hard to find place. Recall that much of the data related to the 9/11 attack on the World Trade Center was known, it was just not recognized as being relevant or significant. Taken in hindsight, the fact that suspicious individuals were taking limited flying lessons, i.e., learning how to fly but not land, was extremely important. Yet, the right people did not understand, the dots were not connected, no one correlated the fragments of data. The situational dynamics of pre-9/11 remained disjointed and fuzzy.
The authorities responsible for homeland security have learned much about intelligence and routinely conduct intelligence sweeps. Still it is highly likely that both structured and unstructured data exist today that if known and understood would be most helpful in preventing future incidents. But mere access to the data is not enough. Even if the data is known, it may not be appreciated for its relevance or significance. The data is often fuzzy, vague, ambiguous, or may have special context. Again the connections between all the dots are still not being made. There is a real need for tools to aid in the analysis and interpretation of data that might otherwise be passed over.
The use of computer-based search engines is well-known. More advanced data searching and analysis techniques, such as data mining and various taxonomies (hierarchy of information) exist, but do not fully address unstructured data or data interpretation needs. Much of the useful data presently out there remains very difficult to access and understand. People looking for information in virtually any area face this common problem. Using present search and analysis techniques, it is impractical to track all data from all sources. The individual slices of data are but pieces in an intelligence gathering jigsaw puzzle that requires better tools to understand. Missing intelligence leads to missed opportunities and poor decisions.
A need exists to organize all types of data to assist in searching data sources and interpreting the retrieved information, particularly from unstructured data sources.