Many government and civilian organizations produce an exceptionally large volume of reports. For various government agencies and civilian entities there may be several reports published per hour. At least for some governmental agencies, these reports are available for searching using a search engine such as Google. The search engine regularly crawls the governmental or civilian document repository to parse the text of the documents and update its search index.
For analysts looking for a particular document, they enter a search term into the search engine that returns a set of documents or links to documents that match the text search. Some of these systems also categorize the documents by the day, month and year they were published and some documents may also be categorized by topic. One can browse the documents using a topic or date search, with summaries of the documents and links to them being returned.
For analysts seeking the content of these reports, the analyst must choose or guess the right set of key words to obtain good search results. If one guesses the set of key words too narrowly, the search will miss reports that the query party might be interested in. If one guesses too broadly, one can end up with an overwhelming number of hits. The result of course is that the analyst does not have time to go through all of these documents.
It is therefore very difficult to run an accurate and focused search for material in these written reports.
Another large problem associated with such a reporting system is that an analyst may be very interested in the location in the world where the report is either published or refers to. Current content management systems have no ability to do what is called spatial searching, which typically involves drawing an area on a map, for example, and showing any reports that reference any points inside the area that is drawn. It would be highly desirable to make such a spatial query and to display the results of the search plotted adjacent the corresponding points on the map.
A benefit of spatial visualization of search results is to ascertain where clusters of reports in some area exist and for which the analyst may not be aware that related activity is going on.
All content management systems store metadata, i.e. data about data, with each content item. Typical metadata includes title, author and publication date, such as is used by a card catalog in a library. This metadata can be searched, for example to find all documents by a particular author. The augmentation of this typical metadata with spatial location data is required to support spatial searches.
In order to support spatial searches, there is a necessity for a system to mine the content of documents for either explicit or implicit geographic references, and to convert these geographic references into geographic metadata, which can be later processed by a spatially enabled relational database management system that can process spatial queries.
While in the past simple searches where geo-coding is utilized to indicate for instance a business establishment and its location, usually by street address and zip code, present query systems for unstructured information do not allow the type of search that is commonly referred to as geo-coding.
There are two issues at stake in developing a spatially enabled query system. One is extracting spatial location information from documents, and the other is using this spatial information along with other metadata and text information to perform focused searches over a large collection of documents. Geographic information may be provided explicitly in a document as geographic coordinates. More common is implicit information like the name of the city or data facility inside the body of the document that one can transform into an explicit geographic coordinate.
Ascertaining the coordinates is only one part of the problem to be solved. What one needs is a method to utilize these coordinates in searching. Current content management systems for searching unstructured information have no spatial search capability.
For instance, if there have been a series of drug trials that a pharmaceutical company has run over a period of time and the results of the trials are written into documents, and assuming the documents are stored in a content management system for safekeeping over time, presently there is no easy way for a researcher to find out what people have done what trials, when and where.
For instance, the researcher might be interested in learning about drug trials for tropical diseases, in which countries these trials have been held and over what time periods. The analyst might be looking for specific references inside the report about adverse reactions to a particular drug for a tropical disease and where the adverse reactions occurred.
This information cannot typically be ascertained by a computer-generated index of text documents in a large repository so as to be able to cull out the most relevant documents. This is because there are three things that need to be incorporated into an efficient unstructured information search. One is the location of things or places within the content of the individual document. Secondly, one wishes to search using data about the documents (metadata), analogous to a card catalogue at a library. For instance, one may be interested in a particular author who is an expert in a given field and one would like to find reports that were published by that particular author. Likewise there may be a date range that the investigator is interested in, for instance knowing what happened in a particular field in the late 1970s such that the focus of the search is over that time span.
A third parameter for the search is specific words or phrases that are important to the researcher, assuming that these words and phrases are mentioned in the text.
What is therefore necessary is a system that combines all three of the different ways of searching into what is ultimately just one query in which the search is done automatically to find the intersection of the three types of results and to return a list of specific reports to retrieve and read. If such a system could be devised it would save considerable time. This would allow one to specify a focused query for accurate results while looking through thousands or millions of documents in a repository.
For purposes of the subject invention, unstructured information is data such as documents that are not normally stored in tables in a relational database management system. Examples are text documents, email messages, video clips or pictures. They are difficult to search with any precision beyond what is provided by a simple text-based search that matches user-provided words with word matches found in the body of the document. When text-based searching a large body of documents, these types of searches often provide hundreds or thousands of results, which would be much too time-consuming to read.
For those applications where the locations of entities are described in the body of the content or in its metadata, location is an important search attribute. Presently there are no content management systems that enable focused unstructured information searching and/or discovery based on the explicit or implicit spatial attributes of the content.