There are many tools available for organizing and accessing documents matching specific criteria, such as containing certain keywords, key phrases, and their boolean combinations (Salton 1989). An important class of key phrases are named entities such as the names of people, organizations, places, dates. In addition to the presence of directly observable entities, there are indirect criteria that enhance document organization and access. For example, a document may describe an illegal act without using the words “illegal” or “unlawful” even once, it may allude to ‘the largest Italian daily’ without mentioning “Corriere della Sera”, or it may describe an oil reservior at latitude 61.3 N longitude 1.16 W without containing these coordinates, just by saying “a hundred miles north of Lerwick”. Adding explicit markers to the text to distinguish entity names and to make explicit information that can be inferred about these, usually by means of a formal markup language such as SGML or XML, is commonly called named entity tagging. For a modem introduction to Information Retrieval and Information Extraction see R. Mitkov (ed): Handbook of Computational Linguistics, Oxford University Press 2003, chapters 29 and 30.
In particular, the use of directly mentioned or inferred geographic coordinates as a document selection criterion is well established (Woodruff and Plaunt 1994). In many cases, documents enrolled in the system either contain explicit geographic coordinates or such coordinates can be assigned to them manually, a labor-intensive process called manual tagging, whereby human readers inspect the documents, look up the coordinates of key places mentioned in the document in an atlas or database, and add tags by hand. From the perspective of Information Retrieval and Information Extraction, document without tags (also called raw or untagged documents) are considerably less valuable than tagged documents, and machine algorithms capable of automating the manual work are of great practical interest.
Many tools commonly used for organizing and accessing documents, in particular web search engines such as Google or Yahoo, also incorporate a step of relevance ranking, whereby documents deemed to be more relevant to the users' query are presented to the user earlier than the less relevant documents. Importantly, such a step can not rely entirely on manual pre-classification or ranking, since the same document will be relevant to some user queries and irrelevant to many others. The standard method for ranking, called “TF-IDF”, is described e.g in S E Robertson and K Sparck Jones: Simple, proven approaches to text retrieval. University of Cambridge Computer Laboratory Technical Report 356, May 1997.
For further background, the reader is referred to the description of the Geographic Text Search (GTS) Engine found in U.S. patent application Ser. No. 09/791,533, filed Feb. 22, 2001, and entitled “Spatially Coding and Displaying Information,” incorporated herein by reference.