1. Field of the Invention
The present invention relates generally to document retrieval, and more particularly to the use of predictive word sequences related to a subject in order to access documents from a document collection.
2. Background
Numerous applications require text mining in large document collections. As the availability of digitized document content increases, the importance of efficient methods and systems for text mining, document access, and document retrieving increase. One such application is the text mining or document retrieving of aviation safety records, where numerous aircraft and airport reports are searched to discover various safety related events or concerns.
In many of these applications, a search query is generated and the document collection is searched using the search query to access or retrieve matching documents. The document collection can contain documents that are labeled as well as unlabeled documents. The labeled documents can include partially labeled, fully labeled, or some incorrectly labeled documents. Documents can be manually and/or automatically analyzed and various tags or labels can be assigned to the respective documents to categorize the documents within the collection. A reliable document access system should be able to handle such omissions and inaccuracies in the document collection.
Many conventional approaches address finding highly predictive word sequences to access documents related to a specified subject from document collections. Word sequences constructed from document collections can have high dimensionality, i.e., there may be a large number of word sequences. In order to address issues associated with the high dimensionality of word sequences, many conventional approaches focus on finding the most frequently occurring sequences. While these approaches are useful, in many cases in applications such as accessing or accessing aviation safety reports, there are highly predictive word sequences that relatively rare. Although rare and highly predictive words can often be identified by subject matter experts, such identification requires excessive amounts of manual effort.
Efficient and accurate methods and systems are therefore desired for accessing documents based on constructed word sequences.