1. Field of the Invention
The present invention relates, in general, to natural language processing, and, more particularly, to systems and methods for filtering and categorizing plain text messages to identify unique or atypical messages.
2. Relevant Background
Today large volumes of data are communicated between individuals and organizations electronically. Often these electronic messages are transferred automatically in newsfeeds or the like. Some automatic processing and reporting domains generate several thousands of reports each day, particularly in some government, news, or financial applications. These electronic messages represent thousands of reports that are transmitted and received on a daily basis.
To effectively use this large volume of information, it is necessary to identify trends, organize, correlate and fuse parametric, attribute, and free formatted data. These processes allow a user to induce knowledge from the raw factual information contained in the plain text messages. Automated tools to help analyze the message streams are seldom adequate to induce information from the data stream and the effectiveness of the available tools depends heavily on the level of experience and training of the system users.
A typical message storage system receives messages over a network or communications line and archives the messages to long term storage. A retrieval and dissemination system matches messages against user query profiles, and a user interface provides message display, annotation, and retrospective search capabilities. Some applications support counting and graphing capabilities such as histograms and bar charts that aide the skilled user in inducing information from the raw messages. Software indexing schemes or specialized hardware accelerators support search of the message database. Systems of this type typically receive thousands of messages and megabytes or gigabytes of data daily.
Trained personnel responsible for analyzing and assimilating the content of the incoming messages typically build complicated profiles to extract specific messages from the message stream. These profiles may be either highly specific, or very broad. Highly specific, elaborate, and exact profiles tend to generate high precision matches against the message stream for known topics of interest. In contrast, very broad profiles are designed to prevent missing important information by returning large amounts of information with lower probability of being relevant to the profile. High precision profiles generate short read queues but are prone to miss relevant messages that are near but not contained within the search volume specified by the profile. High recall profiles generate long read queues requiring users to read and determine the potential relevance of each message within the broad search volume specified by the profile.
One weakness of a template profile approach is that it requires extensive a-priori knowledge to initially construct the profiles. Anytime a-priori knowledge is required, the system becomes user-skill dependent. Also, significant resources are required read the large number of messages collected. Moreover, long term maintenance of the profiles requires expensive user resources because of the dynamic nature of the message streams. Maintaining high levels of query precision and recall efficiency requires continuous tuning of the query profile terms.
A desirable feature of a message processing system depicted is to identify new trends in the message stream. New trend identification depends upon a skilled user spontaneously noticing new patterns in the data. This difficult task is hampered by many factors such as large volumes of data, user fatigue, differing levels of training and experience, and personnel turnover. Effective trend identification requires that the user by made aware of the content and context of the data contained in the message stream.
Recognizing new situations is fundamentally a process that humans are much better at than computers. When it is not possible to fully automate the cognitive aspects required to process report data, then it is desirable to quickly focus the attention of an experienced human user on the most important information first and to reduce the workload of the user to the greatest extent possible without sacrificing performance. What is needed is a machine tool and system that assists a human user in the task of situation awareness, analysis, and problem recognition thereby reducing the workload on the system users and raising overall quality.