Technical Field
The present invention relates generally to data stream applications and, more particularly, to a system and method for resource-adaptive, real-time new event detection.
Description of the Related Art
In a document streaming environment, documents may come from one or more sources. New event detection (NED) is the task of capturing the first documents that mention previously unseen events. This task has practical applications in several domains, where useful information is buried in a large amount of data that grows rapidly with time. Such domains include, but are not limited to, intelligence gathering, financial market analyses, news analyses, and so forth. Applications in those domains are often time-critical and the use of an online new event detection (ONED) system is highly desired.
Turning to FIG. 1, events in a document stream are indicated generally by the reference numeral 100. In FIG. 1, different shapes correspond to different events, and filled shapes represent the documents that need to be captured.
Recently, ONED has attracted much attention. In order to provide a standard benchmark for comparing different algorithms, the National Institute of Standards and Technology (NIST) has organized a Topic Detection and Tracking (TDT) program, where ONED is one of the main tasks. Despite all the efforts, there is still a significant gap between the state-of-the-art ONED systems and a system that can be used in practice.
Most of the existing ONED systems compare a new document D to all the old documents that arrived in the past. If the similarity values between D and the old documents are all below a certain threshold, D is predicted to mention a new event. This method has quadratic time complexity with respect to the number of documents and is rather inefficient. For example, in the latest TDT5 competition, many systems spent several days on processing just 280,000 news articles, whose total size is less than 600 MB. This processing speed is orders of magnitude slower than a typical document arrival rate.
In practice, an ONED system can monitor a large number of document sources. For example, Google news has 4,500 sources and Yahoo! news has more than 5,000 sources. In other applications such as intelligence gathering, document sources can cover an even wider spectrum including, e.g., emails, instant messages, web bulletin boards, blogs, and so forth. Therefore, a practical ONED system needs to handle a high document arrival rate without resorting to an excessive amount of hardware resources. Moreover, due to the bursty nature of document streams, an ONED system should be able to operate gracefully even if it runs out of resources. These performance issues, however, have not been addressed in previous studies.
Turning to FIG. 2, a conventional online new event detection system (ONED) is indicated generally by the reference numeral 200. An output of the ONED system 200 is provided to an output queue 210, waiting to be consumed by a consumer 220. The consumer 220 can be, for example, a person or a computer program that does further deep analysis (e.g., machine translation). The processing speed of the consumer can be much slower than the peak output rate of the ONED system. For example, the state-of-the-art machine translation speed is measured by the number of words per second.
None of the existing ONED systems has considered the following user interface issues: (1) when the consumer is overloaded and cannot keep pace with the output rate of the ONED system, less important documents need to be dropped from the queue (or moved to a low-priority queue) so that the consumer can focus on important documents; and (2) depending on the concrete requirement of the consumer, documents can be sorted in the queue according to different criteria (e.g., importance or arrival time) so that desired documents are processed by the consumer first.