1. Technical Field
This invention generally relates to the processing of streamed data, and more specifically relates to real-time mining and reduction of streamed data to reduce the amount of data stored in a database.
2. Background Art
There are a variety of different devices that can provide information in electronic form that may need to be analyzed. For example, a system in London, England uses cameras to track license plate numbers of all vehicles in the downtown London area. This type of system allows tracking the vehicles in the downtown area, and specifically allows for determining whether certain vehicles (such as those with identified license plates that belong to suspected terrorists) are in the downtown London area. One can readily appreciate that a large number of vehicles go in and out of the downtown London area each day. The data corresponding to the license plate numbers for all these vehicles streams in from the data collection system. The data may include, for example, the camera location, date, time, license plate number, speed, and other related data. Typically this data is packaged as an Extensible Markup Language (XML) record, and is streamed via various communications mediums to a processing facility. At the processing facility, the data is typically written to a database, where it may be accessed to determine whether the data corresponds to a specified list of license plates. This type of a system requires a significant amount of storage. Because the vast majority of the license plates belong to law-abiding citizens, the vast majority of the data is discarded once it is analyzed and determined that the license plate is not on the specified list of license plates of interest. However, the mere collection of all this data as it streams in from the cameras requires a substantial amount of storage, and requires complex algorithms for mining the data after it is stored and discarding the data that is not of interest.
Radio Frequency Identification (RFID) presents a new paradigm where vast amounts of data are typically stored for later mining and reduction of data. Wal Mart and the U.S. Department of Defense have mandated that their suppliers have RFID tags on all items that cost more than one dollar. As a result, systems are being developed that allow collecting the huge amounts of data for RFID systems. These systems typically dump all the RFID data into a database for subsequent processing (e.g., data mining and reduction). One can easily appreciate that a semi-trailer load of goods being delivered to a Wal Mart store may include tens or hundreds of thousands of items, or potentially millions of items. Once the trailer gets within range of an RFID scanner, each RFID tag will respond with its data, and the collecting system will have to receive, store and analyze all of this information. Even with the availability of high density storage devices, retaining the volumes of new information produced by RFID devices for post-processing and reduction can quickly become cost-prohibitive in terms of both hardware and people resources. Traditional tools that store all of the data in a database, then analyze the stored data, require a significant amount of storage. For example, at a Wal Mart distribution warehouse, dozens or hundreds of trucks may be loaded and dispatched to different destinations every day. Tracking this much information using prior art techniques that store all of the data requires a huge amount of storage. In many cases, all of the individual data is not needed. For example, a system may not really care about the individual identifiers for each bag of candy, but may simply want a total count of the number of bags of the same candy. This type of operation is known as an aggregation in the database world. Storing thousands or millions of RFID identifiers in a database in order to simply count the number of records that have similar RFID identifiers requires a huge amount of storage, which is inefficient. Without a way to mine and reduce streamed data real-time as the data is collected and before it is stored in a database, the computer industry will continue to suffer from inefficient mechanisms and methods for collecting and analyzing streamed data.