1. Field
This invention relates to complex data streams, and particularly to methods, systems and computer program products for simplifying complex data stream problems involving feature extraction from noisy data.
2. Description of Background
Currently, stored data is growing at an incredible rate with the majority of the data being stored as unstructured information. This data may contain complex entities of interest such as chemical, gene, protein, bio, nano diagrams, sketches or pictures, contained in data streams. Currently, it is difficult for a machine to efficiently and accurately extract and analyze structures from data streams using existing implemented techniques. It is also extremely difficult to maintain the software required using conventional techniques.
The state of the art is to implement techniques such as neuro-linguistic programming (NLP) and conditional random field (CRF) to allow computers to understand unstructured data. These ‘clean’ data techniques (such as NLP and CRF) are only successful if the majority of the data is uniform and well formatted. Unfortunately, real data is ‘noisy’ and requires extra effort to remove the noise. A noisy data stream presents a significant challenge to typical stream processing technology which expects to process the data presented in a sequential way, recognizing and annotating or extracting structures on-the-fly. In particular, it is difficult to recognize a structure of unpredictable length using a set of sequentially applied transformations that may destroy the structure in order to clean up noise. The alternative of concurrent data stream processing is complex and typically expensive to maintain.