Speech recognition and computer vision do not couple well with higher level processing such as databases, query/retrieval applications, or archiving and indexing applications. In part, these difficulties result because there is no intermediate layer to represent time-varying features of sound and vision in a domain-independent way that can be easily processed using conventional algorithmic methods. For example, a conventional speech processing algorithm may be arranged to map each sound utterance into one or more text characters that can be used as baseline features.
A conventional algorithmic method then can evaluate these baseline features to form words and phrases that can then be indexed by a database. However, each time the base level features change the entire database has to be re-indexed. In some instances, the algorithms themselves may need to be changed when the baseline features change. The scalability of such systems can also be difficult because each algorithmic module may need to be isolated from all the others. It may also become difficult to optimize higher level algorithms or to reuse standard algorithms because of the close coupling.