(1) Field of Invention
The present invention relates to a system for embedding visual intelligence and, more particularly, to a system for embedding visual intelligence that enables machines to visually perceive and contemplate through visual intelligence modules and system integration.
(2) Description of Related Art
Visual processing is the flow of information from visual sensors to cognitive processing. Typical visual processing methods first decompose scenes into objects, track them, and then attempt to recognize spatio-temporal actions by using sophisticated hand-coded models. Since these models are either built manually or use a fixed structure (i.e., not extensible), they do not account for wide variations in actions, and cannot generalize to newer actions. Traditional symbolic reasoning systems rely heavily on hand-crafted domain specific knowledge, pre-defined symbolic descriptions, and the assumption that perception and reasoning are independent, sequential operations. However, real-world problems require richly intertwined dynamic methods for perception and reasoning in order to envision possible scenarios, acquire new knowledge, and augment cognitive capabilities.
The prior art described below include limitations in generic event representation; building concept hierarchies and graphical models for action understanding; and reasoning, envisionment, and grounding. For instance, regarding limitations of current spatio-temporal patterns, dynamics-based approaches to visual intelligence rely on optical flow patterns to segment and classify actions (see Literature Reference No. 73). These approaches model velocity patterns of humans (e.g., ballistic, spring-mass movements) and report 92% accuracy. However, this value was reported for 2 classes of actions, and the algorithm has not been shown to scale well with more classes and moving clutter. Motion history based approaches are generally computationally inexpensive (see Literature Reference Nos. 49, 50, 55, 74). However, these approaches suffer from needing an image alignment process to make the features position-invariant, thus making the method sensitive to noise in the silhouettes used. Pixel level “bag of words” based approaches also use space-time features from various sized video “cuboids’, the collection of which are used to represent the action in video (see Literature Reference Nos. 17, 18, 37). These approaches, however, disregard information on the spatial groupings of sub-blocks.
Regarding limitations of current spatio-temporal concepts, the use of AND/OR graphs (see Literature Reference Nos. 21, 41) for behavior recognition offer an elegant solution to represent structure. However, variation in expression of an action and across classes of action is not handled. Some approaches focus on human pose estimation and dynamics (see Literature Reference Nos. 40, 72). Unfortunately, they lack extensibility in generic action modeling. Use of Latent Semantic Analysis (see Literature Reference No. 53) offers unsupervised learning but lacks spatial and temporal invariance.
Regarding limitations of current reasoning, envisionment, and grounding systems, several cognitive architectures (see Literature Reference Nos. 1, 33) elucidate psychology experiments. However, they do not scale well to large problems and often lack the ability to store perceptual memories, including imagery. Case-based reasoning systems (see Literature Reference Nos. 4, 20, 67) can examine and produce perceptual symbols, but are typically built with little generalization across application domains. Probabilistic logic methods (see Literature Reference Nos. 28, 57) handle uncertainty well but require significant tuning for new domains, and can be computationally cumbersome. Existing symbolic representations of spatio-temporal actions (see Literature Reference Nos. 19, 63) can perform visual inspection, yet lack mental imagery capabilities.
Current approaches cannot accomplish the range of recognition, reasoning, and inference tasks described by the present invention. Thus, a continuing need exists for a system that integrates visual processing and symbolic reasoning to emulate visual intelligence.