People learn language through exposure to a rich perceptual context. Language is grounded by mapping words, phrases, and sentences to meaning representations referring to the world.
It has been shown that even with referential uncertainty and noise, a system based on cross-situational learning can robustly acquire a lexicon, mapping words to word-level meanings from sentences paired with sentence-level meanings. However, it did so only for symbolic representations of word- and sentence-level meanings that were not perceptually grounded. An ideal system would not require detailed word-level labelings to acquire word meanings from video but rather could learn language in a largely unsupervised fashion, just as a child does, from video paired with sentences.
There has been research on grounded language learning. It has been shown to pair training sentences with vectors of real-valued features extracted from synthesized images which depict 2D blocks-world scenes, to learn a specific set of features for adjectives, nouns, and adjuncts.
It has been shown to pair training images containing multiple objects with spoken name candidates for the objects to find the correspondence between lexical items and visual features.
It has been shown to pair narrated sentences with symbolic representations of their meanings, automatically extracted from video, to learn object names, spatial-relation terms, and event names as a mapping from the grammatical structure of a sentence to the semantic structure of the associated meaning representation.
It has been described to learn the language of sportscasting by determining the mapping between game commentaries and the meaning representations output by a rule-based simulation of the game.
It has been presented that Montague-grammar representations of word meanings can be learned together with a combinatory categorial grammar (CCG) from child-directed sentences paired with first-order formulas that represent their meaning.
Although most of these methods succeed in learning word meanings from sentential descriptions they do so only for symbolic or simple visual input (often synthesized); they fail to bridge the gap between language and computer vision, i.e., they do not attempt to extract meaning representations from complex visual scenes. On the other hand, there has been research on training object and event models from large corpora of complex images and video in the computer-vision community. However, most such work requires training data that labels individual concepts with individual words (i.e., objects delineated via bounding boxes in images as nouns and events that occur in short video clips as verbs).
Reference is made to: U.S. Pat. No. 5,835,667 to Wactlar et al., issued Nov. 10, 1998; U.S. Pat. No. 6,445,834 to Rising, III, issued Sep. 3, 2002; U.S. Pat. No. 6,845,485 to Shastri et al., issued Jan. 18, 2005; U.S. Pat. No. 8,489,987 to Erol et al., issued Jul. 16, 2013; US2007/0209025 by Jing et al., published Sep. 6, 2007; and US2009/0254515 by Terheggen et al., published Oct. 8, 2009, the disclosure of each of which is incorporated herein by reference. Reference is also made to “Improving Video Activity Recognition using Object Recognition and Text Mining” by Tanvi S. Motwani and Raymond J. Mooney, in the Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), August 2012.
The attached drawings are for purposes of illustration and are not necessarily to scale.