The present invention relates generally to multimedia systems, including hybrid television-computer systems. More specifically, the present invention relates to story segmentation systems and corresponding processing software for separating an input video signal into discrete story segments. Advantageously, the multimedia system implements a finite automaton parser for video story segmentation.
Popular literature is replete with images of personal information systems where the user can merely input several keywords and the system will save any news broadcast, either radio or television broadcast, for later playback. To date, only computer systems running news retrieval software have come anywhere close to realizing the dream of a personal news retrieval system. In these systems, which generally run dedicated software, and may require specialized hardware, the computer monitors an information source and downloads articles of interest. For example, several programs can be used to monitor the Internet and download articles of interest in background for later replay by the user. Although these articles may include links to audio or video clips which can be downloaded while the article is being examined, the articles are selected based on keywords in the text. However, many sources of information, e.g., broadcast and cable television signals, cannot be retrieved in this manner.
The first hurdle which must be overcome in producing a multimedia computer system and corresponding operating method capable of video story segmentation is in designing a software or hardware system capable of parsing an incoming video signal, where the term video signal denotes, e.g., a broadcast television signal including video shots and corresponding audio segments. For example, U.S. Pat. No. 5,635,982 discloses an automatic video content parser for parsing video shots so that they can be represented in their native media and retrieved based on their visual content. Moreover, this patent discloses methods for temporal segmentation of video sequences into individual camera shots using a twin-comparison method, which method is capable of detecting both camera shots implemented by sharp break and gradual transitions implemented by special editing techniques, including dissolve, wipe, fade-in and fade-out; and content-based keyframe selection of individual shots by analyzing the temporal variation of video content and selecting a key frame once the difference of content between the current frame and a preceding selected keyframe exceeds a set of preselected thresholds. The patent admits that such parsing is a necessary first step in any video indexing process. However, while the automatic video parser is capable of parsing a received video stream into a number of separate video shots, i.e., cut detection, the automatic video processor is incapable of video indexing the incoming video signal based on the parsed video segments, i.e., content parsing.
While there has been significant previous research in parsing and interpreting spoken and written natural languages, e.g., English, French, etc., the advent of new interactive devices has motivated the extension of traditional lines of research. There has been significant investigation into processing isolated media, especially speech and natural language and, to a lesser degree, handwriting. Other research has focused on parsing equations (e.g., a handwritten xe2x80x9c5+3xe2x80x9d), drawings (e.g., flow charts), and even face recognition, e.g., lip, eye, and head movements. While parsing and analyzing multimedia presents an even greater challenges with a potentially commensurate reward, the literature is only now suggesting the analysis of multiple types of media for the purpose of resolving ambiguities in one of the media types. For example, the addition of a visual channel to a speech recognizer could provide further visual information, e.g., lip movements, and body posture, which could be used to help in resolving ambiguous speech. However, these investigations have not considered using the output of, for example, a language parser to identify keywords which can be associated with video segments to further identify these video segments.
The article by Deborah Swanberg eta al. entitled xe2x80x9cKnowledge Guided Parsing in Video Databasesxe2x80x9d summarized the problem as follows:
xe2x80x9cVisual information systems require both database and vision system capabilities, but a gap exists between these two systems: databases do not provide image segmentation, and vision systems do not provide database query capabilities . . . The data acquisition in typical alphanumeric databases relies primarily on the user to type in the data. Similarly, past visual databases have provided keyword descriptions of the visual descriptions of the visual data, so data entry did not vary much from the original alphanumeric systems. In many cases, however, these old visual systems did not provide a sufficient description of the content of the data.xe2x80x9d
The paper proposed a new set of tools which could be used to: semiautomatically segment the video data into domain objects; process the video segments to extract features from the video frames; represent desired domains as models; and compare the extracted features and domain objects with the representative models. The article suggests the representation of episodes with finite automatons, where the alphabet consists of the possible shots making up the continuous video stream and where the states contain a list arcs, i.e., a pointer to a shot model and a pointer to the next state.
In contrast, the article by M. Yeung et al., entitled xe2x80x9cVideo Content Characterization and Compaction for Digital Library Applicationsxe2x80x9d describes content characterization by a two step process of labeling, i.e., assigning shots that are visually similar and temporally close to each other the same label, and model identification in terms of the resulting label sequence. Three fundamental models are proposed: dialogues, action; and story unit models. Each of these models has a corresponding recognition algorithm.
The second hurdle which must be overcome in producing a multimedia computer system and corresponding operating method capable of video story segmentation is in integrating other software, including text parsing and analysis software and voice recognition software, into a software and/or hardware system capable of content analysis of any audio and text, e.g., closed captions, in an incoming multimedia signal, e.g., a broadcast video signal. The final hurdle which must be overcome in producing a multimedia computer system and corresponding operating method capable of story segmentation is in designing a software or hardware system capable integrating the outputs of the various parsing modules or devices into a structure permitting replay of only the story segments in the incoming video signal which are of interest to the user.
What is needed is a multimedia system and corresponding operating program for story segmentation based on plural portions of a multimedia signal, e.g., a broadcast video signal. Moreover, what is needed is an improved multimedia signal parser which either effectively matches story segment patterns with predefined story patterns or which generates a new story pattern in the event that a match cannot be found. Furthermore, a multimedia computer system and corresponding operating program which can extract usable information from all of the included information types, e.g., video, audio, and text, included in a multimedia signal would be extremely desirable, particularly when the multimedia source is a broadcast television signal, irrespective of its transmission method.
Based on the above and foregoing, it can be appreciated that there presently exists a need in the art for a multimedia computer system and corresponding operating method which overcomes the above-described deficiencies. The present invention was motivated by a desire to overcome the drawbacks and shortcomings of the presently available technology, and thereby fulfill this need in the art.
The present invention is a multimedia computer system and corresponding operating method capable of performing video story segmentation on an incoming multimedia signal. According to one aspect of the present invention, the video segmentation method advantageously can be performed automatically or under direct control of the user.
One object of the present invention is to provide a multimedia computer system for processing and retrieving video information of interest based on information extracted from video signals, audio signals, and text constituting a multimedia signal.
Another object according to the present invention is to produce a method for analyzing and processing multimedia signals for later recovery. Preferably, the method generates a finite automaton (FA) modeling the format of the received multimedia signal. Advantageously, key words extracted from a closed caption insert are associated with each node of the FA. Moreover, the FA can be expanded to include nodes representing music and conversation.
Still another object according to the present invention is to provide a method for recovering a multimedia signal selected by the user based on the FA class and FA characteristics.
Yet another object according to the present invention is to provide a storage media for storing program modules for converting a general purpose multimedia computer system into a specialized multimedia computer system for processing and recovering multimedia signals in accordance with finite automatons. The storage media advantageously can be a memory device such as a magnetic storage device, an optical storage device or a magneto-optical storage device.
These and other objects, features and advantages according to the present invention are provided by a storage medium for storing computer readable instructions for permitting a multimedia computer system receiving a multimedia signal containing unknown information, the multimedia signal including a video signal, an audio signal and text, to perform a parsing process on the multimedia signal to thereby generate a finite automaton (FA) model and to one of store and discard an identifier associated with the FA model based on agreement between user-selected keywords and keywords associated with each node of the FA model extracted by the parsing process. According to one aspect of the invention, the storage medium comprises a rewrittable compact disc (CD-RW) and wherein the multimedia signal is a broadcast television signal.
These and other objects, features and advantages according to the present invention are provided by a storage medium for storing computer readable instructions for permitting a multimedia computer system to retrieve a selected multimedia signal from a plurality of stored multimedia signals by identifying a finite automaton (FA) model having a substantial similarity to the selected multimedia signal and by comparing FA characteristics associated with the nodes of the FA model with user-specified characteristics. According to one aspect of the present invention, the storage medium comprises a hard disk drive while the multimedia signals are stored on a digital versatile disc (DVD).
These and other objects, features and advantages according to the present invention are provided by a multimedia signal parsing method for operating a multimedia computer system receiving a multimedia signal including a video shot sequence, an audio signal and text information to permit story segmentation of the multimedia signal into discrete stories, each of which has associated therewith a final finite automaton (FA) model and keywords, at least one of which is associated with a respective node of the FA model. Preferably, the method includes steps for:
(a) analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
(b) comparing the identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
(c) constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
(d) coupling neighboring video shots or similar shots with the identified keyframes when the neighboring video shots are apparently related to a story represented by the identified keyframes;
(e) extracting the keywords from the text information and storing the keywords at locations associated with each node of the constructed FA model;
(f) analyzing and segmenting the audio signal in the multimedia signal into identified speaker segments, music segments, and silent segments
(g) attaching the identified speaker segments, music segments, laughter segments, and silent segments to the constructed FA model;
(h) when the constructed FA model matches a previously defined FA model, storing the identity of the constructed FA model as the final FA model along with the keywords; and
(i) when the constructed FA model does not match a previously defined FA model, generating a new FA model corresponding to the constructed FA model, storing the new FA model, and storing the identity of the new FA model as the final FA model along with the keywords.
According to one aspect of the present invention, the method also included steps for
(j) determining whether the keywords generated in step (e) match user-selected keywords; and
(k) when a match is not detected, terminating the multimedia signal parsing method.
These and other objects, features and advantages according to the present invention are provided by a combination receiving a multimedia signal including a video shot sequence, an audio signal and text information for performing story segmentation on the multimedia signal to generate discrete stories, each of which has associated therewith a final finite automaton (FA) model and keywords, at least one of which is associated with a respective node of the FA model. Advantageously, the combination includes:
a first device for analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
a second device for comparing the identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
a third device constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
a fourth device for coupling neighboring video shots or similar shots with the identified keyframes when the neighboring video shots are apparently related to a story represented by the identified keyframes;
a fifth device for extracting the keywords from the text information and storing the keywords at locations associated with each node of the constructed FA model;
a sixth device for analyzing and segmenting the audio signal in the multimedia signal into identified speaker segments, music segments, and silent segments
a seventh device for attaching the identified speaker segments, music segments, and silent segments to the constructed FA model;
an eighth device for storing the identity of the constructed FA model as the final FA model along with the keywords when the constructed FA model matches a previously defined FA model; and
a ninth device for generating a new FA model corresponding to the constructed FA model, for storing the new FA model, and for storing the identity of the new FA model as the final FA model along with the keywords when the constructed FA model does not match a previously defined FA model.
These and other objects, features and advantages according to the present invention are provided by a method for operating a multimedia computer system storing a multimedia signal including a video signal, an audio signal and text information as a plurality of individually retrievable story segments, each having associated therewith a finite automaton (FA) model and keywords, at least one of which is associated with each respective node of the FA model, the method comprising steps for:
selecting a class of FA models corresponding to a desired story segment to thereby generate a selected FA model class;
selecting a subclass of the selected FA model class corresponding to the desired story segment to thereby generate a selected FA model subclass;
generating a plurality of keywords corresponding to the desired story segment;
sorting a set of the story segments corresponding to the selected FA model subclass using the keywords to retrieve ones of the set of the story segments including the desired story segment.
These and other objects, features and advantages according to the present invention are provided by a story segment retrieval device for a multimedia computer system storing a multimedia signal including a video signal, an audio signal and text information as a plurality of individually retrievable story segments, each having associated therewith a finite automaton (FA) model and keywords, at least one of which is associated with each respective node of the FA model. Advantageously, the device includes:
a device for selecting a class of FA models corresponding to a desired story segment to thereby generate a selected FA model class;
a device for selecting a subclass of the selected FA model class corresponding to the desired story segment to thereby generate a selected FA model subclass;
a device for generating a plurality of keywords corresponding to the desired story segment;
a device for sorting a set of the story segments corresponding to the selected FA model subclass using the keywords to retrieve ones of the set of the story segments including the desired story segment.
These and other objects, features and advantages according to the present invention are provided by a video story parsing method employed in the operation of a multimedia computer system receiving a multimedia signal including a video shot sequence, an associated audio signal and corresponding text information to permit a multimedia signal parsed into a predetermined category having an associated finite automaton (FA) model and keywords, at least one of the keywords being associated with a respective node of the FA model to be parsed into a number of discrete video stories. Advantageously, the method includes steps for extracting a plurality of keywords from an input first sentence, categorizing the first sentence into one of a plurality of categories, determining whether a current video shot belongs to a previous category, a current category or a new category of the plurality of categories responsive to similarity between the first sentence and an immediately preceding sentence, and repeating the above-mentioned steps until all video clips and respective sentences are assigned to one of the categories.
According to one aspect of the present invention, the categorizing step advantageously can be performed by categorizing the first sentence into one of a plurality of categories by determining a measure Mki of the similarity between the keywords extracted during step (a) and a keyword set for an ith story category Ci according to the expression set:       M    k    i    =            (                        MK          Nkeywords                +                  Mem          i                    )        /    2        M    k    i    =      MK    Nkeywords  
where MK denotes a number of matched words out of a total number Nkeywords of keywords in the respective keyword set for a characteristic sentence in the category Ci, where Memi is indicative of a measure of similarity with respect to the previous sentence sequence within category Ci and wherein 0 less than Mki less than 1.
Moreover, these and other objects, features and advantages according to the present invention are provided by a method for operating a multimedia computer system receiving a multimedia signal including a video shot sequence, an associated audio signal and corresponding text information to thereby generate a video story database including a plurality of discrete stories searchable by one of finite automaton (FA) model having associated keywords, at least one of which keywords is associated with a respective node of the FA model, and user selected similarity criteria. Preferably, the method includes steps for:
(a) analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
(b) comparing the identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
(c) constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
(d) coupling neighboring video shots or similar shots with the identified keyframes when the neighboring video shots are apparently related to a story represented by the identified keyframes;
(e) extracting the keywords from the text information and storing the keywords at locations associated with each node of the constructed FA model;
(f) analyzing and segmenting the audio signal of the multimedia signal into identified speaker segments, music segments, laughter segments, and silent segments
(g) attaching the identified speaker segments, music segments, laughter segments, and silent segments to the constructed FA model;
(h) when the constructed FA model matches a previously defined FA model, storing the identity of the constructed FA model as the final FA model along with the keywords;
(i) when the constructed FA model does not match a previously defined FA model, generating a new FA model corresponding to the constructed FA model, storing the new FA model, and storing the identity of the new FA model as the final FA model along with the keywords;
(j) when the final FA model corresponds to a predetermined program category, performing video story segmentation according to the substeps of:
(j)(i) extracting a plurality of keywords from an input first sentence;
(j)(ii) categorizing the first sentence into one of a plurality of video story categories;
(j)(iii) determining whether a current video shot belongs to a previous video story category, a current video story category or a new video story category of the plurality of video story categories responsive to similarity between the first sentence and an immediately preceding sentence; and
(j)(iv) repeating steps (j)(i) through (j)(iii) until all video clips and respective sentences are assigned to one of the video story categories.