The present invention is directed, in general, to a system and method for locating the boundaries of segments of a video program within a video data stream and, more specifically, to a system and method for locating boundaries of video programs and boundaries of commercial messages by using audio categories such as speech, music, silence, and noise.
A wide variety of video recorders are available in the marketplace. Most people own, or are familiar with, a video cassette recorder (VCR), also referred to as a video tape recorder (VTR). A video cassette recorder records video programs on magnetic cassette tapes. More recently, video recorders that use computer magnetic hard disks rather than magnetic cassette tapes to store video programs have appeared in the market. For example, the ReplayTV(trademark) recorder and the TiVO(trademark) recorder digitally record television programs on hard disk drives using, for example, an MPEG video compression standard. Additionally, some video recorders may record on a readable/writable, digital versatile disk (DVD) rather than a magnetic disk.
Video recorders are typically used in conjunction with a video display device such as a television. A video recorder may be used to record a video program at the same time that the video program is being displayed on the video display device. A common example is the use of a video cassette recorder (VCR) to record television programs while the television programs are simultaneously displayed on a television screen.
Video recorders rely on high level Electronics Program Guide (EPG) information in order to determine the start times and the end times of television programs for recording purposes. Unfortunately, the EPG information may often be inaccurate, especially for live television broadcasts. There is a need in the art for an improved system and method for locating the boundaries of video programs. However, broadcasters are not motivated to insert any metadata information about the boundaries of commercial messages (xe2x80x9ccommercialsxe2x80x9d) in video programs.
Various methods exist to detect the start times and the end times of segments of video programs. These methods are typically used to detect commercials so that the commercials may be automatically skipped over when a video program is being recorded in a video recorder. Several well known methods involve the detection of a xe2x80x9cblack frame.xe2x80x9d A black frame is a black video frame that is usually found immediately before and after a commercial. Other methods for detecting the boundaries of a commercial include using cut rate change, super histograms, digitized codes with time information, etc.
Another prior art method for detecting the boundaries of a program or a commercial involves inserting a special code or signal in the video signal to designate the beginning and the end of the program or commercial. Special circuitry is needed to detect and identify the special code or signal.
In addition, there are presently existing television standards that insert program identification information in the video signal. The program identification information uniquely identifies the beginning and the end of the program. This information can also be used to detect the boundaries of programs.
These prior art methods all involve the insertion and detection of special codes, special signals, or special program identification information within a video data stream. There is a need in the art for an improved system and method for locating the boundaries of video programs and commercials within a video data stream without using special codes, special signals, or special program identification information.
There is also a need for an improved system and method for automatically locating the boundaries of video programs and the boundaries of commercials in computerized personal multimedia retrieval systems. Computerized personal multimedia retrieval systems exist for identifying and recording segments of a video program (usually from a television broadcast) that contain topics that a user desires to record. The desired segments are usually identified based upon keywords input by the user. In a typical application, a computer system operates in the background to monitor the content of information from a source such as the Internet. The content selection is guided by the keywords provided by the user. When a match is found between the keywords and the content of the monitored information, the information is stored for later replay and viewing by the user. The downloaded information may include links to audio signals and to video clips that can also be downloaded by the user.
A computerized personal multimedia retrieval system that allows users to select and retrieve portions of television programs for later playback usually meets three primary requirements. First, a system and method is usually available for parsing an incoming video signal into its visual, audio, and textual components. Second, a system and method is usually available for analyzing the content of the audio and/or textual components of the broadcast signal with respect to user input criteria and segmenting the components based upon content. Third, a system and method is usually available for integrating and storing program segments that match the user""s requirements for later replay by the user. Fourth, users prefer to record/playback only program segments and not commercials.
A system that meets these requirements is described in U.S. patent application Ser. No. 09/006,657 filed Jan. 13, 1998 by Dimitrova (a co-inventor of the present invention) entitled xe2x80x9cMULTIMEDIA COMPUTER SYSTEM WITH STORY SEGMENTATION CAPABILITY AND OPERATING PROGRAM THEREFOR INCLUDING FINITE AUTOMATON VIDEO PARSER.xe2x80x9d U.S. patent application Ser. No. 09/006,657 is hereby incorporated herein by reference within this document for all purposes as if fully set forth herein.
U.S. patent application Ser. No. 09/006,657 describes a system and method that provides a set of models for recognizing a sequence of symbols, a matching model that identifies desired selection criteria, and a methodology for selecting and retrieving one or more video story segments or sequences based upon the selection criteria.
A significant improvement in the operation of video signal processors, such as video recorders and computerized personal multimedia retrieval systems, can be obtained if the locations of the boundaries of the video programs and commercials are known. There is therefore a need in the art for an improved system and method for locating the boundaries of video programs and the boundaries of commercials within a video data stream.
To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide an improved system and method for locating the boundaries of video programs and the boundaries of commercials within a video data stream by using the audio content of the program. Specifically, is it is a primary object of the present invention to provide an improved system and method for locating the boundaries of video programs and the boundaries of commercials within a video data stream by using audio categories such as speech, music, silence, and noise.
It is also a primary object of the present invention to provide an improved system and method for automatically locating the boundaries of video programs and the boundaries of commercials within a video data stream without requiring the use of special codes, special signals, or special program identification information inserted in the video data stream.
The system of the present invention comprises an audio classifier controller that categorizes sequential portions of audio signals into audio categories such as speech, music, silence, and noise. The audio classifier controller also categorizes sequential portions of audio signals into audio categories such as speech with background music, speech with background noise, speech with background speech, etc. The audio classifier controller identifies also categorizes sequential portions of audio speech signals in speaker categories when the identity of a speaker can be determined. Each speaker category contains audio speech signals of one individual speaker. Speakers who can not be identified are categorized in an xe2x80x9cunknown speakerxe2x80x9d category.
The audio classifier controller of the present invention also comprises a category change detector that detects when a first portion of the audio signal categorized in a first category ceases and when a second portion of the audio signal categorized in a second category begins. That is, the category change detector determines when a category of the audio signal changes. In this manner the audio classifier controller of the present invention continually determines the type of each audio category.
The category change detector also determines when a first portion of the audio signal categorized in a first speaker category ceases and when a second portion of the audio signal categorized in a second speaker category begins. That is, the category change detector determines when a speaker category of the audio signal changes.
The audio classifier controller of the present invention also comprises a category change rate detector that determines the rate at which the audio categories are changing (the xe2x80x9ccategory change ratexe2x80x9d). The category change rate detector compares the category change rate to a threshold value. The threshold value can either be a preselected value or can be determined dynamically in response to changing operating conditions. If the category change rate is greater than the threshold value, the existence of a commercial segment may be inferred, therefore leading to the existence of a boundary.
It is an object of the present invention to provide an improved system and method for identifying boundaries using classification of audio signals to obtain at least one audio category for each segment of an audio signal.
It is also an object of the present invention to provide an improved system and method for identifying boundaries using classification of audio signals into audio categories such as silence, music, noise and speech.
It is also an object of the present invention to provide an improved system and method for identifying boundaries using classification of audio signals into audio subcategories such as speech with background music, speech with background noise, music with background noise, etc.
It is another object of the present invention to provide an improved system and method for identifying boundaries by accessing a speech database to classify speech audio signals of persons who are speaking during a speech segment of an audio signal.
It is an additional object of the present invention to provide an improved system and method for identifying boundaries by determining when an audio category changes.
It is an additional object of the present invention to provide an improved system and method for identifying boundaries by determining when a speaker changes.
It is also an object of the present invention to provide an improved system and method for determining the rates at which audio categories change in an audio signal.
It is another object of the present invention to compare the rate at which an audio category changes in an audio signal with a threshold value to locate boundaries of video program segments and commercials in a video program segment that contains the audio signal.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.
Before undertaking the DETAILED DESCRIPTION, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms xe2x80x9cincludexe2x80x9d and xe2x80x9ccomprise,xe2x80x9d as well as derivatives thereof, mean inclusion without limitation; the term xe2x80x9cor,xe2x80x9d is inclusive, meaning and/or; the phrases xe2x80x9cassociated withxe2x80x9d and xe2x80x9cassociated therewith,xe2x80x9d as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term xe2x80x9ccontrollerxe2x80x9d means any device, system or part thereof that controls at least one operation, such a device may be implemented in hardware, firmware or software, or some combination of at least two of the same. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.