The present invention relates to a method and an apparatus for detecting sound segments of audio data associated with moving pictures such as a video program recorded on a video tape or a disk, and is concerned with a method and an apparatus which can simplify indexing of a leading position of an audio sequence or interval in a video program.
With the advent of high-speed computers and availability of memory devices or storages of large capacity in recent years as the background, it becomes now possible to handle a mass of moving pictures and associated audio information through digitization thereof. In particular, in the field of the editing of moving pictures and management thereof, the digitized moving pictures can be handled or processed by the pick-up device and the editing apparatus as well as the managing apparatus for production of video programs. As one of these apparatuses, there can be mentioned a CM managing apparatus (also known under the name of CM bank) which is designed for managing several thousand varieties of commercial video segments (video clips) (hereinafter also referred to as the CM or CM video) for preparing given CM videos (video clips) in the order for broadcasting. Heretofore, a plurality of CM video materials have been recorded on a single video tape before broadcasting. In these years, such a CM managing apparatus can also be made use of which is designed for broadcasting the CM video materials supplied from producers thereof such as advertizing agencies. The CM video materials have been supplied individually on a program-by-program basis in the form of video tapes, respectively, wherein video supplied as the mother material contains the name or identifier of the producer and data concerning the production in addition to the intrinsic CM video entity. Further, so-called idle pictures are inserted, respectively, in precedence and in succession to the CM video for several seconds for the purpose of realizing alignment in timing upon the broadcasting. Such being the circumstances, there arises necessity of registering a start and an end of the CM video (clip) to be broadcast in addition to the storage of the mother material supplied from the producer on another recording medium such as a tape, disk or the like by copying.
The work for checking the start and the end of the CM video is currently carried out thoroughly manually, which has imposed an heavy burden on the operator in charge. Because the idle pictures are taken, respectively, in continuation to the start and the end of the intrinsic CM video entity, the operator often encounters such situation that the extent of the CM video to be really broadcast can not be discerned merely through visual observation or check. In the case of the CM video or the like which is constituted by a combination of audio and video, the operator determines discriminatively the start and the end of the video by checking auditorily the sound in the idle intervals in the video (clip) because no sound is recorded in the idle intervals. In the present state of the art, there is unavailable any other method than the one in which the operator decides auditorily the presence or absence of sound by repeating manipulation such as reproduction or play of the video, stoppage or pause, reverse reproduction or reverse play, etc. These manipulations are certainly improved by adopting a dial such as a jog, a shuttle or the like in the video reproducing apparatus or by making use of a scroll bar on an image screen of a computer. However, such manipulations still incur not a little consumption of man power.
With the present invention, it is contemplated as an object thereof to provide a method and an apparatus which make it possible to automate the work involved in deciding auditorily the presence or absence of sound at the start and the end of a CM video (clip) upon registration of CM video material while automating operation for the registration for simplification thereof.
Another object of the present invention is to provide a method and an apparatus for detecting the start and end points of an intrinsic CM video entity on a real-time basis for registering the positions of the start and end points, respectively.
In an interactive registration processing for registering a video in a video managing apparatus, it is taught according to the present invention to provide an envelope arithmetic means for determining arithmetically an envelope of waveform of a sound signal inputted on a time-serial basis, a sound level threshold value setting means for setting previously a threshold value of sound level for comparison with values of the envelope, and a start/end point detecting means for detecting a time point at which the envelope intersects the level of the aforementioned threshold value as a start point or an end point of a sound segment, to thereby allow the presence or absence of the sound determined heretofore with the auditory sense to be decided quantitatively and automatically. In that case, the start/end point detecting means mentioned above is provided with a silence time duration lower limit setting means for setting previously a lower limit on the duration of a silence state, a silence time duration arithmetic means for determining arithmetically an elapsed time during which the value of the envelope of the sound signal waveform has remained smaller than the threshold value of the sound level, and a silence time duration decision means for deciding that the above-mentioned silence time duration has exceeded the lower limit so that sound interruption of extremely short duration such as punctuation between phrases in a speech can be excluded from the detection. Similarly, the start/end point detecting means mentioned above is provided with a sound time duration lower limit setting means for setting previously a lower limit on the duration of a sound state, a sound time duration arithmetic means for determining arithmetically an elapsed time during which the value of the envelope of the sound signal waveform has exceeded the threshold value of the sound level, and a sound time duration decision means for deciding that the sound time duration has exceeded the lower limit so that noise or sound of one-shot nature can be prohibited from being detected. Furthermore, the envelope arithmetic means mentioned above is provided with a filtering means for performing a filtering processing having a predetermined constant time duration on the sound signal inputted on a time-serial basis. As the filtering means mentioned above, a maximum value filter for determining sequentially maximum values of a predetermined constant time duration for the sound signal inputted on a time-serial basis and a minimum value filter for determining sequentially minimum values of a predetermined constant time duration for the sound signal inputted on a time-serial basis are employed.
Furthermore, it is taught according to the resent invention that a video reproducing means for reproducing a video material, a sound input means for inputting a sound signal recorded on an audio track of the video for reproduction as a digital signal on a time-serial basis, and a sound processing means for detecting the start and end points of a sound segment from the sound signal as inputted, and a display means for displaying results of the detections are provided, for thereby enabling the position of the start and end points of the sound segment in the video material to be presented to an operator. The sound processing means is provided with a frame position determining means for determining the frame positions of the video at the time points at which the start and end points the sound interval are detected in addition to the envelope arithmetic means, the sound level threshold value setting means and the start/end point detecting means mentioned previously. The frame position determining means mentioned above is provided with a timer means for counting the elapsed time, starting from the beginning of the detection processing, a means for reading out the frame positions of the video (or moving pictures), an elapsed time storage means for storing elapsed time at the time points at which the start and end points mentioned above are detected and elapsed time at a time point at which the frame position mentioned above is read out, and a frame position correcting means for correcting the frame position as read out by using difference between both the elapsed times mentioned above so that a time lag involved in the detection of the start and end points up to the reading of the frame position can be corrected to thereby allow the frame position to be determined at the detection time point. Furthermore, the sound processing means mentioned above is provided with a means for stopping temporarily the reproduction of the video at the start and end points as detected, to thereby enable the reproduction of the video to be paused at the frame positions corresponding to the start and end points. In that case, a video reproducing apparatus capable of controlling the reproduction of the video by a computer is employed as the video reproducing means. By way of example, a video deck equipped with a VISCA (Video System Control Architecture) terminal, a video deck used generally in the editing by the professional or the like may be employed. In this way, head indexing to the sound segment as detected can be realized efficiently.
Furthermore, it is taught according to the present invention that the sound processing means mentioned previously is provided with a frame position storage means for storing individually the frame positions of the start point and the end point of the sound segment, and a display means for displaying individually the frame positions of the start point and the end point so that the positions of the start point and the end point of the sound segment in the video material can be presented individually to the operator. Besides, the sound processing means is provided with a buffer memory means for storing sound signals inputted time-serially on a constant time-duration basis and a reproducing means for reproducing the sound signals as inputted so that the operator can confirm visually and auditorily the sound interval as detected. Furthermore, on the assumption that the picture subjected to the processing is a CM video material and that such a general rule that the CM video entity has a time duration of 15 seconds or 30 seconds per CM program made use of, the sound processing means mentioned above is provided with a time duration setting means for setting previously an upper limit of the length of time duration of the sound segment having a predetermined constant time duration together with a tolerance range of one or two seconds and a time duration comparison means for comparing the length of a detected time duration extending from the start point to the end point of the sound segment as detected with the set time duration length mentioned above for thereby allowing only the sound segment of a predetermined constant time duration to be detected in a CM video (clip). Additionally, the sound processing means is provided with a margin setting means for setting margins at front and rear sides, respectively, of the sound segment as detected so that the CM video (clip) for broadcasting which has the predetermined time duration can be registered in the CM managing apparatus from the CM video material.