It is common for radio and television broadcasts of live events to be delayed a few seconds before the audio data are transmitted to enable the content to be reviewed in real-time by individuals who are tasked with preventing undesirable language from being received. This form of censorship has become more critical with the increased demand for “cleaning up” the airwaves to eliminate obscenities, profanities, sexually specific words and other content deemed unsuitable for general audiences. While the degree of censorship applied to audio content will typically depend upon the nature of the event that is being broadcast and the audience likely to receive the broadcast, it is apparent that certain expletives and words or phrases will be targeted as unacceptable for a given kind of event, likely audience, and even the time at which a broadcast occurs (depending on whether children would likely be included in the audience at that time).
Humans are remarkably adept at identifying words and phrases that are considered unacceptable. However, for certain types of broadcasts, it would be preferable to employ a more automated approach that avoids the need to pay for or provide a human censor to monitor the broadcast, while still enabling the audio data to be censored at an appropriate level. For example, in massive multiplayer games, it would be impractical to employ a human censor to monitor the multitude of voice chat sessions that might be occurring at one time.
Since human censors have the ability to adapt to the venue and to different speakers and language traits so that the content of the speaker's utterance is not unduly censored, it would also be desirable for any automatic censorship system and method to be capable of similarly dynamically adapting to the requirements of a specific venue and likely audience. The need for such variability in the censorship process should be apparent. For example, an automatic censor system should apply a much more relaxed level of censorship during an interview of a sports personality occurring in a late night broadcast that would likely not be heard by children, and the speaker who is a well-known personality would be expected to be more responsible for any inappropriate personal utterances than the network over which those utterances are broadcast. In contrast, for speech by an announcer during a children's game show, the list of words included in the unacceptable vocabulary used by an automatic censor system would likely be much longer and the tolerance for allowing any questionable words or phrases to slip through would be much tighter.
It would also be desirable to enable the automated system to dynamically adjust to frequency with which a speaker uses expletives and other undesired speech, since the recognition of undesired speech may not be entirely accurate. A lower threshold can be applied if the speaker uses undesired speech more frequently, to avoid any undesired speech being transmitted, while if a speaker only infrequently appears to use such undesired speech, it would likely be preferable to apply a higher threshold to avoid mistakenly censoring speech that is not truly undesired. Since two words that sound alike can have different meanings, depending upon context, it is important that an automated censor system and method apply appropriate rules to avoid censoring perfectly acceptable words, while censoring unacceptable words and phrases, depending upon the context of the spoken words or phrases. It would be desirable to apply a probability threshold in making such decisions, so that the decisions made can be controlled by varying the probability threshold appropriately.
Censorship of spoken language can be annoying if each obscenity or profanity is “bleeped” to obscure it so that it is not understood, particularly if the frequency with which such utterances occur is too great. Accordingly, it would be desirable for an automatic censor system and method to employ an alternative approach wherein the undesired speech is simply either replaced with an acceptable word or phrase, or is reduced in volume below audibility, or is deleted from the audio data. While a human censor can prevent an utterance from being broadcast or can overwrite the undesired language with a word or phrase, human censors do not have the capability to produce an acceptable simulated utterance for a given speaker, to overwrite the undesired utterance of that speaker. The best technique for preventing undesired language from being heard and/or understood also will depend on the application and minimizing the adverse impact on the listener. Songs with undesired language will be adversely impacted if that language is bleeped out. A much less minimal impact on the listener's experience can be achieved by simply removing the undesired words or attenuating the volume of the words, but not the music, when such words or phrases are automatically detected.