By translating voice data into text, speech recognition has played an important part in many Natural Language Processing (NLP) technologies. For instance, speech recognition has proven useful to technologies involving vehicles (e.g., in-car speech recognition systems), technologies involving health care, technologies involving the military and/or law enforcement, technologies involving telephony, and technologies that assist people with disabilities. Speech recognition systems are often trained and deployed to end-users. The training phase typically involves training an acoustic model in the speech recognition system to recognize text in voice data. The training phase often includes capturing voice data, transcribing the voice data into text, and storing pairs of voice data and text in transcription libraries. The end-user deployment phase typically includes using a trained acoustic model to identify text in voice data provided by end-users.
Conventionally, transcribing voice data into text in the training phase proved difficult. Transcribing voice data into text often requires analysis of a large amount of voice data and/or variations in voice data. In some transcription processes, dedicated transcription teams listened to voice data and manually entered text corresponding to the voice data into transcription libraries. These transcription processes often proved expensive and/or impractical due to the large number of sound variations in different words, intonations, pitches, tones, etc. in a given language.
Some transcription processes have distributed transcription tasks to different people, such as individuals with voice-enabled devices. Though some of these crowdsourced transcription processes are less expensive and/or more practical than transcription processes involving dedicated teams, these crowdsourced transcription processes often introduce noise into the transcription process. Examples of noise commonly occurring in crowdsourced transcription processes include errors from incorrect transcriptions and intentionally introduced inaccuracies (e.g., spam, promotional content, inappropriate content, illegal content, etc.).
While conventional noise filtering techniques may reduce noise in many types of crowdsourcing processes, conventional noise filtering techniques have not effectively reduced noise well for many crowdsourced transcription processes. For example, errors from incorrect transcriptions may exhibit irregular patterns, and may difficult to identify without a dedicated audit or validation process. As another example, though errors related to intentionally introduced inaccuracies may exhibit regular patterns, spammers and others introducing these errors may be able to circumvent automated validation measures (test questions, captions that are not machine-readable, audio that is not understandable to machines, etc.). It would be desirable to provide systems and methods that effectively transcribe speech to text without significant noise.