In this digital age, mobile phone accessibility has reached to almost everyone in world. This advancement can be used to derive a demographic advantage for crowd work. But there are various scripts and languages in the world for reading and writing.
Speech transcription is simply a process of writing down the spoken words in the script of language being spoken based on what we hear. Generally speech transcription has relied on the crowd workers being native speakers of the source language. Recently, the mismatched crowd unfamiliar with the spoken language has been used to transcribe the speech in Roman script. The inventors here have recognized several technical problems with such conventional systems, as explained below. Such crowdsourcing again assumes that crowd worker has to be familiar with the Roman script. This scenarios can clearly limit the addressable crowd size. Thus it is important to explore the utility of a highly mismatched crowd which is not only unfamiliar to spoken language but also knows only their native script which may not be the Roman script. In this invention, we utilize such highly mismatched multilingual crowd for speech transcription. Sometimes, if there is a highly mismatched crowd to be used for speech transcription, an intermediate process of transliteration takes place. Such intermediate transliteration step may use English as pivot script from which the transcription in original script is decoded. Since the simple transliteration process cannot account for the errors made by the transcriber. The system model these errors with a phoneme level insertion-deletion-substitution channel model. In other words, the multi-scripted crowd responses can be transliterated into English (Roman) script first and then phoneme sequences are obtained using English grapheme to Source language's phoneme sequence models. The maximum likely phoneme sequence is used to model the insertion deletion and substitution errors made by worker. These channels are used to decode a word in source script using maximum-likely combination of crowd work. The overall system consists of pre-filtering unit that utilizes adaptive tests for removing workers who are of extremely poor quality, and also an allocation strategy that allocates a word to users in an optimized manner until sufficient confidence in the word transcription is built. The intermediate transliteration step helps achieve the phonetic sequences in source language using grapheme to phoneme modelling. In another embodiment, one can also directly model the phoneme sequences from worker's script without using any pivot script.