Dictation systems may require a large number of exemplary labelled speech audio data for training. Acquiring the labelled speech audio data typically requires humans to label the data so as to accurately indicate words present in the speech audio data. Furthermore, performance of a dictation system may be sensitive to context of the speech audio data (e.g., speaker accent and speaking style, domain-specific vocabulary, etc.), and good performance in a particular context may require exemplary labelled speech audio data for that particular context. However, in many contexts, human labelling of data may be infeasible. For example, dictation systems may be used to input private and/or confidential data, and it may not be feasible to provide the private and/or confidential data to a third party for labelling. Furthermore, human labelling of data may be expensive, time-consuming, and/or error-prone.