Certain applications and tasks for speech recognition benefit from a conversational level of understanding; for example, tasks involving user interaction with a game console or a personal assistant application on a personal device. Ideally for such tasks, the statistical language models used for speech recognition are trained with text data that is similar to the targeted domain for which the application is built, for example, entertainment search. In particular, the training text data is similar in terms of format and content, such as word sequences containing not only entities but also carrier phrases around the entities, and style, such as natural language word sequences. Such training data may be necessary for training or adapting statistical language models that can be used for real-time speech recognition (e.g., N-gram models or techniques for first-pass decoding).
However, the collection of such training data, even when collected through crowd sourcing, can be expensive and time consuming. Further, existing approaches to train language models using queries (such as all search queries hitting on a certain set of URLs that are expected to represent a target domain or queries that are associated with knowledge graph entities) and entity lists either contain some sort of content-mismatch or style-mismatch, or lack popularity information, and therefore without any data massaging do not satisfy requirements for real-time conversational understanding.