At any given stage of an interaction between a directed-dialog application and a user, it is often easier to guess the user's intention than the exact choice of words in the user's response. For example, when a user calls in to an Interactive Voice Response (IVR) system of a railway, it can be surmised that (s)he is most likely interested in one of the following: a train status, a reservation, a fare, an agent, or something else. However, it is more difficult to guess how (s)he will phrase the query. The challenge is compounded by the disfluencies (fillers, false starts, repetitions, etc.) inherently present in human speech. Thus, using an Automatic Speech Recognition (ASR) system based on a set of rule-based grammars that enumerate all of the possible user responses is cumbersome and sub-optimal. At the same time, using a standard large-vocabulary Language Model (LM) would also be sub-optimal, as it does not take advantage of the restricted set of words and phrases from which the user can choose. In such situations, a class-LM is typically used.
Class-LMs are similar to standard LMs except for the following difference: some of the entries in class-LMs are tokens/classes that contain one or more words or phrases that typically either occur in similar context or convey the same meaning. Also, with class-LMs, entries can be added to the classes (referred to as fanout-increase) without the need to retrain the LM. Classes can be transferred from one dialog system to other, and class-LMs typically need less data to train than standard LMs.
However, in ASR systems in typical IVR setups that include a class-LM, a challenge exists, given a set of classes (or embedded grammars), in determining an optimal way to embed the classes/grammars in the LM.