1. Field of the Invention
The present invention relates to the development and delivery of computer-based instruction in literacy or language skills. More particularly, it relates to the use of computers using speech recognition in a distributed manner over a wide area network such as the World Wide Web.
2. Background Art
Computer-based instruction in reading and language arts was first developed and studied in the 1970's by Patrick Suppes and his colleagues at Stanford University (Fletcher and Suppes, 1972). Although oral reading instruction remained the goal, computer-based approaches evolved slowly to that point, first using text exclusively, then text with graphics enhancements (important for calling attention to particular spelling patterns or associating words with pictures, for example), text with graphics and recorded or digitized audio output, and then to text, graphics, audio output, and capture and playback of learners' oral input.
As systems progressed along this continuum of complexity, they addressed different component skills of reading or language use. An early system, as detailed in U.S. Pat. No. 4,078,319 of Mazeski et al, utilized optical components to train the student's eyes to focus on particular material and to advance laterally along the line of material to be taught, purportedly conditioning the operation of reading by training the mechanics of eye movement.
A reading teaching system having simultaneous audio and visual display of material is taught in U.S. Pat. No. 4,397,635 of Samuels, wherein the audio and/or the visual display of the material is altered to enhance the student's understanding of the meaning of the material to be learned (e.g., lower volume audio and downwardly-slanted lettering could be used to present the word “down”).
The system taught in U.S. Pat. No. 5,540,589 of Waters involves the use of an audio tutoring system to enable productive trials with feedback and to monitor student progress. Essentially it offers practice on productions where the student shows errors, terminating the session when the number of errors exceeds a preset threshold.
A particular challenge associated with speech recognition systems and critical to successful implementation of an automatic interactive reading system for young children is the development of a model of children's speech with which their productions can be correctly identified. Most speech recognition systems have been developed by building acoustic models out of samples of adult speech. However, children's speech differs markedly from that of adults in terms of, for example, segmentation points, fluency, influence of home language or dialects, articulation control, pitch or even the number of distinct phonemes. The system taught in U.S. Pat. No. 6,017,219 of Adams, Jr. et al. was the first known example of a computer-based reading and language instruction system that used an acoustic model developed from children's speech data.
By way of background, many speech-recognition techniques achieve their effectiveness by operating against models of the way the user or groups of users produce language sounds, called acoustic models, and the way they string words together, called language models. (For our purposes, we will limit the discussion to acoustic models).
Current standard practice involves constructing an acoustic model using a representation known as a Hidden Markov Model (HMM). An HMM relates usually two sets of stochastically distributed elements to one another. To illustrate, an HMM with two sets of elements is used. The first set of elements comprises an observable sequence of events, e1, e2, e3 . . . en.
The second set comprises a sequence of states that are not observable (hence hidden) s1, s2, s3, . . . Sn. The probability of an event ei is conditioned by the hidden states where it can occur, so is expressed as the conditional probability p(ei|sj).
Because events may occur in different states, for any particular observed set of events e there exists a sequence of states si, sj, sk, . . . sn that is the most probable to explain the set of events observed.
HMMs provide a general representation for identifying a state or sequence of states that inhered to produce a sequence of events. If the problem is thought of as finding a label or sequence of labels for a sequence of events, then the appropriateness of an HMM for speech recognition becomes clearer.
In general terms, the efficacy of HMMs for a given problem distills to a comparison of the parameters that describe them. Briefly, these comprise:
1. the number of states in the HMM,
2. the number of distinct observations that can occur within each state,
3. the distribution of transition probabilities from state to state,
4. the distribution of event probabilities,
5. the startup state distribution.
Optimization of the models proceeds by adjusting these parameters of the HMM so that the states it nominates as the most likely match for the observed events are discriminated more clearly from other candidates.
It is difficult for a young child to provide a system with substantial samples of his or her ways of using language. Today's speech recognition systems require an amount of training material to create an acoustic model that is large enough to require that the child have to be able to read the material aloud, something the child may not be ready to do or capable of doing. Repeating large amounts of speech to provide the system with adequate material from which to build a model also lies beyond the capabilities of most children.
Usually such training of the speech-recognition system involves reading lengthy passages of text so that the system has both the textual material and its acoustic representation with which to adapt its acoustic model. Acoustic models may seem more critical than language models whose function it is to reflect the likelihood that a particular word follows a word or a sequence of words in spoken or written messages. Dictation tasks use such likelihood estimates to help disambiguate acoustic data gleaned from the speaker. In reading tasks, however, the text indicates what the reader should produce precisely without recourse to probability. However, language models can be used to predict miscues, such as word omissions, substitutions, or transpositions so they are useful to help identify erroneous productions based on data collected from previous readings of the material.
An approach requiring users to provide the system samples of how they read will not work well with students who cannot yet read, so systems using speech recognition to teach reading must use a single pre-existing acoustic model to acceptably recognize what the child has read without requiring the child to provide additional training input. This is not a satisfactory solution because variation among the forms of children's speech militates against the use of a single acoustic model for literacy training that is released along with the application using it.