The present invention relates to telephone answering devices, and in particular, a telephone answering device with a first speech coder for encoding/decoding fixed voice prompt messages based on a first set of codebooks and a second speech coder for encoding/decoding incoming and outgoing voice messages based on a second set of codebooks, significantly larger in size than the first codebook.
In telecommunication devices, such as digital telephone answering devices (DTADs), speech processing systems are employed to store and forward speech sounds. Conventional digital telecommunication devices provide for the storage and playback of incoming voice messages, outgoing voice messages, and fixed prompt voice messages. Incoming voice messages include messages transmitted over the telephone line by the calling party and recorded by the DTAD. Outgoing messages are the pre-recorded messages played by the DTAD in response to receiving a telephone call. For example, the outgoing message might state xe2x80x9cI am presently unavailable. At the sound of the tone please leave a brief message.xe2x80x9d Incoming and outgoing messages are stored in a read/write memory in the DTAD. These messages are limitless in terms of the number of utterances or phrases that may be expressed and the number of speakers, so long as the memory size is not exceeded. Another type of audio message played by the DTAD is a fixed voice prompt message or xe2x80x9cvoice read only messagexe2x80x9d (VROM), such as a date/time stamp, with significantly fewer utterances or phrases spoken by a single speaker. Since the fixed voice prompt messages need only be read, and not changed, they are stored in a read only memory (ROM).
In a conventional DTAD the VROM messages are stored in an external ROM and compressed using the same coding techniques, for example, code-excited linear predictive coding (CELP), used for the storage of incoming and outgoing messages. Alternatively, the VROM messages may be stored on a linear predictive coding (LPC) synthesis chip; however, this provides a lower quality then CELP coding. External voice ROMs or LPC synthesis chips are relatively large in size. The overall size of the circuitry may be reduced by storing the VROM messages in a smaller memory device, such as a digital signal processor read only memory (DSP ROM). However, the cost of the DSP ROM significantly increases as the available storage capacity increases. Thus, it is preferable to use a DSP ROM with a relatively small storage capacity. By way of example, in a 16 k DSP ROM approximately 12 k is used to stored the encoding speech program and other programs, leaving only approximately 4 k words for the fixed voice prompts. The typical total recording time for storing time/day stamp fixed voice prompts is approximately 37 seconds. An encoding rate of 6.8 kbps, which is generally used in DTAD employing a codebook trained for a relatively large number of utterances and speakers, requires at least 15,725 words of storage. Thus, the overall storage requirements for the fixed voice prompts exceed the storage capacity in the typical low cost DSP ROM. Although DSP ROMs having a larger storage capacity, such as 24 k or 32 k, may be used they are significantly more expensive, and thus may be impracticable.
It is therefore desirable to develop a DTAD in which the fixed voice prompts are stored in a DSP ROM at a reduced compression bit rate while maintaining the quality of the reconstructed speech or voice data.
For the purposes of this invention, the term xe2x80x9cset of codebooksxe2x80x9d is defined to include an LPC codebook, an adaptive codebook, and a fixed codebook. In addition, the term xe2x80x9cvoice messagexe2x80x9d includes both incoming and outgoing voice messages. The terms xe2x80x9cvoice read only messagexe2x80x9d and xe2x80x9cfixed voice promptxe2x80x9d are synonymous.
The digital telephone answering device in accordance with the present invention includes two separate coders, a first speech coder for encoding/decoding fixed voice prompts spoken by a single speaker and a second coder for encoding/decoding voice messages spoken by multiple speakers. The first speech coder uses a first set of codebooks generated by training on a first set of utterances spoken by a single speaker, while the second speech coder uses a second set of codebooks generated by training on a second set of utterances spoken by multiple speakers. Because the first set of utterances is significantly smaller in size than the second set of utterances, and the range of pitch period is significantly smaller in size for the first set of utterances spoken by a single speaker in comparison to that of the second set of utterances spoken by multiple speakers, the size of the first set of codebooks is significantly reduced relative to the size of the second set of codebooks. As a result, the fixed voice prompt messages may be compressed at a lower bit rate with a relatively high quality of encoding, thereby optimizing the codebook and reducing the amount of memory required for storing the encoded fixed voice prompts. Furthermore, the encoding of fixed voice prompts can occur off line, and thus need not be performed by the DSP in real time. Only decoding of the fixed voice prompts is performed by the DSP in real time.
In addition, the present invention is directed to a method of using the telephone answering device described above. Fixed voice prompts are encoded using a first speech coder having a first set of codebooks generated by training on a first set of utterances spoken by a single speaker. Incoming/outgoing voice messages are encoded using a second speech coder having a second set of codebooks generated by training on a second set of utterances spoken by multiple speakers, wherein the second set of utterances is larger than the first set of utterances. The encoded fixed voice prompts and voice messages are stored in first and second memory devices, respectively, for future retrieval and playback.