1. Field of the Invention
The present invention relates to systems and methods of generating phonetic and prosody based information signals from data provided by content providers and, more specifically, the present invention relates to an authoring system for creating and editing such signals. The authoring system is preferably applicable to an individualized and interactive radio and communications system and services, including interactively controlled creation, repository, and delivery of content by human quality voice via a digital wireless communication infrastructure, to a wireless network computer, either hand-held or mountable in vehicles.
2. Background of the Prior Art
There have been a number of various technology advances in the radio broadcast industry which expand the available bandwidth to mobile customers, add some interactive control functions, improve reception, and allow radio programming to incorporate alphanumeric data. Mobile radios started integrating additional subsystems, e.g., U.S. Global Positioning System (GPS) receivers to locate vehicle coordinates, LCD screens for displaying alphanumeric data, e.g., song lyrics, paging, or for displaying graphic information, such as a local map retrieved from a co-located CD-ROM.
However, the fundamental radio technology has changed very little ever since its conception. It is still based on the original idea of channeled broadcast, which offers very little customization to listeners. That is, listeners can merely choose among a few channels, limited by the scarcely available bandwidth.
There have been revolutionary changes in the information technology in the recent years, e.g.: (1) a large amount of information has become available due to the advent of the Internet and other similar repositories; and (2) a sufficient overall communications bandwidth has become available to mobile consumers due to the advent of wireless technologies, e.g., cellular telephony, satellite communications.
Currently, this vast amount of information is primarily tuned for visual presentations on computer screens. Similar to computer users, radio listeners have a constantly increasing need for the capability to choose what they want to listen to, and when they want to listen to it.
There have been conventional computer-based attempts to deliver audio presentation of information to computer users on demand. The traditional method to achieve the above employs audio compression techniques and is quite straightforward. The textual information is first read by a human, the human voice is captured in an audio file, the audio file is then compressed and stored in a network-based information repository for consumption. The playback device, hard-wired to the same wired line network as the repository, retrieves the compressed audio files from the repository, decompresses them, and then plays them back.
Using such a scheme, Real Audio technology delivers AM quality audio (speech or music), if the client communicates with the Real Audio server at about 14 Kbps (kilobits per second), and provides FM quality audio when the available transmission rate is about or better than 28 Kbps. AM quality voice compression may be achieved at lower rates. Clearly, there is a trade-off between the compression ratio and the quality of the restored audio. Today, the maximum voice compression accepted by the wireless telephony industry is approximately 7-8 Kbps. For example, a compression scheme is used by digital cellular telephony standard IS-54 and is based on a vector-sum excited linear prediction (VSELP) coding technique which achieves 7.95 Kbps data rate.
However, this traditional radio on demand scheme assumes transmitting large volumes of digital audio data over long periods of time, i.e., on the order of hours. Using wired lines with its relatively cheap communications cost is economically acceptable for digital audio transmission. The customer is usually connected to the Internet or similar services by using a 14.4 Kbps or 28.8 Kbps modem over a single local telephone line. Therefore, even FM quality audio can be delivered to the customer very cheaply. The charge usually includes the cost of the local call (usually no additional charge is incurred to the basic phone connection cost) and a proportion of the charge paid to the Internet Service Provider (ISP). The latter also may be considered zero (no additional charge), if the ISP service charge is a flat rate.
Alternatively, even a contemporary system based on the widely used AMPS (Advanced Mobile Phone Service which is a wireless network used by analog cellular phones) modem still only reliably delivers about 4 Kbps to 8 Kbps depending on the speed of the vehicle, local geographic landscape and number of users simultaneously sharing the available local bandwidth.
Overall, the cost of wireless data transmission is usually about one or two orders of magnitude more expensive than in the case of wired data transmission. Clearly, such a method of transmitting compressed voice defeats the purpose of using wireless communication in the first place, because compressed speech takes at least as much data bandwidth as can be transmitted over a wireless telephony channel. In other words, the cost of digital voice transmission over AMPS is then approximately the same as transmission of an analog source without compression.
To allow users to share the cost of wireless data transmission, several companies have introduced the so-called Cellular Digital Packet Data (C.D.) technique. It allows multiple users to be connected to an IP (Internet Protocol) network permanently by sharing an idle AMPS channel and hopping between idle AMPS channels. An average data rate per CPD. user depends on the number of users sharing the channel. For example, if 20 users simultaneously send or receive data via one channel, individual average data rate will be just about 400 bps, which is sufficient for e-mail and other relatively short messages. The cost of transmission per byte is somewhat higher than using AMPS, but the packet approach to data transmission allows providers to charge users for the amount of data transmitted, not for the connection time. However, the above-described traditional scheme of compressed audio transmission requires much more bandwidth than is available to users connected to the audio source via CDPD network.
It is anticipated that in a few years, Personal Communication Systems (PCS) will have a somewhat better digital data transmission rate then AMPS and CDPD, but still will not be economical for long hours of wireless digital audio transmission.
It is clear from the above discussion that using traditional methods of transmitting large volumes of digital audio data to radio devices is prohibitively expensive, because the cost of the wireless communications media is optimized for relatively short transmissions, e.g., an average voice phone call or electronic mail. In the foreseeable future, known techniques will not yield the compression ratios necessary for economical transmission of audio data over wireless lines while providing an acceptable broadcast audio quality.
Today, the only known method to deliver large amounts of data wirelessly is using a speech synthesis method. Low bit rate may be obtained using Text-To-Speech (TTS) conversion technology. Regular text is represented by about 8 to 20 characters per second, or requires a maximum 160 bps transmission data rate; however, resulting speech does not deliver an acceptable human intonation.
Although arbitrary speech conversion is based on prosody rules as well as syntactic and morphological analysis, achieving human speaker's voice quality has not been feasible so far. One of the requirements of radio transmission is to deliver a speaker's intonation accurately, because the speaker prosody reflects certain aspects of his/her personality and the state of mind of the speaker. While speech compression delivers speaker's intonation precisely, arbitrary speech synthesis frequently does not.
Experiments show that such a "synthetic" intonation is not acceptable for the majority of radio listeners. As a result, the majority of radio listeners usually feel aggravated in a few minutes or loose their attention. This is the reason that TTS did not find a widespread usage yet, unless the message is short, and to the point, like e-mail or a stock market quote.
Despite the many drawbacks described above, as well as others not particularly mentioned, several radio communication service systems have been proposed. The following are examples of such conventional radio communications service systems.
U.S. Pat. No. 5,303,393 to Noreen et al. describes a so-called RadioSat system including key features of nationwide broadcast of FM channels and paging. While some RadioSat data communication services, like paging, could be implemented by using terrestrial communication, e.g. Personal Communication Systems (PCS), only satellite broadcast transmission provides a significant number of additional FM channels nationwide. Also, any substantial amount of data, like digital audio, may be transmitted to mobile RadioSat terminals via satellites only. Many critical interactive RadioSat applications, including two-way voice communication, require satellites to provide a return channel (mobile-to-satellite), which is not the case for many national satellite systems. Even in the United States, the necessary satellite infrastructure to provide full RadioSat services has yet to be built. Next, the user interface and information delivery is based on a touch screen approach, which is unsafe, because user attention has to be switched from the road to the terminal screen frequently either for receiving the information or for issuing commands. And last, but not least, the scope and spirit of Radio Sat services is essentially a radio broadcast. The RadioSat technology merely expands the number of available channels. Thus, each MSAT could support on the order of 166 FM-quality channels, or four times as many talk channels (AM quality). Individualized services to hundreds of thousands of mobile users cannot possibly be provided by the Noreen et al. system.
The USA Digital Radio foundation of U.S. broadcasters has developed a system for the delivery of in-band on-channel (IBOC) digital audio broadcasting (DAB) in order to introduce compact disc quality broadcast radio, while preserving the infrastructure and investment of the broadcast industry AM segment. Key to the realization of IBOC DAB in limited AM band allocations is a powerful source compression algorithm. The AM IBOC audio source encoding scheme is based on MUSICAM.RTM. which is in turn based on the ISO/MPEG I Audio Layer II (ISO 11172-3) standard for audio sub-band encoding. The standard has been advanced through the development of the psycho-acoustic model to the point where music may be transcoded at a rate of 96 Kbps in order to reproduce 16 bit stereo at a 15 KHz audio bandwidth. The resulting 96 Kbps bit stream includes, in addition to compressed music, a 2.4 Kbps ancillary data stream. The compression of music to 96 Kbps enables broadcasting of DAB over the narrow bandwidth available to the AM allocation.
AM offers DAB a readily available network of high quality audio broadcasting facilities and, as such, its data delivery capability can be used to transmit song titles, artists, and album names and lyrics, traffic and weather information, emergency warnings, paging services, stock market quotations, etc. However, IBOC DAB is essentially a broadcast technology which cannot be used for individualized and interactive data or audio transmission.
Still another approach called the Radio Broadcast Data System (RBDS) allows an FM station to transmit auxiliary data for the newer generation of "smart radios" now coming to the market. The RBDS standard was developed for the U.S. radio market and is an outgrowth of RDS that has been used in Europe for some time. The RBDS signal is transmitted by an FM station on a 57 kHz subcarrier as a bi-phase coded signal with an overall data rate of 1187.5 bps, including forward error correction. The usable data rate is 730 bps. The signal is made of 16 data groups. Each group delivers data for a different application. Thus, one group is used for Differential GPS data to increase an accuracy of GPS satellite-only based positioning. One other group is used for radio paging. Still other groups are used for station identification. Some other group lists alternate station frequencies to let a user keep tuned to the same program when reception is fading. Some groups are used for text transmission, like radio text group, which allows receiving 64 character messages, and radio paging group. This list is not complete and somewhat different for RDS and RDBS standards. American RDBS version reserves groups 3, 5, 6 and 7 for renting by station owners to service providers. For example, content providers may transmit newspapers and periodicals, promotional messages and advertising, artist's name and title of song.
Overall the useful data transmission rate for a single group is 45.6 bps. This data rate can be mostly used for scrawling text messages on an LCD screen, e.g., song lyrics. Moreover, it is known that the RDS standard creators admit that the Radio Text feature is unlikely to be used in car receivers, due to the distracting effect of a video screen to a driver.
The data transmission rates typical for the RDS/RDBS standards are obviously too slow for any audio-related application. Also, interactive applications are completely out of the scope of those standards. As a result, while RDBS or RDS standards substantially expand broadcast services, they still do not provide users with individualized and fully interactive audio content transmission.
Another approach is described in U.S. Pat. No. 5,321,514 to Martinez which proposes making practical use of presently unusable "taboo" frequencies for bi-directional data transfer between consumer digital receivers and television transmitters. The so-called "T-NET" system makes use of a spread spectrum approach to provide bi-directional digital communication on a vacant television channel. An aggregate upstream data rate may achieve about 3 Mbps at the service area radius of about 6 miles, so the T-NET system may provide about 10,000 users with an individual upstream data rate of about 300 bps per user and downstream data rate of about 200 bps per user. While this approach may provide an individualized and interactive data service, interactive audio services are still out of the range of such a system. In addition, using such frequencies may generate unacceptable interference for public television channels, and may not be allowed by the Federal Communication Commission (FCC), routinely and everywhere.
Bell Atlantic has offered a service which allows cellular telephone users to receive pre-recorded voice messages. Those messages may be local news, weather, stock market, traffic announcements and other information. The user requests information by first calling a special number and then punching telephone keys to browse through the menu offered by a pre-recorded voice. However, considering the high cost of a cellular call, such an information system is prohibitively expensive if used more than a few minutes per day. Also, the quality of speech delivered through the cellular phone is typically lower than AM voice quality.
General Motors Corporation introduced its OnStar system for the 1997 Cadillac model. By linking the car's cellular phone to a global positioning satellite, OnStar can locate and send help to a stranded or disabled motorist; including sending medical assistance as soon as it detects that the car's air bag has been deployed. OnStar's service center operator receives coordinates of an automobile equipped with the OnStar system and could navigate its user, over the cellular phone, with continuous directions.