The present invention relates to analysis, processing, transmission, and storage of digital speech data recorded at the network clients of a set of servers.
Use of Internet technologies to interact with customers of enterprises is proliferating. Enterprises are taking advantage of increasing use by consumers of the global, packet-switched network known as the Internet. Through the internet consumers may not only obtain information about goods and service, but may order the goods or services themselves.
To interact with their customers, an enterprise uses a server that is connected to the World Wide Web (xe2x80x9cWebxe2x80x9d). The World Wide Web includes a network of servers on the Internet. The servers communicate by using the Hypertext Transfer Protocol (HTTP) with clients that are under the control of users, and deliver files referred to as xe2x80x9cpagesxe2x80x9d to the clients. The files typically contain code written in the Hypertext Markup Language (HTML). The page may define a user interface, which is typically a graphical user interface (xe2x80x9cGUIxe2x80x9d).
The pages are delivered to clients that request them. Typically, a client retrieves a page using a computer device that runs a client program, referred to as a browser. A browser is a combination of software and hardware that is capable of using the HTTP protocol to retrieve data from a server connected to the Web. When a browser running on a client receives a page containing code that conforms to HTML, the browser decodes it. The HTML code may define a graphical user interface. Thus, when a browser decodes a page, it generates the GUI. The user interacts with the GUI to enter data, such as text data entered through a keyboard, which the client transmits back to the server.
An organization that operates a server on the Web for the purpose of delivering user interfaces through which customers or constituents of an enterprise may interact with the enterprise, is herein referred to as a service provider. A service provider may be an internal organization of an enterprise (e.g. an information systems department), or an external organization, typically hired by the enterprise.
A graphical user interface often includes a display that may contain text, graphical controls (e.g. selection buttons, command buttons, labels), and commands for displaying content defined by other files or sources of data, such as graphical image files or other pages containing HTML code. Thus, HTML code includes commands for displaying content defined by other pages and sources of data. Such commands specify a location (xe2x80x9clinkxe2x80x9d) of a source of data (e.g. file, or a server that generates a stream of data).
Most clients and browsers are configured to generate input and output in forms of media other than graphical. Thus, user interfaces generated by a browser are not limited to interacting with users through a mouse or a keyboard. For example, browsers may retrieve digital speech data over the Web from a server. When the digital speech data is received, the client decodes the digital speech data and generates sound. Specifically, a browser under the control of a user may download a page from a server of a service provider. The HTML code in the page may define commands and links for pictures, and, in association with the pictures, (1) a label, and (2) commands and links to retrieve and play digital speech clips. Each picture depicts a product offered by the enterprise associated with the service provider. When HTML code is decoded by a browser, it generates in its display a picture and label adjacent to the picture. The label may be clicked by a user using a mouse. In response to clicking the label, the browser connects to the source of the sound associated with the picture, and begins to receive digital speech data. In response to receiving digital speech data, the client generates sound, and in particular, music and a narrative advertising the product.
Transmission of information between users and service providers is not a one way process, users also may transmit information to the service providers. For example, a user may download a page that defines a graphical user interface for ordering products. The user enters date for the order through the interface. The entry of order may include typing in information, such as an address and credit card number, which is collected by browser, and eventually transmitted to the service provider.
Many clients are capable of receiving digital speech input from a user. Thus, an interface downloaded by a browser may not be configured just to convey speech to a user, but to receive voice input from the user. The voice input received from the user is converted into digital speech data, which is transmitted by the browser to a service provider. The ability to both convey and receive speech input from the user provides a method of communication that may be more effective and convenient in many cases.
For example, a service provider sells books using the Web. Customers may download pages that each describe a book which may be purchased from the service provider. The display generated for a particular page contains the following: (1) text describing the book, (2) a command button (xe2x80x9cnarrative buttonxe2x80x9d) associated with a narrative from the author, (3) a set of command buttons (xe2x80x9creader comment buttonsxe2x80x9d) associated with sound clips left by various readers of the books, and (4) a command button (xe2x80x9cleave comment buttonxe2x80x9d) for leaving a verbal comment about the book. When the user clicks the command button, the client retrieves digital speech data for playing the narrative from a server of the service provider, and then plays it back to the user. The user then hears a description of the book in the authors own voice. When the user clicks one of the reader comment buttons, the client retrieves digital speech data for playing a comment left by a reader of the book. The user hears the comment in the reader""s own voice, hearing not only the words, but the emotion behind them. Emotion is a concept not easily conveyed in writing for mainstream users.
When the user clicks on the leave comment button, the client prompts the user to provide voice input, which the client records. The client transmits digital speech data of the recording to the service provider""s server. Voice input is a method of providing input that is more effective and convenient for many users.
However, receiving voice input is not as necessarily convenient for a service provider. Managing speech input and data requires capabilities in addition to capabilities normally needed to process more traditional forms of user entered data, such as ASCII text data entered through a GUI. These capabilities require use of other technologies and personnel skilled in those technologies to support the technologies. For example, receiving speech input requires techniques for automatically controlling gain (recorded volume of a speaker""s voice), compressing speech data, reliable transmission of digital speech data over the Internet, and caching digital speech data transmitted between the client and the server. When the digital speech data is received, it may be stored and further processed. For example, once a server receives digital speech data, the server may apply speech recognition processes to generate, for example, keywords. A database may be needed to manage information used to manage the digital speech data, information that may be extracted from the digital speech data using speech recognition technology.
Voice processing technology also enables service providers to obtain forms of information not easily obtained from traditional forms of user entered data. For example, anger detection methods may be used to detect anger in digital speech data generated by a customer. Digital speech data may also be used to authenticate a user.
Employing digital speech processing technology requires additional resources to support the technology. Software must be purchased, developed, and maintained. Personnel that are experts in the technology must be hired. The cost of acquiring the resources are often so high that the implementation of digital speech processing technology is uneconomical.
Based on the foregoing, it is clearly desirable to provide a system that lessens the cost associated with processing digital speech data originated from a client on the Internet.
Described herein is a system that enables service provider""s to integrate speech functionality into their applications. A service provider maintains a set of application servers. To provide a particular speech service to a client of the application server, the application server causes the client to request the speech service from another set of servers. This set of servers is responsible for providing this speech service as well as others. Such speech services include recording digital speech data at the client, and storing the recordings. Later, the application servers may retrieve the recordings, and even more, retrieve data derived from the recordings, such as data generated through speech recognition processes.