The rapid uptake of mobile telephones and cheaper and more widespread mobile connectivity are driving adoption of information and communication technology in developing nations, but access to the Internet remains difficult for two major reasons. First, the Internet is mostly in English and is thus largely inaccessible to billions of people for whom English is not a native or second language. Second, the Internet is accessible largely through text-based technologies (i.e., web browsing, email and text messaging) and is thus not usable by those who are unable to read and write. Systems have been created to address this problem. Using the Spoken Web system, for example, individual users or organizations can create interactive voice-based applications (voice sites) employing inexpensive mobile telephones or touchtone wired telephones. These voice sites can then be accessed and modified by other Spoken Web users with their telephones. See, for example, N. Patel et al., “Experiences Designing a Voice Interface for Rural India,” IEEE SLT—Spoken Language Technologies (2008), N. Patel et al., “A Comparative Study of Speech and Dialed Input Voice Interfaces in Rural India,” CHI conference on Human factors in computing systems (2009), S. Agarwal et al., “Content Creation and Dissemination by-and-for Users in Rural Areas,” ICTD International Conference on Information and Communication Technologies and Development (2009) and A. Kumar et al., “Voiserv: Creation and Delivery of Converged Services Through Voice for Emerging Economies,” WoWMoM'07 Proceedings of the 2007 International Symposium on a World of Wireless, Mobile and Multimedia Networks (2007) (hereinafter “Kumar”), which describe systems that are interactive voice response (IVR) applications.
IVR is a technology that allows a computer to detect voice and dual-tone multi-frequency signaling (DTMF) keypad inputs (touchtone) and respond with audio. IVR systems are common for automating call center automated response, but there are many other applications, including telephone banking, airline reservation booking and other information services (e.g., weather). Users typically navigate by listening to the system audio output until there is some type of prompt for user input and then respond with spoken commands that are recognized using a speech recognition engine or using touchtone keypad inputs. This user input is used to generate more system audio output. Given the low technological barrier to deploying interactive voice applications, there is considerable promise for widespread adoption, especially in low-literacy populations, provided that the applications can be readily accessed through audio navigation.
However, effective navigation through large amounts of audio content is a well-known challenge (see, for example, S. Kristoffersen et al., “Making Place” to Make IT Work: Empirical Explorations of HCI for Mobile CSCW,” Proceedings of the international ACM SIGGROUP conference on supporting group work GROUP, p. 276-285 (1999)). This issue is further compounded by the limited user input capabilities of touchtone and mobile telephones. Techniques including auditory icons and earcons (see, for example, Human-Computer Interaction Fundamentals (Human Factors and Ergonomics) Andrew Sears (Editor), Julie A. Jacko (Editor) pg. 224) and skimming (see, for example, B. Arons, “SpeechSkimmer: A System for Interactively Skimming Recorded Speech,” ACM Transactions on Computer-Human Interaction, vol. 4, no. 1, pgs. 3-38 (March 1997)) can improve sequential access to and local navigation among voice recordings. Information retrieval using words and phrases extracted using automatic speech recognition has had some success, but some regions of the world have many languages and dialects, resulting in high recognition errors. Also, users do not always know what they are searching for. Thus, the problem of selecting, filtering and structuring relevant audio content from a large database remains (see, for example, M. Yin, “Dial and See-Tackling the Voice Menu Navigation Problem with Cross-Device User Experience Integration,” In Proceedings of the 18th Annual ACM Symposium on User interface Software and Technology (Seattle, Wash., USA, Oct. 23-27, 2005)). In some cases, metadata can be inferred from the audio content, such as the gender of the speaker, but this information may be of only limited use in structuring relevant audio content. With little metadata available, user-directed browsing through this space is difficult and error-prone.
Therefore, techniques that improve user navigation through audio content on voice-based interactive systems would be desirable. The most common way of improving user navigation is by reorganizing the interactive dialogue of user input and audio response. There are existing methods for automatically or semi-automatically restructuring IVR menus. See, for example, U.S. Pat. No. 7,076,049 issued to Bushey et al., entitled “Method of Designing a Telecommunications Call Center Interface.”
IVR systems may have large amounts of audio for several reasons. An IVR system may access a database of audio recordings that grows over time, for example, or an IVR system may store and retrieve user recordings that are then navigated by other users. In the Spoken Web (see Kumar), for example, the user can press a key to speak an answer to a question and have it recorded and stored in an audio file that is then later played back for other users. Users must navigate large numbers of these recordings. IVR menus through this audio cannot be optimized for different groups or individual users ahead of time because the set of recordings and the set of users change over time.
A social recommender is a software system that uses ratings of items from prior users to recommend items to new users. Recommendations help users find items that are interesting to them. Items can be movies, music, books, articles, web pages, products, and so on. A rating is a user's estimate of their interest in an item along a scale, for example, from one to five. Social recommenders typically suggest items of interest (recommending) or filter out items not of interest (collaborative filtering). Social recommenders go through users' ratings of items to find patterns. A pattern could be, for example, a subset of all users who tend to agree on their ratings of items (a cohort). For example, most users from New Orleans, USA might be interested in articles about flooding while most users from Alaska, USA might be interested in articles about oil spills. If a user fits the pattern, then recommendations can be made (or filtering performed) based on the item ratings of other users in the cohort of similar users. Social recommenders are common today on the Web.
Therefore, techniques for using social recommendation for IVR would be desirable.