The most common and expressive setting for the use of language is face-to-face conversation. It is something that most everyone in the world has some experience doing and requires little training. Conversation is both an individual and social process. It is a joint action that requires common ground for the coordination of meaning and understanding.
Many ways have been developed to establish common ground in face-to-face conversation. Because the exchange is in real time, people engaged in a face-to-face conversation can show understanding with back-channel feedback; by pointing, gesturing or gazing, and by their choice of words, timing, and turn-taking. Importantly, those engaged in a conversation can also interrupt if they wish to speak before it is their turn.
Over the last 50 to 75 years, technology has removed the need to be face-to-face to communicate in real time. As the telephone has made its way into every house, and now into nearly every pocket, we've learned to converse without co-presence. We've established techniques to continue joint actions and establish common ground without facial expression or gesture and only with language. Because the conversation still occurs in real time, we can use back-channel feedback and turn-taking metaphors to establish common ground and have successful communication.
The answering machine has added a new dimension to distance communication. Asynchronous communication moved us farther from the familiar face-to-face style, requiring new skills. With voicemail, there is no way to continually ground events over the course of the conversation; the lack of feedback interferes with the normally mutual process of grounding events. In addition to the extra burden required to keep common ground in short term memory, one has to continually remember to check for messages, and often there is an added task of having to respond by calling each person back. While these are all clearly skills we can learn, there might be a cost in the quality or pleasure of communication.
A number of factors confound study of the use of stored voice as a communication medium. First, it spans two very different sorts of technologies, answering machines (stand alone recording devices, found in domestic settings) and voicemail systems, accessed by telephone only and typically (though not exclusively) in business settings. Each of these environments produces a different mix of voice message genres (e.g. chatty, information gathering, informing, decision making) though there may be some overlap; message type likely influences user interface requirements. With an answering machine, messages are typically heard and then discarded. In a voice messaging system, the messages may be annotated, forwarded, and archived.
Studies focused on expert users of voicemail have found that there are three main problems experienced when managing voicemail: scanning, information extraction and search (see “All talk and all action: strategies for managing voicemail messages,” by S. Whittaker, J. Hirschberg and C. H. Nakatani in Proceedings of Human Factors in Computing Systems (CHI), 1998, pp. 249-250. Scanning is used to give message priority and for locating saved messages. Information extraction is often done by taking notes about a message in order to save important information for future reference. Users also spend a large amount of time searching for archived messages and tracking the status of saved messages.
The problem of information extraction in the context of formulating a reply to a voicemail has been addressed with interfaces that allow users to take notes related to the content of the voicemail or allow them to scan a transcript of the message as they listen. See “Jotmail: a voicemail interface that enables you to see what was said,” by S. Whittaker, R. Davis, J. Hirschberg and U. Muller in Proceedings of Human Factors in Computing Systems (CHI), 2000, pp. 89-96, and “SCANMail: a voicemail interface that makes speech browsable, readable and searchable,” by S. Whittaker, J. Hirschberg, B. Amento, L. Stark, M. Bacchiani, P. Isenhour, L. Stead, G. Zamchick, and A. Rosenberg in Proceedings of Human Factors in Computing Systems (CHI), 2002, pp. 275-282.
Answering machines (or phone-accessed voicemail systems) do not have rich graphical user interfaces, and users are required to either jot down notes or keep the content of the message in memory as they attempt to respond. Voicemail has more recently become a very popular feature for mobile phones. Checking voicemail while mobile and with such a small screen makes it nearly impossible to take notes or view transcripts. As a result, more practical methods of replying to voicemail need to be explored. As is well known, memory or recall from memory deteriorates with age, making this task of extracting and remembering information difficult for the elderly. Message recipients must also juggle functionality between listening to a series of messages and then dialing phone numbers, while keeping the message in memory, to reply.
Additionally, despite the media richness of computer-mediated communication, voicemail still remains a closed, single-medium system. Although prevalent on mobile devices and in networked environments, it has rarely benefited from the devices and connectivity around it. It is accordingly desirable to utilize existing capabilities to perform functions such as accepting and delivering voice messages via the Internet, and to support sender-supplied photos and voice annotated slide shows as messages.
Previous attempts to provide a “conversational answering machine” include the PhoneSlave, developed nearly two decades ago. See “Phone Slave: A graphical telecommunications interface,” by C. Schmandt and B. Arons in Proceedings of the Society for Information Display, 26(1), 1985, pp. 79-82. PhoneSlave used recorded speech and pause-based audio recording to gather responses to questions such as “Who's calling please?”, “What's this in reference to?”, and “At what number can you be reached?,” and later could play each of these snippets back to the PhoneSlave owner, in response to voice commands. PhoneSlave used speech recognition (in lieu of today's telephone caller ID) to try to identify repeat callers, and could deliver personal messages to them when they called back, as well as indicate whether their previous message had been heard.
Part of PhoneSlave's attraction at the time was that voicemail was still new enough that callers were often not facile at leaving messages on a machine; PhoneSlave took complete messages by turning the interaction into a form-filling conversation. Most callers would likely be unwilling to participate in such a routine now, although “Whom may I say is calling?” has been used for call screening in products by Active Voice and Wildfire (available on the World Wide Web at www.activevoice.com and www.wildfire.com).
A Japanese project implemented answering machines which would mutter back-channel responses (“hai” in Japanese) to encourage callers to leave longer or more complete messages is described in “A Multi-functional Telephone with Conversational Responses and Pause Deletion Recording,” by K. Gomi, Y. Nishino, H. Matsui, and F. Nakamura, IEEE Transactions on Consumer Electronics, 1988. The “Grunt” system described by C. Schmandt in “Employing Voice Back Channels to Facilitate Audio Document Retrieval,” Proceedings of ACM Conference on Office Information Systems (COIS), 1988, pp. 213-218, presented driving directions over a telephone, pausing between each major route segment and analyzing any user response based on length and pitch contour to decide whether and when to proceed, or offer more explanation.
In the 1990's several research systems used conversational paradigms bordering on natural language input to control live interactive systems over the phone using speech recognition. MailCall described by M. Marx and C. Schmandt in “MailCall: Message Presentation and Navigation in a Nonvisual Environment,” Proceedings of Human Factors in Computing Systems (CHI), 1996, pp. 165-172, emphasized text message retrieval, and its successor SpeechActs described by N. Yankelovich, N., G. Levow and M. Marx in “Designing SpeechActs: issues in speech user interfaces,” Proceedings of Human Factors in Computing Systems (CHI), 1995, pp. 369-376, used more conversational techniques and covered a wider range of applications. QuietCalls described by L. Nelson, S. Bly and T. Sokoler in “Quiet Calls: Talking Silently on Mobile Phones,” Proceedings of Human Factors in Computing Systems (CHI), 2001, pp. 174-181, supported live voice interaction over telephones, with one party speaking and the other playing recorded audio snippets, driven by a conversational state model.
U.S. Pat. No. 5,880,840 issued to Lang et al. (Sony Corp.) on Mar. 30, 1999 describes a voice mail reply method for use in answering machines and office voice mail systems in which an incoming voice mail message is stored and then played back. As the voice mail message is being played back, the listener can interrupt the playback and record a response. The original voice mail message, with the responses inserted, is then returned to the originator.