Just like human personal assistants, digital assistants or virtual assistants can perform requested tasks and provide requested advice, information, or services. An assistant's ability to fulfill a user's request is dependent on the assistant's correct comprehension of the request or instructions. Recent advances in natural language processing have enabled users to interact with digital assistants using natural language, in spoken or textual forms, rather than employing a conventional user interface (e.g., menus or programmed commands). Such digital assistants can interpret the user's input to infer the user's intent, translate the inferred intent into actionable tasks and parameters, execute operations or deploy services to perform the tasks, and produce outputs that are intelligible to the user. Ideally, the outputs produced by a digital assistant should fulfill the user's intent expressed during the natural language interaction between the user and the digital assistant.
Digital assistants that interact with users via speech inputs and outputs typically employ speech-to-text processing techniques to convert speech inputs to textual forms that can be further processed, and speech synthesis techniques to convert textual outputs to speech. In both cases, accurate conversion between speech and text is important to the usefulness of the digital assistant. For example, if the words in a speech input are incorrectly identified by a speech-to-text process, the digital assistant may not be able to properly infer the user's intent, or may provide incorrect or unhelpful responses. On the other hand, if the words in a speech output are incorrectly pronounced by the digital assistant, the user may have difficulty understanding the digital assistant. Moreover, incorrect pronunciations by the digital assistant make the assistant seem less polished and less capable, and may reduce users' interest and confidence in the digital assistant.
For many words, accurate recognition and synthesis are relatively easy, because their pronunciations are fairly standard, at least between people with similar accents or from similar geographical regions. However, certain words or classes of words may be subject to many different pronunciations, making accurate recognition and synthesis more difficult. For example, proper names are often subject to different pronunciations by different people, and it is often not possible to discern the correct pronunciation based only on the spelling of the name. This ambiguity in the correct (or preferred) pronunciation of names is a possible source of recognition and synthesis errors by a digital assistant.
Accordingly, there is a need for systems and methods to allow users to specify pronunciations of words for recognition and synthesis by a digital assistant.