1. Technical Field
The present disclosure relates to synthesizing speech and more specifically to providing access to a backend speech synthesis process via an application programming interface (API).
2. Introduction
To a casual observer, any text-to-speech (TTS) system appears to be a black-box solution for creating synthetic speech from input text. In fact, TTS systems are mostly used as black-box systems today. In other words, TTS systems do not require the user or application programmer to have linguistic or phonetic skills. However, internally, such a TTS system has multiple, clearly separated modules with unique functions. These modules process expensive source speech data for a specific speaker or task using algorithms and approaches that may be closely guarded trade secrets.
Often, one party generates the source speech data by recording many hours of speech for a particular speaker in a high-quality studio environment. Another party has a set of highly tuned, effective, and proprietary TTS algorithms. In order for these two parties to collaborate one with another, each must provide the other access to their own intellectual property, which one or both parties may oppose. Thus, the current approaches available in the art force parties that may be at arm's length to either cooperate at a much closer level than either party wants or not cooperate at all. This friction prevents the benefits of TTS to spread in certain circumstances.