The field of the present disclosure relates generally to speech-to-text recognition and text-to-speech generation, and more particularly, to systems, devices and methods of a framework integrating speech-to-text and text-to-speech modules to provide intelligent interactions with humans.
Some traditional systems intended to support spoken communication between humans and devices, machines, and services are limited. In some systems, the speech spoken by a human might not be accurately understood or interpreted by the machine, application, or service. In some instances, the machine, application, or service may have trouble discerning the individual words in the spoken speech. In some other instances, the spoken words themselves may be understood but the full meaning of the spoken words might not be fully understood by the machine, application, or service. In some instances, speech generated by a machine may fail to convey an emotion, urgency, or time-critical nature of information in a manner that a human can fully and accurately comprehend the idea or warning being communicated quickly.
It is noted that human generated natural language speech is oftentimes understood by humans hearing such speech to include more meaning than the words alone might indicate. The additional meaning of human speech might be conveyed in the sentiments and emotions attached to the spoken words by the speaking human. Some sentiments and emotions might be conveyed in terms of the pace, rate, tone, volume, pitch, etc. of the speech, the emphasis placed on the spoken words, the volume of the spoken words, and other aspects. Traditionally, machines, applications, services, and other artificial systems fail to fully and accurately recognize human speech and/or produce natural sounding speech in varying contexts and situations.
Therefore, there exists a need for methods and systems that support and facilitate human and artificial systems speech interactions that efficiently and intelligently capture and produce natural language.