US 12,170,087 B1
Altering audio to improve automatic speech recognition
Gregory M. Hart, Mercer Island, WA (US); and William Spencer Worley, III, Half Moon Bay, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Oct. 28, 2022, as Appl. No. 17/976,137.
Application 17/976,137 is a continuation of application No. 16/510,060, filed on Jul. 12, 2019, granted, now 11,488,591.
Application 16/510,060 is a continuation of application No. 15/918,608, filed on Mar. 12, 2018, granted, now 10,354,649, issued on Jul. 16, 2019.
Application 15/918,608 is a continuation of application No. 14/994,926, filed on Jan. 13, 2016, granted, now 9,916,830, issued on Mar. 13, 2018.
Application 14/994,926 is a continuation of application No. 13/627,890, filed on Sep. 26, 2012, granted, now 9,251,787, issued on Feb. 2, 2016.
Int. Cl. G10L 15/22 (2006.01); G10L 15/20 (2006.01); G10L 17/00 (2013.01); G11B 27/00 (2006.01); H03G 3/32 (2006.01); H03G 5/02 (2006.01); H04R 3/12 (2006.01); G10L 15/26 (2006.01)
CPC G10L 15/22 (2013.01) [G10L 15/20 (2013.01); G10L 17/00 (2013.01); G11B 27/005 (2013.01); H03G 3/32 (2013.01); H03G 5/02 (2013.01); H04R 3/12 (2013.01); G10L 2015/223 (2013.01); G10L 15/26 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A device comprising:
at least one speaker;
at least one microphone;
one or more processors; and
computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the device to perform operations, the operations comprising:
storing, at the device, first data representing one or more predefined words indicating that a user is going to provide a subsequent command to the device;
causing the at least one speaker to output first content at a first volume;
receiving, while the at least one speaker outputs the first content at the first volume, a first input audio signal generated by the at least one microphone based at least in part on first sound from the user;
performing speech processing on the first audio signal to generate second data indicating one or more spoken words of the user;
determining, using the first data and the second data, that the one or more spoken words of the user correspond to the one or more predefined words indicating that the user is going to provide a subsequent command to the device;
causing the at least one speaker to output the first content at a second volume that is less than the first volume;
receiving, while the at least one speaker outputs the first content at the second volume, a second input audio signal representing a voice command generated by the at least one microphone based at least in part on second sound from the user;
sending, based at least in part on the determining, third data based at least in part on the second input audio signal to one or more computing devices that are remote from an environment of the device;
receiving, from the one or more computing devices and based at least in part on the voice command, output audio data representing an audible response to the voice command; and
causing the at least one speaker to output the audible response represented in the output audio data.