Headsets are a popular way for a user to listen to music or audio privately, or to make a hands-free phone call, or to deliver voice commands to a voice recognition system. A wide range of headset form factors, i.e. types of headsets, are available, including earbuds. The in-ear position of an earbud when in use presents particular challenges to this form factor. The in-ear position of an earbud heavily constrains the geometry of the device and significantly limits the ability to position microphones widely apart, as is required for functions such as beam forming or sidelobe cancellation. Additionally, for wireless earbuds the small form factor places significant limitations on battery size and thus the power budget. Moreover, the anatomy of the ear canal and pinna somewhat occludes the acoustic signal path from the user's mouth to microphones of the earbud when placed within the ear canal, increasing the difficulty of the task of differentiating the user's own voice from the voices of other people nearby.
Speech capture generally refers to the situation where the headset user's voice is captured and any surrounding noise, including the voices of other people, is minimised. Common scenarios for this use case are when the user is making a voice call, or interacting with a speech recognition system. Both of these scenarios place stringent requirements on the underlying algorithms. For voice calls, telephony standards and user requirements demand that high levels of noise reduction are achieved with excellent sound quality. Similarly, speech recognition systems typically require the audio signal to have minimal modification, while removing as much noise as possible. Numerous signal processing algorithms exist in which it is important for operation of the algorithm to change, depending on whether or not the user is speaking. Voice activity detection, being the processing of an input signal to determine the presence or absence of speech in the signal, is thus an important aspect of voice capture and other such signal processing algorithms. However, even in larger headsets such as booms, pendants, and supra-aural headsets, it is very difficult to reliably ignore speech from other persons who are positioned within a beam of a beamformer of the device, with the consequence that such other persons' speech can corrupt the process of voice capture of the user only. These and other aspects of voice capture are particularly difficult to effect with earbuds, including for the reason that earbuds do not have a microphone positioned near the user's mouth and thus do not benefit from the significantly improved signal to noise ratio resulting from such microphone positioning.
Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is solely for the purpose of providing a context for the present invention. It is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.
Throughout this specification the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
In this specification, a statement that an element may be “at least one of” a list of options is to be understood that the element may be any one of the listed options, or may be any combination of two or more of the listed options.