Mobile devices occupy an increasingly prominent niche in the evolving marketplace, serving as access points at various stages of conducting a seemingly infinite number of activities. As this trend continues, mobile devices and mobile network capabilities provided thereby are leveraged in an increasing number and breadth of scenarios. Recent examples include the extension of mobile technology to provide a host of financial services such as check deposit, bill payment, account management, etc. In addition, location data gathered via mobile devices are utilized in an increasing number of applications, e.g. to provide targeted advertising, situational awareness, etc.
As the mobile development community finds new utility for devices, users are presented with more numerous, complex, and specific opportunities to provide input required by or advantageous to the underlying process the mobile device is utilized to perform. In addition, the context of the situations in which a user may interact with, or provide input to, a process continues diversifying.
This diversification naturally includes expansion into niches where the implemented technique may not necessarily be the most optimal or even an acceptable approach from the perspective of the user. In a culture where a fraction of a second determines the difference between an acceptable and unacceptable solution to a given challenge, developers seek every possible performance advantage to accomplish superior technology.
For example, several well-known inefficiencies exist with respect to user input received via a mobile device. A first inefficiency is small screen size typical to mobile devices, particularly mobile phones. Since the conventional “smartphone” excludes a physical keyboard and pointer device, relying instead on touchscreen technology, the amount of physical space allocated to a given key on a virtual “keyboard” displayed on the mobile device screen is much smaller than possible for a human finger to accurately and precisely invoke. As a result, typographical errors are common when considering textual user input received via a mobile device.
In order to combat this limitation, typical mobile devices employ powerful predictive analytics and dictionaries to “learn” a given user's input behavior. Based on the predictive model developed, the mobile device is capable of predicting the user's intended input text when the user's actual input corresponds to text that does not fit within defined norms, patterns, etc. The most visible example of utilizing such a predictive analysis and dictionary is embodied in conventional “autocorrect” functionality available with most typical mobile devices.
However, these “autocorrect” approaches are notorious in the mobile community for producing incorrect, or even inappropriate, predictions. While in some contexts these inaccuracies are humorous, the prevalence of erroneous predictions results in miscommunication and errors that frustrate the underlying process, the user, and ultimately defeat the adoption and utility of mobile devices in a wide variety of contexts to which a mobile device could be leveraged for great benefit.
As a result, some developers have turned to alternative sources of input, and techniques for gathering input via a mobile device. For example, most solutions have focused on utilizing audio input as an alternative or supplement to textual input (i.e. tactile input received via a virtual keyboard shown on the mobile device display). In practice, this technique has conventionally been embodied as an integration of speech recognition functionality of the mobile device (e.g. as conferred via a “virtual assistant” such as “Siri” on an APPLE mobile device (iOS 5.0 or higher)).
The illustrative embodiment of this audio input extension being added to a mobile keyboard is demonstrated in the figure depicted below. While this figure displays an interface generated using APPLE's iOS mobile operating system, similar functionality may be found on other platforms such as ANDROID, MICROSOFT SURFACE RT, etc. as well.
Audio input may be received via integrating an extension into the mobile virtual keyboard that facilitates the user providing input other than the typical tactile input received via the mobile device display. In one approach, the audio extension appears as a button depicting a microphone icon or symbol, immediately adjacent the space bar (at left). A user may interact with a field configured to accept textual input, e.g. a field on an online form, PDF, etc. The mobile device leverages the operating system to invoke the mobile virtual keyboard user interface in response to detecting the user's interaction with a field. The user then optionally provides tactile input to enter the desired text, or interacts with the audio extension to invoke an audio input interface. In the art, this technique is commonly known as “speech-to-text” functionality that accepts audio input and converts received audio input into textual information.
Upon invoking the audio input interface, and optionally in response to receiving additional input from the user via the mobile device display (e.g. tapping the audio extension a second time to indicate initiation of audio input), the user provides audio input, which is analyzed by the mobile device speech recognition component, converted into text using a speech-to-text engine, and input into the field with which the user interacted to invoke the mobile virtual keyboard.
Via integration of audio input to the textual input/output capabilities of a mobile device, a user is enabled to input textual information in a hands-free approach that broadens the applicable utility of the device to a whole host of contexts otherwise not possible. For example, a user may generate a text message exclusively using audio input, according to these approaches.
However, speech recognition and audio input capabilities of conventional mobile devices are extremely limited. For instance, as noted above mobile operating systems may conventionally include a “virtual assistant” or analogous function capable of receiving audio input, processing the audio input, and performing a predetermined set of “basic” tasks. Basic tasks include those such as invoking core OS functions, applications, etc. (e.g. launching a browser application included with the OS, performing an internet search using the browser, querying mobile device hardware for relevant data such as GPS location, device orientation, time, etc.).
These virtual assistant and analogous conventional functions are not capable of processing audio input in a specialized context beyond the general, basic functionalities included with the OS. For example, a virtual assistant is adept at performing internet searches, querying mobile device components, and providing predetermined responses to predefined queries, but is not generally capable of performing functions of a third-party application installed on the mobile device. Typically, this limitation arises because the virtual assistant is not configured to integrate with (or even aware of) the third party application's functionality and mechanisms for accomplishing that functionality.
Other conventional techniques exist for facilitating a machine's comprehension of human language. In particular, natural language processing (NLP) techniques exist which enable much broader machine intelligence with respect to linguistic audio input. NLP techniques are vastly superior to conventional mobile technology (e.g. virtual assistants as described above) in terms of being capable of a broad comprehension of linguistic audio input even absent prior instruction, training, etc. Indeed, some virtual assistants employ NLP techniques to improve the mobile OS audio processing capabilities.
However, the application/addition of NLP techniques to existing audio input processing and responsive functionality remain limited in scope and application to generic, common exchanges of information and “native” operating system functionality. A major source of complication and difficulty for enabling more specific and situationally-appropriate functions using audio arises from difficulty in determining context. The same word may have different meaning depending entirely upon the circumstances in which it is employed, and this meaning may not be discernable from the content of the statement alone. Accordingly, it is difficult or impossible to develop accurate, reliable audio processing capabilities without the capability to glean appropriate contextual information in addition to the content of the audio.
While certain virtual assistants such as SIRI®, CORTANA®, etc. generally provide speech recognition functionality, and may be used to invoke and/or interface with native functions of a host device and/or operating system such as querying search engines, location information, etc., to date no known generic (i.e. domain independent) NLP or mobile audio processing techniques specifically address the problem of facilitating a user navigating a mobile application and extracting information, less still a mobile application configured to perform data capture, processing, extraction, and subsequent business workflow integration functions.
Such applications would advantageously avoid the common problems associated with tactile input via mobile devices, as well as improve the overall user experience by reducing workload and frustration for the user. The resulting business advantages to customer retention and engagement make audio input an attractive but challenging approach to addressing shortfalls of conventional tactile input techniques and technologies. However, due at least in part to the challenges mentioned above, no such solution is presently available.
Therefore, it would be highly beneficial to provide new methods, systems and/or computer program product technologies configured to supplement and/or replace tactile input as a mechanism for receiving user input and navigating a mobile application, especially a mobile application configured to perform data capture, processing, extraction, and subsequent business workflow integration functions.