In recent years, the capabilities of mobile devices and the data networks supporting them have advanced to the point where it is possible to offer multimodal search capabilities to mobile consumers. For example, applications such as Speak4it® (Speak4it is a registered trademark of AT&T Intellectual Property, Inc., of Reno, Nev.) allow people to find businesses by using spoken queries, and then browse the results on a graphical user interface (GUI).
An important feature of the Speak4it application, and applications with similar local search functionality, such as Google® Mobile (Google is a registered trademark of Google Inc., of Mountain View, Calif.) and Vlingo® (Vlingo is a registered trademark of Vlingo Corporation, of Cambridge, Mass.), is the ability to use global positioning system (GPS), cell tower triangulation, Wi-Fi™ triangulation, and/or other location determining techniques to ascertain the approximate location of a device in order to constrain the results returned so they are relevant to the user's presumed local context. When a user says “gas stations,” the system will return a map showing gas stations in the immediate vicinity of the location of the device. This strategy allows users to conduct searches even when they do not know the name or pronunciation of the town they are in and, like other kinds of multimodal input, is likely to reduce the complexity of their queries, thereby simplifying recognition processes and user understanding.
However, as interactive multimodal dialog capabilities are added to search applications such as Speak4it and a broader set of use cases is considered, the “brute force” approach of assuming that the most salient (i.e., most relevant, important, or significant) location for the user is always the current physical location of the device may not be sufficient. If the device has a touch screen and the search application provides the user with a map, the salient location may be a location the user has explicitly touched. If the map is pan-able, through touch (e.g., point, multi-point, and/or gesture), on-screen software controls (e.g., directional pad, joystick, soft buttons, etc.), or hardware interface component (e.g., keypads, scroll wheels, scroll balls, dedicated hardware buttons, etc.), the most salient location may be the last location to which the user panned. Alternatively, if the user is able to refer to locations by voice, for example, “Show the Empire State Building” or “Chinese restaurants on the Upper West Side,” then the relevant location referent may have been introduced as part of that spoken dialog. That is, by interacting with the system the user may have established a series of actions aimed at grounding some location. Thus, the user would likely consider that grounded location as being most salient and as the location reference to be inferred when the location in which a search is to be conducted is otherwise left ambiguous or unspecified.
As an example of this grounding problem, suppose a user is interacting with a GPS-enabled mobile device in Manhattan and is currently located in the Lower East Side but browsing a search application to find a That restaurant near Central Park, the user says, “Show Central Park,” and then scrolls and zooms-in on the map to view a four-block square area on the Upper West Side next to Central Park. If the user then says, “That restaurants,” most people would understand that this user seeks information about That restaurants in the four-block zone of the Upper West Side now displayed on the device because the user's speech and actions have laid down a trail of contextual traces that lead to the Upper West Side as the grounded location, for at least the duration of that particular interaction. However, a system that solely uses GPS to establish the location of the device for a query would fail that simple test of human understanding, and would instead display restaurants in the user's immediate vicinity of the Lower East Side—probably undoing the user's map navigation actions in the process, and losing the established context of interaction.
Queries handled by Speak4it and other similar applications typically cover descriptions of categories or names of businesses and, as a result, the queries it receives tend to be short and not grammatically complex. Thus, when a user makes an effort to speak the name of a location in a query, it is safe to assume that the uttered location is salient to that person for that query. For the majority of cases, however, people do not explicitly state a location, revealing a need for some mechanism to determine the intended location.
The embodiments presented herein address the aforementioned deficiencies in establishing the grounded location—that is, the location a person believes has been established as mutually salient with a search system when issuing a search request—from the many possible locations that could also be relevant.