This invention relates to automatic spoken dialogue script discovery, and more particularly to automatic configuration of a spoken dialog system for script-based access to applications.
The World-Wide Web (the “web”) includes various types of resources intended to be accessed by human users, including documents that directed incorporate content, as well as interfaces to computer-implemented systems that provide content in response to information provided through the interfaces.
Computer-oriented interfaces to computer-implemented systems are also accessible over the Web, for example, using “Web Services” interfaces, which may provide a way to exchange data using formats such as XML and JSON. To use a web service, an author of a “client” system generally uses documentation for the web service to write a program that accesses information via the web service. Structure of requests and responses to a web service may be specified (e.g., as XML schema) permitting some automation of authoring of clients.
In practice, most computer-implemented systems that are accessible over the Web provide human-oriented interfaces and very few provide computer-oriented (e.g., web service) interfaces. Therefore, there is a need to be able to automatically use human-oriented interfaces without requiring extensive programming for each interface.
Increasingly, users' desire to access computer-implemented systems without using conventional GUI-based interfaces. For example, today's voice-based personal assistants (e.g., Apple's Siri) attempt to provide information using a voice-based dialogue rather than using a GUI.
The content of the Web has been automatically indexed since the early 1990's using automated “web robots” that “crawl” accessible content. Generally, such web crawlers, start with a web site, and then expand their indexing search by following the hyperlinks on each site to other sites, and continuing the search in sensible ways. A substantial improvement over these web searching browsers was introduced by Google, which ranked each page by a function of the number of other pages which pointed to it.
“Crawling” of the web by search engines provides a way of automatically accessing content incorporated on web pages, for example, in response to keyword-based queries. However, such approaches are generally focused on the explicit content on web pages, and not on the content accessible via interfaces presented on the web.
Later efforts attempted to extract information from web pages by parsing the HTLM or the DOM (Document Object Module) information. Since the DOM information is relatively static, these techniques allowed re-sampling of news, weather, and other information pages. However, even these advanced techniques do not allow the user to take advantage of web sites where information is supplied to the site, and data, maps, pictures, or audio is returned. Early attempts at creating a “semantic web,” where the restrictions which allow the automatic use of a web site are annotated and cataloged, have mostly failed. For instance, the W3C refers to “Semantic Web” as a vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. However, this technology has not been widely adopted.
Robotic interaction with web-based interfaces (e.g., be they human-oriented or web services based) can be scripted based on human programming (sometimes referred to as “screen scraping”). For example, a programmer mimics the actions of a user to retrieve information in a computer-implemented system.
One of the challenges in the construction of a general purpose dialogue system is adding additional functionality covering new services or new interactions. For example, one might want to enable users to book tickets on a new airline, order food from a local restaurant, buy movie tickets, or use a new social networking service. Traditionally, programmers would create or use the APIs necessary for interacting with each additional service, costing many man-hours. Moreover, if one wants to integrate these new components into the rest of a dialogue system (by, for example, using the same representation for contact information or flight itineraries) even more care must be taken to ensure all of the components fit together.
In many ways, however, this is duplicated effort: the HTML-powered display Internet as we know it contains most of the components needed for interacting with a broad array of online services and information. Indeed, many if not most online services are built expressly with the display Internet in mind. However, the focus on display-first services means that much of the information available on the Internet is less accessible or even inaccessible to machines or to audio-only interfaces. This is not to say that efforts have not been made to make the web more accessible (to both people and machines). Standards like ARIA enable users with disabilities—including visual impairment—to navigate websites more easily, by for instance using screen readers. Semantic Web standards likewise are an attempt at making machine-interpretation of websites easier.
These standards are not uniformly or (especially in the case of Semantic Web) widely employed. Even when accessibility standards are employed, they do not make for an experience that is as easy to use than those with the default visual display interface. In other words, screen readers are just as their name implies: they read the screen, leaving information integration to the user. It is not the coherent interactive experience that a person might have if they were to interact with another person who for whatever reason cannot see the website in question (for example, because they are visually impaired, driving, or simply not at a computer).
Voice-based interfaces to computer-implemented systems generally require programming of a “connector” between human-computer dialogue component and the interface to the computer-implemented system. For example, experimental Travel Reservation systems have implemented voice-based dialogs and programmed interactions with travel reservation systems (e.g., Sabre).