Many natural language processing algorithms require substantial amounts of domain relevant data in order to produce accurate results. Obtaining sufficient data economically and quickly remains an unsolved problem.
Information about a domain is especially needed for development of natural language understanding applications, including but not limited to dialog interface systems. Such systems may also be multi-modal. An important source of initial data could include a human expert; however, that data must be processed and integrated, usually through a manual process, into various modules of the system including but not limited to language models, semantic or conceptual analysis, and dialog management. In addition, to the problem of processing and integrating the expert information, the amount of data that can be obtained from a domain expert is insufficient on its own to produce a functioning system.
Because the availability of domain experts is limited; another approach is needed to expand the data obtained from domain experts into a data set sufficiently large to build a natural language and/or speech technology system such as a spoken dialog system for a domain. Still, further there is a need for integrating large domain-specific data sets into a build environment for a target application, interface, device, or other technology. For example, there is need to allow engineers or developers to integrate domain-specific data sets with applications through the use of an Integrated Development Environment (IDE). An ideal solution would enable developers to automatically collect the domain-specific data sets using an IDE which possesses the ability to directly engage an MTurk infrastructure and obtain the data sets through crowdsourcing. This solution would further automatically compile the useful and valuable data obtained from the MTurk infrastructure, compile the information into a database, and integrate the database into the application.
Crowdsourcing utilizes the labor, services, ideas, opinions or knowledge of a large group of individuals in the general public usually by means of the Internet and for little or no compensation. It can be a fast and effective way to gather a significant sampling of information.
Crowdsourcing has been an emerging technology since the internet became an effective communications tool. Crowdsourced projects have taken many forms. It was used to help build and manage Linux, a major open source computer operating system, and was instrumental in the creation of Gnu software by the Free Software Foundation.
Crowdsourcing of a different type was used by Seti, the Search for Extraterrestrial Intelligence, where users volunteered their computing resources for a large distributed signal processing attack on radio telescope data. This project, launched in May, 1999, became the largest distributed computing system in the word, and earned a place in the Guinness Book of Records.
Among the current crowdsource websites are Innocentive for scientific research and iStockphoto for sharing pictures, as well as YouTube for video sharing. Wikipedia, an extension of Nupedia founded in 2001, likewise became a crowdsourced encyclopedia, created by and continually updated by public contributions and editorial work. Most importantly, in 2005, Amazon launched its “crowdsourcing Internet marketplace”, Mechanical Turk (MTurk) website. The site permits “Requesters” to submit Human Intelligent Tasks (HITs) to “Providers” over the Internet. Using the MTurk process, Requesters can obtain the information, service, etc. that they require quickly and efficiently from many Providers.
A review of crowdsourcing as of 2006 may be found in an article from Wired magazine called “The Rise of Crowdsourcing” (see URL www.wired.com/wired/archive/14.06/crowds.html?pg=4&topic=crowds&topic_set=Issue 14.06—June 2006). In that article, the authors review how people are using other people's spare cycles to create content, solve problems, and to contribute to research and development for enterprises.
As noted in the book “Crowdsourcing: Why the Power of the Crowd is Driving the Future of Business” (Jeff Howe, Random House Digital, Aug. 18, 2008), there are three primary forms of crowdsourcing in common use today. These are    a. Information for the prediction market, or information market, in which the crowd purchases “futures” on some outcome    b. Problem solving or Crowdcasting, in which a potential employee makes a public posting to a large community of undifferentiated workers, and    c. “Idea jam”, which is a “large, online brainstorming session” Howe, Jeff (2008 Aug. 18).
Additional articles directed to the more generic use of crowdsourcing include the following:    a. Kaisser, M. and Lowe, J. “Creating a research collection of question answer sentence pairs with Amazon's Mechanical Turk”, In Proceedings of the Sixth International Language Resources and Evaluation, 2008;    b. Parent, G. and Eskenazi, M. “Clustering dictionary definitions using Amazon's Mechanical Turk”. In NAACL Workshop on Creating Speech and Language Data With Amazon's Mechanical Turk”, 2010;    c. Franklin, M., Kossmann, D., Kraska, T., Reynold, S., “CrowdDB: Answering Queries with Crowdsourcing”, Michael, SIGMOD′11, Jun. 12-16, 2011, Athens, Greece, ACM 978-1-4503-0661-4/11/06.
The first two articles (a and b) merely provide examples of the use of MTurk crowdsourcing for either research or as a resource to be used in natural language processing. In the third article (c), the authors review using crowdsourcing to augment database look and search tasks. The article and the methods described do not help build a better or improved system. The article does describe how to formulate micro-tasks and how to manage the user interface to get these accomplished.
Prior research work exists on using crowdsourcing for other tasks per se such as transcription (Lee, C and Glass, J “A Transcription Task for Crowdsourcing with Automatic Quality Control” Proceedings of Interspeech 2011, Aug. 28-31, pages 3041-3044), acoustic modeling for automatic speech recognition (Audhkhasi, K. et al, “Reliability-Weighted Acoustic Model Adaptation Using Crowd Sourced Transcriptions” Proceedings of Interspeech 2011, Aug. 28-31, pages 3045-3048), and for building human-in-the-loop applications (McGraw, I. et al. “Growing a Spoken Language Interface on Amazon Mechanical Turk”, Proceedings of Interspeech 2011, Aug. 28-31, pages 3057-3060). The prior work fails to provide an automated method for expanding small amounts of in-domain expert data into data sets large enough to support development of a speech and natural language system for the domain. More importantly, the prior work fails to describe any innovation in the development of an IDE or software application development tool that directly engages the MTurk infrastructure.
Several patents have been applied for or granted in the use of crowdsourcing for utilizing the collective intelligence of crowds.    a. U.S. patent application 2010/0332281 A1 to Horvitz and Shahaf titled “Task Allocation Mechanisms and Markets for Acquiring and Harnessing Sets of Human and Computational Resources for Sensing, Effecting, and Problem Solving”, filed Jun. 26, 2009;    b. Application 2012/0293016 to Milan Vojnovic, Dominic Daniel DiPalantino titled “Crowdsourcing” describes using contests for optimizing results in crowdsourcing for logo design, code writing, or question answering”.    c. U.S. Pat. No. 8,099,311 to Gioacchino La Vecchia, Alberto Colombo, Massimo Piccioni titled “System and method for routing tasks to a user in a workforce” describes using previously collected skills profiles to accomplish Enterprise tasks”.
Finally, the USPTO itself is using crowdsourcing to assist in analyzing the suitability of patents for issue. It has asked for assistance using the web site patents.stackexchange.com/, which was fielded in 2012.
The prior work fails to provide an automated method for expanding small amounts of in-domain expert data into data sets large enough to support development of a speech and natural language system for the domain. Furthermore, the prior work fails to describe any innovation in the development of an IDE or software application development tool that directly engages the MTurk infrastructure.
These and all other extrinsic materials discussed herein are incorporated by reference in their entirety. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
In the discussion of systems and methods for the searching of databases using natural language sound data, crowdsourcing has been cited as a useful tool. For example, U.S. patent application publication 2012/0233207 to Mohajer titled “Systems and Methods for Enabling Natural Language Processing”, filed May 24, 2012, describes the need to use natural language libraries when determining the meaning of the sound data. Methods to generate these natural language libraries are described. The work additionally indicates that crowd sourcing can be used to generate an aggregated natural language library from content generated by many separate service providers or developers. The work merely describes the possible use of crowdsourcing as a means to supplement or improve natural language libraries for use in database searches using speech recognition input. The work fails to address the need for an IDE or software application development tool that gathers crowdsourced information by directly engaging the MTurk infrastructure.
Other work describes a job distribution platform for evaluating and training a worker who performs crowd sourced tasks online. U.S. patent application publication 2013/00061717 to Olsen et al. titled “Evaluating a Worker in Performing Crowd Sourced Tasks and Providing In-Task Training Through Programmatically Generated Test Tasks”, filed Oct. 17, 2011, describes a platform through which crowd sourced tasks are presented to workers. One method involves generating test tasks with known correct or incorrect results and presenting the test task to a worker to evaluate the quality of a worker. The work fails to describe or envision the value of an IDE or software application development tool that gathers crowdsourced information by directly engaging the MTurk infrastructure.
In U.S. patent application publication 2009/0240652 to Su et al. titled “Automated Collection of Human-Reviewed Data”, filed Mar. 19, 2008, the claimants describe an automated system that collects, analyzes, and seeks to improve human-reviewed data. The work specifically relates to the MTurk infrastructure and the automated collection of human-reviewed data (HRD). A data processing system which is in communication with one or more systems for collecting human-reviewed data. Wrappers which store parameters specific to the data requests and libraries for transforming the data requests to human intelligent tasks (HITs) are described. The data processing system is claimed to possess a number of components that facilitate transforming and sending HITs to the HRD collection systems and receiving and analyzing HRD from the HRD collection system for purposes of improving the quality of the collected HRD. The data processing system described however fails to possess the abilities of an IDE or software application development tool that gathers crowdsourced information by directly engaging the MTurk infrastructure.
Another work specifically relates to the MTurk infrastructure. U.S. patent application publication 2012/0158732 to Mital et al. titled “Business Application Publication”, filed Dec. 17, 2010, discusses a crowd sourcing infrastructure allowing clients to customize decision applications. The work describes the use of the MTurk infrastructure in developing or distributing applications. Applications are electronically submitted to a data warehouse via a data feed. After users select and download submitted applications, they can be evaluated and customized by the user. Subsequently, customized applications can be submitted to the data warehouse for publication with other applications. Other applications may be created through an automated generation process. In some automated generation implementations, an autogenerator engine searches a library of existing applications for visualization and business logic expressions that may be applicable to the data of the data feed. The development process described system described however fails to possess the abilities of an IDE or software application development tool that gathers crowdsourced information by directly engaging the MTurk infrastructure. Furthermore, the work does not disclose construction of a library based on MTurk work submitted from a development tool.
Thus, there yet exists a need for an efficient automated method for expanding expert data into a larger domain specific data set using an IDE or software application development tool that gathers crowdsourced information by directly engaging the MTurk infrastructure.
These and all other extrinsic materials discussed herein are incorporated by reference in their entirety. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
Unless the context dictates the contrary, all ranges set forth herein should be interpreted as being inclusive of their endpoints, and open-ended ranges should be interpreted to include commercially practical values. Similarly, all lists of values should be considered as inclusive of intermediate values unless the context indicates the contrary.