The present application relates to electronic speech recognition and transcription; and, more particularly, to processes and systems for facilitating “free form” dictation, including directed dictation, constrained recognition and/or structured transcription among users having heterogeneous system protocols. The grandparent application Ser. No. 09/996,849, which is herein incorporated by reference, presents a system and processes for facilitating electronic speech recognition and/or transcription among users having heterogeneous system protocols.
As set forth in the parent application, networked application service providers (ASPs) are the most efficient way to utilize sophisticated speech recognition and/or transcription engines having robust dictionaries and vocabularies for large scale users, especially in the professions. The networked application service provider (also known as “on-demand” software or software as “a service”) interconnects application software to high accuracy speech recognition and/or transcription engines which may exist on a centralized specific server application; or, one of the facilities in a peer-to-peer network computing (peer node); or, networking as a distributed application architecture that partitions tasks and/or workloads between peers to form a peer-to-peer network, as well as the “cloud” computing network configuration.
However, a barrier to implementation of these networked systems is the user's use of internal “business” and/or system protocol (legacy protocols), which include, in many cases, both unique native communications and application protocols. These protocols are marked by their unique interface with the entities system and/or organization, and are, therefore, not universal in their interconnect capabilities or their application. Thus, most network systems are unavailable to users employing legacy or native systems.
As set forth in the parent and grandparent applications, seamlessly interfacing with network application service provider software that enables powerful speech recognition and/or transcription engines to interface with legacy systems is required in order for these legacy systems to interface effectively with robust network based systems. Centralized databases (or uniformly accessible databases) that contain information for a number of users, including the wide spread availability of specific vocabularies which include phraseology, grammar, and dictionaries, as well as formatting structures for users of the system, are usually more efficient than a network of mere direct, point-to-point links between individual users.
But universally available recognition databases, including vocabulary databases and dictionaries, suffer from significant inefficiencies in facilitating communications between users of a more centralized database system, especially if the dictation to be transcribed is “free form” or dynamic. Even though a recognition engine is very accurate in spoken word (speech) recognition, the transcription may be filled with transcribed material which is “out of context,” misinterpreted or not formatted correctly. Simply stated, “garbage in—garbage out.”
Thus, even though engine providers advertise in terms of recognition and transcription accuracy, the real issue with these robust engines is ease of use (user friendliness); and, the direct usability of the transcribed material without extensive editing, correcting and/or reformatting. Perhaps most significantly, the content of a single database rarely contains every user's required information, even when that database specializes in information regarding a particular field or expertise, e.g. medicine.
A system for facilitating the exchange of speech (which includes spoken text and verbal and non-verbal commands) and information among users having heterogeneous and/or disparate internal system protocols, which is safe, secure, and easy to use, was set forth in the parent and grandparent applications. However, seamless use of automated speech recognition and/or transcription engines (ASRs) by one or more networked application server providers (ASPs) presents a system restriction which is inherent to this configuration. Even though, the remotely located ASRs are more robust and provide for use of larger and more diverse dictionaries and vocabularies, including specific dictionaries, the ability of a remote user to properly select the needed system information for a specific application is restricted and complicated. This is especially true when ASRs and/or different aspects of a single vocabulary or a specific dictionary need to be selected “on the fly,” i.e. dynamically, or during a “free form,” streamed dictation session, or in response to a streamed, prerecorded session.
When a particular “free form,” streamed dictated session requires access to a myriad of specialized functions, such as medical information, which must serve a number of specialized purposes, these system restrictions may overshadow the usefulness of networked robust ASRs. Similar restrictions are present on these remote robust ASRs, especially when certain formatting and vocabulary are necessary for very specialized application or functions, which form a portion of otherwise normal dictation.
Although some prior art systems contain “drop down menus” which can be populated and thus create documents with predetermined word lists and/or short phrases for the system, these systems contain inherent restrictions and interruptions in the dictating session which limit the required functionality for “free form,” streamed dictation. That is, these menus/lists do not provide the flexibility to accept the streaming of dictated sentences and phrases, including jargon, normally associated and/or recognized by practitioners and/or paraprofessional or administrative personnel in a specific trade or profession such as, for example, medicine or law.
Thus, populating drop down menus/lists with predetermined single words or short phrases has not proven adequate for these higher functionality uses and unduly constrains the speaker and/or interrupts his/her train of thought. Additionally, these types of drop down menus/lists are more easily populated by an administrator on a keyboard or with a mouse; and, do not require the capability or sophistication of a centrally controlled transcription system and robust recognition and/or transcription engines (ASRs). An example of complex, “free form” dictation is a surgeon dictating notes during an open heart procedure or a radiologist reading an X-ray film or an MRI scan.
Previous attempts to expand the flexibility of centrally controlled systems were to create large “user files” or databases which could be accessed only by a single user. These user files contained the needed “user profile” for dictation, as well as the user specific vocabularies or dictionaries for the ASRs. Thus, all the capability of the system for a single user had to be pre-stored for that user alone. This limited the amount of new indexed data generally accessible to a specific user, as well as the flexibility of using ASRs and/or dynamically (“on the fly”) switching to specialized vocabularies as needed or directed by the user or the system. That is, general databases, for example, dictionaries, could not be universally updated without the necessity of updating each individual user's database associated with each specific ASR. Further, as these databases grew, the ability to navigate the different capabilities of these large databases in a short time frame (“on the fly”) became limiting, especially during “live,” complex transcription that required the ASR to dynamically switch among vocabularies of multiple specialties to obtain optimum recognition accuracy and/or with multiple speakers such as, for example, legal depositions. This made certain uses impossible such as, for example, in a court room setting or in an operating theater.
Moreover, previous systems did not provide dynamic system interface between the automated speech recognition and/or transcription engine (ASR) and the legacy user such that the system could prompt the user to focus the dictation to provide a more structured set of recognition rules and/or a constrained recognition and/or a structured transcription. Such systems required cumbersome human machine (system) interface requiring the user to, for example, pause in order to “command” or instruct the system to accommodate the different scenarios; and, then pause until the system could locate and upload the database required to respond.
Additionally, certain recognition/speech engines, by design, process audio files on a “batch bases.” Although a design limitation, not related to the accuracy or the speed of the engine, this constraint, heretofore, foreclosed certain applications, including limiting their capability to transcribe streamed dictation to an amount of information accepted by the recognition engine in a single batch. Other speech engines are only compatible with dictated text from a specific source such as, for example, live microphone or line input. This inhibits the ability of these engines to operate with digital systems or systems which digitize speech into data packets for system identification and processing. Thus, even though the capability was provided to access networked and remote functionality, the complete value of this capability was hindered by these inherent limitations.