1. Field of the Invention
The present invention relates to preserving private or confidential information in natural language databases, and more specifically to extraction of private information from natural language databases and to hiding an identity of a person associated with the private information.
2. Introduction
Goal-oriented spoken dialog systems aim to identify intents of humans, expressed in natural language, and take actions accordingly to satisfy their requests. In a spoken dialog system, typically, first the speaker's utterance is recognized using an automatic speech recognizer (ASR). Then, the intent of the speaker is identified from the recognized sequence, using a spoken language understanding (SLU) component. The following is an example dialog between an automated call center agent and a user.                System: How may I help you?        User: Hello. This is John Smith. My phone number is 973 area code 1239684. I wish to have my bill, long distance bill, sent to my Discover card for payment.        System: OK, I can help you with that. What is your credit card number?        User: My Discover card number is 28743617891257 hundred and it expires on first month of next year.        System: . . .        
As it is clear from this example, these calls may include very sensitive information about the callers, such as names as well as the credit card and phone numbers.
State-of-the-art data-driven ASR and SLU systems are trained using large amounts of task data which is usually transcribed and then labeled by humans. This tends to be a very expensive and laborious process. In the customer care domain, “labeling” means assigning one or more of the predefined intent(s) (call-type(s)) to each utterance. As an example, consider the utterance I would like to pay my bill, in a customer care application. Assuming that the utterance is recognized correctly, the corresponding intent or the call-type would be Pay(Bill) and the action would be learning the caller's account number and credit card number and fulfilling the request. The transcribed and labeled data may then used to train automatic speech recognition and call classification models.
The bottleneck in building an accurate statistical system is the time spent preparing high quality labeled data. Sharing of this data is extremely important for machine learning, data mining, information extraction and retrieval, and natural language processing research. Reuse of the data from one application, while building another application is also crucial in reducing the development time and making the process scalable. However, preserving privacy while sharing data is important since such data may contain confidential information. Outsourcing the data and tasks that require private data is another example of information sharing that may jeopardize the privacy of speakers. It is possible to mine natural language databases to gather aggregate information using statistical methods. The gathered information may be confidential or sensitive. For example, in an application from the medical domain, using the caller utterances and their call-types, one can extract statistical information such as the following:
y % of the U.S. doctors prescribe <DRUG1> instead of <DRUG2>
x % of company A's customers call the customer care center to cancel their service which may be information that should be kept private due to business-related reasons. A way of making information available while protecting privacy and confidentiality is needed.