1. Field
Embodiments of the invention generally relate to techniques for evaluating document content. More specifically, embodiments provide a method to identify a keyword list used to determine whether domain-specific language is used in a particular document.
2. Description of the Related Art
Data Loss Prevention (DLP) products provide software tools used to evaluate information shared across computer networks. For example, a DLP tool may be configured to evaluate the content of email messages to determine whether they should be blocked because they contain information that should not be shared outside of an enterprise. The DLP tool may use templates to evaluate content relative to a particular domain or standard. For example, a HIPAA template could flag a document as a potential violation if it is leaving the enterprise and contains both personally identifying information and medical terminology (e.g., drug, disease and medical procedure terms).
Frequently, a DLP tool may rely on a keyword list to determine whether a document or message is related to a particular topic or domain (or otherwise subject to a policy regarding dissemination, sharing, or access). If a document contains one or more terms from the keyword list, the document will be flagged (or blocked from being shared, etc). However, keeping a keyword list up to date is a significant challenge. Further, intelligently identifying keywords to include in a keyword lists is itself a difficult task. Such lists may be updated manually, but doing so is often labor intensive and time consuming (e.g., some datasets have over 100K terms entries). Further, it is often the case that the author of the list, who may or may not have medical expertise, will not fully be aware of all the possible meanings (e.g. senses) for a word. Resulting in words that, while having a sense related to the particular domain, may have other senses not related to that domain.
Another approach is to generate a keyword list. For example, medical terminology may be extracted from a coding standard, e.g., ICD-9-CM, a list of codes and descriptions published by the US government for coding medical interactions. However, if a potential keyword is used relatively infrequently in everyday language, it may not be specific to the domain of interest. Thus, relying on term frequencies alone can result in several poor choices such as “illegitimacy”, “bovine”, “symmetry”, and “turpentine” along with good choices such as “abdominothoracocervical”, “abdominouterine” and “abdominovenous,” being selected as keywords. Terms such as “bovine” are infrequent, but not primarily related to medicine. More generally, simply generating a keyword list from an ontology or standard will result in a keyword list with terms that are indeed infrequent, but that do not have a meaning specific to that target domain (i.e., words having a single or at least primary “sense”—referred to as “monosemy”.