1. Field
The disclosed embodiments generally relate to thesaurus management systems and in particular to searching for and retrieving collected data mapped to base dictionaries.
2. Brief Description of Related Developments
In many industries, a variety of terms are used as labels for products, concepts, parts, ingredients, procedures, milestones, and other labels commonly used in the industry or within a particular company. Often, such terms are applied inconsistently, either from subtle permutations, the use of a more specific or more general term, and errors. The use of a thesaurus of terms can be beneficial in the determination of equivalent and related terms. Such a thesaurus can be queried to find an equivalent term so that consistent term usage can be applied across the industry or company.
For example, studies or trials, such as clinical studies, are often undertaken during preparation of a new consumer product. Such studies are used to determine adverse effects, effectiveness, marketability, duration, and other aspects of the new product. In the health and pharmaceutical industries, clinical studies are often mandated and scrutinized by Federal and state governmental regulations prior to the release of a new pharmaceutical or medical product. Typically, a large quantity of clinical data is generated by such studies and the clinical data is provided from a number of different sources involved with the study. The source of the data may be from human test subjects, physician reports, drug dispensary logs, laboratory test results, and other sources. The clinical data is then entered and analyzed, typically from a text format.
Pharmaceutical companies that develop new drugs are required to produce reports that show that each new drug does not harm individuals who use it. The data used to produce these reports is often hard to analyze because it is collected as freeform text. The data can include adverse events in a patient's health and corresponding medical procedures, the medications a patient received during treatment, and diagnoses. This document refers to the freeform text data collected in studies and trials as verbatim terms.
Verbatim terms are difficult to process because the terminology used across a single study or across different but related studies can vary. For example, different investigators may tend to use different terms for the reporting or recording of the same or very similar medical conditions. Thus, one or more different terms may be used to report the same condition or related conditions. But reporting, grouping and further analysis of verbatim terms require the use of consistent terminology. If different terms are used to describe the same condition, it is difficult to collect accurate data related to the condition, unless all of the possible terms are searched. If each term is not searched, all of the representative cases may not be collected and analyzed. In studies where there are many cases, the collected data may not then present an accurate analysis of data related to a particular test, condition or trial.
Several vendors, such as WHO, publish dictionaries that can be used in processing verbatim terms. However, dictionaries can only partially process the data. Dictionaries cannot take misspellings, term mutations and entirely new terms—such as new drugs—into account. An analysis simply based on matches between the dictionary terms and the verbatim terms may not be usable. This document refers to verbatim terms that do not match a dictionary term as thesaurus omissions.
When a trial or study is conducted the entered data is mapped to one or more dictionaries. In a clinical study, the entered data is mapped to one or more medical dictionaries for further analysis and data retrieval. One example of such a medical dictionary is MedDRA. When the mapping of the data is complete, the dictionaries can be used to search and analyze the collected data. Standardised MedDRA Queries (SMQ) were developed by the Council for International Organizations of Medical Sciences (CIOMS) to provide a standard way to search collected data for a given condition. Standardised MedDRA queries are groupings of MedDRA terms that relate to a specific medical condition or area of interest. The queries include terms that can related to signs, symptoms, diagnoses, physical findings, laboratory and other test data, for example. Standardised MedDRA queries are essentially another dictionary that can be superimposed on top of MedDRA for search purposes.
The ORACLE™ Thesaurus Management System (TMS) addresses the complexities associated with managing global thesauri. One of the most time-consuming tasks with development processes, such as drug development, is the classification of verbatim terms to permit deriving the standard terms for use in analysis from the free text originally captured. Many dictionaries exist for different types of information. The organization of these dictionaries, their organization and defined hierarchies can vary considerably. Presently, the searching of a dictionary in a specific manner for a given condition can require manual combining of ad-hoc queries or writing a specific program, which is cumbersome and time-consuming.
It would be advantageous to provide a global facility to standardize terminology use across dictionaries, computer applications, time and organizations. It would also be advantageous to provide a centralized, globally available repository of dictionary terms and associated verbatim terms, where information in the repository is accessible through advanced searching and classification algorithms. It would also be advantageous to provide a retrieval tool to accompany a repository of dictionary terms that allows for the retrieval of data that have no obvious linguistic relationship.