1. Field of the Invention
The invention relates generally to the field of analyzing, managing and acting upon clinical information, and specifically to a system which integrates genetic and phenotypic data from a group of subjects into a standardized format in order to validate the data and to make better decisions related to the genetic and phenotypic information of a particular subject.
2. Description of the Related Art
The current methods by which clinical decisions are made do not make the best possible use of existing information management technologies. Very little data from clinical trials and electronic clinical records—collected as it is by a variety of methods, and stored in many different formats across a wide range of systems—is immediately reusable by other groups in the biomedical community and can be accessed to aid caregivers and other decision makers. This situation will become considerably more complicated once personal genetic data occupies a more central role in understanding the causes and treatments of diseases and other predispositions of subjects. Within the next decade it will be possible to scan the entire genome of a patient either for clinical trials, or for the purpose of personalized drug assignment. Insofar as researchers continue to use different methods and systems for storing and analyzing this data, and the associated phenotypic data, all of the difficulties associated with sharing information within the biomedical community will persist and worsen.
A body of prior art exists to develop tools that manage the integration of existing data sets. For example, success has been achieved with tools that input textual data and generate standardized terminology in order to achieve information integration such as, for example, the Unified Medical Language System (UMLS): Integration Biomedical Terminology. Tools have been developed to inhale data into new ontologies from specific legacy systems, using object definitions and Extensible Markup Language (XML) to interface between the data model and the data source, and to validate the integrity of the data inhaled into the new data model. Bayesian classification schemes such as MAGIC (Multisource Association of Genes by Integration of Clusters) have been created to integrate information from multiple sources into a single normative framework, using expert knowledge about the reliability of each source. Several commercial enterprises are also working on techniques to leverage information across different platforms. For example, Expert Health Data Programming provides the Vitalnet software for linking and disseminating health data sets; CCS Informatics provides the eLoader software which automates loading data into ORACLE® Clinical; PPD Patient Profiles enables visualization of key patient data from clinical trials; and TABLETRANS® enables specification of data transformations graphically.
Depending on the tool, automated approaches to data integration can be far less resource intensive than the manual data integration, but will always be more constrained. It is exceedingly difficult to teach a computer how data of heterogeneous types should be sensibly merged. Prior art that most successfully approaches data integration makes use, in some form, of standardized master templates which define a data model, and provide a clear framework to researchers for inputs into, and augmentations of, that data model. This has been successfully applied, for example, in the GO (Gene Data model) project which provides a taxonomy of concepts and their attributes for annotating gene products. Similar projects include the Mouse Gene Database (MGD) and the Mouse Gene Expression Database (GXD). However, no system exists today to combine all phenotypic and genetic information associated with a patient into a single data model; to create a series of logical and statistical interrelationships between the data classes of that standard; to continually upgrade those relationships based on the data from multiple subjects and from different databases; and to use that information to make better decisions for an individual subject.
Prior art exists to manage information in support of caregivers and for streamlining clinical trials. Some of the enterprises involved in this space include Clinsource which specializes in software for electronic data capture, web randomization and online data management in clinical trials; Perceptive Informatics which specializes in electronic data capture systems, voice response systems, and web portal technologies for managing the back end information flow for a trial; and First Genetic Trust which has created a genetic bank that enables medical researchers to generate and manage genetic and medical information, and that enables patients to manage the privacy and confidentiality of their genetic information while participating in genetic research. None of these systems make use of expert and statistical relationships between data classes in a standardized data model in order to validate data or make predictions; or provide a mechanism by which electronically published rules and statistical models can be automatically input for validating data or making predictions; or guarantee strict compliance with data privacy standards by verifying the identity of the person accessing the data with biometric authentication; or associate all clinical data with a validator the performance of which is monitored so that the reliability of data from each independent source can be efficiently monitored; or allow for compensation of individuals for the use of their data; or allow for compensation of validators for the validation of that data.
Prior art exists in predictive genomics, which tries to understand the precise functions of proteins, RNA and DNA so that phenotypic predictions can be made based on genotype. Canonical techniques focus on the function of Single-Nucleotide Polymorphisms (SNP); but more advanced methods are being brought to bear on multi-factorial phenotypic features. These methods include regression analysis techniques for sparse data sets, as is typical of genetic data, which apply additional constraints on the regression parameters so that a meaningfuil set of parameters can be resolved even when the data is underdetermined. Other prior art applies principal component analysis to extract information from undetermined data sets. Recent prior art, termed logical regression, also describes methods to search for different logical interrelationships between categorical independent variables in order to model a variable that depends on interactions between multiple independent variables related to genetic data. However, all of these methods have substantial shortcomings in the realm of making predictions based on genetic and phenotypic data. None of the methods provide an effective means of extracting the most simple and intelligible rules from the data, by exploring a wide array of terms that are designed to model a wide range of possible interactions of variables related to genetic data. In addition, none of these prior techniques enable the extraction of the most simple intelligible rules from the data in the context of logistic regression, which models the outcome of a categorical variable using maximum a-posteriori likelihood techniques, without making the simplifying assumption of normally distributed data. These shortcomings are critical in the context of predicting outcomes based on the analysis of vast amounts of data classes relating to genetic and phenotypic information. They do not effectively empower individuals to ask questions about the likelihood of particular phenotypic features given genotype, or about the likelihood of particular phenotypic features in an offspring given the genotypic features of the parents.