A dataset is similar to a spreadsheet in concept and comprises rows and columns. Each row is called an observation and represents a subject. Each column is called a variable and represents a feature, trait, or measurement related to the subject. Subject ID is a special variable that is used to identify each subject, such as a patient in a clinical research.
A distribution of a variable is a basic statistical description of the variable. For a continuous variable, such as a subject's height in inches, the common statistics of interest include means, standard deviation, minimum, maximum, median, and various percentile ranks, such as 10 percentile, 25 percentile, etc. For a discrete, or categorical, variable, such as gender and race, the common statistics of interest include the counts for each of the discrete categories.
A regression model is a statistical formula that uses independent variables, referred to as Exposures and Covariates, to predict a dependent variable of interest, referred to as Outcome. The following formula is an example of a regression model:f(SBP),where SBP=β0+β1*AGE+β2*BMI+e SBP is the Outcome of the regression model and represents systolic blood pressure of subject patients. AGE is an independent variable and represents the age of the patients. BMI is also an independent variable and represents the body mass index of the patients.
An Exposure is an independent variable in a regression model whose variation is observed to determine how it influences the variation of the Outcome. A Covariate, or adjusting variable, is also an independent variable in a regression model that is not an Exposure. In the exemplary regression model, for example, the BMI is a Covariate of the AGE and vice versa. Either of the two or both independent variables can be selected as an Exposure.
A regression coefficient is a constant that represents the rate of change of an Exposure as a function of changes in the Outcome. In the exemplary regression model, for example, β1 and β2 are the regression coefficients associated with the AGE and BMI variables, respectively. If β2 is equal to zero, for instance, it means that there is no correlation between the changes in BMI and the changes in SBP. A regression coefficient shows an extent to which a variable associated with the coefficient is correlated with the Outcome of a regression model.
A variable is said to be associated with another variable if the changes of the two variables are found to be correlated. An association test involves fitting and testing a regression model to determine regression coefficients to see if any of them carries significant correlation with respect to the Outcome. Epidemiological data analysis, for example, focus on the association of Exposures with an Outcome wherein the association is tested with and without adjusting other Covariates.
Stratification is defined as the process of partitioning data into distinct or non-overlapping groups. Stratification is used when a study population's sub-domains are of particular interest. A stratified variable is a variable that represents a measurement obtained from a partitioned group of a study population.
Statistical tools presently available in the prior art are rigidly designed around statistical methods rather than the ease of obtaining data analysis outputs. Users (e.g., epidemiologists), for instance, have to do a lot of programming in order to apply the statistical methods to analyze available data, extract the relevant information from the outputs of such tools, and put the information into a report.