The failure of clinical trials to detect significant differences in efficacy between treatment groups is a well-recognized and increasingly costly impediment to clinical drug development (Robinson & Rickels, 2000, Journal of Psychopharmacology 20:593-596). The difficulty is particularly acute in clinical trials of psychiatric drugs, where placebo response rates of 30-40% or more are not uncommon (Thase, 1999, Journal of Clinical Psychiatry 60 (Suppl. 4): 23-31; Trivedi & Rush, 1994, Neuropsychopharm 11(1): 33-43; Quitkin et al., 2000, Am. J. Psychiatry. 157: 327-337), making discrimination of active drug effects especially demanding. Some studies of major depression have reported placebo response rates as high as 70% (Brown et al., 1988, Psychiatry Res. 26: 259-264).
Placebo-controlled trials are increasingly difficult to justify on ethical grounds when an effective treatment is known (Quitkin, 1999, Am. J. Psychiatry 156: 829-836). In order to assure assay sensitivity, and/or in response to regulatory requirements, more trials are incorporating active comparators—drugs which are known to have efficacy in treating a particular disorder. In such trials sensitivity assumes an even more important role for detecting the small differences between two positive outcomes.
To enhance the statistical power of a given clinical trial, investigators can simply include greater numbers of patients. However, this approach has several major drawbacks. First, it adds substantially to the cost of performing clinical trials, as the cumulative per-subject costs often represent the majority of the total costs of a trial. More importantly, this approach requires that larger numbers of patients be exposed to experimental drugs, or drugs that may not yet have shown clear benefit for their illness. Unfortunately, increasing the number of patients is also associated with an increased placebo response rate (Keck et al. 2000, Biol. Psychiatry 47: 748-755; Keck et al., 2000, Biol. Psychiatry 47: 756-761; Shatzberg & Kraemer, 2000, Biol. Psychiatry 47: 736-744), thereby negating to some extent the benefit of an increased sample size.
Another mechanism for enhancing the power of a clinical trial involves improving the reliability of outcome measurements (Leon et al., 1995, Arch. Gen. Psychiatry 52: 867-871). When outcome measurements require human evaluation of clinical status, reliability depends on the skills of the human raters performing the evaluation. Improving and/or making the skills of human raters reliable and sensitive present a significant hurdle in designing, conducting, and even analyzing clinical trials.
Another feature of clinical trials is the need for one or more launch meetings to, inter alia, train raters, provide information to study coordinators and leaders, and discuss the underlying methodology. Such launch meetings for training raters can be quite expensive, particularly if many such meetings are required.
Current Methods of Rater Training
Although many clinical trials depend on raters, previously known methods leave much to be desired. In psychiatric clinical trials, for example, ratings by human raters are often the primary outcome measures. Despite the critical role of human raters, large clinical trials typically offer only cursory rater training at a study launch meeting just prior to rater certification. Over the course of a 1-3 day launch meeting, the time allotment for training and rater certification is usually 2-4 hours. In addition, experience shows that human raters are frequently unaware how important it is to the success of a study that rater reliability be maintained.
Often current rater training may be limited to reading though items on the rating scale(s). Some trials offer raters verbal and written conventions to help standardize the approach to common rater dilemmas (e.g., round up when rating falls between two anchor points, rate each item independently of contribution of concomitant drugs or general medical conditions). Trials generally do not provide raters with scripts for the primary outcome measures or other instruction for limiting the variation in a scale score due to the interview itself.
Problems in Rater Training
Initial rater training, though necessary, is often of limited value. The interval between the launch meeting and local site enrollment of patients into the clinical trial is seldom less than three weeks and is often more than three months. Even when training manuals are provided, the raters often fail to consult them.
Variations in scoring conventions from one trial to another may further dilute the benefits of initial rater training. As a result, raters may deviate from the training instructions. It is not unusual to find, for example, raters in a study using the Young Mania Rating Scale (YMRS) (Young et al., 1978, Br. J. Psychiatric. 133: 429-435) applying scoring conventions that were taught in a previous training session for a completely different study that used another rating scale such as the Schedule for Affective Disorders and Schizophrenia, Current symptom version (“SADS-C”) mania rating scale.
Even the most skillful training at a launch meeting cannot train raters who do not attend the meeting. Some studies cope with this problem by staging multiple launch meetings, but this is expensive and does not address the need (which commonly arises) to hire additional raters after study start-up. Variations between the different launch meetings (which may be conducted by different personnel) may result in further rater variability.
Problems in Rater Certification
Rater certification refers to the process by which rater performance is documented to be within an acceptable range. Common practice for certification requires raters to score a rating scale based on viewing a videotaped patient interview. Certification is typically based on achieving agreement as determined by calculation of an intra-class correlation coefficient or more often by reference to an expert consensus score that serves as a “gold standard.” Most clinical trials attempt to certify raters at the launch meeting itself, and require raters to meet a certification standard when tested on a single occasion. Since certification is typically carried out immediately following the training, the frequency at which raters achieve the targets of certification is likely to be much higher than might be expected with a delay between launch meeting and certification.
There is thus a need to improve the certification of raters, and to reduce the time between certification and clinical administration of ratings scales. There is also a need to permit certification outside of a clinical launch meeting environment.
The Need for Improved Rater Reliability
The need for standardized rater training has been described (Muller & Wetzel, 1998, Acta. Psychiatr. Scand. 98: 135-139). To minimize measurement error, investigators seek more consistent and better-trained raters. Unfortunately, since most large trials occur over a long period of time and involve multiple centers, each with its own raters, the logistical obstacles to standardized training continue to be serious hurdles.
Even modest gains in rater reliability can reduce result in substantial reduction in the sample size requirement, time, cost and risk of failure that can thwart development of promising therapeutic agents. For example, Perkins et al. (Biol. Psychiatry 47: 762-766 (2000)) calculated that an improvement in reliability from R=0.7 to R=0.9 could reduce sample size requirements by 22%. This may often translate into significant cost savings.
In psychiatric trials (i.e., trials of therapy for a psychiatric disorder), where objective biological outcome measures may be lacking, reliability may be particularly poor, as investigators typically rely on rating scales completed by human clinicians. Seeking to quantify subjective experiences or behavior introduces substantial measurement error. Thus, the need for increased rater reliability is even greater in psychiatric trials.
Problems of Ongoing Reliability
Certification of raters under controlled conditions leaves room for error and abuse as well as simple incompetence during the actual conduct of study ratings. Many studies utilize raters operating under considerable time pressure. Experience indicates that rating scale scores are significantly correlated with the duration of the rating interview and that time allotted for the interview tends to decrease over the course of a study. Attempting to interview patients in a fixed time tends to lower the scores of symptomatic patients, reducing potential drug-placebo differences.
Audio or video taping of interviews could effectively ameliorate this problem, but is a costly, time intensive, and intrusive methodology that requires an elaborate system of expert review, resolution of differences and remediation. Each tape must be reviewed in its entirety by an expert, or panel of experts, effectively doubling or tripling the amount of time required to obtain a particular rating. Moreover, this methodology is often unacceptable to patients and raters. Awareness of the tape recording may alter patient behavior (and the resulting ratings). For example the patient may feel more self-conscious about discussing sensitive or embarrassing topics while being recorded.
There is thus a need for efficient monitoring of raters during the course of a trial in order to detect rater drift and variance so that remediation efforts and recertification may be instituted when necessary.
Problems of Recertification
Re-certification of raters refers to the process by which previously certified raters are reexamined to confirm that their ratings remain calibrated to study standards. This process aims to measure and reduce the tendency for raters to drift away from the rating norms established at study start-up. In theory, this is a relatively simple process that can be accomplished by having raters rate videotapes for which consensus or “gold standard” ratings have been established.
Despite the desirability of re-certification, few studies ever recertify raters. The simple requirement for additional tapes with gold standard rating is not particularly challenging. More significant obstacles include the expense of reassembling the raters in a central location or coordinating rater schedules with those of a visiting monitor. Additionally, there is a risk that failure of a single rater to recertify may cripple a site in the midst of study operations.
There is thus a need for a re-certification process that is more convenient and better integrated into the conduct of clinical trials.