The majority of prior art evaluation systems rely on a few techniques. Item response theory is the predominant interpretation strategy in the prior art. Item response theory guides the interpretation of multiple-choice tests, in which a test comprises a list of items to be selected by a respondent, and wherein items often comprise a stem, a best response, and a list of distracting responses. Items typically have variable difficulty, inversely related to the proportion of respondents who select the best response. A distracting response may be said to confer a degree of difficulty related to the proportion of respondents who select it instead of the best response. Experience using an item defines its degree of difficulty, and the difficulty posed by each distracting response.
Experience with multiple items also establishes a correlation, if any, between respondents' choices on different items. Accordingly, the response selected to simple or difficult items may be used to predict responses to items of intermediate difficulty. An adaptive test may take advantage of this knowledge. A respondent who selects the best response to the most difficult items is likely to score well on items of intermediate and low difficulty. The test administrator may classify a very competent or very incompetent respondent simply by inspecting answers to a few difficult or simple items. The respondent may conclude the test very quickly, without completing a large number of intermediate difficulty items based on the responses given to the difficult or simple items. This computer adaptive testing process does not adapt the test to likely respondent weaknesses, new knowledge, or to content that is relevant given earlier choices.
Variations on the multiple-choice theme include True/False questions, items with multiple acceptable responses, and clinical set problems. A clinical set problem presents clinical information in one or more blocks. A cluster of questions, typically True/False questions about appropriate testing and treatment strategies, follows each block of information. Later blocks of information must be carefully designed such that they do not compromise the items in earlier blocks of information. For instance, if an early block of information precedes an item asking whether to obtain an x-ray, a subsequent block disclosing the results of an x-ray implies that it should be ordered. These do not show that the respondent knows what to ask, how to interpret answers, or how to react to problems.
An alternative embodiment of the clinical set problem presents a block of information on paper or computer, and then allows the user to select questions and view answers using a highlighting pen on paper or a computer display. The user may have many options, and can demonstrate the ability to select questions, interpret answers, and recommend treatments during a single clinical encounter. However, paper and static computer cases generally cannot anticipate the progress of the virtual patient through subsequent visits. A programmer can create a static computer case that is grossly responsive to likely interventions, but such cases are tightly constrained by their fixed program. For instance, a static program can not simulate the consequences of the time and dose of medication in detail.
An alternative prior art evaluation process, commonly used in medical fields, requires an assessor to monitor a clinician's behavior with a checklist. In an exemplary implementation, the assessor is a human actor with or without a real medical problem, who portrays a medical problem (or a plurality thereof) to the respondent. This actor is called a standardized patient. Typically, the test sponsor trains the standardized patient both to portray a medical problem and to observe respondent's behavior. The standardized patient notes the presence or absence of each monitored behavior of a respondent. For instance, the standardized patient may check that a respondent introduced himself by name, and that the respondent ignored a direct question. Limited quantitative reporting is possible.
The checklist used in this evaluation method is necessarily fixed for several reasons. First, standardized patients must be thoroughly familiar with their checklists in order to properly evaluate respondents. These lists are often long, and adding multiple versions and variations would complicate training. Second, a comparison between respondents is a typical analysis that is undertaken to evaluate a training program, or to convince a respondent that the evaluation of his performance is normal or unusual. Standardization is required to ensure comparability. Third, standardized patients would typically be ill-equipped or unable to dynamically and objectively modify their checklists in response to respondents' decisions.
In other implementations, a third party observer may use a checklist to describe a physician's actions. The observer is typically trained to look for specific behaviors during the physician-patient interaction. A third party may observe directly, through a two-way mirror, or may analyze a written, audio, or audiovisual record of the encounter. In a third party observation, less constrained data entry is typical, such as recording of numeric data or subjective impressions.
A further common evaluation process is the oral examination of a respondent by an expert examiner. An oral examination is a cycle in which the examiner provides a block of information with implied or explicit questions. The respondent answers, and then the examiner provides more information and questions. The examiner may have a script, but may also improvise content and questions at any time. The examiner typically judges the responses—or the respondent—to determine whether the respondent passes. This method suffers from the disadvantage that oral examinations are notoriously difficult to standardize, and could easily be influenced by examiner biases.
Evaluation of free text, drawings, and similarly rich response content by an expert also suffers from subjective inconsistencies, as documented by Trenholm, et al. in U.S. Pat. No. 6,234,806, “System and method for interactive scoring of standardized test responses,” incorporated herein by reference. Trenholm describes a system whereby a test analyst, analogous to an expert examiner, may interactively modify the grading rules for various features of a response. In this system, expert examiners work with and constantly update a database intended to support more consistent scoring. The expert modifies the database after the respondent completes the test. The respondent has no direct interaction with the database. A response may be automatically re-evaluated when the database criteria change.
Another prior art evaluation process requires assessment of actual processes and outcomes. In the case of clinical medicine skills assessment, outcomes may include longevity; markers that correlate with longevity, such as blood pressure; perceived health; functional status, such as the ability to walk a given distance; resource consumption, such as length of hospitalization; and productivity measures, among others. A process assessment would typically determine whether queries and interventions were done, with the expectation that these queries and interventions lead to improved outcomes.
Yet another evaluation process attempts to evaluate the value of information that a user collects during a patient simulation, based on decision theoretic principles (Downs S M, Friedman C P, Marasigan F, Gartner G, A decision analytic method for scoring performance on computer-based patient simulations, Proceedings of the AMIA Annual Fall Symposium 1997: 667-71, incorporated herein by reference, and Downs S M, Marasigan F, Abraham V, Wildemuth B, Friedman C P, Scoring performance on computer-based patient simulations: beyond value of information. Proceedings of the AMIA Annual Fall Symposium 1999: 520-4, incorporated herein by reference). The decision analytic scoring mechanisms described thus far, while novel and appealing, have several limitations. First, published methods describe a single simulated patient encounter, not a series of encounters.
Second, published methods penalize a user for gathering data with value for managing disease rather than making a diagnosis. Third, published methods do not address correct interpretation of Query responses. Fourth, published methods do not address selection of Interventions. Fifth, published VOI algorithms do not attempt to tabulate the contribution of a Query to several relevant problems in a multi-problem system. Finally, even this simplified form is very complex. Methods that are either simpler or more complex are required for many applications.
The prior art includes physical simulation devices. U.S. Pat. Nos. 5,704,791; 5,800,177; 5,800,178; and 5,882,206, all by Gillio, et al, incorporated herein by reference, teach processes for evaluating a user's skill in executing a given procedure using a physical surgical simulation device. Furthermore, Gillio anticipates combination of the system with other systems, including systems that dynamically modify the physical simulation and user performance criteria. A virtual patient program comprises one such other system. However, the range of options available in managing a virtual patient could require an extensive array of physical simulators. For instance, if management options include a needle biopsy option, a laparascopic option, and an open surgery option, then each examinee could require access to three physical simulators. Therefore, an evaluation process would benefit from processes that allow the user to view and critique Interventions, rather than perform them.
The prior art further includes evaluation by peers. Peer evaluation comprises recorded descriptions of an individual's skill, made by professional colleagues. In the field of medicine, physicians and nurses are now called upon to critique the skills and professional behavior of physicians certified by the American Board of Internal Medicine. Such a peer evaluation process is not easily standardized. It necessarily requires large numbers of evaluations or wide confidence intervals, meaning low thresholds for passing.
All of the foregoing evaluation processes have significant limitations. Although relatively inexpensive to devise and administer, multiple choice questions are very unlike the tasks that clinicians—or any other experts—actually perform. Checklist assessments can efficiently evaluate performance at a single encounter or in retrospect when respondents share a common experience. However, traditional checklists do not dynamically adjust to evolving situations. Oral examinations may adjust to evolving situations, but standardization is typically compromised and potentially forfeited. Actual outcome and process measures have many desirable features, but the collection of outcome measures is very time consuming and expensive, and does not efficiently convey or reinforce knowledge regarding effective clinical practice, especially for a series of steps in a guideline. Furthermore, it is difficult to use actual measures to assess knowledge of rare events, such as unusual complications or combinations of problems.
New methods for simulating complex systems and theoretical frameworks for evaluating complex systems also demonstrate a need for new evaluation methods. U.S. Pat. No. 6,246,975, Rovinelli et al., incorporated herein by reference, and recent literature (Sumner W., Hagen M D, Rovinelli R. The item generation methodology of an empiric simulation project, Advances in Health Sciences Education 1999; 4(1):25-35), also incorporated herein by reference, demonstrate new methods for simulating a complex system. A medical simulation may produce a virtual patient with a plurality of Health States. Each Health State may evolve independently of the others, while the presence of other Health States may influence the rate of progression of a Health State. One feature of this design is a structure called a Parallel Network comprising a series of Health States representing stages of a disease. The simulation program creates a patient based on any combination of Parallel Networks, with the presenting health state in each Parallel Network specified. Thus, a first Parallel Network that works properly with a second set of other Parallel Networks in a first virtual patient simulation will work again when combined with any third set of Parallel Networks in a second virtual patient simulation.
This design demonstrates a general principle that other scalable complex system simulation programs are likely to reproduce: distinct system problems, including those with important interactions, deserve independent representation in a complex system model. Independent representation greatly facilitates reconfiguration of simulations to portray multiple problems, and does not sacrifice the ability to model interactions between problems.
A second feature of complex system models is that a user (or respondent) may seek a wide variety of information about the system at any time. In the aforementioned Rovinelli patent and Sumner paper, information gathering objects are called Queries (or equivalently, Reveals). A simulator may provide a fixed, stochastically selected, or dynamically generated response to a Query. Normal responses to Queries may not imply any need for additional Queries. Abnormal responses to Queries typically imply that further evaluation or management is required, even if the Query result is a false positive. In clinical settings, for instance, a suspicious mammography result demands repeat testing in a short period of time, even though the results are likely to be normal.
A third feature of a complex system model is that the user may apply (or in the clinical setting, prescribe) a wide variety of Interventions at any time. An Intervention is the application of some new process to the complex system with the intention of changing the progression of a problem or its manifestations, and is part of a broader concept called Course Of Action (COA) in some publications. Interventions typically result in harmful and beneficial effects in defined periods of time, and the user is normally responsible for monitoring both. Because Interventions have different side effects, users are especially responsible for anticipating the side effects of the Interventions they select.
A previously published mechanism for evaluating physician interaction with a patient simulation describes some general principles for using Bayesian networks (Sumner W, Hagen M D, Rovinelli R. The item generation methodology of an empiric simulation project, Advances in Health Sciences Education 1999; 4(1):25-35, incorporated herein by reference). The published method has significant limitations. First, the published method assumes a complicated object called a “Condition” (or Relational Condition) as a common element of virtual patients and state definitions in Bayesian network nodes. Conditions support a plurality of patient characteristics, including two related to physician queries and interventions; record elaborate age information; and describe time only as patient age. Creating grading criteria in Conditions is difficult because the structure is tailored to a different task, tracking patient data. An object devoted to physician actions and describing time relative to other events is preferable.
Second, Conditions do not describe how users schedule subsequent encounters with a virtual patient. This is a serious omission, because the user may waste resources on too frequent encounters or risk deteriorating health between too infrequent visits.
Third, the publication describes Plan objects as comprehensive management strategies, such as “practice guidelines”. However, it is often useful to divide guidelines into reusable component parts. While vast Plans may be useful sometimes, component parts are more reusable and maintainable, such as a Plan for checking renal function, a Plan for checking liver function, a Plan for monitoring response to therapy, and a Plan for monitoring evidence of a side effect. A Master Plan may list these component parts.
Fourth, the publication describes evaluating Plans “when the simulation concludes,” but the other evaluation processes may be simpler. Some highly interactive skill assessment and feedback processes require dynamic evaluation during a simulation. Checklists of actions can be created dynamically at the onset of a simulation and when relevant to events during the simulation. Moreover, the checklists facilitate evaluation. Furthermore, some Plans can or should be evaluated before the conclusion of a simulation. For instance, if a Health State changes during a simulation, the Plan for the old Health State can be evaluated immediately, and the Plan for the new Health State becomes relevant thereafter. Therefore, Plans may provide information to the simulator when the Plan becomes relevant, and interpret information from the simulation when the Plan is evaluated.
Fifth, the publication describes “an automated critique process,” but we have determined that for feedback purposes, a user who is a trainee may wish to view status reports on demand. For instance, the trainee working with a practice test may ask for a status report, and then react to any deficiencies identified in the report. Such a report must be produced on demand.
Sixth, the publication anticipates that “Plans can assemble textual feedback in the same way as queries,” but more and different structure is desirable. A Query response structure comprises typically one block of text to return to a user. However, a description of Plan adherence benefits from an overview of context and multiple descriptors of good and bad ideas done and not done, and is therefore more complex than a Query response.
Seventh, the publication does not teach evaluation of the “treatment to goal” principle, although a robust simulator is capable of supporting such an assessment. “Treatment to goal” states that, over time, the user should continue adjusting treatment until a specific goal is achieved, and then should maintain the goal, making further adjustments only to minimize side effects, simplify the treatment, or reduce costs. In the case of blood pressure, the user managing a virtual patient with elevated blood pressure usually should continue adjusting treatment until blood pressure reaches a normal range. The fact that the blood pressure must be above, below, or within a normal range at the conclusion of a simulation is a valid evaluation criterion: users should receive credit or praise when the pressure is within range, and penalties or critiques otherwise.
Eighth, the publication does not teach any process wherein the user critiques the process of care. Queries and Interventions are often useless or harmful when executed incorrectly. For instance, abrupt questioning may discourage patients from answering completely. Chewing nicotine gum, inhaling before discharging a pulmonary inhaler, and a lapse in sterile technique during surgery are all procedural problems that compromise otherwise useful Interventions. In the surgical example alone, it can be useful to evaluate how the user performs an Intervention. In addition, evaluating the user's ability to critique how a patient or colleague performs a task is a distinct process from evaluating the user's technical skill.
Multimedia evaluations
Educational assessment literature and current assessment practices describe multiple techniques for using multimedia in assessment processes. Examiners may use audio recordings to assess skills such as language comprehension or identification of sounds. Examiners use pictures to test recognition of diagnoses in dermatology and radiology. Examiners use graphical tools, such as polygon enclosures overlaid on still pictures, to test plans for applying radiotherapy to malignant masses. Similarly, architectural examinations use incomplete blueprints to test planning skills by adding shapes representing additions or changes to the blueprint. Tactile feedback devices offer additional opportunities to evaluate procedure skills. Examiners have used video clips, especially in videodisc training programs, to illustrate disorders of movement, sensation, bodily function, or thought that the user should diagnose or manage.
We have determined that a number of valuable multimedia assessment processes are obvious extensions of existing methods. Specifically, we believe that the following tasks are extensions of the polygon enclosure task: (1) drawing excision boundaries around a lesion; (2) drawing the boundaries of a cast, brace, or bandage; and (3) locating an anatomical point or region for injection, examination, or imaging.
In addition, examiners may use the following media to require users to make diagnoses: (1) computer-generated fly-through simulations of endoscopies of hollow organ systems; and (2) video clips of functional studies, such as radiographic dye studies and radioisotope tracing studies.
We have determined that multimedia enables two additional evaluation techniques. One group of novel techniques comprises at least one prior art task modified by dynamically generated boundary conditions that maintain consistency with a dynamic simulation. A second group of novel techniques comprises user critiques of decisions portrayed on at least one multimedia image.