It is a well-known fact in the computer-aided detection (CAD) research community that proper tuning and training of a pattern recognition code (such as the CAD code referred to in this invention) requires a large database of training cases. Anil K. Jain and Richard C. Dubes, “Algorithms Clustering Data”, Prentice Hall, March 1988 contains a discussion of the requirements on the number of training examples as a function of degrees of freedom. Some review papers that describe the concepts of feature extraction and classification by neural networks in a CAD application are: Matt Kupinski et al., “Computerized detection of mammographic lesions: Performance of artificial neural network with enhanced feature extraction”, SPIE Vol 2434, p 598, and Maryellen Giger and Heber MacMahon, “Image Processing and Computer-aided Diagnosis”, RSNA Vol. 34, N 3, May 1996)
A large database is needed for two reasons. First, abnormalities such as lesions in mammograms have a wide spectrum of differing appearances, and the training database should contain examples of all types. Second, these codes typically contain both rule-based criteria and neural network classifiers to reduce the number of false positives, and the proper values of all parameters used in these rules and classifiers depends on having seen many more training examples than there are numbers of parameters, or features, in order to avoid “overtraining”, or “over-optimizing,” the tendency of the code to memorize its training data.
A rule of thumb is that one should have at least 10 times more training cases than the degrees of freedom in the decision making code. Another conservative practice is to separate the training database from the test database, and maintain absolute independence in order to avoid biased performance results. In a study performed by Burhenne et. al. (Burhenne et. al., Potential Contribution of Computer-aided Detection to the Sensitivity of Screening Mammography, Radiology, May 2000, p 554-562) performance of a particular CAD code was tested on an independent database of 1083 breast cancers. This particular code was “tuned”, or “trained”, on a “training database” of approximately 1500 cancer cases.
FIG. 10 shows an example of the use of a rule to separate true lesions from false positives. In this example, one feature is plotted versus another feature in a scatter plot. The true lesions 1020 appear on this scatter plot as dark x's, the false positives as light dots 1030. It is apparent that the true lesions tend to cluster in a band near the center of the scatter plot, while the false positives are mostly in a vertical cluster below the true positives. By using the dark dashed line 1010 as a “decision surface”, and accepting only the marks above the line, most of the false positives will be eliminated while most of the true positives are retained. It can be appreciated that the more “training”, or “example” cases one has, the better the line or decision surface will be placed, i.e. the more optimal the separation of the true lesions from the false positives. In practice, a typical CAD code may have dozens of such rules, as well as a classifier to allow decisions to be made in a complex multi-featured decision space.
Recently a new mammographic x-ray detector has been approved by the FDA: the Senograph 2000 produced by GE. This product will soon be followed by other similar digital detectors produced by such companies as Lorad, Fisher, Siemens, and Fuji. In the field of chest radiography, digital detectors have been available for some time. Now, there is a very critical barrier to the use of CAD codes applied specifically to the medical images obtained by these new detectors. This is the fact that the devices have been in existence for such a short time that the number of cancer cases taken and archived is not yet sufficient to train or tune these codes. The number of cancers in existence detected in these digital detectors is not yet sufficient to even test these codes with great confidence. Using CAD codes on direct digital medical images with any confidence therefore requires a method to obtain parameters and feature values needed by the code from a source other than the small number of existing cases.