The present invention relates to the analysis of molecular arrays, or biochips, and, in particular, to a method and system for processing a scanned image of a molecular array in order to index the regions of the image that correspond to features of the molecular array and to extract data from indexed positions within the scanned image that correspond to optical or radiometric signals emanating from features of the molecular array.
Molecular arrays are widely used and increasingly important tools for rapid hybridization analysis of sample solutions against hundreds or thousands of precisely ordered and positioned features containing different types of molecules within the molecular arrays. Molecular arrays are normally prepared by synthesizing or attaching a large number of molecular species to a chemically prepared substrate such as silicone, glass, or plastic. Each feature, or element, within the molecular array is defined to be a small, regularly-shaped region on the surface of the substrate. The features are arranged in a regular pattern. Each feature within the molecular array may contain a different molecular species, and the molecular species within a given feature may differ from the molecular species within the remaining features of the molecular array. In one type of hybridization experiment, a sample solution containing radioactively, fluorescently, or chemoluminescently labeled molecules is applied to the surface of the molecular array. Certain of the labeled molecules in the sample solution may specifically bind to, or hybridize with, one or more of the different molecular species that together comprise the molecular array. Following hybridization, the sample solution is removed by washing the surface of the molecular array with a buffer solution, and the molecular array is then analyzed by radiometric or optical methods to determine to which specific features of the molecular array the labeled molecules are bound. Thus, in a single experiment, a solution of labeled molecules can be screened for binding to hundreds or thousands of different molecular species that together comprise the molecular array. Molecular arrays commonly contain oligonucleotides or complementary deoxyribonucleic acid (xe2x80x9ccDNAxe2x80x9d) molecules to which labeled deoxyribonucleic acid (xe2x80x9cDNAxe2x80x9d) and ribonucleic acid (xe2x80x9cRNAxe2x80x9d) molecules bind via sequence-specific hybridization.
Generally, radiometric or optical analysis of the molecular array produces a scanned image consisting of a two-dimensional matrix, or grid, of pixels, each pixel having one or more intensity values corresponding to one or more signals. Scanned images are commonly produced electronically by optical or radiometric scanners and the resulting two-dimensional matrix of pixels is stored in computer memory or on a non-volatile storage device. Alternatively, analog methods of analysis, such as photography, can be used to produce continuous images of a molecular array that can be then digitized by a scanning device and stored in computer memory or in a computer storage device.
FIG. 1 shows a generalized representation of a molecular array. Disk-shaped features of the molecular array, such as feature 101, are arranged on the surface of the molecular array in rows and columns that together comprise a two-dimensional matrix, or grid. Features in alternative types of molecular arrays may be arranged to cover the surface of the molecular array at higher densities, as, for example, by offsetting the features in adjacent rows to produce a more closely packed arrangement of features. Radiometric or optical analysis of a molecular array, following a hybridization experiment, results in a two-dimensional matrix, or grid, of pixels. FIG. 2 illustrates the two-dimension grid of pixels in a square area of a scanned image encompassing feature 101 of FIG. 1. In FIG. 2, pixels have intensity values ranging from 0 to 9. Intensity values of all non-zero pixels are shown in FIG. 2 as single digits within the pixel. The non-zero pixels of this scanned image representing feature 101 of FIG. 1 inhabit a roughly disk-shaped region corresponding to the shape of the feature. The pixels in a region surrounding a feature generally have low or 0 intensity values due to an absence of bound signal-producing radioactive, fluorescent, or chemoluminescent label molecules. However, background signals, such as the background signal represented by non-zero pixel 202, may arise from non-specific binding of labeled molecules due to imprecision in preparation of molecular arrays and/or imprecision in the hybridization and washing of molecular arrays, and may also arise from imprecision in optical or radiometric scanning and various other sources of error that may depend on the type of analysis used to produce the scanned image. Additional background signal may be attributed to contaminants in the surface of the molecular array or in the sample solutions to which the molecular array is exposed. In addition, pixels within the disk-shaped image of a feature, such as pixel 204, may have 0 values or may have intensity values outside the range of expected intensity values for a feature. Thus, scanned images of molecular array features may often show noise and variation and may depart significantly from the idealized scanned image shown in FIG. 1.
FIG. 3 illustrates indexing of a scanned image produced from a molecular array. A set of imaginary horizontal and vertical grid lines, such as horizontal grid line 301, are arranged so that the intersections of vertical and horizontal grid lines correspond with the centers of features. The imaginary grid lines establishes a two-dimensional index grid for indexing the features. Thus, for example, feature 302 can be specified by the indices (0,0). For alternative arrangements of features, such as the more closely packed arrangements mentioned above, a slightly more complicated indexing system may be used. For example, feature locations in odd-indexed rows having a particular column index may be understood to be physically offset horizontally from feature locations having the same column index in even-indexed rows. Such horizontal offsets occur, for example, in hexagonal, closest-packed arrays of features.
In order to interpret the scanned image resulting from optical or radiometric analysis of a molecular array, the scanned image needs to be processed to: (1) index the positions of features within the scanned image; (2) extract data from the features and determine the magnitudes of background signals; (3) compute, for each signal, background subtracted magnitudes for each feature; (4) normalize signals produced from different types of analysis, as, for example, dye normalization of optical scans conducted at different light wavelengths to normalize different response curves produced by chromophores at different wavelengths; and (5) determine the ratios of background-subtracted and normalized signals for each feature while also determining a statistical measure of the variability of the ratios or confidence intervals related to the distribution of the signal ratios about a mean signal ratio value. These various steps in the processing of scanned images produced as a result of optical or radiometric analysis of molecular arrays together comprise an overall process called feature extraction.
Designers, manufacturers, and users of molecular arrays have recognized a need for automated feature extraction. Automated feature extraction, like any other automated technique, can produce enormous savings in the time and cost of using molecular arrays for chemical and biological analysis. Automated feature extraction can also eliminate inconsistencies caused by user error and can greatly increase the reproducibility and objectivity of feature extraction.
One embodiment of the present invention comprises a method and system for automated feature extraction from scanned images produced by optical, radiometric, or other types of analysis of molecular arrays. First, horizontal and vertical projections of pixel values, called row and column vectors, are computationally produced from the scanned image. The row and column vectors are analyzed to determine the positions of peaks, and the positions of the first and last peaks in the row and column vectors are used to estimate the positions of the corner features within the scanned image. Typically, bright control features, i.e. features designed to hybridize to labeled sample molecules of any sample solution to which a molecular array is exposed, are placed on the border of the molecular array to facilitate this process. When necessary, row and column vectors can be calculated over a range of rotations of a two-dimensional, orthogonal coordinate system in order to select the most favorable rotation angle at which to fix the coordinate system. Analysis of regions of the scanned image representing the corner features can be used to more exactly locate the positions of the corner features. Then, using the established positions of the corner features, an initial coordinate system is computationally established for the scanned image. Using the initial coordinate system, the centroids of features producing strong signals, or, in other words, pixels having high signal-to-noise ratios and located close to expected positions in the scanned image, are determined, and a regression analysis is used to refine the coordinate system to best correspond to the determined positions of the strong features. The refined coordinate system is employed to locate the positions of weak features and the positions of the background regions local to each feature. Next, a process is used to analyze various different signals generated by different analytical methods in order to select the most reliable portions of each feature and the local background surrounding the feature for subsequent signal extraction and signal variability determinations. For example, the fluorescence of hybridized labeled molecules may be measured at green light wavelengths and at red light wavelengths, with the intensities produced at each position of the surface of the molecular array at red and green wavelengths corresponding to two different signals. Finally, signal data and variability of signal data are extracted from the reliable regions of each feature and each local background region of the scanned image.