A. Field of the Invention
The invention relates to replicator networks trainable to create a plurality of basis sets of basis vectors used to reproduce data for confirming identification of the data.
B. Description of the Related Art
Computers have long been programmed to perform specific functions and operations by means of sophisticated computer programming. However, in order to distinguish between data having similar features, human intervention is often required to make decisions about identification, categorization and/or separation of such data. There are no automated analysis systems that can perform sophisticated classification and analysis tasks at levels comparable to those of skilled humans.
A computer is, in essence, a number processing device. In other words, the basic vocabulary computers use to communicate and perform operations consists of numbers, mathematical relationships and mathematically expressed concepts. One portion of a computer""s vocabulary includes basis vectors.
Basis vectors play an important role in multimedia and telecommunications. For instance, the images transmitted across the Internet and digital satellite television use powerful data compression technologies that encode images and sounds using predetermined sets of basis vectors to reduce the size of the transmitted data. After transmission, the encoded images and sounds are decoded by a receiver using the predetermined basis vector sets. By using pre-determined basis vectors in the transmission and reception of images and sounds, the data can be stored and transmitted in a much more compact form. Typical data compression techniques using codecs (coders and decoders) using basis vector sets include:
JPEG and NWEG codecsxe2x80x94cosine waves form the basis vector sets,
Wavelet codecsxe2x80x94wavelets form the basis vector sets, and
Fractal codecsxe2x80x94fractals form the basis vector sets.
FIG. 1 is a grey-scale rendering of the basis vector set used in the JPEG compression technique. FIG. 1 shows an 8xc3x978 array of basis vectors, each basis vector being a two-dimensional cosine wave having a different frequency and orientation. When an object image is to be transmitted over the Internet, the JPEG coder identifies a combination of these basis vectors that, when put together, define each section of the object image. Identification of the combination of basis vectors are transmitted over the Internet to a receiving computer. The receiving computer reconstructs the image using the basis vectors. In any given image, only a relatively small subset of basis vectors are needed in order to define the object image. The amount of data transmitted over the Internet is greatly reduced by transmitting identification of the basis vectors compared to transmitting a pixel by pixel rendering of the object image. The basis vectors in the JPEG technique are the limited vocabulary used by the computer to code and decode information. Similar basis vector sets are used in other types of data transmission, such as NV3 audio files. The smaller the vocabulary is, the more rapid the data transmission. In data compression, each data compression technique has its own predetermined, fixed set of basis vectors. These fixed sets of basis vectors are the vocabulary used by the compression technique. One of the primary purposes of the basis vector sets in data compression is to minimize the amount of data transmitted, and thereby speeding up data transmission. For instance, the JPEG data compression technique employs a 25 predetermined and fixed set of basis vectors. Cellular telephone data compression techniques have their own unique basis vectors suitable for compressing audio signals.
Traditionally basis vectors have been, in essence, a vocabulary used by computers to more efficiently compress and transmit data. Basis vectors may also be useful for other purposes, such as identification by computers of information. However, if an identification system is provided with a limited vocabulary, then only limited types of data are recognizable. For instance, identification systems have been developed to scan and recognize information in order to sort that information into categories. Such systems are preprogrammed to recognize a limited and usually very specific type of data. Bar code readers are a good example of such systems. The bar code reader is provided with a vocabulary that enables it to distinguish between the various width and spaces between bars correlating to a numeric value. However, such systems are fixed in that they can only recognize data pre-programmed into their computer systems. Once programmed, their function and vocabulary are fixed.
Another type of pre-programmed recognition system is in genome-based research and diagnostics. Specifically, sequencers have been developed for analyzing nucleic acid fragments, and for determining the nucleotide sequence of each fragment or the length of each fragment. Both Perkin-Ehner Corporation and Pharmacia Corporation currently manufacture and market such sequencer devices. In order to utilize such devices, a variety of different procedures are used to break the nucleic acid fragment into a variety of smaller 20 portions. These procedures include use of various dyes that label predetermined nucleic acids within the fragment at specific locations in the fragment. Next, the fragments are subjected to gel electrophoresis, subjected to laser light by one of the above mentioned devices and the color and intensity of light emitted by the dyes is measured. The color and intensity of light is then used to construct an electropherograxn of the fragment under analysis.
The color and intensity of light measured by a device indicates the presence of a dye further indicating the location of the corresponding nucleic acid within the sequence. Such sequencers include scanners that detect fluorescence from several illuminated dyes. For instance, there are dyes that are used to identify the A, G, C and T nucleotide extension reactions. Each dye emits light at a different wavelength when excited by laser light. Thus, all four colors and therefore all four reactions can be detected and distinguished in a single gel lane.
Specific software is currently used with the above mentioned sequencer devices to process the scanned electropherograms. The software is pre-programmed to recognize the light pattern emitted by the pre-designated dyes. The vocabulary of the software is limited to enable the system to recognize the specific patterns. Even with pre-designated patterns and logical results, such systems stiff require human intervention for proper identification of the nucleic acid sequences under study. Such systems yield significant productivity enhancements over manual methods, but further improvements are desirable.
There exists a need for a reliable, expandable and flexible means for identifying and classifying data. In particular, there is a need for more flexible identification systems that can be easily enhanced for identification of new and differing types of data.
One object of the invention is to provide a simple and reliable system for identifying data.
Another object of the present invention is to provide a data classification system with more flexible means for identifying and classifying data.
The invention relates to a method and apparatus that is trainable to identify data. The invention includes inputting several previously identified data sets into a computer and creating within the computer a plurality of unique basis vector sets. Each basis vector set includes a plurality of basis vectors in one to one correspondence with each of the identified data sets. For each data set, a comparison set of data is created using only the created basis vector sets. A comparison is made between each of the previously identified data sets and corresponding-comparison data set thereby generating error signals for each comparison. A determination is made to determine the acceptability of the error. Once the error is determined to be acceptable, the training phase is completed.
Once the basis vector sets have been established as being acceptable, new unidentified data is inputted into the computer. The new data is replicated separately using each individual basis vector set constructed during the training phase. For each basis vector set, a comparison is made between the inputted data and the replicated data. The inputted data is accurately replicated only by one of the basis vector sets, thereby providing a means for classifying the now identified data.
The foregoing and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description, when taken in conjunction with the accompanying drawings.