Mass spectrometry is a very important assay method for biopolymer, which can be further linked to protein database searching and thus be a crucial assay in proteomics study. The conventional methods for searching a protein by using a mass spectrum have been using softwares such as SEQUEST (Eng et al., J. Am. Soc. Mass Spectrom. 5:976-989, 1994; Thermo Electron Corp., USA), Mascot (Perkins et al., Electrophoresis, 20:3551-3567, 1999; Matrix Science Ltd., USA, www<dot>matrixscience<dot>com , Sonar (Field, H. I. et al., Proteomics, 2:36-47, 2002; bioinformatics<dot>genomicsolutions<dot>com/) and X!Tandem (Craig et al., Bioinformatics, 20:1466-1467, 2004; Proteome Software Inc., USA). According to the above methods, each target-to-be protein is extracted from a database comprising amino acid sequences and its mass spectrum pattern is predicted, which is then compared with the real mass spectrum investigated. The search algorithm used for the software for protein searching is exemplified by MOWSE algorithm (Pappin et al., Curr. Biol. 3:327-332, 1993), SEQUEST algorithm (Eng et al., J. Am. Soc. Mass Spectrom. 5:976-989, 1994), etc.
According to the fast accumulation of protein related information, the protein sequence database increases tremendously. So, if a database comprises all the sequences, searching efficiency decreases, so that it will be of no practical use. To overcome this problem, a more sophisticated sequence database such as UniProtKB/SwissProt (Bairoch et al., Nucleic Acids Res. 33:D154-159, 2005; www<dot>ebi<dot>uniprot<dot>orq/uniprot-srv/) or IPI database (Kersey et al., Proteomics 4(7):1985-1988, 2004; www<dot>ebi<dot>ac<dot>uk/IPI/IPIhelp<dot>html) can be used for database searching using a mass spectrum. These databases comprise the representative protein sequences with elimination of similar sequences which have been picked up from the known protein databases. They only contain 20% of the proteins listed in NCBI nr database of National Center for Biotechnology Information, USA, which is an integrated protein sequence database, so that they can be more effectively used for database searching using a mass spectrum.
Although these sophisticated sequence databases can make protein sequence searching easy and fast with the representative protein sequences, they might not be able to confirm some similar protein sequence. In using the general sequence searching program, a target sequence can be identified as long as there is a similar sequence in the database. However, mass spectrometry is a method to identify a peptide by investigating the molecular weight of the peptide. Therefore, if the difference between sequences is big, the screening result is controversial. So, to be practical, the size of a database has to be reduced, but for more accurate analysis, a database has to include various sequences. In particular, to find out some modifications on a specific region on a protein by using a molecular weight, the existence of the corresponding sequence in database is a critical factor for screening accuracy. Thus, an algorithm that is able to calculate similar sequence in mass spectrum has been proposed (Creasy et al., Proteomics 2(10):1426-34, 2002; Kayser et al., J. Biomol. Tech. 15(4):285-95, 2004). However, using this algorithm takes long time and the results might be in question.