A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the Patent and Trademark files or records, but otherwise reserves all copyright rights whatsoever.
The present application includes and incorporates by reference a computer sequence listing appendix on a single compact disc and its duplicate, each compact disc was created on Oct. 23, 2003 and includes the files Appendix A.doc (96 kB) and Appendix B.doc (160 kB).
The invention relates to a system, methods and products for managing, finding, and/or displaying biomolecular interactions.
Technological advances and mounting interest have pushed proteins into the scientific spotlight. This growing field encompasses the study of proteins, both in structure and in function, contained in a proteomexe2x80x94the protein equivalent of a genome. Because of increased interest and technique automation (Mendelsohn et al., 1999), the rate of proteomic data production is growing in a similar fashion as that of genomics a decade ago. For example, mass spectrometers, gene chips, and two-hybrid systems have made cellular signaling pathway mapping faster and easier and consequently these are becoming large producers of data. Protein-protein interaction and more general biomolecule-biomolecule (protein-DNA, protein-RNA, protein-small molecule, etc.) interaction information is being generated and recorded in the literature. Lessons from the genomic era have taught us that large amounts of related data recorded in scientific journals soon becomes unmanageable. A well designed common data specification based on a model of the biological information is therefore required to describe and store biomolecular interaction data.
The present inventors have designed a data specification for the storage and management of biomolecular interaction and biochemical pathway data that possesses the following properties:
1. It describes the full complexity of the biological data, from simple binary interactions to large-scale molecular complexes and networks of pathways and interactions. It stores protein, DNA, RNA, and other molecules in full atomic detail, since character based sequence abstractions of biomolecules often miss important chemical features, such as methylation on DNA. This allows as much data as possible to be stored for scientific use in electronic form rather than in print.
2. It is easily computable. A computer can easily read, write, and traverse the specification. This facilitates maintenance of a database of such information, creation of advanced queries and querying tools and development of computer programs that use the information for data visualization, data mining, and visual data entry.
3. It is platform and database independent. Tools written for one platform can read data created on another platform directly. It handles the data structure without modification as well.
4. It is succinct and easy for humans to understand. Field to data correspondence is very clear and a human readable format of the specification is available.
The data structure was designed for a database referred to herein as xe2x80x9cBINDxe2x80x9d (Biomolecular Interaction Network Database). The data structure is written in a data specification language called Abstract Syntax Notation. 1 (ASN.1, also known as X.208 or ISO-8824) The U.S. National Center for Biotechnology Information (NCBI) uses ASN.1 to describe and store all of its biological and publication data and all of GenBank, MMDB and PubMed (Ostell and Kans, 1998). BIND inherits the NCBI data model, which provides a solid foundation for the BIND data specification through the use of mature NCBI data types that describe sequence, 3D structure, and publication reference information.
Although the specification is written in ASN.1, it is not restricted to this syntax. The data structures can be readily translated to other common data specification languages such as CORBA IDL (Object Management Group, 1996) or XML if the need arises. Aside from ASN.1, no other biological data specification is sufficiently rich in mature data types to use as a foundation for BIND without first building and testing those base data types.
The BIND data specification represents complex cellular pathway information efficiently in a computer. BIND defines three main data types: interactions, molecular complexes, and pathways. Each of these objects is composed of various component and descriptor objects that are either defined in the specification proper or inherited from the NCBI ASN.1 data specifications. For example, an interaction record contains, among other data objects, two BIND-objects. A BIND-object describes a molecule of any type and is itself defined using simpler sub-objects. Normally, a BIND-object describing a biopolymer sequence will store a simple link to a sequence database, such as GenBank (Benson et al., 1999). If, however, the sequence is not present in a public database, it can be fully represented using an embedded NCBI-Bioseq object. The NCBI-Bioseq object is how NCBI stores all of the sequences in GenBank and is a mature data structure. BIND also inherits the NCBI taxonomy model (also used and supported by EMBL, DDBJ and Swiss-Prot) and data, via an inherited NCBI-BioSource, and is designed so that interactions can be both inter- and intra-organismal. Sequence, structure, publication, taxonomy and small molecule databases provide a strong foundation for BIND.
Broadly stated, the present invention contemplates a system for electronically managing, finding, and/or visualizing biomolecular interactions comprising a computer system including at least one computer receiving data on biomolecular interactions from a plurality of providers and processing such data to create and maintain images and/or text defining biomolecular interactions, said computer system, in response to data requests, creating and transmitting to a plurality of end-users, the images and/or text defining biomolecular interactions.
In an embodiment, a system for electronically managing, finding, and/or visualizing biomolecular interactions is provided comprising:
(a) a maintenance entity for receiving data on biomolecular interactions from a plurality of providers and means for receiving and processing such data to create and maintain images and/or text defining biomolecular interactions; and
(b) one or more computer systems maintained by the maintenance entity and having means for creating and transmitting to a plurality of end-users the images and/or text defining biomolecular interactions.
The system is useful in managing, finding, and/or displaying biomolecular interactions including interactions involving proteins, nucleic acids (RNA, DNA), and ligands, molecular complexes, and signaling pathways. The interactions are defined both at the molecular and atomic levels and in particular they may be defined by chemical graphs.
The invention also provides a method for displaying on a computer screen information concerning biomolecular interactions comprising retrieving an image and/or text defining a biomolecular interaction from a system of the invention.
The present invention also provides a data structure stored in the memory of a computer the data structure having a plurality of records and each record containing a biomolecular interaction and information relating to the biomolecular interaction. In an embodiment the biomolecular interaction is identified by chemical graphs. The information in the data structure may be accessible by using indices which may represent selections of information from the chemical graphs.
The term xe2x80x9crecordxe2x80x9d used herein generally refers to a row in a database table. Each record contains one or more fields or attributes. A given record may be uniquely specified by one or a combination of fields or attributes known as the record""s primary key. A record of a biomolecular interaction as used herein is generally a record containing information identifying the biomolecular interaction as a chemical graph and a plurality of other attributes with information pertaining to the biomolecular interaction (e.g. information on the cellular place of interaction, experimental conditions used to observe the interaction, conserved sequence comment of molecules in the interaction if they are biological sequences, information on molecules in the interaction, description of metabolic and signaling pathways, cell cycle stages in which an interaction is involved, locations of binding sites on the molecules in an interaction, chemical actions mediated by the interactions, and chemical states of the molecules in the interaction).
The term xe2x80x9cchemical graphxe2x80x9d refers to a connectivity graph of all the atoms and bonds in a molecule in a biomolecular interaction. The graph may include three-dimensional coordinates.
The invention also provides a method for storing a representation of a biomolecular interaction in a memory of a computer system, the method executed on a computer system and comprising the steps of:
(a) identifying a chemical graph of a biomolecular interaction; and
(b) storing a record in a data structure of the invention.
The invention further provides a method for storing a representation of a biomolecular interaction in a memory of a computer system, the method executed on a computer system and comprising the steps of:
(a) identifying a chemical graph of a biomolecular interaction;
(b) generating one or more indices from information in the chemical graph; and
(c) storing a record in a data structure of the invention.
The invention still further provides a method for identifying a biomolecular interaction that is similar to a reference biomolecular interaction, the method executed on a computer and comprising the steps of:
(a) conducting a similarity search for each molecule in a test biomolecular interaction;
(b) screening the results of the similarity search preferably by selected taxonomy;
(c) assembling a putative biomolecular interaction to create a test record;
(d) accessing one or more records in a data structure stored in the memory, the data structure having a plurality of records, each of the records containing a reference biomolecular interaction and information relating to the reference biomolecular interaction; and
(e) matching the test record with each record in the data structure to produce a matching record containing a reference biomolecular interaction matching the test biomolecular interaction.
The similarity searches may be based for example on sequence similarity or identity, or similarities in molecular weights, pIs, mass fingerprinting data or mass spectrometric data, fragment-ion tag data, peptide masses from enzymatic digestion, fragment ion masses, isotope patterns, and sequence tag data. Standard tools available in the art for similarity searching and screening can be used. (For example, the following tools may be used BLAST BioScan, Fasta3, PropSearch, SAMBA, SAWTED, Scanps, FDF, ExPASY Proteomics Tools TagIdent: PeptIdent:ProteinProspector:MultiIdent: PeptideSearch:PROWL:Mascot:BioSCAN, Pro).
Another aspect of the invention provides a computer system for storing a representation of one or more biomolecular interactions in a memory in the computer system and for comparing one or more reference biomolecular interactions to a test biomolecular interaction, comprising:
(a) a database means stored in the memory representing one or more biomolecular interactions; each of the biomolecular interactions represented by a chemical graph; and
(b) a data structure means for storing a plurality of record means, each record means containing chemical graphs of the test biomolecular interaction.
The invention also provides a computer system comprising memory means, storage means, program means, and stored means for building virtual-models of biomolecular interactions in the computer system comprising:
(a) one or more libraries of reference biomolecular interactions that comprise any number of attributes or components of the biomolecular interaction which values are either being used to describe characteristics of the types of biomolecular interactions in the computer system, or values or data structures used by the program at runtime, or are to be used to more specifically describe characteristics of individual components of the biomolecular interaction that each instance of a type of biomolecular interaction is to represent, or characteristics of each instance of biomolecular interaction in the computer system; wherein the attributes have values of any type in the computer system or in a network accessible by the computer system;
(b) means for manipulating the biomolecular interaction by domain experts or program means comprising visual means for making the biomolecular interactions available through menus or palettes or programmatic means; and
(c) constructor means to create new instances from the definitions of the biomolecular interactions, and means to establish directional output-input links between complemenatary instances of the biomolecular interactions directly or through components.
Also provided is a computer system comprising:
(a) a database having a plurality of records, wherein each record contains a reference biomolecular interaction defined by a chemical graph and descriptive information from an external database which information correlates the biomolecular interactions to records in the external database; and
(b) a user interface allowing a user to selectively view information regarding a biomolecular interaction.
In an embodiment, a computer system is provided comprising:
(a) a database having a plurality of records, each of said records containing a reference biomolecular interaction defined by a chemical graph and descriptive information from an external database, which information correlates the biomolecular interactions to records in the external database;
(b) a processor in communication with said database and responsive to user input to access records in said database; and
(c) a user interface allowing a user to provide user input to said processor to selectively view information regarding a biomolecular interaction.
Still further the invention provides a database system comprising a plurality of internal records, the database comprising a plurality of records, wherein each record contains a biomolecular interaction defined by chemical graphs and descriptive information from an external database which information correlates the biomolecular interactions to records in the external database.
In an embodiment the external database is PubMed. The interface of the computer system may further comprise user selectable links to enable a user to access additional information for a biomolecular interaction. The links may comprise HTML links. Additionally provided is a method of using a computer system to present information, or a method of presenting information pertaining to records of biomolecular interactions in a database, the records containing information identifying the biomolecular interaction and defining the biomolecular interaction by chemical graphs, the method comprising:
(a) providing an interface for entering query information relating to a biomolecular interaction;
(b) locating data corresponding to the entered query information; and
(c) displaying the data corresponding to the entered query information.
In step (b) the data is located by examining records in the database.
The invention further provides a computer program product comprising a computer-usable medium having computer-readable program code embodied thereon relating to a plurality of records of biomolecular interactions, the records identifying the biomolecular interactions and defining chemical graphs of the biomolecular interactions, the computer program product comprising computer-readable program code for effecting the following steps within a computing system:
(a) providing an interface for entering query information relating to a biomolecular interaction;
(b) locating data corresponding to the entered query information; and
(c) displaying the data corresponding to the entered query information.
The invention contemplates a database storing data relating to biomolecular interactions comprising:
(a) first data types describing biomolecular interactions between chemical objects;
(b) second data types describing collections of biomolecular interactions; and
(c) third data types describing pathways between said collections of interactions.
The first data types may include objects for the chemical objects, each of the objects including at least one of a pointer to an external database describing the chemical object, a sequence, and a chemical graph. The first data types may be stored as records and further include objects identifying the biomolecular interactions and defining chemical graphs of the biomolecular interactions.
The second data types may include lists of identifications referencing the biomolecular interactions in the collections. The third data types may include objects for the chemical objects that can form networks of interactions. The networks of interactions may include metabolic pathways and cell signaling pathways. The third data types may additionally include sequences of identifications referencing biomolecular interactions that make up the pathways.
The systems and products of the present invention may be used to study and identify biomolecular interactions. Such information is of significant interest in pharmaceutical research, particularly to identify potential drugs and targets for drug development. The systems and products provide great power and flexibility in analyzing biomolecular interactions.