1. Field of the Invention
The present invention relates generally to information databases, specifically to database similarity joins and more specifically to a system and method for information organization whereby characteristics regarding entities are inferred from the characteristics of similar entities. This is referred to herein as a xe2x80x9cfuzzy similarity joinxe2x80x9d and is exemplified using a chemical similarity join.
2. Background Information
Chemists, biologists and other users regularly create and test series of chemical compounds in investigating and verifying a hypothesis. In this process, the users often seek to obtain chemical compounds exhibiting certain characteristics, or behaving according to certain metrics, and may seek to synthesize compounds having similar characteristics or behavior patterns.
The process of searching for a chemical compound of some commercial value usually starts with broad-based selection and testing. An example of this is the high-throughput screening typically used in the initial phase of pharmaceutical agent discovery. Pharmaceutical discovery is used as an example, but the same type of process is used for agricultural chemical discovery and material science research, as well as in other related fields.
In High-throughput Screening (xe2x80x9cHTSxe2x80x9d), the number of compounds examined and tested for a desirable biological response can often range from 50,000 to 500,000, or more. The goal is to find some smaller set of compounds within the larger set that are active in a biological screen, and to treat these compounds as xe2x80x9cleadsxe2x80x9d that can be further developed into an eventual drug candidate. The initial library of compounds tested represents many different types of chemicals.
The chemicals in the initial library can come from several sources, including those developed in-house by conventional synthesis, commercial acquisition, combinatorial chemistry, and natural product extraction. These compounds are typically placed in micro-titer plates. Typical formats for the plates include 96 and 384 well plates, but there is a trend to higher-density plates such as 1536 and 3456 well plates. These plates are typically manipulated by robots to perform the biological screening.
The screens themselves are usually based on a biological receptor. The receptor is either isolated so that binding to the receptor can be measured somewhat directly, or a cell line is engineered to give a detectable response when the receptor is modulated by the potential drug lead.
Although most initial libraries comprise thousands of chemical compounds, even the most extensive library represents a mere sub-set of the trillions (or more) of potential chemical structures that might have xe2x80x9cdrug-likexe2x80x9d characteristics. It is estimated that the total of all compounds available from commercial vendors is currently limited to about 1 million compounds.
The list of compounds to be screened can be selected randomly from those available, or is often chosen with some intuitive xe2x80x9cbiasxe2x80x9d of the chemists or biologists involved in a particular project. This bias can often be advantageous to the project in that chemists often have unique insights into the types of chemicals that may lead to viable drug candidates. However, and as with any bias, an intuitive approach can at times result in potential novel chemicals being overlooked.
In the last few years, the trend in the art has been to select compounds based on the diversity of the compounds within the final selection set. This process is intended to insure that many broad classes of compounds are tested. Both the measure of diversity (diversity metric) and the diversity selection method have been much discussed, but these always are dependent on a measure of similarity between two compounds. The general tendency is to choose compounds that are as different from each other as possible, but this can often lead to selection of the most chemically xe2x80x9cuniquexe2x80x9d compounds in the set; accordingly, this approach can lead to overlooking or missing potentially active lead compounds.
In conducting these studies, researchers rarely desire selection methods that find large clusters of structurally similar compounds within the library (e.g. 5000 benzodiazepam derivatives would not be desirable). Singletons, i.e., compounds that have no similar structure in the dataset, are also generally considered undesirable because these do not allow for the opportunity to develop any structure-activity correlation information. Rather, selection methods that lead to sets of 10-15 similar structures are considered preferable. Such small sets of similar compounds allow for some analysis of the effect of small structure variations on the activity of the compounds (referred to as Structure-Activity Relationships, or, SAR studies). In addition, the small clusters help validate the screeningxe2x80x94if 5 of 10 compounds in a small cluster evidence biological activity, because the cluster is comprised of chemically related structures, the activity is more likely to be reproducible and xe2x80x9coptimizable.xe2x80x9d
The initial biological screening produces compounds that are generally referred to as xe2x80x9cchemical hitsxe2x80x9d or simply xe2x80x9chitsxe2x80x9dxe2x80x94hits are compounds that have been screened in an assay and evidence biological activity above a desired threshold. These hits rarely include the final drug candidate that will be further analyzed in animal toxicology studies and, ultimately, in human clinical trials. Indeed, these hits generally represent leads that are optimized by producing small changes in their chemical structures; these changes are generally intended to improve or enhance the biological activity of the leads until a commercial candidate is identified via additional screening. These follow-on compounds can be referred to as analogues of the initial hits. This process of optimization of the hits is generally referred to as xe2x80x9clead follow-up.xe2x80x9d
Lead follow-up has generally been accomplished by medicinal chemists, who make small sets of analogues of some of the lead compounds. As with the initial screen that led to the initial hits, the analogues are then also tested for biological efficacy. The structure modifications that resulted in reduced activity are usually discarded in favor of those that increased the activity, and new modifications to the analogue compounds are often also made and tested. The medicinal chemist follows the leads until a compound (or a small set of compounds) is identified that has appropriate efficacy for a drug candidate.
In the last several years, the medicinal chemist has often been aided by computer-based design technologies such as Quantitative Structure Activity Relationships (QSAR). These programs use efficacy data for previously tested compounds to predict the efficacy of compounds yet to be tested. The goal of QSAR program is to give accurate predictions of the activities prior to testing the compounds. QSAR programs have generally been successful, not in predicting the activity of the eventual drug candidate, but in allowing more efficient selection of each round of analogue synthesis. While the compounds predicted to be active by QSAR methods do not always have the activity predicted, generally these compounds have an increased chance of being active compared to the general population.
Pharmaceutical development is generally very competitive. Therefore, and almost without deviation, once a drug candidate is selected, extensive patent searches are conducted in order to insure that the candidate if or the use of the candidate is not restricted by another""s patent position. Animal toxicology studies generally follow the patent search. If the animal toxicology results are acceptable, human clinical trials of the drug candidate are pursued.
The process of screening, analoging and identification of potential drug candidates can be very time consuming and expensive. Patent searching, particularly in the area of chemical compounds, can also be very time consuming and expensive. Animal toxicology studies involving the potential drug candidate can easily cost hundreds of thousands of dollars. Human clinical studies designed to establish the safety and efficacy of the potential drug candidate in humans exceed tens of millions of dollars. It is, therefore, a imperative that as much information relating to the potential drug candidate be understood as early in the process as possible such that substantial investments in time, effort and financial resources are not directed to, e.g., a potential drug candidate that is covered by the claims of a third party patent, or, e.g., a potential drug candidate that is chemically related to another compound that evidenced safety issues in human clinical studies.
Relational Databases Systems (xe2x80x9cRDSxe2x80x9d) are used prevalently throughout industry and academia to store and search information on a plethora of subjects. RDS employ a table structure to store information about the various instances of each entity. These tables have defined columns that are the attributes of each data item (rows). The data in each column can be of several types, including text, numeric, date/time, binary, etc. Data in certain columns can be indexed for faster retrieval.
In the relational model for database design, data that is repeated for several rows is usually split out into a new table/entity definition. This process is referred to as xe2x80x9cnormalizationxe2x80x9d and is generally accomplished to protect the data integrity and to save disk space. The relationship between the data in the tables is, however, maintained.
The data in RDS is generally queried by the user or application program by generating a specific query in a query-directed language. The Oracle(trademark) system is a preferred example of a RDS. In Oracle, as in many other RDS, queries are posed using the Structured Query Language (SQL). This language allows easy retrieval of the information stored in the various tables, and allows related data in a different table to be combined. The construct of an SQL query that performs this combination of data is called a xe2x80x9cjoin.xe2x80x9d The word join, in this context, is a term of art; it is noun, and not a verb. A join links rows of one table with rows of another based on some common or related columns (attributes). The join can be performed xe2x80x9con the flyxe2x80x9d (i.e., the join itself is added to each query as it is created), or can be predefined to give a pseudo table, generally referred to as a xe2x80x9cview.xe2x80x9d The view has the appearance of a new table, but generally, the view is not stored as such.
The Internet provides a useful technique for making information available to a variety of individuals each of whom may be located at a variety of different locations. Indeed, within the vast Internet environment, individuals can access information tools from remote locations. Beneficially, the Internet is a preferred way for accessing information stored in relational databases, such as those described above.
The Internet, which originally came about in the late 1960""s, is a computer network made up of many smaller networks spanning the entire globe. The host computers or networks of computers on the Internet allow public access to databases containing information in numerous areas of expertise. Hosts can be sponsored by a wide range of entities including, for example, universities, government organizations, commercial enterprises and individuals.
Internet information is made available to the public through servers running on an Internet host. The servers make documents or other files available to those accessing the host site. Such files can be stored in databases and on storage media such as, for example, optical or magnetic storage devices, preferably local to the host.
Networking protocols can be used to facilitate communications between the host and a requesting client. TCP/IP (Transmission Control Protocol/Internet Protocol) is one such networking protocol. Computers on a TCP/IP network utilize unique identification (xe2x80x9cIDxe2x80x9d) codes, allowing each computer or host on the Internet to be uniquely identified. Such codes can include an IP (Internet Protocol) number or address, and corresponding network and computer names.
Created in 1991, the World-Wide Web (Web, or www) provides access to information on the Internet, allowing a user to navigate Internet resources intuitively, without IP addresses or other specialized knowledge. The Web comprises hundreds of thousands of interconnected xe2x80x9cpagesxe2x80x9d, or documents, which can be displayed on a user""s computer monitor. The Web pages are provided by hosts running special servers. Software that runs these Web servers is relatively simple and is available on a wide range of computer platforms including PC""s. Equally available is Web browser software, used to display Web pages as well as traditional non-Web files on the user""s system.
The Web is based on the concept of hypertext and a transfer method known as xe2x80x9cHTTPxe2x80x9d (Hypertext Transfer Protocol). HTTP is designed to run primarily over TCP/IP and uses the standard Internet setup, where a server issues the data and a client displays or processes it. One format for information transfer is to create documents using Hypertext Markup Language (HTML). HTML pages are made up of standard text as well as formatting codes indicating how to display the page. The browser reads these codes to display the page.
Each Web page may contain pictures and sounds in addition to text. Associated with certain text, pictures or sounds are connections, known as hypertext links, to other pages within the same server or even on other computers within the Internet. For example, links may appear as underlined or highlighted words or phrases. Each link is directed to a web page by using a special name called a URL (Uniform Resource Locator). URLs enable the browser to go directly to the associated file, even if it is on another Web server.
In addition to the Internet, which allows for general, public retrieval of information, other means of accessing such information exist and are commonly utilized. For example, direct modem connections between two computers, proprietary internal networks within large institutions and organizations, etc. are equally available and useful means for accessing catalogued information stored in databases.
The present invention is directed toward a system and method for information organization whereby characteristics regarding entities can be inferred from the characteristics of similar entities. According to one aspect of the invention, database similarity joins can be used to allow characteristics or parameters regarding information contained in a first database to be inferred from characteristics or parameters regarding information contained in a second database. According to this aspect of the invention, the invention can provide for the retrieval of information that is not organized in a manner that a specific user may require or desire. This allows retrieval based upon common characteristics or a similarity between entities organized in unrelated databases. The information can be retrieved and organized in a manner that makes the information more useful to the user.
One approach that allows for such retrieval and organization is referred to as a fuzzy similarity join. According to this approach, it is not necessary that the relationship between the retrieved information be intuitively or organizationally related in the manner in which it is retrieved. Instead, the retrieval of desired information can be based upon a similarity among entities in one or more databases.
These and other aspects of the invention, which can be implemented individually or collectively, are perhaps best described in terms of an example application. For example, consider the application of chemical searching, where a scientist may wish to obtain certain information about one or more compounds of interest. According to conventional chemical database strategy, information of interest to a scientist about a compound of interest may not be readily available in a database, or may not be available at all. According to one aspect of the invention, the scientist can perform a database join to obtain information about the compound of interest from another database.
According to another aspect of the invention in this application, the scientist can perform a chemical similarity join (or fuzzy similarity join) to infer information about the compound of interest, based upon the characteristics of other parameters of xe2x80x9csimilarxe2x80x9d compounds. According to this aspect of the invention, the chemical similarity join allows the scientist to search one or more databases to obtain information about the xe2x80x9csimilarxe2x80x9d compounds. The scientist can use this information to infer behavior or other characteristics or parameters about the compound of interest.
In one implementation, for example, the chemical space can be defined such that a neighborhood effect exists for the property in question (for example, toxicology), then the property for the compound(s) of interest in one database can be inferred from the property data of similar compounds in another database. Thus, this aspect of the invention in this application allows two tables to be joined by a similarity comparison of the two structures. An exact match of the two structures is not required to perform the join operation.
According to another aspect of the invention, a searching tool can be used that combines actual compound data with a virtual data set to facilitate neighborhood searching around a preferred set of property metrics. The neighborhood relationship can be the basis for the similarity join.
According to another aspect of the invention, the data set can be screened to eliminate records having particular or identified properties or characteristics. Additionally, the data set can be combined with other data to allow further filtering to exclude unwanted classes of records.
According to yet another aspect of the invention, the searching tool can be linked to an ordering system, allowing the users to purchase identified items.
These and other features, advantages and aspects of the invention are discussed in more detail below.