Researchers often use the scientific method to seek undocumented information or investigate occurrences related to a subject. Certain embodiments of the scientific method may include at least the steps of identifying a question about the subject, forming a hypothesis about the subject, conducting an experiment about the subject, and assessing whether the experimental results support the hypothesis. If the experimental results support the hypothesis, the hypothesis may become an experimental conclusion. If the experimental results do not support the hypothesis, then the not-hypothesis may become an experimental conclusion.
A single cycle of the scientific method may not be sufficient for a community to accept that an experimental conclusion is accurate. If an experimental conclusion is supported by many well-designed experiments, the experimental conclusion may become generally accepted as accurate or factual. However, even when accepted conclusions are supported by many, many experimental results, accepted conclusions may still be altered or overturned if new evidence renders those conclusions uncertain or unsupported.
Other elements of scientific method may include submitting results for peer review, publishing results and experimental conclusions, and attempting to replicate the experimental results obtained by the same researcher or other researchers.
While every scientific inquiry may not follow a strict scientific method, scientific inquiry almost always includes identifying some research subject and seeking to answer some question by conducting research, such as experimental research.
Also, certain experimental research may identify some results that do not specifically answer the question, do not support the hypothesis, or possibly do not provide practical solutions to a question. Such results do lead to an experimental conclusion and provide negative information about the subject. Negative information can be useful and valuable, but possibly less valuable than positive information, that is the information that results from experimental research that supports or confirms a hypothesis. To illustrate the difference between negative information and positive information, Thomas Edison is said to have tried over 2,000 different materials as light bulb filaments and found them to be unsuitable for a commercial product. This information regarding each of those 2,000 filaments that did not fit his requirements for a commercial light bulb constituted the valuable negative information generated by Edison's experiments. However, arguably the most significant and valuable experiment and respective conclusion—that was also positive information—was that he identified a material that could be used as a filament that burned for 40 hours. This material was the precursor to a practical filament useful in a commercially-viable light bulb. For purposes of this application, the term “positive information” means the solution to a problem or the answer to a question, whether incremental or partial or complete or full.
For inexperienced researchers, and even experienced researchers, it can be challenging to identify questions or hypotheses that may or are likely to lead to valuable or significant conclusions. A valuable or significant conclusion may be valuable or significant, for example, because of its importance relative to known information, possibly because it inspires other research questions and ultimately because of the monetary value assigned to the conclusion.
Researchers' educational or career options may be dependent on the significance and value of the experimental conclusions reached as a result of their research. Also, experimental research is typically time consuming and expensive. Accordingly, efficiently finding valuable or significant conclusions can be very important to a researcher.
In addition, those who provide research funds often wish to have a high likelihood of reaching valuable or significant conclusions, for example, to improve the likelihood of receiving a return on their investment or even improve the reputation of the entity that provides the funds.
Clearly, there is a need to identify research questions and hypotheses—collectively referred to as “hypotheses” for purposes of this application—that may be or are likely to result in valuable or significant conclusions. For purposes of this application, a question or hypothesis that is likely to result in a valuable or significant conclusion is termed a “quality hypothesis”. A quality hypothesis or “quality prediction” is one that is shown to have an 80% or better accuracy rate. This accurate rate can be characterized for purposes of this application also as “high precision”.
Some techniques have been developed which attempt to identify quality hypotheses. More specifically, certain known techniques utilize computer systems configured to more quickly or automatically identify quality hypotheses. While such known computer-implemented techniques may have some advantages, such as speed over the previously known techniques, there are certain disadvantages associated with these techniques. Generally, some known techniques are slow to implement, may provide too many irrelevant results mixed in with possibly relevant results, do not take into account all or enough relevant factors correlated with quality hypotheses, are computationally expensive, and/or have one or more other disadvantages associated with them. The following will illustrate some of the disadvantages associated with known techniques that seek to identify quality hypotheses from the perspective of research in the field of biology.
One known technique developed to identify quality hypotheses includes mining (by a computer) of published research literature for co-occurrence of concepts. This technique may utilize the ABC inference model, which states that if concept B is related to both concept A and concept C, then concept A is likely related to concept C, whether directly or indirectly. However, this technique is considered to often return low-quality results and, accordingly, requires manual intervention to facilitate the identification of quality hypotheses. As such, such approaches have generated few quality hypotheses since their inception decades ago.
Another proposed technique for identifying quality hypotheses includes reflective random indexing, an improved version of random indexing for indirect inference and knowledge discovery. This technique generally outperforms latent semantic analysis—(“LSA”) and random indexing to accomplish the task of predicting hidden relationship between bio-entities. However, published literature with wide ranging topics—e.g., open discovery—may cause false positive results, often characterized as “noise”.
More efficient techniques for identifying quality hypotheses may be developed using supervised machine learning of the information appearing in published articles and other available information. Generally, machine learning is intended to allow computers “to learn” without being explicitly programmed. Supervised machine learning may include the steps of using a training set of information in which one or more concepts are each associated with a category. Then, uncategorized examples are inputted into the system and the machine attempts to categorize the examples through evaluation and using what it “learned” from the training set.
One challenge in using machine learning methods to identify quality hypotheses is how to generate instances for training and evaluation without introducing excess noise into the method. For example, if a pair of concepts is input as a “positive example” in the training set, it is difficult to define a “negative example”, since a non-interaction between two or more concepts does not mean that it is not possible, just that it has not been found. Accordingly, this technique may introduce noise to the training set.
Clearly, there is a demand for an improved system and methods for automatically generating quality hypotheses using machine learning techniques. The present invention satisfies this demand.