The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Machine Learning
In machine learning input variables are used to predict an output variable. The input variables are often called features and are denoted by X=(X1, X2, . . . , Xk), where each Xi, i∈1, . . . , k is a feature. The output variable is often called the response or dependent variable and is denoted by the variable Yi. The relationship between Y and the corresponding X can be written in a general form:Y=ƒ(X)+∈
In the equation above, ƒ is a function of the features (X1, X2, . . . , Xk) and ∈ is the random error term. The error term is independent of X and has a mean value of zero.
In practice, the features X are available without having Y or knowing the exact relation between X and Y. Since the error term has a mean value of zero, the goal is to estimate ƒ.Ŷ={circumflex over (ƒ)}=(X)
In the equation above, {circumflex over (ƒ)} is the estimate of ∈, which is often considered a black box, meaning that only the relation between the input and output of {circumflex over (ƒ)} is known, but the question why it works remains unanswered.
The function {circumflex over (ƒ)} is found using learning. Supervised learning and unsupervised learning are two ways used in machine learning for this task. In supervised learning, labeled data is used for training. By showing the inputs and the corresponding outputs (=labels), the function {circumflex over (ƒ)} is optimized such that it approximates the output. In unsupervised learning, the goal is to find a hidden structure from unlabeled data. The algorithm has no measure of accuracy on the input data, which distinguishes it from supervised learning.
Neural Networks
A neural network is a system of interconnected artificial neurons (e.g., a1, a2, a3) that exchange messages between each other. The illustrated neural network has three inputs, two neurons in the hidden layer and two neurons in the output layer. The hidden layer has an activation function ƒ(•) and the output layer has an activation function g(•). The connections have numeric weights (e.g., w11, w21, w12, w31, w22, w32, v11, v22) that are tuned during the training process, so that a properly trained network responds correctly when fed an image to recognize. The input layer processes the raw input, the hidden layer processes the output from the input layer based on the weights of the connections between the input layer and the hidden layer. The output layer takes the output from the hidden layer and processes it based on the weights of the connections between the hidden layer and the output layer. The network includes multiple layers of feature-detecting neurons. Each layer has many neurons that respond to different combinations of inputs from the previous layers. These layers are constructed so that the first layer detects a set of primitive patterns in the input image data, the second layer detects patterns of patterns and the third layer detects patterns of those patterns.
A neural network model is trained using training samples before using it used to predict outputs for production samples. The quality of predictions of the trained model is assessed by using a test set of training samples that is not given as input during training. If the model correctly predicts the outputs for the test samples then it can be used in inference with high confidence. However, if the model does not correctly predict the output for test samples then we can say that the model is overfitted on the training data and it has not been generalized on the unseen test data.
A survey of application of deep learning in genomics can be found in the following publications:    T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org:142760, 2017;    Angermueller C, Parnamaa T, Parts L, Stegle O. Deep Learning For Computational Biology. Mol Syst Biol. 2016; 12:878;    Park Y, Kellis M. 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825-826. (doi:10.1038/nbt.3313);    Min, S., Lee, B. & Yoon, S. Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016);    Leung M K, Delong A, Alipanahi B et al. Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets 2016; and    Libbrecht M W, Noble W S. Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015; 16(6):321-32.