Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen 9-October 2015 Presentation by: Ahmad Alsahaf Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering
A long History of Animal Breeding
A long History of Animal Breeding First step in animal breeding: Domestication
Breeding for desirable traits Longevity, fertility, more wool, milk, eggs, meat, etc.
Breeding for desirable traits Longevity, fertility, more wool, milk, eggs, meat, etc. Fasters race horses, better hunting dogs, cuter kittens.
Early History 1900 s Selective breeding relied on observable traits and human intuition 1930 s
Early History Selective breeding relied on observable traits and human intuition 1900 s 1930 s Rediscovery of Mendel s law of inheritance Gregor Mendal (1822-1884)
Early History Selective breeding relied on observable traits and human intuition 1900 s Rediscovery of Mendel s law of inheritance Biometrician, Karl Pearson, and the rejection of Mendel s laws 1930 s Karl Pearson (1854-1936) 1940 s
Early History Selective breeding relied on observable traits and human intuition 1900 s Rediscovery of Mendel s law of inheritance Biometrician, Karl Pearson, and the rejection of Mendel s laws 1930 s Animal Breeding plans 1937 book by Lush. First application of statistics and quantitative genetics to animal breeding (cattle) 1940 s
Early History Selective breeding relied on observable traits and human intuition 1900 s Rediscovery of Mendel s law of inheritance Biometrician, Karl Pearson, and the rejection of Mendel s laws 1930 s Animal Breeding plans 1937 book by Lush. First application of statistics and quantitative genetics to animal breeding (cattle) 1940 s Artificial insemination became common practice in dairy cattle
The Dairy Cattle Example One of the sectors of the animal industry that benefitted most from selective breeding, and the use of data in it. Pedigree recordshave been kept well Few and easilymeasureable traits (Milk/protein/fat yields,feed efficiency) Bulls deemed good can be fully utilized Advanced artificial insemination technology
The Holstein Friesian Dairy Cattle Breed
Genotype Vs. Phenotype
Progeny Testing Test bull: Genetic information not available Bull s milk producing daughters Artificial Insemination Measure the quality of the milk Determine the economic value of the bull
Progeny Testing Test bull: Genetic information now AVAILABLE Bull s milk producing daughters Artificial Insemination 50,000 70,000 Genetic Markers Measure the quality of the milk Determine the economic value of the bull
Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Grzesiak, Wilhelm, et al. "Detection of cows with insemination problems using selected classification models." Computers and electronics in agriculture 74.2 (2010): 265-273.
Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Lactation number % HF Genome Sex of the calf Age of cow AI season Health metric % of fat/protein in Milk In 1200 cows nominal phenotypes, categorical phenotypes, environmental factors Good cow Bad cow
Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Lactation number % HF Genome Sex of the calf Age of cow AI season Health metric % of fat/protein in Milk In 1200 cows nominal phenotypes, categorical phenotypes, environmental factors Linear Classifiers Logistic Regression Artificial Neural Networks Multivariate adaptive regression splines Good cow Bad cow
Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Lactation number % HF Genome Sex of the calf Age of cow AI season Health metric % of fat/protein in Milk In 1200 cows Genetic information, nominal phenotypes, categorical phenotypes, environmental factors Logistical and Economical implications Of the classification outcome False positives Vs. False negatives Good cow Bad cow
Machine Learning Examples 2. Clustering dairy cows based on their phenotypic traits ile Analizi, Kümeleme Yöntemleri. "Principal component and clustering analysis of functional traits in Swiss dairy cattle." Turk. J. Vet. Anim. Sci. (2008). 3. Prediction of insemination outcome Shahinfar, Saleh, et al. "Prediction of insemination outcomes in Holstein dairy cattle using alternative machine learning algorithms." Journal of dairy science (2014) 4. Predicting the lactation yield of dairy cows using multiple regression or neural networks Grzesiak, W., et al. "A comparison of neural network and multiple regression predictions for 305-day lactation yield using partial lactation records." Canadian journal of animal science (2003) Phenotype-Phenotype prediction studies # Data ML methods 1 10 phenotypes and environmental factors ANN, Logistig reg. 2 5 phenotypes Hierachial clustering, PCA 3 26 phenotypes and environmental factors naïve Bayes, decision trees 4 7 phenotypes ANN, multiple regression
Machine Learning with High Dimensional Genetic Data Genome Wide Association Studies A unit of genetic variation or a Genetic Marker SNP s (single nucleotide polymorphism) The goal is to associate an SNP (or several) with a phenotype, e.g. a disease This is typically done by GWAS (Genome Wide Association Studies). Which SNP s (or other markers) occur frequently within a population that has the trait of interest.
Machine Learning with High Dimensional Genetic Data Why Machine Learning? Quantitative traits (e.g. Milk yield, disease, longevity) are controlled by multiple markers. Machine Learning can associate multiple genetic markers to a phenotype AND find complex interactions between markers. Machine Learning can facilitate dealing with redundant and irrelevant variables.
Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Input: n: 297 cows p: 35,798 SNPS i.e. Small n, large p problem Output: Milk yield Protein yield Fat yield
Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Input: n: 297 cows p: 35,798 SNPS i.e. Small n, large p problem Output: Milk yield Protein yield Fat yield Dealing with dimenionality: Bayesian regularized back propagation; commonly used to avoid overfitting in BP. 297 variables derived from the original 35,798 Using genome derived (SNP) relationships between the cows as inputs instead of the SNP s themselves. By constructing a matrix of genomic relationships that s analogous to a covariance matrix and is based on allele frequency in the population
Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Results: Effective number of parameters
Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Results: Mean Squared Error of the predictions
Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Input: 395 Holstein cows 42,275 SNPS i.e. Small n, large p problem Output: Residual Feed Intake of the cow Adjusted for environmental and external factors
Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Methods: Decision trees A predictive model with a tree structure based on if-else statement. At each node, pick the best split (best question to ask).
Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Methods: Random Forests algorithm (ensemble method). The output is the averaged outcome of all weak learners in the ensemble (decision trees)
Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Dealing with dimensionality: Bootstrapping for each tree in the forest At each note of each tree, choosing the best split out of a subset of p variables, not all of them (100, 1000)
Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Results and Findings: Ranking SNP s according to their importance to the phenotype output. (implicit feature ranking capability of decision trees). B A C Identifying pairs epistatic genes through the RF structure, as they will tend to fall into the same branches of trees. (Parent-child) D E
Dipartimento di Elettronica, Informazione e Bioingegneria Master of Science in Automation and Control Engineering Dec 2014 Supervisor: Andra Castelletti Co-supervisor: Stefano Galelli Co-supervisor: Matteo Giuliani Master Thesis by: Ahmad Alsahaf
What is Model-order reduction (Emulation Modelling)? Such that: the emulator is less computationally intensive than the PB model; the input-output behavior reproduces accurately the PB model behaviour; the emulator is credible from the user/analyst s point of view. (Physically inrerpretable)
What is Model-order reduction (Emulation Modelling)? Such that: the emulator is less computationally intensive than the PB model; the input-output behavior reproduces accurately the PB model behaviour; the emulator is credible from the user/analyst s point of view. (Physical interpretability)
Recursive Variable Selection - A feature selection algorithm >2% State variables Exogenous inputs Control variables Output variable
PCA vs Sparse PCA Coefficients heat map
PCA Vs. Sparse and Weighted PCA Emulator performance Emulator performance 1 0.9 0.8 Explained variance PCA WPCA SPCA 0.7 0.6 R2 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Principle Components Emulator structure: Extra-trees (Geurts et al., 2006) Ref: Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.