Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering



Similar documents
Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS

The impact of genomic selection on North American dairy cattle breeding organizations

Robust procedures for Canadian Test Day Model final report for the Holstein breed

Data Mining Practical Machine Learning Tools and Techniques

Abbreviation key: NS = natural service breeding system, AI = artificial insemination, BV = breeding value, RBV = relative breeding value

GENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING

Evaluations for service-sire conception rate for heifer and cow inseminations with conventional and sexed semen

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Genetic improvement: a major component of increased dairy farm profitability

Data Mining. Nonlinear Classification

Final Project Report

MINISTRY OF LIVESTOCK DEVELOPMENT SMALLHOLDER DAIRY COMMERCIALIZATION PROGRAMME. Artificial Insemination (AI) Service

NAV routine genetic evaluation of Dairy Cattle

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Logistic Regression (1/24/13)

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

vision evolving guidelines

Data Mining Part 5. Prediction

Genomics: how well does it work?

Terms: The following terms are presented in this lesson (shown in bold italics and on PowerPoint Slides 2 and 3):

Basics of Marker Assisted Selection

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Knowledge Discovery and Data Mining

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Machine learning for algo trading

Data Mining for Knowledge Management. Classification

Comparison of Data Mining Techniques used for Financial Data Analysis

Genomic selection in dairy cattle: Integration of DNA testing into breeding programs

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Breeding for Carcass Traits in Dairy Cattle

Classification algorithm in Data mining: An Overview

Statistics Graduate Courses

Gerry Hobbs, Department of Statistics, West Virginia University

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

RATES OF CONCEPTION BY ARTIFICIAL INSEMINATION OF. 1 Miss. Rohini Paramsothy Faculty of Agriculture University of Jaffna

Social Media Mining. Data Mining Essentials

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

MS1b Statistical Data Mining

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

INTRODUCTION. The identification system of dairy cattle; The recording of production of dairy cattle; Laboratory analysis; Data processing.

A Content based Spam Filtering Using Optical Back Propagation Technique

Data Mining - Evaluation of Classifiers

Genetic parameters for female fertility and milk production traits in first-parity Czech Holstein cows

Supervised Learning (Big Data Analytics)

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Random forest algorithm in big data environment

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Chapter 6. The stacking ensemble approach

Biomedical Big Data and Precision Medicine

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

GROSS MARGINS : HILL SHEEP 2004/2005

SUMMARY Contribution to the cow s breeding study in one of the small and middle sizes exploitation in Dobrogea

GENETIC DATA ANALYSIS

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Additional sources Compilation of sources:

Better credit models benefit us all

Lecture 6. Artificial Neural Networks

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics

Knowledge Discovery and Data Mining

Microsoft Azure Machine learning Algorithms

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Factors for success in big data science

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

1. About dairy cows. Breed of dairy cows

Scope for the Use of Pregnancy Confirmation Data in Genetic Evaluation for Reproductive Performance

Major Advances in Globalization and Consolidation of the Artificial Insemination Industry

Data quality in Accounting Information Systems

Data Mining Techniques Chapter 6: Decision Trees

Milk protein genetic variation in Butana cattle

Ensemble Data Mining Methods

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Lecture 9: Introduction to Pattern Analysis

Artificial Insemination (AI) in Cattle

Using multiple models: Bagging, Boosting, Ensembles, Forests

School of Nursing. Presented by Yvette Conley, PhD

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

Sustainability of dairy cattle breeding systems utilising artificial insemination in less developed countries - examples of problems and prospects

Alison Van Eenennaam, Ph.D.

The All-Breed Animal Model Bennet Cassell, Extension Dairy Scientist, Genetics and Management

Data Mining Classification: Decision Trees

Staying good while playing God Looking after animal welfare when applying biotechnology

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

What is the Cattle Data Base

Chapter 12 Discovering New Knowledge Data Mining

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Data Mining On Diabetics

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Data Mining Analysis of HIV-1 Protease Crystal Structures

8. Machine Learning Applied Artificial Intelligence

Data Mining mit der JMSL Numerical Library for Java Applications

Beef - Key performance indicators. Mary Vickers

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

How To Make A Credit Risk Model For A Bank Account

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Dr. G van der Veen (BVSc) Technical manager: Ruminants gerjan.vanderveen@zoetis.com

The Data Mining Process

Feature Subset Selection in Spam Detection

Transcription:

Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen 9-October 2015 Presentation by: Ahmad Alsahaf Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

A long History of Animal Breeding

A long History of Animal Breeding First step in animal breeding: Domestication

Breeding for desirable traits Longevity, fertility, more wool, milk, eggs, meat, etc.

Breeding for desirable traits Longevity, fertility, more wool, milk, eggs, meat, etc. Fasters race horses, better hunting dogs, cuter kittens.

Early History 1900 s Selective breeding relied on observable traits and human intuition 1930 s

Early History Selective breeding relied on observable traits and human intuition 1900 s 1930 s Rediscovery of Mendel s law of inheritance Gregor Mendal (1822-1884)

Early History Selective breeding relied on observable traits and human intuition 1900 s Rediscovery of Mendel s law of inheritance Biometrician, Karl Pearson, and the rejection of Mendel s laws 1930 s Karl Pearson (1854-1936) 1940 s

Early History Selective breeding relied on observable traits and human intuition 1900 s Rediscovery of Mendel s law of inheritance Biometrician, Karl Pearson, and the rejection of Mendel s laws 1930 s Animal Breeding plans 1937 book by Lush. First application of statistics and quantitative genetics to animal breeding (cattle) 1940 s

Early History Selective breeding relied on observable traits and human intuition 1900 s Rediscovery of Mendel s law of inheritance Biometrician, Karl Pearson, and the rejection of Mendel s laws 1930 s Animal Breeding plans 1937 book by Lush. First application of statistics and quantitative genetics to animal breeding (cattle) 1940 s Artificial insemination became common practice in dairy cattle

The Dairy Cattle Example One of the sectors of the animal industry that benefitted most from selective breeding, and the use of data in it. Pedigree recordshave been kept well Few and easilymeasureable traits (Milk/protein/fat yields,feed efficiency) Bulls deemed good can be fully utilized Advanced artificial insemination technology

The Holstein Friesian Dairy Cattle Breed

Genotype Vs. Phenotype

Progeny Testing Test bull: Genetic information not available Bull s milk producing daughters Artificial Insemination Measure the quality of the milk Determine the economic value of the bull

Progeny Testing Test bull: Genetic information now AVAILABLE Bull s milk producing daughters Artificial Insemination 50,000 70,000 Genetic Markers Measure the quality of the milk Determine the economic value of the bull

Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Grzesiak, Wilhelm, et al. "Detection of cows with insemination problems using selected classification models." Computers and electronics in agriculture 74.2 (2010): 265-273.

Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Lactation number % HF Genome Sex of the calf Age of cow AI season Health metric % of fat/protein in Milk In 1200 cows nominal phenotypes, categorical phenotypes, environmental factors Good cow Bad cow

Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Lactation number % HF Genome Sex of the calf Age of cow AI season Health metric % of fat/protein in Milk In 1200 cows nominal phenotypes, categorical phenotypes, environmental factors Linear Classifiers Logistic Regression Artificial Neural Networks Multivariate adaptive regression splines Good cow Bad cow

Machine Learning Examples 1. Using classification models (supervised learning) to detect problems in artificial insemination. Lactation number % HF Genome Sex of the calf Age of cow AI season Health metric % of fat/protein in Milk In 1200 cows Genetic information, nominal phenotypes, categorical phenotypes, environmental factors Logistical and Economical implications Of the classification outcome False positives Vs. False negatives Good cow Bad cow

Machine Learning Examples 2. Clustering dairy cows based on their phenotypic traits ile Analizi, Kümeleme Yöntemleri. "Principal component and clustering analysis of functional traits in Swiss dairy cattle." Turk. J. Vet. Anim. Sci. (2008). 3. Prediction of insemination outcome Shahinfar, Saleh, et al. "Prediction of insemination outcomes in Holstein dairy cattle using alternative machine learning algorithms." Journal of dairy science (2014) 4. Predicting the lactation yield of dairy cows using multiple regression or neural networks Grzesiak, W., et al. "A comparison of neural network and multiple regression predictions for 305-day lactation yield using partial lactation records." Canadian journal of animal science (2003) Phenotype-Phenotype prediction studies # Data ML methods 1 10 phenotypes and environmental factors ANN, Logistig reg. 2 5 phenotypes Hierachial clustering, PCA 3 26 phenotypes and environmental factors naïve Bayes, decision trees 4 7 phenotypes ANN, multiple regression

Machine Learning with High Dimensional Genetic Data Genome Wide Association Studies A unit of genetic variation or a Genetic Marker SNP s (single nucleotide polymorphism) The goal is to associate an SNP (or several) with a phenotype, e.g. a disease This is typically done by GWAS (Genome Wide Association Studies). Which SNP s (or other markers) occur frequently within a population that has the trait of interest.

Machine Learning with High Dimensional Genetic Data Why Machine Learning? Quantitative traits (e.g. Milk yield, disease, longevity) are controlled by multiple markers. Machine Learning can associate multiple genetic markers to a phenotype AND find complex interactions between markers. Machine Learning can facilitate dealing with redundant and irrelevant variables.

Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Input: n: 297 cows p: 35,798 SNPS i.e. Small n, large p problem Output: Milk yield Protein yield Fat yield

Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Input: n: 297 cows p: 35,798 SNPS i.e. Small n, large p problem Output: Milk yield Protein yield Fat yield Dealing with dimenionality: Bayesian regularized back propagation; commonly used to avoid overfitting in BP. 297 variables derived from the original 35,798 Using genome derived (SNP) relationships between the cows as inputs instead of the SNP s themselves. By constructing a matrix of genomic relationships that s analogous to a covariance matrix and is based on allele frequency in the population

Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Results: Effective number of parameters

Example: From genotype to Milk yield 1. Using Neural Networks Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87. Results: Mean Squared Error of the predictions

Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Input: 395 Holstein cows 42,275 SNPS i.e. Small n, large p problem Output: Residual Feed Intake of the cow Adjusted for environmental and external factors

Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Methods: Decision trees A predictive model with a tree structure based on if-else statement. At each node, pick the best split (best question to ask).

Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Methods: Random Forests algorithm (ensemble method). The output is the averaged outcome of all weak learners in the ensemble (decision trees)

Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Dealing with dimensionality: Bootstrapping for each tree in the forest At each note of each tree, choosing the best split out of a subset of p variables, not all of them (100, 1000)

Example: Genotype to Feed efficiency 1. Using Random Forests (Decision Trees) Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal of dairy science 96.10 (2013) Results and Findings: Ranking SNP s according to their importance to the phenotype output. (implicit feature ranking capability of decision trees). B A C Identifying pairs epistatic genes through the RF structure, as they will tend to fall into the same branches of trees. (Parent-child) D E

Dipartimento di Elettronica, Informazione e Bioingegneria Master of Science in Automation and Control Engineering Dec 2014 Supervisor: Andra Castelletti Co-supervisor: Stefano Galelli Co-supervisor: Matteo Giuliani Master Thesis by: Ahmad Alsahaf

What is Model-order reduction (Emulation Modelling)? Such that: the emulator is less computationally intensive than the PB model; the input-output behavior reproduces accurately the PB model behaviour; the emulator is credible from the user/analyst s point of view. (Physically inrerpretable)

What is Model-order reduction (Emulation Modelling)? Such that: the emulator is less computationally intensive than the PB model; the input-output behavior reproduces accurately the PB model behaviour; the emulator is credible from the user/analyst s point of view. (Physical interpretability)

Recursive Variable Selection - A feature selection algorithm >2% State variables Exogenous inputs Control variables Output variable

PCA vs Sparse PCA Coefficients heat map

PCA Vs. Sparse and Weighted PCA Emulator performance Emulator performance 1 0.9 0.8 Explained variance PCA WPCA SPCA 0.7 0.6 R2 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Principle Components Emulator structure: Extra-trees (Geurts et al., 2006) Ref: Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.