GOBII. Genomic & Open-source Breeding Informatics Initiative

Similar documents
Marker-Assisted Backcrossing. Marker-Assisted Selection. 1. Select donor alleles at markers flanking target gene. Losing the target allele

Delivering the power of the world s most successful genomics platform

Investigating the genetic basis for intelligence

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Basics of Marker Assisted Selection

GENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING

A Strategy for Plant Breeding Data Management in International Agricultural Research

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Introductory to Advanced Training Course Five Day Course Information and Agenda October, 2015

i2b2 Clinical Research Chart

The impact of genomic selection on North American dairy cattle breeding organizations

The key linkage of Strategy, Process and Requirements

Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS

Accelerating variant calling

New Directions and Changing Faces for the USDA Sunflower Genetics Research Programs. Brent Hulke, Ph.D. Research Geneticist

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Databricks. A Primer

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

How To Find Rare Variants In The Human Genome

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

SAP HANA Enabling Genome Analysis

Enhancing Functionality of EHRs for Genomic Research, Including E- Phenotying, Integrating Genomic Data, Transportable CDS, Privacy Threats

SNPbrowser Software v3.5

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc

How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer

Research Roadmap for the Future. National Grape and Wine Initiative March 2013

Databricks. A Primer

Azure Machine Learning, SQL Data Mining and R

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

The Data Mining Process

Logistic Regression (1/24/13)

Software Cost. Discounted STS Rate Units Total $0.00 $0.00 $0.00 $0.00 Total $0.00

Big Data and the Data Lake. February 2015

Structure of the presentation

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Microsoft Business Intelligence Platform

Cheminformatics and Pharmacophore Modeling, Together at Last

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Global Alliance. Ewan Birney Associate Director EMBL-EBI

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

Federal Interagency Traumatic Brain Injury Research (FITBIR)

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

i2b2 Clinical Research Chart

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Introductory genetics for veterinary students

GenomeStudio Data Analysis Software

GRIN-Global Project. the global plant genebank information management system

IBM WebSphere DataStage Online training from Yes-M Systems

Modernizing Healthcare

Pedigree Based Analysis using FlexQTL TM software

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Complex Event Processing (CEP) Why and How. Richard Hallgren BUGS

Integration of genomic data into electronic health records

Java Modules for Time Series Analysis

Genomic selection in dairy cattle: Integration of DNA testing into breeding programs

Oracle RAC Services Appendix

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

SpreadSheet Inside. Xenomorph White Paper. Spreadsheet flexibility, database consistency

BIOINFORMATICS Supporting competencies for the pharma industry

Genomics and the EHR. Mark Hoffman, Ph.D. Vice President Research Solutions Cerner Corporation

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Sisense. Product Highlights.

How to Enhance Traditional BI Architecture to Leverage Big Data

Open source framework for data-flow visual analytic tools for large databases

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Predictive Analytics

Unified Big Data Processing with Apache Spark. Matei

Milk protein genetic variation in Butana cattle

Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing

Advanced In-Database Analytics

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Acceleration for Personalized Medicine Big Data Applications

Smarter Research. Joseph M. Jasinski, Ph.D. Distinguished Engineer IBM Research

Course Catalog.

Prerequisites. Course Outline

Ellucian BPM Solutions Roadmap

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Good Agile Testing Practices and Traits How does Agile Testing work?

OpenChorus: Building a Tool-Chest for Big Data Science

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Fluency With Information Technology CSE100/IMT100

IDL. Get the answers you need from your data. IDL

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

INRA's Big Data perspectives and implementation challenges. Pascal Neveu UMR MISTEA INRA - Montpellier

Marketing Automation Request for Proposal

STATISTICA Solutions for Financial Risk Management Management and Validated Compliance Solutions for the Banking Industry (Basel II)

Transcription:

GOBII Genomic & Open-source Breeding Informatics Initiative

My Background BS Animal Science, University of Tennessee MS Animal Breeding, University of Georgia Random regression models for longitudinal traits PhD Statistical Genetics, University of Georgia Feature selection and prediction algorithms Dow AgroSciences Quantitative Geneticist (2008-2011) Quantitative Genetics Group Leader (2011-2015) Development and implementation of global trial analysis system Development and implementation of genomic selection into NA corn breeding program

Genomic Data More Data, More Information? Genomic data is becoming increasingly more cost effective to generate. High Volume and High Dimensional data Need effective data management tools Analysis pipelines to turn data into information Genomic information does not replace phenotypic information Must have quality multi-year and multi-environment data to take full advantage of genomic information. Must be able to integrate genomic and phenotypic information Must have well designed training datasets to achieve needed prediction accuracies

Genomic Selection Selection Intensity Selection Accuracy Phenotype Environment Genotype R = irs g L Genetic Standard Deviation Generation Interval Train Potential Advantages of Genomic Selection Predict i,s g r L Early discarding, first stage screening based on genomic information Incorporate genomic information into early stage trials and multi-year evaluations Early recycling, reduce stages to variety release

r Accuracy Key Drivers Genetic Architecture and Heritability Model Training Population Data When properly implemented, is genomic selection accurate enough to drive increased genetic gain? Yes*

Z. Lin et al. Crop & Pasture Science 2014

Frequency 0 5 10 15 20 Histogram of Accuracy 0.2 0.4 0.6 0.8 1.0 Accuracy

Correlation =.7 4 3 2 1 0-5 -4-3 -2-1 0 1 2 3 4 5 Discarding: Lose ~0.5% Picking Winners Advance ~33% -1-2 -3-4 -5

Correlation =.5 4 3 Discarding: Lose ~9% 2 1 Picking Winners Advance ~20% 0-5 -4-3 -2-1 0 1 2 3 4 5-1 -2-3 -4

Correlation =.3 5 4 3 2 1 Discarding: Lose ~21% Picking Winners Advance ~8% 0-5 -4-3 -2-1 0 1 2 3 4 5-1 -2-3 -4-5

Training i,s g, L Modifying the Funnel Widen the funnel Discard lines with low likelihood of success or absence of key traits based on genomic information. Can increase lines screened without increases in yield trial plot load (heavier nursery plot load) Increase selection intensity Prediction Early Stage Screening Characterization Release Shorten the funnel As accuracies of genomic predictions increase there is the possibility to replace the first stage of screening with GS and make recycling decisions earlier. Reduce the generation interval

Key Components Breeding Strategy Phenotypic Information Data Management (BMS) Analysis Pipelines Skilled Breeders Genomic Information Data Management (GOBII)

GOBII Mission To work closely with CGIAR centers to develop open-source capabilities and enable the implementation genomic and marker assisted selection for staple crops in the developing world. Vision Effective deployment of genomic information in breeding programs has the potential to significantly increase genetic gain in key crop performance traits. This can lead to staple crop varieties with improved yields and better adaption to growing conditions in South Asia and Sub-Saharan Africa, bringing us closer to providing a sustainable and reliable food supply

Key Components Breeding Strategy Phenotypic Information Data Management (BMS) Analysis Pipelines Skilled Breeders Genomic Information Data Management (GOBII)

Execution and Implementation Many Transformative Efforts Fail Many failed initiatives have great strategies They fall apart in the execution Need to have clearly define objectives Define the most critical elements and focus on those (must haves). Clearly defined deliverables aligned to those critical elements Action Avoid planning paralysis Engagement Commitment

Initial Phase Strategy Prioritize initial deliverables based on Urgency of the need across CG centers Technical feasibility look for low hanging fruit Dependencies on other deliverables Leverage existing components to the fullest extent possible Direct all user interaction through an API (focus on BRAPI) allowing the development team to switch out components on the back end with minimal user disruption Quickly piecing together a system to meet immediate needs of users should buy time to develop a truly nextgen solution for Phase 2 implementation

Sequence Data File Store Meta Data DB Pipeline: Genomic Variant Calls and imputation BRAPI LIMS Marker Variant DB Client Side Application and GUI Field Trial Management System

Work Packages WP1 Breeding Workflow Mapping/Project Prioritizations WP2 Data Warehouse/DataMart WP3 Server Application Data Analysis Pipelines WP4 Genomic API/ETL WP5 Client Application(s) Breeder Tools

Breeding Workflow Mapping/Project Prioritizations Breeding processes and strategy for each breeding program Line development process and timelines Key decision points Key traits GS and MAS strategies Understand marker workflows How marker data is pulled and filtered Common marker analyses Where markers are deployed in the breeding process Set initial prioritizations Understand critical marker needs that are not being met with current systems

Data Warehouse/DataMart Sequence Data Compressed FASTQ files Meta Data Relational database linking sample information to compressed FASTQ files. Sample and marker meta information Support basic BRAPI marker statistic calls Physical and Linkage map information Support BRAPI genomic maps calls Marker Calls Set up initial solution using currently implemented marker DBs Support BRAPI allele matrix Call Select and mock up and test large matrix store db solutions postgresql/citus, monet, Canssandra, Hbase, MongoDB

Server Application Variant Calling and Imputation Pipeline Leverage Existing Pipeline(s) File Selection Tool Based on SQL queries of meta data Analysis G matrix calculations (Possibly using TASSEL implementations) Calculations of LD PCoA decompositions

Genomic API/ETL API BRAPI implementation via web interface Custom GOBII API calls when needed ETL Mapping common queries to DB schemas Pull large blocks of data filtering on sample and marker characteristics Pull lines carrying haplotypes of interest. Client Application(s) Visualizations PCoA Connection to Flapjack LD matrices and LD decay SNP Calling pipeline File selection tool

Thank You