GOBII Genomic & Open-source Breeding Informatics Initiative
My Background BS Animal Science, University of Tennessee MS Animal Breeding, University of Georgia Random regression models for longitudinal traits PhD Statistical Genetics, University of Georgia Feature selection and prediction algorithms Dow AgroSciences Quantitative Geneticist (2008-2011) Quantitative Genetics Group Leader (2011-2015) Development and implementation of global trial analysis system Development and implementation of genomic selection into NA corn breeding program
Genomic Data More Data, More Information? Genomic data is becoming increasingly more cost effective to generate. High Volume and High Dimensional data Need effective data management tools Analysis pipelines to turn data into information Genomic information does not replace phenotypic information Must have quality multi-year and multi-environment data to take full advantage of genomic information. Must be able to integrate genomic and phenotypic information Must have well designed training datasets to achieve needed prediction accuracies
Genomic Selection Selection Intensity Selection Accuracy Phenotype Environment Genotype R = irs g L Genetic Standard Deviation Generation Interval Train Potential Advantages of Genomic Selection Predict i,s g r L Early discarding, first stage screening based on genomic information Incorporate genomic information into early stage trials and multi-year evaluations Early recycling, reduce stages to variety release
r Accuracy Key Drivers Genetic Architecture and Heritability Model Training Population Data When properly implemented, is genomic selection accurate enough to drive increased genetic gain? Yes*
Z. Lin et al. Crop & Pasture Science 2014
Frequency 0 5 10 15 20 Histogram of Accuracy 0.2 0.4 0.6 0.8 1.0 Accuracy
Correlation =.7 4 3 2 1 0-5 -4-3 -2-1 0 1 2 3 4 5 Discarding: Lose ~0.5% Picking Winners Advance ~33% -1-2 -3-4 -5
Correlation =.5 4 3 Discarding: Lose ~9% 2 1 Picking Winners Advance ~20% 0-5 -4-3 -2-1 0 1 2 3 4 5-1 -2-3 -4
Correlation =.3 5 4 3 2 1 Discarding: Lose ~21% Picking Winners Advance ~8% 0-5 -4-3 -2-1 0 1 2 3 4 5-1 -2-3 -4-5
Training i,s g, L Modifying the Funnel Widen the funnel Discard lines with low likelihood of success or absence of key traits based on genomic information. Can increase lines screened without increases in yield trial plot load (heavier nursery plot load) Increase selection intensity Prediction Early Stage Screening Characterization Release Shorten the funnel As accuracies of genomic predictions increase there is the possibility to replace the first stage of screening with GS and make recycling decisions earlier. Reduce the generation interval
Key Components Breeding Strategy Phenotypic Information Data Management (BMS) Analysis Pipelines Skilled Breeders Genomic Information Data Management (GOBII)
GOBII Mission To work closely with CGIAR centers to develop open-source capabilities and enable the implementation genomic and marker assisted selection for staple crops in the developing world. Vision Effective deployment of genomic information in breeding programs has the potential to significantly increase genetic gain in key crop performance traits. This can lead to staple crop varieties with improved yields and better adaption to growing conditions in South Asia and Sub-Saharan Africa, bringing us closer to providing a sustainable and reliable food supply
Key Components Breeding Strategy Phenotypic Information Data Management (BMS) Analysis Pipelines Skilled Breeders Genomic Information Data Management (GOBII)
Execution and Implementation Many Transformative Efforts Fail Many failed initiatives have great strategies They fall apart in the execution Need to have clearly define objectives Define the most critical elements and focus on those (must haves). Clearly defined deliverables aligned to those critical elements Action Avoid planning paralysis Engagement Commitment
Initial Phase Strategy Prioritize initial deliverables based on Urgency of the need across CG centers Technical feasibility look for low hanging fruit Dependencies on other deliverables Leverage existing components to the fullest extent possible Direct all user interaction through an API (focus on BRAPI) allowing the development team to switch out components on the back end with minimal user disruption Quickly piecing together a system to meet immediate needs of users should buy time to develop a truly nextgen solution for Phase 2 implementation
Sequence Data File Store Meta Data DB Pipeline: Genomic Variant Calls and imputation BRAPI LIMS Marker Variant DB Client Side Application and GUI Field Trial Management System
Work Packages WP1 Breeding Workflow Mapping/Project Prioritizations WP2 Data Warehouse/DataMart WP3 Server Application Data Analysis Pipelines WP4 Genomic API/ETL WP5 Client Application(s) Breeder Tools
Breeding Workflow Mapping/Project Prioritizations Breeding processes and strategy for each breeding program Line development process and timelines Key decision points Key traits GS and MAS strategies Understand marker workflows How marker data is pulled and filtered Common marker analyses Where markers are deployed in the breeding process Set initial prioritizations Understand critical marker needs that are not being met with current systems
Data Warehouse/DataMart Sequence Data Compressed FASTQ files Meta Data Relational database linking sample information to compressed FASTQ files. Sample and marker meta information Support basic BRAPI marker statistic calls Physical and Linkage map information Support BRAPI genomic maps calls Marker Calls Set up initial solution using currently implemented marker DBs Support BRAPI allele matrix Call Select and mock up and test large matrix store db solutions postgresql/citus, monet, Canssandra, Hbase, MongoDB
Server Application Variant Calling and Imputation Pipeline Leverage Existing Pipeline(s) File Selection Tool Based on SQL queries of meta data Analysis G matrix calculations (Possibly using TASSEL implementations) Calculations of LD PCoA decompositions
Genomic API/ETL API BRAPI implementation via web interface Custom GOBII API calls when needed ETL Mapping common queries to DB schemas Pull large blocks of data filtering on sample and marker characteristics Pull lines carrying haplotypes of interest. Client Application(s) Visualizations PCoA Connection to Flapjack LD matrices and LD decay SNP Calling pipeline File selection tool
Thank You