Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator



Similar documents
Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis

What is Cancer? Cancer is a genetic disease: Cancer typically involves a change in gene expression/function:

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Delivering the power of the world s most successful genomics platform

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

escience and Post-Genome Biomedical Research

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

SAP HANA Enabling Genome Analysis

PhD theses. Dr. Reiniger Lilla. Semmelweis University Doctoral School of Pathology

Sequencing and microarrays for genome analysis: complementary rather than competing?

Antibody Structure, and the Generation of B-cell Diversity CHAPTER 4 04/05/15. Different Immunoglobulins

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Simplifying Data Interpretation with Nexus Copy Number

G E N OM I C S S E RV I C ES

PART 3.3: MicroRNA and Cancer

MUTATION, DNA REPAIR AND CANCER

-> Integration of MAPHiTS in Galaxy

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Expression Quantification (I)

Genomes and SNPs in Malaria and Sickle Cell Anemia

LifeScope Genomic Analysis Software 2.5

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Next generation DNA sequencing technologies. theory & prac-ce

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

PREDA S4-classes. Francesco Ferrari October 13, 2015

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Analysis of NGS Data

GeneProf and the new GeneProf Web Services

High Performance Compu2ng Facility

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Disease gene identification with exome sequencing

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Next Generation Sequencing

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Agilent CytoGenomics Software A Complete Solution for Cytogenetic Research Data Analysis

Basic processing of next-generation sequencing (NGS) data

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Introduction to next-generation sequencing data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Version 5.0 Release Notes

Cancer Genomics: What Does It Mean for You?

Human Genome Organization: An Update. Genome Organization: An Update

Information leaflet. Centrum voor Medische Genetica. Version 1/ Design by Ben Caljon, UZ Brussel. Universitair Ziekenhuis Brussel

Special report. Chronic Lymphocytic Leukemia (CLL) Genomic Biology 3020 April 20, 2006

GC3 Use cases for the Cloud

School of Nursing. Presented by Yvette Conley, PhD

Core Facility Genomics

Overview of Next Generation Sequencing platform technologies

CHALLENGES IN NEXT-GENERATION SEQUENCING

Visualizing Networks: Cytoscape. Prat Thiru

A Primer of Genome Science THIRD

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Biological Sciences Initiative. Human Genome

NCBI resources III: GEO and ftp site. Yanbin Yin Spring 2013

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Module 1. Sequence Formats and Retrieval. Charles Steward

Cloud-Based Big Data Analytics in Bioinformatics

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Introduction to NGS data analysis

Biomedical Big Data and Precision Medicine

Frequently Asked Questions Next Generation Sequencing

What s New in Pathway Studio Web 11.1

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

CHAPTER 2: UNDERSTANDING CANCER

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms

Genomic Analysis of Mature B-cell Malignancies

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

New solutions for Big Data Analysis and Visualization

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Overview sequence projects

Bioinformatics Grid - Enabled Tools For Biologists.

Factors for success in big data science

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

DNA, RNA, Protein synthesis, and Mutations. Chapters

mygenomatix - secure cloud for NGS analysis

Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

Protein Protein Interaction Networks

Challenges associated with analysis and storage of NGS data

GenomeStudio Data Analysis Software

Transcription:

Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator

Why is the NGS data processing a big challenge? Computation cannot keep up with the Biology.

Source: illumina

Source: illumina 1000 human gnome 50 whole genome per day 5 tera bytes (only mapped reads) per day

Source: strand-

Less tools are av ioinformatics of NGS data Performed by the sequencing instrument. Has been the main focu Bioinformatics research nesifter: Next Generation Data Management and Analysis for Next Generation Sequencing. p://www.geospiza.com/

rganizing the variation data. Scalable. Enable insightful queries in a timely manner. Support various NGS data (variations, expressions, annotations, ).

consortium of databases for genomic iscovery.

Sample Database(SampleDB): clinical and experimental information of the samples (type of disease, pathology, age, sex, ). Annotation Database(AnnotationDB): annotations of genomic regions (sources: UCSC, Ensembl,.) Structural Variation Database(SVDB): genomic structural variations (translocations, inversions, large indels). Expression Database (ExpressionDB): expression levels of genomic regions (RPKM values). Human Variation Database (HVDB): small genomic variants (SNP, small indels) Loss of Heterozygosity & Copy Number Variation (LOH_CNV)

uman Variation Database (HVDB) Starting point of the consortium. Stores SNPs and small indels. Contains more than 4 billion variations across over 6000 samples. Implemented with PostgreSQL and Java. Its template and APIs are publically available.

nalyzing the data Mutated pathways in types of cancer. Variation hotspots. Correlation between various variation types (eg. correlation between SNVs and genomic translocations). Correlation between variations and expressions.

utation analysis pipeline: A high throughput pipeline on top of the genomic database consortium. Current version identifies statistically significant mutational hotspots.

alidating the variations Through the analysis of mapped read (raw data) at the variation site. Calculates the confidence based the ratio of good reads that support the variation. Uses the mapped read of the matched normal if available. The process is performed on a computing cluster in a parallel way

ment report.

erformance ata size: 2.5 billion variations ~2000 cancer samples. ~2000 normal samples. latform: Linux Centos PostgreSQL database. Java APIs. Database server: eight core Xeon 3.00 GHz, 64 GB memory Application machine: 4 core, 8 GB memory he pipeline run completes in about 23 hours.

Analysis of 40 DLBCL genomes.

oal: Identify mutational hotspots in DLBCL genome. ohort: 40 whole genome DLBCL samples and their matched normal amples. onclusion: Small regions in the promoter of certain genes harbor an xtraordinary amount of somatic mutations. hese regions undergo somatic hypermutation.

omatic HyperMutation (SHM) Naturally occurs in B-Cell development to generate diverse antibodies. It occurs in variable region of immunoglobulin genes. 10 5-10 6 fold greater than the normal rate of mutation across the genome. Mutations are mostly single base substitution (insertion and deletions are less common).

SHM Characteristics SHM has a tendency toward certain motifs in DNA sequence, most significantly WRCY (where W denotes A or T; R denotes A or G; and Y denotes C or T) or its reverse complement RGYW

SHM can aberrantly target proto-oncogenes oncogenes (BCL6, PIM1, MYC, RHOH, PAX5) and tumor suppressors (CD95). Such mistargeting of SHM (ashm) contributes to the development of diffuse large B-ell lymphomas. SHM also has a driving role in chromosomal translocations in B-cell lymphomas. In the past decade twelve genes had been identified to have ashm. In addition to these genes our analysis identifies many more. Are these novel genes really targeted by ashm?

Do they show characteristics of SHM? More Transition than Transversion SNVs. Tendency toward WRCY/RGYW motif. More C:G mutations than A:T A bell shape mutation distribution around TSS. e studied these characteristics for the genes that had similar or higher mutation te than those known to be ashm targets (44 genes).

Oncotarget 3: 1308-13

All known targets of ashm (12 genes) are in the list and 75% of them have a significant ashm indicator (a good control for our analysis). More than 81 and 90 percent of the SHM-targets showed a bias for SHM criteria Motif enriched and C:G vs A:T mutation bias. If these gene are enriched with ashm mutations, a random mutated gene should have a significantly less ashm indicator value.

The difference in RPKM values reflects a trend towards higher mrna abundance of the mutated genes. This coincide with the observation that gene expression promotes SHM.

lation between tions and ngements

uture works The processing pipeline is specially in early stages (include more analysis). Data visualization and GUI to browse results. Utilizing Big Data technologies to improve performance. Incorporate other data sources in a systematic way (pathways, PPI networks, ). Implement mechanisms to share data.

Anthony Fejes Steven Jones Inanc Birol Ryan Morin Maria Mendez-Lago Nina Thiessen An He Richard Varhol Tina Wang Richard Corbett Misha Bilenky Gordon Robertson Andy Chu Readman Chiu Karen Mungall

Thanks