Open source analytics for Big Data in Big Pharma
|
|
|
- Russell Miller
- 10 years ago
- Views:
Transcription
1 Open source analytics for Big Data in Big Pharma Applications in next generation sequencing data Big Data SIG 23 Apr 2015 Miika Ahdesmaki Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
2 Crash course to molecular biology Central dogma DNA is the ~static part RNA is the dynamic middle man - Only 1% of DNA is protein-coding (or exonic ) Proteins are involved in virtually all cell functions We can sequence DNA and RNA using ultra high throughput sequencing (3 rd gen Next Generation Sequencing) "Centraldogma nodetails" by Narayanese at English Wikipedia - Own work. Licensed under Public Domain via Wikimedia Commons 2 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
3 Why NGS? Personalised medicine: - One drug for all patients no longer realistic (especially in oncology) - Different demographics have different variations of risks - Understanding patient specific needs will help guide their individual medication Cancer is a genetic disease, most often the result of spurious mutations in DNA - Understanding changes in cancer DNA can help defeat the disease Next generation high throughput sequencing offers genome DNA analyses in days and under $10k 3 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
4 What is next generation sequencing? Sequencing NGS: massively parallel DNA sequencing Oncology biggest consumer of NGS at AZ We sequence RNA and DNA e.g. from - Clinical samples - Cell lines - Xenografts / explants 4 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
5 What is next generation sequencing? Sequencing The DNA/RNA is pre-processed, fragmented and the short fragments are sequenced (in random order) 5 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
6 What is next generation sequencing? Alignment The short fragments are aligned to a reference sequence, such as the human reference HG19 6 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
7 What is next generation sequencing? Downstream Processing (variants, expression) The alignments are further processed to answer the following questions - How are the alignments different from the reference (SNPs, indels)? - Which genes are expressed? HG19 7 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
8 Uses of NGS Patient stratification Biomarkers for prognosis, drug response, safety Expression Variants NGS Data Explants Tumors-FFPE Tumors fresh frozen Cell lines Clinical samples RNA-Seq DNA-Seq Targeted Whole exome Whole genome Fusions Coding variants Coding and noncoding variants New Target ID Mechanism of drug action Mechanism of disease Mechanisms of resistance 8 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
9 Data generation and volumes AZ: Mix of outsourced sequencing and internal data generation Typical size of files per sample: In oncology, individuals are often studied in pairs (tumour/normal, parental/daughter), doubling the data volumes Typical study sizes: 100GB - 1TB raw compressed data One of our most frequent Big Data problems Whole genome: GB Exome Dna-seq: 10-20GB RNA-seq 10-15GB Single gene targeted: MB 9 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
10 Data generation and volumes Over the past 3-4 years we accumulated ~400TB of sequencing data via - Acquiring public data sets (TCGA, ICGC) - Vendor sequencing (major) - Internal sequencing (minor) Over we expect - Internal sequencing to become the major data generation source (5 new sequencers in 2015 to accompany 2 sequencers in ) - 1PB of sequencing data by mid 2016 Long term prediction of volumes difficult 3 tiered storage for processing, short term storage and long term storage - Amazon Glacier strongly considered for long term storage 10 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
11 Partnering with the leaders Illumina Announces Strategic Partnerships with AstraZeneca, Janssen and Sanofi to Redefine Companion Diagnostics for Oncology Illumina, Inc. announced it has formed collaborative partnerships with leading pharmaceutical companies to develop a universal NGS-based oncology test system - The system will be used for clinical trials of targeted cancer therapies with a goal of developing and commercializing a multi-gene panel for therapeutic selection, resulting in a more comprehensive tool for precision medicine 11 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
12 Pipelines and analytics 12
13 Production Dealing with the complexity Number of NGS tools increases daily.. annotatebed bcbio_nextgen.py ctest hash_tar plot_roc.r srf_info vcffilter vcfrandom append_sff bcftools cuffcompare index_tar plot-vcfstats srf_list vcffixup vcfrandomsample bam12auxmerge bed12tobed6 cuffdiff interpolate_sam.pl prep_reads STAR vcfflatten vcfregionreduce bam12split bedgraphtobigwig cufflinks intersectbed psl2sam.pl subtractbed vcfgeno2alleles vcfregionreduce_and_cut bam12strip bedpetobam cuffmerge io_lib-config qualimap tabix vcfgeno2haplo vcfregionreduce_pipe bam2fastx bedpetobed12 dbilogstrip isnovoindex randombed tabtk vcfgenosamplenames vcfregionreduce_uncompressed bamadapterclip bedpetovcf dbiprof juncs_db rtg tagbam vcfgenosummarize vcfremap bamadapterfind bedtobam dbiproxy kmerprob s3cmd tophat vcfgenotypecompare vcfremoveaberrantgenotypes bamauxsort bedtobigbed expandcols liftover sam2vcf.pl tophat2 vcfgenotypes vcfremovenonatgc bamcat bedtoigv export2sam.pl linksbed sambamba tophat-fusion-post vcfglbound vcfremovesamples bamchecksort bed_to_juncs extract_fastq long_spanning_reads samblaster tophat_reports vcfglxgt vcfroc bamclipreinsert bedtools extract_qual lumpy sam_juncs trace_dump vcfgtcompare.sh vcfsample2info bamcollate bgzip extract_seq makescf samtools twobitinfo vcfhetcount vcfsamplediff bamcollate2 bigbedinfo facount map2gtf samtools.pl twobittofa vcfhethomratio vcfsamplenames bamdownsamplerandom bigbedsummary fasize mapbed scalpel unionbedgraphs vcfindelproximity vcfsitesummarize bamfilteraux bigbedtobed fastafrombed maq2sam-long scf_dump variant_effect_predictor.pl vcfindels vcfsnps bamfilterflags bigwiginfo fastqc maq2sam-short scf_info vcf2fasta vcfindex vcfsom bamfilterheader bigwigsummary fastqtobam maskfastafrombed scf_update vcf2sqlite.py vcfintersect vcfsort bamfilterrg bigwigtobedgraph fatotwobit md5fa scramble vcf2tsv vcfkeepgeno vcfstats bamfixmateinformation bigwigtowig featurecounts md5sum-lite scram_flagstat vcfaddinfo vcfkeepinfo vcfstreamsort bamindex blast2sam.pl fetchchromsizes mergebed scram_merge vcfafpath vcfkeepsamples vcf_strip_extra_headers bamleftalign bowtie2 filter_vep.pl multibamcov scram_pileup vcfallelicprimitives vcfleftalign vcftobedpe bammapdist bowtie2-align fix_map_ordering multiintersectbed segment_juncs vcfaltcount vcflength vcfuniq bammarkduplicates bowtie2-build flankbed mutect jar seqtk vcfannotate vcfmultiallelic vcfuniqalleles bammarkduplicates2 bowtie2-inspect freebayes normalisefasta shufflebed vcfannotategenotypes vcfmultiway vcfutils.pl bammaskflags bowtie2sam.pl gatk-framework novo2paf slopbed vcfbiallelic vcfmultiwayscripts vcfvarstats bammdnm brew GenomeAnalysisTK.jar novo2sam.pl snpeff vcfbreakmulti vcfnobiallelicsnps vep_convert_cache.pl bammerge bwa genomecoveragebed novoalign soap2sam.pl vcfcat vcfnoindels vep_install.pl bam_merge ccmake get_comment novoaligncs SomaticAnalysisTK.jar vcfcheck vcfnosnps vt bamrank closestbed getoverlap novoaligncsmpi sortbed vcfclassify vcfnulldotslashdot wgsim bamrecompress clusterbed gffread novoalignmpi speedseq vcfcleancomplex vcfnumalt wgsim_eval.pl bamreset cmake glia novobarcode speedseq.config vcfclearid vcfoverlay wigtobigwig bamseqchksum complementbed grabix novoindex splitreadsamtobedpe vcfclearinfo vcfparsealts windowbed bamsort contig_to_chr_coords groupby novomethyl splittertobreakpoint vcfcombine vcfplotaltdiscrepancy.r windowmaker bamsplit convert_trace gtf_juncs novope2bed.pl sra_to_solid vcfcommonsamples vcfplotaltdiscrepancy.sh xmlwf bamsplitdiv coveragebed gtf_to_fasta novorun.pl srf2fasta vcfcomplex vcfplotsitediscrepancy.r zoom2sam.pl bamtobed cpack gtftogenepred novosort srf2fastq vcfcountalleles vcfplottstv.sh ztr_dump bamtofastq cpanm gtf_to_sam novoutil srf_dump_all vcfcreatemulti vcfprimers bamtofastq cram_dump hash_exp nucbed srf_extract_hash vcfdistance vcfprintaltdiscrepancy.r bamtools cram_index hash_extract pairtobed srf_extract_linear vcfecho vcfprintaltdiscrepancy.sh bamtools cramtools hash_list pairtopair srf_filter vcfentropy vcfqual2info bamzztoname crc32 hash_sff platypus srf_index_hash vcfevenregions vcfqualfilter 300+ (OSS) tools within our production framework Infinite number of combinations to get it wrong 13 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
14 Production Overcoming the Complexity Scalability, Reproducibility, Flexibility, Accessibility Forced to use open source tools and OS (Linux), no closed source alternatives exist - Integration challenging - Variant calling and expression analysis very much an open research questions, rapidly changing code - No licensing costs, but costs in internal and external consulting Bcbio-nextgen - An open source Python toolkit providing best practice pipelines for fully automated NGS analysis - Main developer Brad Chapman (HSPH) - Unit tested, version controlled, development in Github - Scalable across different clusters, schedulers, Amazon cloud AZ is active recognised contributor and collaborator to HSPH and bcbio-nextgen 14 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
15 Production Overcoming the Complexity Bcbio-nextgen overview The user writes/modifies a high level configuration file specifying inputs and analysis parameters - Very few tuning parameters -> Given the same data, two analysts will produce the same results 15 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
16 Getting it right Given the rapid changes in the individual analysis tools, how do we know the pipeline gets it right? Solution: reference standards For germline sequencing, the Genome in A Bottle Consortium established a gold standard for an individual (NA12878) - Samples from NA12878 can be bought off the shelf - Compare sequencing and analytics results to the gold standard, establish sensitivity, PPV of variant calls, compare to other people s results For tumour sequencing, several standards exist - Horizon Diagnostics tumour standard - ICGC-TCGA DREAM Mutation Calling challenge 16 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
17 Processing and managing the data NGS HPC clusters on 4 main R&D sites - UK (SGE, ~200 cores, gpfs) - Sweden (SLURM, >500 cores, Lustre) - China (SGE, >100 cores, gpfs) - US (UGE, >200 cores, gpfs) Data generated or received in one place processed locally by the NGS Production Team (each member has access to all HPC clusters) - Processed data handed over to disease area bioinformaticians in a controlled manner Quick pipes between the sites allows data sharing when required Cloud computing 17 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
18 NGS + Cloud NGS Suited to using Cloud Large scale storage needs High computational power that can continue to scale Inherently (embarrassingly) parallel, easily ported Peaks and valleys in compute needs, so burst into cloud as needed instead of large investment upfront Launch-able computing centre utilising Amazon EC2 18 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
19 StarCluster from MIT with our pipeline 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 320 SSD 320 SSD 320 SSD 320 SSD 40 TB GlusterFS /ngs 320 SSD 320 SSD 32 Core 32 Core 32 Core 19 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
20 Why not Hadoop? The use of a large number of mostly academic open source tools that are 99.9% not written for Hadoop No pipeline implements wrapping up of the above tools in a Hadoop framework Disk I/O admittedly the bottle neck in current parallel file system architectures for NGS analytics - Gpfs locally at AZ - Lustre in AWS, local scratch SSD 20 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
21 Visualising the data JBrowse genome browser Most popular genome analysis viewer is the Integrated Genome Viewer (IGV, Broad Institute), a Java based standalone program - Requires a Java app - Requires configuration JBrowse, a web browser based genome viewer is inherently easier for non-tech savvy people: point your browser to it and it just works - Physical location of data less important, only the part that is shown transferred Data of interest, such as genomic variants, can be annotated by a URL to JBrowse 21 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
22 JBrowse BRCA2 gene screenshot Reference DNA sequence and amino acids BRCA2 alternative exons Detected gene variant (G to A mutation) Evidence in the data for the variant Noise in the data 22 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
23 Summary 23
24 Summary NGS data is accumulating faster and faster Analysing and interpreting the data is I/O intensive (+CPU and RAM) Easily parallelised using SMP and simple schedulers (SGE, Slurm) Current challenges in integrating all the processed data (in e.g. no-sql databases) Long term storage (due to e.g. regulatory requirements) in e.g. Amazon Glacier
25 Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0) , F: +44 (0) , 25 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca
New solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina [email protected] http://bioinfo.cipf.es/imedina Head of the Computational Biology
Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center
Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing
Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute
Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per
A leader in the development and application of information technology to prevent and treat disease.
A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today
Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA
Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow Barry Bolding Cray Inc Seattle, WA 1 CUG 2013 Paper Genomic Applications on Cray supercomputers: Next Generation Sequencing
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
Hadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) [email protected], INRIA/Irisa, Campus de Beaulieu, 35042,
G E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
Data Sharing Initiative: International Cancer Genome Consortium
Data Sharing Initiative: International Cancer Genome Consortium Tom Hudson, MD President and Scientific Director Ontario Institute for Cancer Research 1 Sharing Data Sharing BIG Genome Initiative: DATA
UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production
Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department
Delivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
Personalized Medicine and IT
Personalized Medicine and IT Data-driven Medicine in the Age of Genomics www.intel.com/healthcare/bigdata Ketan Paranjape General Manager, Life Sciences Intel Corp. @Portlandketan 1 The Central Dogma of
Introduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo
Preparing the scenario for the use of patient s genome sequences in clinic Joaquín Dopazo Computational Medicine Institute, Centro de Investigación Príncipe Felipe (CIPF), Functional Genomics Node, (INB),
CHALLENGES IN NEXT-GENERATION SEQUENCING
CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture
High Throughput Sequencing Data Analysis using Cloud Computing
High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom ([email protected]) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure
The NGS IT notes. George Magklaras PhD RHCE
The NGS IT notes George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org
Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America
Genomic Medicine The Future of Cancer Care Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America Personalized Medicine Personalized health care is a broad term for interventions
Challenges associated with analysis and storage of NGS data
Challenges associated with analysis and storage of NGS data Gabriella Rustici Research and training coordinator Functional Genomics Group [email protected] Next-generation sequencing Next-generation sequencing
Big data in cancer research : DNA sequencing and personalised medicine
Big in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005 Deciphering
Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik
Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated
Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect
on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze
Cloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation
PN 100-9879 A1 TECHNICAL NOTE Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation Introduction Cancer is a dynamic evolutionary process of which intratumor genetic and phenotypic
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) [email protected] Paris November 2013 The Sanger DNA sequencing method Sequencing
Cloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
Worldwide Collaborations in Molecular Profiling
Worldwide Collaborations in Molecular Profiling Lillian L. Siu, MD Director, Phase I Program and Cancer Genomics Program Princess Margaret Cancer Centre Lillian Siu, MD Contracted Research: Novartis, Pfizer,
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/
GC3 Use cases for the Cloud
GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste, 24.10.2013 Who am I System Architect
Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE
Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE CRUK Stratified Medicine Initiative Somatic mutation testing for prediction of treatment response in patients with solid tumours:
Practical Solutions for Big Data Analytics
Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute ([email protected]) Paul Dave ([email protected]) Dinanath Sulakhe ([email protected]) Alex Rodriguez ([email protected])
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/
Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs
Provisional Translation (as of January 27, 2014)* November 15, 2013 Pharmaceuticals and Bio-products Subcommittees, Science Board Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer
Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar
Computational infrastructure for NGS data analysis José Carbonell Caballero Pablo Escobar Computational infrastructure for NGS Cluster definition: A computer cluster is a group of linked computers, working
Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013
Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine
OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution
OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS)) Abstract The advent
Genetic diagnostics the gateway to personalized medicine
Micronova 20.11.2012 Genetic diagnostics the gateway to personalized medicine Kristiina Assoc. professor, Director of Genetic Department HUSLAB, Helsinki University Central Hospital The Human Genome Packed
Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable
DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences
Comparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri
Large-scale Research Data Management and Analysis Using Globus Services Ravi Madduri Argonne National Lab University of Chicago @madduri Outline Who we are Challenges in Big Data Management and Analysis
BioHPC Web Computing Resources at CBSU
BioHPC Web Computing Resources at CBSU 3CPG workshop Robert Bukowski Computational Biology Service Unit http://cbsu.tc.cornell.edu/lab/doc/biohpc_web_tutorial.pdf BioHPC infrastructure at CBSU BioHPC Web
Text file One header line meta information lines One line : variant/position
Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position ##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1!
Installation Guide for Windows
Installation Guide for Windows Overview: Getting Ready Installing Sequencher Activating and Installing the License Registering Sequencher GETTING READY Trying Sequencher: Sequencher 5.2 and newer requires
Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT
Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this
Analysis of NGS Data
Analysis of NGS Data Introduction and Basics Folie: 1 Overview of Analysis Workflow Images Basecalling Sequences denovo - Sequencing Assembly Annotation Resequencing Alignments Comparison to reference
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office
2013 Laboratory Accreditation Program Audioconferences and Webinars Implementing Next Generation Sequencing (NGS) as a Clinical Tool in the Laboratory Nazneen Aziz, PhD Director, Molecular Medicine Transformation
How-To: SNP and INDEL detection
How-To: SNP and INDEL detection April 23, 2014 Lumenogix NGS SNP and INDEL detection Mutation Analysis Identifying known, and discovering novel genomic mutations, has been one of the most popular applications
Considering De-Identification? Legacy Data. Kymberly Lee 16-Jul-2015
Considering De-Identification? Legacy Data Kymberly Lee 16-Jul-2015 Introduction This presentation provides an overview of Clinical data sharing, clinical data privacy, and clinical transparency. Discuss
A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System
A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System Young-Ho Kim, Eun-Ji Lim, Gyu-Il Cha, Seung-Jo Bae Electronics and Telecommunications
IOmark- VDI. Nimbus Data Gemini Test Report: VDI- 130906- a Test Report Date: 6, September 2013. www.iomark.org
IOmark- VDI Nimbus Data Gemini Test Report: VDI- 130906- a Test Copyright 2010-2013 Evaluator Group, Inc. All rights reserved. IOmark- VDI, IOmark- VDI, VDI- IOmark, and IOmark are trademarks of Evaluator
Bursting to a Hybrid Cloud for Services OFC 2015
Bursting to a Hybrid Cloud for Services OFC 2015 Big Data applications Big Compute in the cloud Why burst to the cloud? Opportunities 2 Big Data Apps Need Big Compute Life Sciences Bioinformatics Next
Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers
Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/
IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS
IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS 29 OCTOBER 2015 DR. DIRK J. EVERS BACKGROUND TreatmentMAP
Managing and Conducting Biomedical Research on the Cloud Prasad Patil
Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School SaaS & PaaS gmail google docs app engine
Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.
Big Data Challenges technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Data Deluge: Due to the changes in big data generation Example: Biomedicine
Disease gene identification with exome sequencing
Disease gene identification with exome sequencing Christian Gilissen Dept. of Human Genetics Radboud University Nijmegen Medical Centre [email protected] Contents Infrastructure Exome sequencing
An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
Data management challenges in todays Healthcare and Life Sciences ecosystems
Data management challenges in todays Healthcare and Life Sciences ecosystems Jose L. Alvarez Principal Engineer, WW Director Life Sciences [email protected] Evolution of Data Sets in Healthcare
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS
ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS INCORPORATE GENOMIC DATA INTO CLINICAL R&D KEY BENEFITS Enable more targeted, biomarker-driven clinical trials Improves efficiencies, compressing
LifeScope Genomic Analysis Software 2.5
USER GUIDE LifeScope Genomic Analysis Software 2.5 Graphical User Interface DATA ANALYSIS METHODS AND INTERPRETATION Publication Part Number 4471877 Rev. A Revision Date November 2011 For Research Use
Overview of Next Generation Sequencing platform technologies
Overview of Next Generation Sequencing platform technologies Dr. Bernd Timmermann Next Generation Sequencing Core Facility Max Planck Institute for Molecular Genetics Berlin, Germany Outline 1. Technologies
Basic processing of next-generation sequencing (NGS) data
Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance
Integrated Rule-based Data Management System for Genome Sequencing Data
Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer
School of Nursing. Presented by Yvette Conley, PhD
Presented by Yvette Conley, PhD What we will cover during this webcast: Briefly discuss the approaches introduced in the paper: Genome Sequencing Genome Wide Association Studies Epigenomics Gene Expression
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless
The most powerful open source data science technologies in your browser.!! Yves Hilpisch
The most powerful open source data science technologies in your browser.!! Yves Hilpisch I. The Market and The Problem II. How We Solve The Problem III. Market Size and Facts IV. Strategic Opportunities
How Real-time Analysis turns Big Medical Data into Precision Medicine?
Medical Data into Dr. Matthieu-P. Schapranow GLOBAL HEALTH, Rome, Italy August 27, 2014 Important things first: Where to find additional information? Online: Visit http://we.analyzegenomes.com for latest
Building your Big Data Architecture on Amazon Web Services
Building your Big Data Architecture on Amazon Web Services Abhishek Sinha @abysinha [email protected] AWS Services Deployment & Administration Application Services Compute Storage Database Networking
Automating installation, testing and development of bcbio-nextgen pipeline
Automating installation, testing and development of bcbio-nextgen pipeline GUILLERMO CARRASCO HERNÁNDEZ [email protected] June 2013 Final project at Barcelona School of Informatics (FIB)
CSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Data Analysis for Ion Torrent Sequencing
IFU022 v140202 Research Use Only Instructions For Use Part III Data Analysis for Ion Torrent Sequencing MANUFACTURER: Multiplicom N.V. Galileilaan 18 2845 Niel Belgium Revision date: August 21, 2014 Page
Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples
DATA Sheet Single-Cell DNA Sequencing with the C 1 Single-Cell Auto Prep System Reveal hidden populations and genetic diversity within complex samples Single-cell sensitivity Discover and detect SNPs,
THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health
THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health CONFIDENTIAL Background and Disclosures Professor of Biostatistics and
Next generation sequencing (NGS)
Next generation sequencing (NGS) Vijayachitra Modhukur BIIT [email protected] 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known
Netapp HPC Solution for Lustre. Rich Fenton ([email protected]) UK Solutions Architect
Netapp HPC Solution for Lustre Rich Fenton ([email protected]) UK Solutions Architect Agenda NetApp Introduction Introducing the E-Series Platform Why E-Series for Lustre? Modular Scale-out Capacity Density
Information for patients and the public and patient information about DNA / Biobanking across Europe
Information for patients and the public and patient information about DNA / Biobanking across Europe BIOBANKING / DNA BANKING SUMMARY: A biobank is a store of human biological material, used for the purposes
Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London
Building a Collaborative Informatics Platform for Translational Research: An IMI Project Experience Prof. Yike Guo Department of Computing Imperial College London Living in the Era of BIG Big Data : Massive
HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis
HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina [email protected] Scientific Computing Unit Bioinformatics and Genomics Department
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons
The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Cloud Computing and Amazon Web Services
Cloud Computing and Amazon Web Services Gary A. McGilvary edinburgh data.intensive research 1 OUTLINE 1. An Overview of Cloud Computing 2. Amazon Web Services 3. Amazon EC2 Tutorial 4. Conclusions 2 CLOUD
