Data Management Tools: practical approaches and lessons learned when scaling up a computing and data environment to keep up with the pace of data

Size: px
Start display at page:

Download "Data Management Tools: practical approaches and lessons learned when scaling up a computing and data environment to keep up with the pace of data"

Transcription

1 Data Management Tools: practical approaches and lessons learned when scaling up a computing and data environment to keep up with the pace of data intensive research

2 Declaration of Potential Conflicts-of-Interest, Consulting and Corporate Collaborators NHCA Group

3 Scientific research is now data driven, analysis costs now exceed data generation costs, most data goes unanalyzed and the most competitive institutes will be those that embrace informatics as a discipline

4 Computations in the life and medical sciences are unique I Emphasis on symbolic/integer (non-floating point) intense computations (yet with floating point capabilities). Diverse types of computations that are continuously evolving, and demand different hardware, software and compute environment configurations Emphasis on a mix of computing technologies (microprocessor, GPGPU, FPGA) with an objective to build the capability to optimize different codes (or code parts) on the different platforms for maximum performance in a pipeline. Emphasis on scalable, sustainable hardware/software/environmental architectures and installations (including support staff). Emphasis on data intense computations, so significant storage and bandwidth to/from the HPC installation and to/from storage to processors is essential.

5 Computations in the life and medical sciences are unique II Emphasis on providing answers through very simple web interfaces (especially for biomedical researchers, health care providers, or applications that appeal to lay people) by creating or porting applications that demand real-time HPC intense resources. Life and Medical Scientists want to solve cancer, not become programmers. Installation would have readily available en masse data from major public and local databases and a semantic web approach to gathering and accessing the larger data and knowledge bases that are available and essential to extract new knowledge. Typical datasets, such as NextGen Human Genome Sequences and Medical images are TB in size, and some projects are in the PB size.

6 Informatics involves hardware, software, and expertise because scientists want answers should be thought of as an integrated mix that is continuously evolving, and complementary, not competitive Computing hardware (local, centralized/supercomputer centers, cloud) Computing software (many, varied, continuously evolving, few standards, best software becomes proprietary, comparisons of different implementations is biased, and most important there is little funding for sustaining software) Data analysts (many flavors programmers, informaticians, bioinformaticians, statisticians, clinical informaticians, anthropologists.)

7 Local computing is primarily done on a machine we developed: SHADOWFAX A heterogeneous computing environment for data intensive computations ~2,524 CPUs, > 12TB RAM (Dell/Intel) ~27,000 GPUs (nvidia) 8 FPGA hybrid core systems (Convey) ~0.8 PB Disk Arrays (DDN) 100 PB Sun/Oracle tape storage system

8 Local computing is primarily done on a machine we developed: SHADOWFAX With local synchronized copies of major databases: Medline, arxiv, PubMed Central, Genbank, SwissProt, 1,000 Genomes Project, The Cancer Genome Atlas, Wikipedia Designed to meet the needs of applications that demand HPC: deep sequencing assembly and analysis, molecular modeling, simulations, proteomics analysis, text mining, Health IT Deals with vendors greatly controlled cost

9 Data Analysis Core (DAC) provides turnkey study design, monitoring and analysis Projects are diverse ~80 projects completed in 2 years Genome assembly focuses on nonhuman and especially challenging genomes Turkey, bacteria, insects (butterfly), fish Genome variation discovery and annotation projects RNAseq Multiple projects ranging from binary comparisons to multifactor time studies mirna expression and discovery SNP population studies Metagenomic studies Co-Author papers for contributions are made to the science 9 published 4 submitted ~10 currently in draft Core personnel directly participate in grant applications USDA grant submitted (PI) 7 grants submitted as copi

10 A data intense example NextGen sequence analysis and exploitation

11 NextGen DNA sequence analysis is now the rate limiting step The cost of sequencing has dropped from $3B/genome to ~$1K/genome. New genomes are sequenced daily. It is estimated that there are 30,000 human genomes complete, with 15,000 of these in the public domain. Analysis has focused on Single Nucleotide Polymorphisms ( SNPs ), which are single letter changes in the DNA code. For complex diseases like cancer, heart disease and mental disorders, extensive work has still only explains 10-20% of the known genetic component. Recent research indicates that do to experimental measurement noise, perhaps most of the measured variations are false positives. Data analysis pipelines are built from a number of standard tools. There are many public and proprietary analysis pipelines, and there performance accuracy is highly contested. Truth Data is just beginning to be assembled. Different types of DNA sequencing do not cross-validate.

12 Microsatellites, or repetitive DNA sequences are particularly challenging Microsatellites, also called Simple Sequence Repeats or Short Tandem Repeats, are an understudied portion of genome; because they are considered part of our Junk DNA or more recently Dark Matter DNA; research focus has been on Single Nucleotide Polymorphisms ( SNPs ) Microsatellites have known value: long used for paternity and forensic testing and linked to neurological diseases (e.g. Huntington s and Fragile-X) None of major genomic research projects have focused on Microsatellites: not Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas, ENCODE or the icogs study.

13 Microsatellite myths dispelled, enabling new discoveries Myth 1: Accurate and efficient analysis of the ~1 million Microsatellites is not possible. Microsatellite genotypes in 1000 Genome Project and The Cancer Genome Atlas demonstrated to be only 20% accurate 1 ; new proprietary algorithm is 96% accurate Myth 2: Microsatellites are hyper-variable, and will therefore not be useable in genotype-phenotype association studies Analysis of 1,200 healthy genomes demonstrated that 98% of the ~150,000 microsatellites in genes are highly invariant Myth 3: Heritable and spontaneous components of disease will be explained by SNPs. Recent icogs study involving over 200,000 subjects demonstrated that known and new SNPs explain less than 50% of heritability in breast, ovarian and prostate cancer 1 McIver, 2010

14 Research Pipeline Download and rebuild thousands of healthy and affected genomes Create genotype distributions for healthy and affected populations Compute Fishers Exact Test p-value for each of ~1 million loci and rank results Identify Patterns of Informative Microsatellites (PIM) from loci that pass Bonferroni and Benjamini Hochberg False Discovery Rate tests Manually review, do QC, compute sensitivity and specificity Annotate with ontologies, literature, input from experts Validate PIM with sequencing of wellcharacterized samples Business analysis; product definition; IP Publish; translate, regulatory approval, reimbursement; team with established clinical services co.

15 Genomeon has created a unique library of over 8000 genomes from 1000 Genomes Project and The Cancer Genome Atlas with corrected microsatellites Healthy Population representing many ethnicities Ovarian cancer Breast cancer Brain cancer: Glioma; Glioblastoma; Medulloblastoma Lung cancer Prostate cancer Melanoma Autism

16 Comparative analysis has yielded new actionable clinical diagnostics and drug targets for cancer, for example Breast Cancer

17 Pattern of 55 informative microsatellites differentiates Breast Cancer germlines from healthy germlines Sensitivity = 88% Specificity = 77% BRCA ½ positive samples

18 Genes proximate to 55 BC Informative Loci 52 loci are in genes, 3 loci are intergenic Of the 52 loci, 1 is in an exon, 4 are in untranslated regions while the rest are intronic located very close to the intron-exon boundary. Many of the genes are known to be alternatively spliced and are differentially expressed, both of which imply mechanism Ontologies: notch signaling, genome stability, alternative splicing, programed cell death, cell cycle and apoptosis 32 of the 52 genes previously associated with cancer, 18 with breast cancer Several genes are known and highly pursued drug targets, new targets include several kinase and membrane bound proteins. 11 of the 52 genes are targets or affected by pharmaceuticals, including 5 that are prescribed or in clinical trials for BC.

19 Applications of these microsatellite loci variations Cancer Risk Diagnostics Microsatellite profiling for increased risk of cancer, and the tissues at highest risk Companion/Treatment Diagnostics - Many informative microsatellites are functional elements implicated in therapeutic response Clinical Trial Support - Use of microsatellite profile to differentiate subpopulations in clinical trials Drug Targets - Identification of large number of genes previously unassociated with cancer - many with functions associated with cancer processes Toxicology - Quantification of stress induced exposures/stressors via microsatellite mutation screen Prognosis - Comparison of microsatellite variations between germlines and tumors Non-cancer Diseases - PTSD, Autism, MS, cardiac diseases, aging

20 Another data intense example Text analytics to quantify publication ethics violations and fraud.but lets talk about that and the fallout later.

21 Lessons Learned.. Informatics has become a critical bottleneck, is evolving quickly, is expensive and requires continuous investment ($, people, recognition.), but it is here to stay and is required to be competitive.

22 A few things to keep in mind Grow and evolve Systems (hardware/cloud, software and people) should be obtained in smaller, diverse (including jobs with high memory requirements, fast database access, intense parallel message passing) chunks and grow as demand grows to take advantage of Moore s Law, changing requirements, and vendor competition Systems should and can be operational on day one Provide for public-facing real time web services Verification and retention Data AND analysis history must be verified and retained will be required and will make one more competitive Restrictions Access to public databases is variable, for example TCGA cannot be downloaded/analyzed in the cloud, and there are minimum systems/personnel required for access

23 A few things to keep in mind Security Server security via multiple layers, limited access, invisibility Collaboritoriums are hard to secure, most times simple solutions are best (Google drive) Varying and changing requirements of institutes, governments, projects Liability Material Transfer Agreements now involve data and are getting more complex Release of data, software, etc. Uncertainty Changing demand, fluctuating funding, and impact of breakthroughs HIPAA Clinical data access may not be possible even behind their firewalls, driven by fear of loss of control, discovering an adverse event or comparisons across practices Access to data Commercialization/translation Patentability, proprietary/trade secret The world s best bioinformatics company is worth. The world s worst pharma company is worth..

24 Cloud computing is not yet the answer to computing in the life and medical sciences Locality/Dependencies Where is the data and what about data that must be merged from many sources? Compute match Some jobs require non-standard hardware configurations for performance: some genomic assemblies require 2+TB of memory, some simulations require extremely high data exchange/update rates Bandwidth Getting the data to the cloud from local sources can be limiting, as will be cases where data is moved from cloud to cloud. Cost The initial cost is low, but the sustained cost can be high, and in academic settings, funding to support work beyond 3 years is very difficult Security There are HIPAA compliant clouds, there are issues with acceptance Storage Costs are still high for sustained storage. Known amounts of local storage drives scientists to be economical in experimental design. Unknowns What happens when the cloud goes down? What happens if a supplier goes out of business? What if

25 One possible solution.. Create and support critical mass sized entities that span AIRI members, so that members together take advantage of scale

26 Discipline-specific informatics entities : condo computing organization where members buy in and excess capacity available to new/unfunded researchers Mix of compute technologies and bulk purchasing Best of class software, algorithms and data warehouses Automated pipelines Data analysists as independent researchers and as collaborators Complete data analysis solutions computing, statistics, experimental design, data monitoring/archiving, data and analysis reproducibility validation/checking Data and analysis delivery portals (required by funders and journals) Critical mass so all needed expertise and infrastructure is available and continuously upgraded to meet changing needs

27 Thank you. Any Questions?

Human Genome Organization: An Update. Genome Organization: An Update

Human Genome Organization: An Update. Genome Organization: An Update Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion

More information

A leader in the development and application of information technology to prevent and treat disease.

A leader in the development and application of information technology to prevent and treat disease. A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today

More information

A Primer of Genome Science THIRD

A Primer of Genome Science THIRD A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:

More information

An Introduction to Genomics and SAS Scientific Discovery Solutions

An Introduction to Genomics and SAS Scientific Discovery Solutions An Introduction to Genomics and SAS Scientific Discovery Solutions Dr Karen M Miller Product Manager Bioinformatics SAS EMEA 16.06.03 Copyright 2003, SAS Institute Inc. All rights reserved. 1 Overview!

More information

Clinical Genomics at Scale: Synthesizing and Analyzing Big Data From Thousands of Patients

Clinical Genomics at Scale: Synthesizing and Analyzing Big Data From Thousands of Patients Clinical Genomics at Scale: Synthesizing and Analyzing Big Data From Thousands of Patients Brandy Bernard PhD Senior Research Scientist Institute for Systems Biology Seattle, WA Dr. Bernard s research

More information

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation

More information

Center for Causal Discovery (CCD) of Biomedical Knowledge from Big Data University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing

Center for Causal Discovery (CCD) of Biomedical Knowledge from Big Data University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center for Causal Discovery (CCD) of Biomedical Knowledge from Big Data University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center Yale University PIs: Ivet Bahar, Jeremy Berg,

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Factors for success in big data science

Factors for success in big data science Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)

More information

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs) Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs) Single nucleotide polymorphisms or SNPs (pronounced "snips") are DNA sequence variations that occur

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each

More information

Medical Informatics II

Medical Informatics II Medical Informatics II Zlatko Trajanoski Institute for Genomics and Bioinformatics Graz University of Technology http://genome.tugraz.at zlatko.trajanoski@tugraz.at Medical Informatics II Introduction

More information

Introduction to Genome Annotation

Introduction to Genome Annotation Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT

More information

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative The Human Genome Project From genome to health From human genome to other genomes and to gene function Structural Genomics initiative June 2000 What is the Human Genome Project? U.S. govt. project coordinated

More information

Balancing Big Data for Security, Collaboration and Performance

Balancing Big Data for Security, Collaboration and Performance Balancing Big Data for Security, Collaboration and Performance Sai Balu Lineberger Cancer Center UNC Chapel Hill Oct 14, 2014 About UNC Oldest Public University -1793 Top 5 Public University. 46th World

More information

Attacking the Biobank Bottleneck

Attacking the Biobank Bottleneck Attacking the Biobank Bottleneck Professor Jan-Eric Litton BBMRI-ERIC BBMRI-ERIC Big Data meets research biobanking Big data is high-volume, high-velocity and highvariety information assets that demand

More information

Big Data Trends A Basis for Personalized Medicine

Big Data Trends A Basis for Personalized Medicine Big Data Trends A Basis for Personalized Medicine Dr. Hellmuth Broda, Principal Technology Architect emedikation: Verordnung, Support Prozesse & Logistik 5. Juni, 2013, Inselspital Bern Over 150,000 Employees

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Scaling up to Production

Scaling up to Production 1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE

More information

ITT Advanced Medical Technologies - A Programmer's Overview

ITT Advanced Medical Technologies - A Programmer's Overview ITT Advanced Medical Technologies (Ileri Tip Teknolojileri) ITT Advanced Medical Technologies (Ileri Tip Teknolojileri) is a biotechnology company (SME) established in Turkey. Its activity area is research,

More information

Cancer Genomics: What Does It Mean for You?

Cancer Genomics: What Does It Mean for You? Cancer Genomics: What Does It Mean for You? The Connection Between Cancer and DNA One person dies from cancer each minute in the United States. That s 1,500 deaths each day. As the population ages, this

More information

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel

More information

How Can Institutions Foster OMICS Research While Protecting Patients?

How Can Institutions Foster OMICS Research While Protecting Patients? IOM Workshop on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials How Can Institutions Foster OMICS Research While Protecting Patients? E. Albert Reece, MD, PhD, MBA Vice

More information

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Integrating Bioinformatics, Medical Sciences and Drug Discovery Integrating Bioinformatics, Medical Sciences and Drug Discovery M. Madan Babu Centre for Biotechnology, Anna University, Chennai - 600025 phone: 44-4332179 :: email: madanm1@rediffmail.com Bioinformatics

More information

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated

More information

Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.

Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres. Big Data Challenges technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Data Deluge: Due to the changes in big data generation Example: Biomedicine

More information

Hacking Brain Disease for a Cure

Hacking Brain Disease for a Cure Hacking Brain Disease for a Cure Magali Haas, CEO & Founder #P4C2014 Innovator Presentation 2 Brain Disease is Personal The Reasons We Fail in CNS Major challenges hindering CNS drug development include:

More information

Human Genome and Human Genome Project. Louxin Zhang

Human Genome and Human Genome Project. Louxin Zhang Human Genome and Human Genome Project Louxin Zhang A Primer to Genomics Cells are the fundamental working units of every living systems. DNA is made of 4 nucleotide bases. The DNA sequence is the particular

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013 ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and

More information

School of Nursing. Presented by Yvette Conley, PhD

School of Nursing. Presented by Yvette Conley, PhD Presented by Yvette Conley, PhD What we will cover during this webcast: Briefly discuss the approaches introduced in the paper: Genome Sequencing Genome Wide Association Studies Epigenomics Gene Expression

More information

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,

More information

IO Informatics The Sentient Suite

IO Informatics The Sentient Suite IO Informatics The Sentient Suite Our software, The Sentient Suite, allows a user to assemble, view, analyze and search very disparate information in a common environment. The disparate data can be numeric

More information

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing

More information

SNP Essentials The same SNP story

SNP Essentials The same SNP story HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it, is more than

More information

Q&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center

Q&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center Q&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center Name: Kevin Shianna Age: 39 Position: Senior vice president, sequencing operations, New York Genome Center, since July 2012 Experience

More information

TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298

TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298 DIAGNOSTICS BUSINESS ANALYSIS SERIES: TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298 By ADAMS BUSINESS ASSOCIATES MAY 2014. May 2014 ABA 298 1 Technologies, Products & Services

More information

ebook Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry.

ebook Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry. Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry. www.persistent.com 3 4 5 5 7 9 10 11 12 13 From the Vantage Point

More information

MediSapiens Ltd. Bio-IT solutions for improving cancer patient care. Because data is not knowledge. 19th of March 2015

MediSapiens Ltd. Bio-IT solutions for improving cancer patient care. Because data is not knowledge. 19th of March 2015 19th of March 2015 MediSapiens Ltd Because data is not knowledge Bio-IT solutions for improving cancer patient care Sami Kilpinen, Ph.D Co-founder, CEO MediSapiens Ltd Copyright 2015 MediSapiens Ltd. All

More information

White Paper. Version 1.2 May 2015 RAID Incorporated

White Paper. Version 1.2 May 2015 RAID Incorporated White Paper Version 1.2 May 2015 RAID Incorporated Introduction The abundance of Big Data, structured, partially-structured and unstructured massive datasets, which are too large to be processed effectively

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk Three data delivery cases for EMBL- EBI s Embassy Guy Cochrane www.ebi.ac.uk EMBL European Bioinformatics Institute Genes, genomes & variation European Nucleotide Archive 1000 Genomes Ensembl Ensembl Genomes

More information

Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation. D. POLVERARI, CTO October 06-07 2008

Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation. D. POLVERARI, CTO October 06-07 2008 Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation D. POLVERARI, CTO October 06-07 2008 Data integration definition and aims Definition : Data integration consists

More information

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti Data deluge (and its applications) Prologue Data is becoming cheaper and cheaper to produce and store Driving mechanism is parallelism on sensors, storage, computing Data directly produced are complex

More information

From Data to Foresight:

From Data to Foresight: Laura Haas, IBM Fellow IBM Research - Almaden From Data to Foresight: Leveraging Data and Analytics for Materials Research 1 2011 IBM Corporation The road from data to foresight is long? Consumer Reports

More information

Doctor of Philosophy in Computer Science

Doctor of Philosophy in Computer Science Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects

More information

2019 Healthcare That Works for All

2019 Healthcare That Works for All 2019 Healthcare That Works for All This paper is one of a series describing what a decade of successful change in healthcare could look like in 2019. Each paper focuses on one aspect of healthcare. To

More information

The Human Genome Project

The Human Genome Project The Human Genome Project Brief History of the Human Genome Project Physical Chromosome Maps Genetic (or Linkage) Maps DNA Markers Sequencing and Annotating Genomic DNA What Have We learned from the HGP?

More information

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins The Future of the Electronic Health Record Gerry Higgins, Ph.D., Johns Hopkins Topics to be covered Near Term Opportunities: Commercial, Usability, Unification of different applications. OMICS : The patient

More information

Core Facility Genomics

Core Facility Genomics Core Facility Genomics versatile genome or transcriptome analyses based on quantifiable highthroughput data ascertainment 1 Topics Collaboration with Harald Binder and Clemens Kreutz Project: Microarray

More information

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences

More information

CTC Technology Readiness Levels

CTC Technology Readiness Levels CTC Technology Readiness Levels Readiness: Software Development (Adapted from CECOM s Software Technology Readiness Levels) Level 1: Basic principles observed and reported. Lowest level of software readiness.

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Big Data and the Data Lake. February 2015

Big Data and the Data Lake. February 2015 Big Data and the Data Lake February 2015 My Vision: Our Mission Data Intelligence is a broad term that describes the real, meaningful insights that can be extracted from your data truths that you can act

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing

More information

Large Gene Interaction Analytics at University at Buffalo, SUNY

Large Gene Interaction Analytics at University at Buffalo, SUNY Large Gene Interaction nalytics at University at Buffalo, SUNY Giving researchers the ability to speed computations and increase data sets Overview The need Researchers required the ability to quickly

More information

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources Appendix 2 Molecular Biology Core Curriculum Websites and Other Resources Chapter 1 - The Molecular Basis of Cancer 1. Inside Cancer http://www.insidecancer.org/ From the Dolan DNA Learning Center Cold

More information

The role of big data in medicine

The role of big data in medicine The role of big data in medicine November 2015 Technology is revolutionizing our understanding and treatment of disease, says the founding director of the Icahn Institute for Genomics and Multiscale Biology

More information

University Uses Business Intelligence Software to Boost Gene Research

University Uses Business Intelligence Software to Boost Gene Research Microsoft SQL Server 2008 R2 Customer Solution Case Study University Uses Business Intelligence Software to Boost Gene Research Overview Country or Region: Scotland Industry: Education Customer Profile

More information

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of

More information

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training

More information

Accelerating variant calling

Accelerating variant calling Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo

More information

Regulated Applications in the Cloud

Regulated Applications in the Cloud Keith Williams CEO Regulated Applications in the Cloud Aspects of Security and Validation Statement on the Cloud and Pharma s added Complexity Clouds already make sense for many small and mediumsize businesses,

More information

Make the Most of Big Data to Drive Innovation Through Reseach

Make the Most of Big Data to Drive Innovation Through Reseach White Paper Make the Most of Big Data to Drive Innovation Through Reseach Bob Burwell, NetApp November 2012 WP-7172 Abstract Monumental data growth is a fact of life in research universities. The ability

More information

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS 29 OCTOBER 2015 DR. DIRK J. EVERS BACKGROUND TreatmentMAP

More information

Digital Catapult. The impact of Big Data in a Connected Digital Economy Future of Healthcare. Mark Wall Big Data & Analytics Leader.

Digital Catapult. The impact of Big Data in a Connected Digital Economy Future of Healthcare. Mark Wall Big Data & Analytics Leader. 1 Digital Catapult The impact of Big Data in a Connected Digital Economy Future of Healthcare Mark Wall Big Data & Analytics Leader March 12 2014 Catapult is a Technology Strategy Board programme Agenda

More information

Compliance and the Cloud. Guiding principles and architecture for addressing Life Science compliance in the cloud

Compliance and the Cloud. Guiding principles and architecture for addressing Life Science compliance in the cloud Compliance and the Cloud Guiding principles and architecture for addressing Life Science compliance in the cloud Life Sciences Industry Unit Microsoft Corporation June 2012 ii Legal Disclaimers The information

More information

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS 1. The Technology Strategy sets out six areas where technological developments are required to push the frontiers of knowledge

More information

TRANSLATIONAL BIOINFORMATICS 101

TRANSLATIONAL BIOINFORMATICS 101 TRANSLATIONAL BIOINFORMATICS 101 JESSICA D. TENENBAUM Department of Bioinformatics and Biostatistics, Duke University Durham, NC 27715 USA Jessie.Tenenbaum@duke.edu SUBHA MADHAVAN Innovation Center for

More information

How To Make Cancer A Clinical Sequencing

How To Make Cancer A Clinical Sequencing 10 this time, it s Personal In what is an exciting era in the evolution of oncology treatment, this special feature by Deborah J. Ausman explores how Next-Generation Sequencing and Convergent Informatics

More information

MUTATION, DNA REPAIR AND CANCER

MUTATION, DNA REPAIR AND CANCER MUTATION, DNA REPAIR AND CANCER 1 Mutation A heritable change in the genetic material Essential to the continuity of life Source of variation for natural selection New mutations are more likely to be harmful

More information

Integrating Genetic Data into Clinical Workflow with Clinical Decision Support Apps

Integrating Genetic Data into Clinical Workflow with Clinical Decision Support Apps White Paper Healthcare Integrating Genetic Data into Clinical Workflow with Clinical Decision Support Apps Executive Summary The Transformation Lab at Intermountain Healthcare in Salt Lake City, Utah,

More information

Big data in cancer research : DNA sequencing and personalised medicine

Big data in cancer research : DNA sequencing and personalised medicine Big in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005 Deciphering

More information

Four Things You Must Do Before Migrating Archive Data to the Cloud

Four Things You Must Do Before Migrating Archive Data to the Cloud Four Things You Must Do Before Migrating Archive Data to the Cloud The amount of archive data that organizations are retaining has expanded rapidly in the last ten years. Since the 2006 amended Federal

More information

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE E15

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE E15 INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE ICH HARMONISED TRIPARTITE GUIDELINE DEFINITIONS FOR GENOMIC BIOMARKERS, PHARMACOGENOMICS,

More information

New solutions for Big Data Analysis and Visualization

New solutions for Big Data Analysis and Visualization New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology

More information

International Stem Cell Registry

International Stem Cell Registry International Stem Cell Registry Importance of Stem Cells Stem cells are model systems for the study of development and disease. Pluripotent stem cells offer new tools for drug design and discovery. Pluripotent

More information

Improve Cooperation in R&D. Catalyze Drug Repositioning. Optimize Clinical Trials. Respect Information Governance and Security

Improve Cooperation in R&D. Catalyze Drug Repositioning. Optimize Clinical Trials. Respect Information Governance and Security SINEQUA FOR LIFE SCIENCES DRIVE INNOVATION. ACCELERATE RESEARCH. SHORTEN TIME-TO-MARKET. 6 Ways to Leverage Big Data Search & Content Analytics for a Pharmaceutical Company Improve Cooperation in R&D Catalyze

More information

Matteo di Tommaso FDA-PhUSE March 2013 Vice President, Research Business Technology Chair, PRISME Forum

Matteo di Tommaso FDA-PhUSE March 2013 Vice President, Research Business Technology Chair, PRISME Forum Pharma R&D IT & The Cloud Matteo di Tommaso FDA-PhUSE March 2013 Vice President, Research Business Technology Chair, PRISME Forum This presentation outlines a general technology direction. Pfizer Inc has

More information

How To Change Medicine

How To Change Medicine P4 Medicine: Personalized, Predictive, Preventive, Participatory A Change of View that Changes Everything Leroy E. Hood Institute for Systems Biology David J. Galas Battelle Memorial Institute Version

More information

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO The Fusion of Supercomputing and Big Data Peter Ungaro President & CEO The Supercomputing Company Supercomputing Big Data Because some great things never change One other thing that hasn t changed. Cray

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic

More information

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes Integration of Genetic and Familial Data into Electronic Medical Records and Healthcare Processes By Thomas Kmiecik and Dale Sanders February 2, 2009 Introduction Although our health is certainly impacted

More information

Testimony of. Paul Misener Vice President for Global Public Policy, Amazon.com. Before the

Testimony of. Paul Misener Vice President for Global Public Policy, Amazon.com. Before the Testimony of Paul Misener Vice President for Global Public Policy, Before the United States House of Representatives Committee on Energy and Commerce Subcommittee on Communications and Technology Subcommittee

More information

The National Institute of Genomic Medicine (INMEGEN) was

The National Institute of Genomic Medicine (INMEGEN) was Genome is...... the complete set of genetic information contained within all of the chromosomes of an organism. It defines the particular phenotype of an individual. What is Genomics? The study of the

More information

NIH s Genomic Data Sharing Policy

NIH s Genomic Data Sharing Policy NIH s Genomic Data Sharing Policy 2 Benefits of Data Sharing Enables data generated from one study to be used to explore a wide range of additional research questions Increases statistical power and scientific

More information

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely

More information

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote

More information

Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences

Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences Prof. Dr. Martin Hofmann-Apitius Head of the Department of Bioinformatics Fraunhofer Institute for Algorithms and

More information

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives Dirk.Repsilber@oru.se 2015-05-21 Functional Bioinformatics, Örebro University Vad är bioinformatik och varför

More information

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D Big Data Visualization for Genomics Luca Vezzadini Kairos3D Why GenomeCruzer? The amount of data for DNA sequencing is growing Modern hardware produces billions of values per sample Scientists need to

More information

Genetic diagnostics the gateway to personalized medicine

Genetic diagnostics the gateway to personalized medicine Micronova 20.11.2012 Genetic diagnostics the gateway to personalized medicine Kristiina Assoc. professor, Director of Genetic Department HUSLAB, Helsinki University Central Hospital The Human Genome Packed

More information

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC 601 - Presentation

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC 601 - Presentation HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM Aniket Bochare - aniketb1@umbc.edu CMSC 601 - Presentation Date-04/25/2011 AGENDA Introduction and Background Framework Heterogeneous

More information