Big data in cancer research : DNA sequencing and personalised medicine

Big in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005

Deciphering the cancer genome with high-throughput technologies Cancer karyotype Cancer Normal karyotype is a gene disease Sequence the cancer genome (i.e. read its DNA sequence) to : Understand the molecular mechanisms of tumoral progression Tailored the therapy for each patient individually Use high-throughput sequencing methods (Next-Generation sequencing)

30 years ago... the era of DNA sequencing Walter Gilbert Harvard Nobel Laureate, 1980 Co-inventor with Frederick Sanger of the eponymic DNA sequencing method in 1977 I expect that within a few years, our technology will be able to sequence one megabase/technician-year. At that rate 100 technicians could sequence the genome in 30 years. An effort to improve the technology over a 10-year period should raise the rate by a factor of 10. The Scientist. October 20. 1986

Evolution of sequencing technologies and cost decreasing Year Genome 2003 HGP 2007 Venter 2008 Watson 2009 Cost $ Duration Technology Nb. of scientists 2,700,000,000 13 years Sanger 2,800 100,000,000 4 years Sanger 31 Roche 454 27 1,500,000 4.5 months 50,000 4 weeks Helicos 3 Sources: Pushkarev et al. (2009), Wadman et al. (2008) Roche 454 Illumina Solid Helicos In 2013, around 5000$ to sequence a human genome in one week with one technician (1500 times faster than Gilbert's prediction) Toward the 1000$ genome

Data tsunami in cancer research Low cost sequencing + Availability to every lab = Cost is divided by 2 in : CPU - Moore's law: 18 months Storage - Kryder's law : 12-14 months Network - Butter's law : 9 months NGS' law : 5 month informatic challenges

Next-generation sequencing... some figures... Sequencing with Illumina Hiseq 2500 : 6 billions of sequences: 1 sequence = 100 bases (A, T, C, G) 1 experiment = 600 billions of bases = 200,000 Les Misérables 1Tb of (per week) Human genome = 3 billions of bases = 1,000 Les Misérables Reference human genome (known sequence) = dictionnary Cancer genome = wrong copy the the dictionnary In cancer, genes = words contains mutation = mistake gene1 = GIRAFFE gene1 = GILAFFE Cancer creates new words = fusion genes gene1 = GIRAFFE, genes2 = ZEBRA new gene = GIBRA The 6 billions of sequences will be compared to the reference genome to find the mutations and fusion genes taking into account the fact that the sequencer itself makes error when reading the sequence

Extraction of the biological signal from the raw Development of algorithms and statistical methods Interdisciplinary work with bioinformaticians, informaticians, biologists, mathematiciens, statisticians and algorithmists HPC infrastructure Pieces of the cancer genome CGAGCTG ACGAGCT TCCTAGC GCTCCTA TTTACGA AGCTCCT TTTACGA AGCTCCT ACGACTT ACTACGA GGCCAAC CGGCCAA AGCTGCG CGAGCTG CTACGAG CATCTAC Reference Genome Sequence = dictionnary A C T A C G A C T C T A C G A G C A T C TA C G A GC T A C T A G C G A T C A C G A G C T G C G A G C A A C G GC CA A C Mutations

Visualisation of the significant fusions Intra-chromosome fusions Intra-chromosome fusions Source: MCF-7 breast cancer cell line, Hampton et al., Genome Research 2009

Application to personalised medicine: the SHIVA clinical trial molecularly targeted therapy >? conventional therapy Molecular profile Molecular abnormality Targeted agent Targeted agent Chemotherapy Chemotherapy Chemotherapy Targeted agent Targeted agent Targeted agent Targeted agent compare the efficacy of molecularly targeted therapy based on tumor molecular profiling versus conventional therapy in patients with refractory cancer

SHIVA clinical trial: the workflow Patient s inclusion Shipment to CRB biopsy clinic Validation of amplified/deleted genes by IHC 4 weeks Shipment to pathology Shipment of DNA to Affymetrix platform DNA extraction Affymetrix Cytoscan HD IHC RO/RP/RA Shipment of DNA to sequencing platform Sequencing Ion Torrent Bioinformatics integration List of amplified/ deleted genes Bioinformatics analysis: detection of amplified/deleted genes Bioinformatics analysis: detection of mutated genes Elaboration of a report that is sent to the Molecular Biology Board Therapeutic decision

The therapeutic decision is based on a report with the list of molecular abnormalities Simple decision rules: If STK11 is mutated targeted therapy = everolimus Other simple rules are used for other targeted therapies Cancer biology is much more complex and these naive rules need to be improved

Cancer is a complex disease Multiple biological layers Interactions between chemical species The multidimensional nature of the cancer (genome, proteome, epigenome, kinome, etc.) has to be considered to unravel the complexity of the disease. Mathematical models and computational systems biology are definitely needed to improve current decision rules and understand the emergent properties of cancer cells. In order to perfom such integrative analyses with sophisticated mathematical models, the integration of these multidimensional informations within an efficent information system is required.

Data integration is a major challenge in cancer research Private Medical Copy Number images Public Clinical NGS MS Gene expression Phenotyping Biobank Reactome TCGA CCLE ICGC RPPA A large Volume of patients' is disseminated accross a large Variety of bases which increase in size at a huge Velocity. In order to extract most of the hidden Value from these we must face challenges at : the technical level : develop a powerful informatic architecture the organisational and management levels : define the procedures to collect with hightest confidence and quality the scientific level : create sophisticated mathematical models to predict the disease evolution and patient's risk At Institut Curie we are currently building an information system to fully integrate all the molecular, biological and clinical

Can we dream of an online prediction system to help therapeutic prediction? Private Public wrapper LIMS NGS wrapper LIMS RPPA wrapper Reactome wrapper...... Every day, for several patients, information are collected : wrapper Gene expression LIMS Integrative analysis aim at building signatures to predict disease evolution (e.g. risk of metastatis) Clinical Centralised bioinformatics base Virtual base pathological complete response survival response to therapy molecular profiles etc. Therapeutic decision Re-evaluate prediction rules in real-time taking into account these new informations Apply online machine learning techniques Prediction of pcr New patient Training math models Observed pcr... time

Towards P4 medicine P4 medecine was coined by Leroy Hood (president of the Institute of System Biology) The practise of medicine is mainly reactive, i.e. the physician reacts to the disease state of the patient and little is done to prevent the occurrence of the disease. Predictive medicine was first introduced by Jean Dausset (Nobel prize in medicine, 1980). P4 medicine : Predictive : consider the genetic background of the individual and his environment Preventive : adapting lifestyle, traking preventing drugs Personalised : tailored the treatment to the unique feature of the individual (such as patient's genetic background, tumour's genetic and epigenetic landscape, life environment) Parcipatory : many options about healthcare which require in-depth exchanges between the indivudual and his physician P4 medicine = manage patient'health instead of manage a patient's disease

Big basket with a large variety of

Data integration + mathematical models leverage new information

Bienvenue à GATTACA