Deliverable 7.3.1 First report on sample storage, DNA extraction and sample analysis processes

Model Driven Paediatric European Digital Repository Call identifier: FP7-ICT-2011-9 - Grant agreement no: 600932 Thematic Priority: ICT - ICT-2011.5.2: Virtual Physiological Human Deliverable 7.3.1 First report on sample storage, DNA extraction and sample analysis processes Due date of delivery: 31 st August 2014 Actual submission date: 4 th November 2014 Start of the project: 1 st March 2013 Ending Date: 28 th February 2017 Partner responsible for this deliverable: OPBG Version: 1.3 1

Dissemination Level: Public Document Classification Title First report on sample storage, DNA extraction and sample analysis processes Deliverable 7.3.1 Reporting Period 2 Authors OPBG Work Package WP7 Security PU Nature Keyword(s) RE Sample storage, DNA extraction and sample analysis processes NB. The content of the present deliverable 7.3.1 is strictly related to the deliverable 7.2.1 titled First Report Data Collection Process, for the experimental, laboratory and bioinformatic procedures. Document History Name Remark Version Date Lorenza Putignani Preliminary Draft (with already the 1.1 15/09/2014 first two paragraphs prepared in common with D7.2.1) Lorenza Putignani Final version 1.2 31/10/2014 List of Contributors Name Baban Anwar Barbara Simionati Manco Melania Putignani Lorenza Affiliation OPBG BMR GENOMICS SRL OPBG OPBG List of reviewers Name Bruno Dallapiccola Affiliation OPBG Abbreviations 2

Table of Contents 1.1 Introduction... 4 1.2 Materials and Methods... 7 2. Details of Task activities... 10 2.1 Task T7.3 DNA analysis (mm18).... 10 3. Detailed results... 12 Conclusions and Future Perspective... 13 Index of Figures Figure 1. The assignment of extended genotype (gut microbiota) to complement the genomic reservoir and to fully interpret phenotype profiling.... 5 Figure 2. Example of BioProject for metagenome annotation, usually employed at the Metagenomics Unit of OPBG.... 6 Figure 3. Original bioinformatic pipelines designed and set to generate and process metagenomic and metabolomic data.... 7 Figure 4. Biobank of reference and disease samples available at the Metagenomics Unit of OPBG... 8 Figure 5. Operational steps for bioinformatic data integration workflows and dissemination activities linked to WP7... 9 3

1.1 Introduction NB This paragraph is the same in Deliverable 7.2.1: First report on data collection process In the context of the Model Driven Paediatric European Digital Repository (MD-PAEDIGREE), besides clinical data, the collection and management of genomic and metagenomic data, may actually complement instrumental, routine laboratory and clinical data as a staple resource for medical research. Clinical data is collected during the course of ongoing patient care and the -omic and meta-omic information may actually complement the electronic health records, providing piece of evidence of the entire spectrum of ontological features of the patients. In detail, the entire set of age (i.e., stratification), flare-up conditions, naïve baseline of the pathology manifestation, external perturbations such as diet, antibiotic administration, stress-related symptoms, may be synthetically named by using the term phenomics, expression of the several phetotyping traits of the patient. Over the past 15 years, many authors have proposed that phenomics - large-scale phenotyping - is the natural complement to genome sequencing as a route to rapid advances in systems biology, preparing the route to systems medicine (Schork, N. J. Genetics of complex disease-approaches, problems, and solutions. Am. J. Respir. Crit. Care Med. 156, S103 S109, 1997); Schilling, C. H., Edwards, J. S. & Palsson, B. O. Toward metabolic phenomics: analysis of genomic data using flux balances. Biotechnol. Prog. 15, 288 295, 1999; Houle, D. In The Character Concept in Evolutionary Biology (ed. Wagner, G.) 109 140, Academic Press, 2001); Bilder, R. M. et al. Phenomics: the systematic study of phenotypes on a genome-wide scale. Neuroscience 164, 30 42 (2009); Freimer, N. & Sabatti, C. The human phenome project. Nature Genet. 34, 15 21, 2003). Phenomic-level data are necessary to understand which genomic variants affect phenotypes, to understand pleiotropy and to furnish the raw data that are needed to decipher the causes of complex diseases (obesity, juvenile idiopathic arthritis, cardiopathies). Our limited ability to understand many important biological phenomena suggests that we are not already measuring all important variables and that broadening the possibilities will pay rich dividends. Fundamentally, we can choose to include into this new point of view, additional parameters or data such as genomic fingerprinting indexes (e.g., disease-gene candidates, polymorphisms) and metagenomic gene scaffolds (microbiome), linked to metabolic activities (metabolome), to provide additional and useful indexes of disease. All genotyping and phenotyping parameters need to be measured by omics and meta-omics technologies; indeed WP 7 actually provide the added value to the Project, thanks to technologies for high-throughput phenotyping and genotyping which are fully available in the MD-PEDIGREE Consortium, at the OPBG facilities, and which include conceptual, analytical frameworks, fused to advanced bioinformatic approaches that enable the use of very high-dimensional data. Additionally, dynamic models that link clinical phenomena across levels, have been designed and are currently under advancement. However, phenotypic data continue to be the most powerful predictors of important biological outcomes, such disease progression and mortality. Although analyses of genomic data have been successful at uncovering biological phenomena, they are - in most cases -supplementing rather than supplanting phenotypic information. 4

In WP7, we have identified the scientific and operational rationales for carrying out phenomics research and to integrate phenomic to genomic and metagenomic data by advanced approaches. We have employed conceptual frameworks to taking full advantage of phenomic-level data, considering phenomics and metagenomics as independent disciplines. To evaluate the role of genomic (assessed by disease-gene or candidate gene analysis) and metagenomic (based on gut microbiota signatures) profiling on the development and progress of diseases and on their outcomes, the post-analytical data collection and analysis processes have represented one of the milestones of the WP7. The theoretical and operational framework is based on the concept of extended genotype associate to the new idea of superorganism (Putignani et al., Pediatric Research-Nature, 2014), in which host genome and gut microbiota metagenome can be considered in the context of the functional and structural activities synergically produced by the host and its tissue microbiota. Because of different internal and external stimuli, the individual phenotypes of the patient and/or individual can be considered the product of different variables such as: i) diet; ii) inflammation; iii) environment; iv) xeno-metabolites. The individual phenotype, therefore, is the combination of all these trans-acting elements, combined to genomic and metagenomic reservoirs, through genetic and epigenetic controls. Once the single microbiota is fully described, a genetic fingerprinting is available to complement the individual genetic reservoir (code), through multi-level meta-omic platforms (metagenomics, metabolomics, metaproteomics). The produced data can be employed at individual and population level, to assist in the design of therapeutic and diagnostic pipelines or, rather, in the disease risk prediction of important disease at early onset, respectively (Figure 1). Figure 1. The assignment of extended genotype (gut microbiota) to complement the genomic reservoir and to fully interpret phenotype profiling. 5

During this first year of activities, we have decided to leave out the diet factor from the integration pipelines, because of the complexity of the nutritional algorithms in the assessment of the microbiota components; this aspect will be hopefully developed by dedicated future EU Projects. The other affecting factors have been fully considered in the first step of patient recruitment and sample collection (baseline, onset) and progressively they will be considered during follow-up (e.g., flare-up). They have been analyzed for each patients and the associated ontologies or categories of clinical-diagnostic treats have been uploaded onto the Gnubila database as qualitative and quantitative metagenomics and metabolomics maps, expressed in term of relative abundances of OTUs (operational taxonomic units) and metabolites (volatilome). During this year, the process of data collection (D 7.2.1) has taken place at three levels: i) OPBG repository database, with household data processing and storage procedures; ii) NCBI BioProject submission, EBI repository database (Figure 2); iii) Gnubila data submission, with the intent to generate a shared platform for model generation. Figure 2. Example of BioProject for metagenome annotation, usually employed at the Metagenomics Unit of OPBG. 6

1.2 Materials and Methods NB This paragraph is the same in Deliverable 7.2.1: First report on data collection process All the fecal samples (please see Details of Task activities) have been collected, stored by software assisted barcoding system and sored at the Biobank of OPBG, under controlled conditions. The analyses of the samples have been practicable thanks to the technological platform and related pipelines developed so far (Figure 3). Several original pipelines have been designed and applied to the analytical phase of the data processing, also in collaboration with bioinformatic groups, starting from statistics to systems biology pipelines of data integration (Figure 3). Figure 3. Original bioinformatic pipelines designed and set to generate and process metagenomic and metabolomic data. The large reference database can furthermore provide differential fingerprinting profiling comparing obese, JIA microbiota to other disease signatures to develop phenotyping map for pediatric diseases (Figure 4). 7

Figure 4. Biobank of reference and disease samples available at the Metagenomics Unit of OPBG The integration of phenotyping and genotyping traits will represent the next step of the future activities of the Consortium and Metagenomics Units, with generation of data repository at local (server with 6 CPU) and remote sites (Gnubila, EBI) and with dissemination linked to bioinformatic activities (Figure 5). 8

Figure 5. Operational steps for bioinformatic data integration workflows and dissemination activities linked to WP7 9

2. Details of Task activities 2.1 Task T7.3 DNA analysis (mm18). Progress T7.3.1 DCMP. Molecular results from blood target enrichment sequencing are expected to be obtained from BMR by May as previously described in the project. Regarding the samples from UCL and DHZ, it was decided, during the first internal meeting, that their samples will transit in OPBG for DNA extraction and then will be shipped to BMR. However, to date OPBG has not received any sample from the above mentioned institutions. Still OPBG is proceeding in performing DNA extraction and QC verification. T7.3.2 Rheumatology. In order to analyze the OTU content of JIA patients, a targeted approach based on pyrosequencing of the variable regions V1 and V3 of 16S rrna locus have been performed. Qualitative and quantitative metagenomic analyses of gut microbiota OTUs at Phylum and Order level, have been provided, including the bioinformatic elaborations of JIA gut microbiota type, described by weighted/unweighted UNIFRAC and Bray Curtis algorithms. T7.3.3. CVD Obesity. Blood SNPs analysis is in progress at BMR Genomics. Qualitative and quantitative metagenomic analyses of gut microbiota OTUs at Phylum and Order level, have been provided, including the bioinformatic elaboration of obesity microbiota type, described by weighted/unweighted UNIFRAC and Bray Curtis algorithms. 10

Significant Results T7.3.1 DCMP The first 18 months of the project have been dedicated to design a custom gene panel and to validate the protocol for target enrichment of 56 genes involved in CMD and other forms of inherited cardiomyopathies (HCM, ARVC, CPVT, LQT, SQT and Brugada Syndrome). In agreement with the other partners of the project, the number of genes has been expanded from 18 to 56, in order to get a more comprehensive cardiomyopathy profile for the clinical samples, at similar costs. The genes of interest are listed in the attached excel file Gene list-md-paedigree-wp7. The sequence data obtained from the processing of the first samples were analyzed using a custom bioinformatic pipeline, including variant calling and annotation of the detected variants. These preliminary data were also used to verify the quality of the panel in terms of coverage, reads on-target and specificity of the probes.gene panel design. Since none of the commercially available standard kits allows the selective enrichment of all genes of interest, we opted for designing a custom gene panel. A careful preliminary analysis of performance and costs of several enrichment kits led us to choose the Agilent kit HaloPlex Custom Target Enrichment (1-500 kb cod: G9901C). The panel design was carried out using the web tool Agilent SureDesign:https://earray.chem.agilent.com/suredesign/index.htm. The parameters have been optimized in order to improve the coverage in sequence regions characterized by high GC-content and low mappability. The design included all coding exons of genes of interest, UTR regions and from 25 to 50 bp of flanking intronic regions. The target region size is 443 kb and the target coverage is 99.38%. Sample preparation and sequencing. Each genomic DNA (gdna) sample was first checked for quality and quantity and then an individual targetenriched, indexed library was prepared, following the official Agilent protocol (in attachment) for the Illumina platform. For each sequencing run, equimolar amounts of 22 libraries were multiplexed and the final pool was sequenced in the paired end format 2 x 150 bp on Illumina MiSeq system using the Illumina kit "MiSeq Reagent Kit v2 (300 cycle)". For Bioinformatic pipeline for CMPD panel sequencing please see Deliverable 7.2.1 Annex 1 T7.3.2 Rheumatology, JIA. Bioinformatic pipeline for metagenomic analyses: please see Deliverable 7.2.1 T7.3.3. CVD Obesity. Bioinformatic pipeline for CVD-risk assessment; please see Deliverable 7.2.1 Annex 2; 11

Explanation of reasons for failing to achieve critical objectives and its impact Blood and fecal sample collection for genetic and metagenomic analyses, respectively, is still at the beginning or even not started for DHZ, UCL, Utrecht samples. Reasons for deviations from DoW The sampling from DHZ and Utrecht patients has started. Sample collection by UCL should be started immediately. We do not have specific explanations. Proposed corrective actions We suggest the immediate sampling process from the rest of Consortium Centers that still have not provided to sample collection. 3. Detailed results a) metagenomics, metabolomics profiling of gut microbiota; b) and host genomics Please see deliverable 7.2.1 report, which is strictly related to the present deliverable 7.3.1 for both experimental and bioinformatic procedures. 12

4. Conclusions and Future Perspective Based on these preliminary results, sample and DNA biobank will be enlarged and procedures standardized during the first year of activities will be followed for both genomic and metagenomic activities. Ontological categories and phenomics features will be deposited onto the MD-Paedigree Infostructure database for all patient analyzed; integration of data and omics data will be performed at local level (OPBG, metagenomics Unit) for metabolomics and metagenomics, by optimized and dedicated bioinformatic pipelines, and at Infostructure level by considering the other features, including host genomics and clinical variables for obese, JIA and CVR-associated patients. 13