Big data for better science Data Science Institute
Sense and Sensibility Julie A. McCann Adaptive Embedded Systems The aim of the Adaptive Emergent Systems Engineering (AESE) group in the Department of Computing is to examine the relationships between embedded systems and their environments (physical and human) to better understand their behaviours and impacts and to exploit this knowledge to enhance the performance of such systems. ICRI Cities London Living Lab (L3) The Hyde Park L3 Platform will advance the use of sensing and social platforms deployed in-the-wild to support research into ecology, air quality, water quality, noise and light pollution, public engagement, and the communication and manageability of sensed data. This will enable e.g. the Royal Parks authority to visualise real and near-time data through a simple dashboard alongside deeper analysis of raw data. Also in the Isis Education Centre educators and their audiences can engage school children and the general public with a better understanding of the park and its ecology, usage and history. Crowdsourcing and Opportunistic Networking Future smart cities will require sensing on a scale hitherto unseen. Fixed infrastructures have limitations regarding sensor maintenance, placement and connectivity. Employing the ubiquity of mobile phones is one approach to overcoming some of these problems whereby the phone carries the data. This work is first to exploit underlying social networks and financial incentivisation, combining network science principles and Lyapunov optimisation techniques, we have shown that global social profit across a hybrid sensor and mobile phone network can be maximised. Smart Water Systems Water networks are moving away from sparsely instrumented telemetry systems. The vast majority of next generation approaches to manage such networks consist of denser sensor networking but these still require data to be sent back to some core management servers. Actuation technologies are becoming more on-line and in-line with sensor networking. This brings about opportunities to make water networks smarter and in turn more resilient and optimal. Such a network is an example of a Cyberphysical System (CPS). With sample rates of up to 120/s there is a strong need for big data analytics and adaptive cloud computing. Acknowledgements London Living Labs is sponsored by Intel and Future Cities Catapult. Smart Water Systems is sponsored by NEC Japan and FP7 WISDOM. Photos Ivan Stoianov. Opportunistic Sensing is sponsored by the Intel Collaborative Research Institute Sustainable Connected Cities. Department of Computing, Huxley Building, South Kensington Campus, Imperial College London, SW7 2AZ. Email: jamm@imperial.ac.uk wp.doc.ic.ac.uk/aese
Co-design of Cyber-Physical Systems Eric Kerrigan Cyber-Physical Systems Cyber-physical systems (CPS) are composed of physical systems that affect computations, and vice versa, in a closed loop. By tightly integrating computing with physical systems one can design CPS that are smarter, cheaper, more reliable, efficient and environmentally friendly than systems based on physical design alone. Examples include modern automobiles (the 2013 Ford Fusion generates 25GB of data per hour), aircraft and trains, power systems, medical devices and manufacturing processes. The dramatic increase in sensors and computing power in CPS present unique big data challenges to the engineer of today and tomorrow. The key big data questions for CPS are what, where, when and how accurate to measure, compute, communicate and store? My team is providing answers to these by developing control systems theory and mathematical optimization methods to automatically design the computer architecture and algorithms at the same time as the physical system. This co-design process results in a better overall system compared to iterative methods, where sub-systems are independently designed and optimized. cyber-physical system optimal inputs u (y) physical system computing system disturbances u (y) :=argminf(u, y) u s.t. g(u, y) =0 h(u, y) 0 numerical errors measurements y optimal design parameters for physical system p c co-designer (p,c ):=argminφ(p, c) p,c s.t. α(p, c) =0 β(p, c) 0 optimal design parameters for computing system By understanding the nature and timescales of the physical dynamics one can dramatically reduce the amount of data needed in order to make a decision and/or increase the quality and quantity of information extracted from a given data set. Current work is concerned with model-based feedback methods that allow one to minimize the amount of measurements and computational resources to estimate, in real-time, information that can then be used to control and optimize the behaviour of the overall system. Mathematical Optimization Most CPS co-design problems can be formulated as a multiobjective and constrained mathematical optimization problem. Furthermore, CPS are optimal only if the computing system is executing tasks with the goal of optimising given performance criteria. We are therefore developing methods to: model and solve the non-smooth and uncertain optimization problems that result during the co-design process, and solve constrained, nonlinear optimization algorithms in realtime on embedded and distributed computing systems. Control and Dynamical Systems Theory The main technical challenge in the co-design of CPS is to merge abstractions from physics with computer science: the study of physical systems is based on differential equations, continuous mathematics and analogue data, whereas the study of computing systems is based on logical operations, discrete mathematics and digital data. Furthermore, while a computation is being carried out, time is ticking and the system continues to evolve according to the laws of physics. A designer therefore has to trade off system performance, robustness and physical resources against the timing and accuracy of measurements, communications, computations and model fidelity. We are developing system-theoretic methods to understand and exploit this hybrid and real-time nature of CPS. Current work includes the co-design of parallel computing architectures, linear algebra and optimization algorithms to increase the efficiency of the computations. Acknowledgements This research is in collaboration with George Constantinides, Jonathan Morrison, Rafael Palacios, Mike Graham and Jan Maciejowski (Univ. of Cambridge). Department of Electrical & Electronic Engineering and Department of Aeronautics, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: e.kerrigan@imperial.ac.uk www.imperial.ac.uk/people/e.kerrigan
Crystallisation of Biological Molecules for X-Ray Crystallography Lata Govada Sahir Khurshid Tim Ebbels Naomi E. Chayen The Problem Detailed understanding of protein structure is essential for rational design of therapeutic treatments and also for a variety of industrial applications. The most powerful method for determining the structure of proteins is X-ray crystallography which is totally reliant on the availability of high quality crystals. The crystallisation of proteins involves purified protein undergoing slow precipitation from an aqueous solution where the protein molecules organise themselves in a repeating lattice structure. The Challenge There is currently no means of predicting suitable crystallisation conditions for a new protein. Figure 3 illustrates a sample/cross section of the enormous chemical space explored during screening. Finding crystallization conditions for a new protein is like searching for a needle in a haystack. Initial attempts (referred to as screening) involve the exploration of multi-dimensional parameter space using 1000s of candidate conditions. The miniaturisation and automation of such screening trials has been of great benefit but crystallisation continues to remain the rate limiting step to structure determination (Figure 1). Figure 2 is a crystal of the Human Macrophage Migration Inhibitory Factor. Figure 3. Plot of crystal hits for 269 macromolecules from the structural genomics community. Dark blue indicates five or more crystal hits for that cocktail, medium blue 3-4 and light blue 1-2. White areas are unsampled areas of chemical space. Figure 1. Results from structural genomics centres worldwide (Target Track PSI). Figure 2. Crystal of Human Macrophage Migration Inhibitory Factor. The relevant parameters include the type and concentration of precipitating agent, the concentration of protein, the type and concentration of a secondary precipitating agent and/or of an additive, the ph and temperature amongst others. One or more of these conditions may show some promise, most often in the form of microcrystals, clusters, or microcrystalline suspension. The following, optimisation step consists of fine-tuning these promising conditions by changing the values of the various parameters, such as concentrations and ph, in small increments, until useful crystals are obtained. This common approach fails in 80% of cases even when high throughput methods are employed. High throughput has not yielded high output and significant amounts of protein sample, time and resources are wasted. Computational and Systems Medicine, Department of Surgery and Cancer, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: n.chayen@imperial.ac.uk http://www.imperial.ac.uk/medicine/people/n.chayen/ A wealth of public data (PDB, BMCD) exists which is not being tapped into efficiently. The ability to predict crystallisation conditions would revolutionise this field. Addressing this challenge will require two aspects of Big Data Science. Firstly, the data generated from structural genomics projects on crystallisation conditions is huge, with millions of combinations of protein sequence and conditions attempted in high throughput screens. Storage, search and retrieval of this data in an efficient way will require tools of big data bases. Secondly, the discovery of patterns in sequences and other molecular properties which predict optimal crystallisation conditions will require sophisticated statistical and machine learning algorithms in order to make sense of the high dimensional but still sparsely sampled data. The desired result would be a more efficient methodology for conducting crystallisation experiments and an in silico approach to prediction of crystallisability. This would save immense amounts of experimental time, protein sample etc., and transform the field.
Global fits of dark matter theories Roberto Trotta Pat Scott Charlotte Strege The Dark Matter mystery The experimental hunt for dark matter is entering a crucial phase. Decades of astrophysical and cosmological studies have shown almost conclusively that 80% of the matter in the Universe is made of a new type of particle. One of the key questions of cosmology and particle physics today is to determine the nature and characteristics of such a particle. The aim of our work is to put constraints on the physical parameters of theoretical models for dark matter (such as Supersymmetry) by combining four complementary probes: cosmology, direct detection, indirect detection and colliders. This is the so-called global fits approach. Experimental probes of Dark Matter Cosmology: Observations of the relic radiation from the Big Bang, the cosmic microwave background, constrain the amount of dark matter in the Universe with very high precision. Direct detection: Direct detection experiments aim at detecting dark matter by measuring the recoil energy of nuclei undergoing a collision with a dark matter particle. Some highly controversial claims for detection are directly contradicted by other experiments, which have not found any statistically significant signal. Indirect detection: Dark matter particles annihilating into Standard Model particles produce high energy photons and neutrinos, which can be detected using dedicated space and ground-based observatories. Colliders: The Large Hadron Collider at CERN is putting strong limits on the properties of putative particles beyond the Standard Model. The recent discovery of the Higgs boson (for which the Nobel Prize in physics 2013 was awarded) also puts strong constraints on the properties of such speculative theories. Our work implements, for the first time, the entire spectrum of these constraints in a statistically correct way, in order to extract the maximum information possible about the nature of dark matter. Statistical constraints from Global Fits on the dark matter mass and scattering cross section in a 15-dimensional theory (Strege et al, to appear) Big Data challenges Our group has developed a world-leading Bayesian approach to the problem, allowing us to explore in a statistically convergent way theoretical parameters spaces previously inaccessible to detailed numerical study. Our methodology couples advanced Bayesian techniques with fast approximated likelihood evaluations. Even so, it remains computationally very challenging: Each likelihood evaluation requires numerical simulation of the ATLAS detector. This involves the generation of a large number of simulated events, the production of a numerical likelihood function based on a binned analysis and the evaluation of the ensuing constraint. The above process is CPU and disk-space intensive: our current study (see above) required 100s of TB of disk space and 400 CPU-years of computing power. We studied theoretical models with up to 15 free parameters. The most general models have up to 105 parameters, so novel techniques are needed to explore such complex parameter spaces. Acknowledgements We thank Imperial High Performance Computing services and the University of Amsterdam for providing computing resources. This project is in collaboration with G Bertone, R Ruiz de Austri and S Caron. Map of the relic radiation from the Big Bang, used to measure the amount of dark matter in the Universe. Credit: Planck/ESA Astrophysics Group, Blackett Laboratory, Imperial College London, Prince Consort Road, London SW7 2AZ. Email: r.trotta@imperial.ac.uk www.robertotrotta.com
Digital City Exchange David Birch Yike Guo Nilay Shah Orestis Tsinalis John Polak Koen van Dam Eric Yeatman Context and Challenge Cities are now home to more than half of the worlds population. They face significant challenges, such as congestion, air quality, provision of food and electricity, but also offer opportunities for innovation and collaboration as well as an increased efficiency enabled by their density. A smart city is a connected city: efficient use of resources through interaction and integration. This requires a better understanding of the complexity of cities and urban living. Approach A three tier solution comprising an ontology-supported sensor data store, a workflow engine and a web based interface to build a chain of connected data sets and models, enables the creation of services which take advantage of (real-time) data, analytics and predictive models. We have the data but how can we make the most of city data and cope with integration and the vast scale? City infrastructures are connected and influence one another. Currently data is collected, analysed and used in the traditional silos of energy, transport, education, waste, etc, but the hypothesis of the Digital City Exchange is that through data integration better decisions can be made. We are building the infrastructure to facilitate this and then test it with analytical and predictive models. City Data Data is collected by utility companies, (local) governments and service providers, but also by residents. This includes induction loops in the roads to measure traffic flows, air quality monitors, pothole reporting via smartphone, smart bins that report when they are full, social media messages sent, etc. Much of this data is closed and only one party has access to it, while other data is shared (possibly paid for) or even released as open data for anyone to use. Platforms are needed to store, analyse and collaborate using this data. Acknowledgements Digital City Exchange is a five-year programme at Imperial College London funded by Research Councils UK s Digital Economy Programme (EPSRC Grant No. EP/I038837/1) Email: D.Stokes@imperial.ac.uk www.imperial.ac.uk/dce
Astronomically Big Data David L. Clements Steve Warren Daniel Mortlock Alan Heavens Large scale catalogs in astrophysics are already large, but the next generation of surveys will boost that size by orders of magnitude. In particular the Euclid Mission will provide Hubble Space Telescope quality near-ir images across the entire sky, while the Large Synoptic Survey Telescope (LSST) will image the entire (accessible) sky in 5 different colours every 5 days. Conventional methods of classifying objects (using image metrics or Citizen Science) may be inadequate for fully exploiting the discovery space of these vast surveys. Statistical analysis of these vast datasets, to test Einstein's theory of gravity and shed light on the Big Bang, also presents formidable data analysis challenges which need to be met if the power of the surveys is to be realised. Euclid & LSST: The coming deluge The forthcoming Euclid and LSST projects will be orders of magnitude beyond the scale of SDSS & similar current projects. Current state of the art: SDSS Sloan Digital Sky Survey (SDSS) observed ¼ of the sky in 5 optical bands, obtaining imaging & photometry for 500 million sources, and spectroscopy for 1 million. Images and spectra are automatically analysed, but human eyeball citizen science through Zooniverse has proved useful in finding truly unusual objects, for example Hanny s Voorwerp, the green object shown below, a previously unknown and poorly understood ionised gas cloud in the intergalactic medium, found through the citizen science project Galaxy Zoo. (Source: NASA/ESA/W) Euclid will observe ~40% of the sky to resolutions comparable to the Hubble Space Telescope (HST). 10 billion galaxies will be imaged, each of which will have 100 times the number of pixels in an SDSS image, for ~ 2000x the amount of data/night. LSST (Large Synoptic Survey Telescope) will be a wide field 8m telescope which will survey ~ ½ the sky (20000 sq. deg.) in 5 colours every 5 days. Can be combined to give time resolution to search for transient sources (eg. Supernovae), stacked to go deep, or some variety. Data rate is 30 Terrabytes/ night & it will run for >10 years. The discovery space for these projects is so big that it cannot be handled by either conventional computing or citizen science approaches. Physics Department, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: d.clements@imperial.ac.uk http://astro.ic.ac.uk/dclements/home; davecl.wordpress.com; @davecl42
Future Computational Platforms Christos Bouganis Specialized Computational Platforms The increasing need for processing large amount of data as fast as possible, combined with the development of increasingly complex computational models for the more accurate modeling of the underlying processes, has led researchers and practitioners to adopt suboptimal approximation models or, in certain cases, to the heavy use of High-Performance Computing clusters. However, both approaches are not desirable as the former one does not provide the best possible solution where the latter approach results in low silicon efficiency and high power consumption as these systems are not tailored to the structure of a specific application. Probabilistic Inference Acceleration Our work also focuses on the bioinformatics domain where it is often required to analyze large amount of data using complex probabilistic models. As Probabilistic Inference algorithms are computational expensive, our work focuses on the design of computational platforms with an architecture that is tuned to the probabilistic inference algorithm. Recent results obtained form the acceleration of population-based MCMC algorithms show that two orders of magnitude speed-ups over traditional CPU code can be achieved with minimum power footprint. In the Circuit and Systems group of Electrical and Electronic Engineering Department, we contact research into core computational platforms that can be adapted to specific applications leading to high performance gains within a power budget compared to the classical computer architectures. Our current work involves the design of computational platforms for the acceleration of the training stage of computational demanding Machine Learning algorithms and the acceleration of probabilistic algorithms for Bayesian Inference when they applied to health care. Machine Learning Our group has developed a computational platform that accelerates the training stage of a Support Vector Machine algorithm making possible to achieve high classification rates within a limited time and power budget. By designing the architecture of the system to match the targeted algorithm, the system has achieved a speed up of two orders of magnitude consuming only a fraction of the power footprint compared to a personal computer. Other key aspects of our research are the optimization of the memory interface to maximize the bandwidth between computation and SDRAM memory, and data-path optimization, including computer arithmetic, for low power and high performance. Department of Electrical and Electronic Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: christos-savvas.bouganis@imperial.ac.uk www.cas.ee.ic.ac.uk/people/ccb98
Big Data in Medical Imaging Daniel Rueckert Ben Glocker Overview In medical imaging, a vast amount of information is collected about individual subjects, groups of subject or entire populations. A characteristic of medical imaging is that the sensors or devices (e.g. CT or MR machines) can produce 2D, 3D or even 4D datasets. While each dataset is large in itself, the amount of derived information from each dataset is often much larger than the original information. In the following we outline the challenges of big data in the context of medical imaging that are addressed in the Biomedical Image Analysis Group at Imperial College London. Big data from clinical studies/trials Over the last years there has been an explosion of imaging data generated from clinical trials. In addition to imaging data collected for drug development there is an increasing amount of data available for research purposes. Two of the most prominent examples this are the Alzheimer s Disease Neuroimaging Initiative (ADNI) and the Human Connectome Project (HCP). The latter project is build a comprehensive map of neuronal connections at macroscale. For this state-of-the art diffusion and functional MR imaging (see figure below left) is collected from 1200 subjects, producing more than 25GB of raw data per subject. The analyzed data (see figure below right) requires more than 1 PB storage. Machine learning for medical imaging The use of machine learning in the analysis of medical images plays an increasingly important role in many real-world, clinical applications ranging from the acquisition of images of moving organs such as the heart, liver and lungs to the computer-aided detection, diagnosis and therapy. For example, machine learning techniques such as manifold learning can be used to identify classes in the image data and classifiers may be used to differentiate clinical groups across images (see figure below left). In addition, the approaches allow the combination of imaging information and non-imaging information, e.g. genetics (see figure below right). Special vertices encoding, non-imaging information, e.g. ApoE genotype The figure below shows the application of these ideas for the automatic identification of subjects with dementia. Big data from population studies An example of big data from population studies is the UK Biobank imaging effort. This project has recently received funding for a large-scale feasibility study which, if successful, will allow it to conduct detailed imaging assessments of 100,000 UK Biobank participants. This more detailed characterisation of these participants will allow scientists to develop an even greater understanding of the causes of a wide range of different diseases (including dementia) and of ways to prevent and treat them. The imaging study will involve magnetic resonance imaging of the brain, heart and abdomen (see figure right), low power X-ray imaging of bones and joints and ultrasound of neck arteries. Biomedical Image Analysis Group, Department of Computing, Huxley Building, Imperial College London, South Kensington Campus, London SW7 2AZ Email: d.rueckert@imperial.ac.uk, b.glocker@imperial.ac.uk http://biomedic.doc.ic.ac.uk
Effects of high-frequency company specific news on individual stocks Robert Kosowski Ras Molnar Research Objectives The aim of this research is to study the impact of high-frequency company specific news on individual stocks. The term high frequency news in this context means news that are reported electronically by news company during the day. Why are high-frequency news interesting to study? Highfrequency news is an important information source for all market participants and sheds light on economic transmission mechanisms that cannot be observed using lower frequency, for example, daily end of day closing prices or low frequency economic indicators. How is our research novel? The contribution of our research lies in the fact that we are not only measure the sentiment extracted from news but other news characteristics as well. We also utilize high-frequency data which have not been studied in this perspective extensively. What are expected outputs? We expect to find that high frequency news and novel sentiment measures have an economically significant impact on asset prices. It is likely that the innovations in our methodology will lead to more significant results compared to existing studies. Big Data For the purpose of our project, we use two main sources of the high-frequency information. Both imply a vast amount of data related to both news and trades. We use the news database based on the Reuters Site Archive. This dataset contains about 5.6 million Reuters news from the beginning of 2007 until the end of 2012. Raw HTML files take about 426GB, while the database containing news identifiers and news text is around 31GB large. Number of high-frequency news by year Number of news 1400000 1200000 1000000 800000 600000 400000 We use the TAQ database for high-frequency stock data. This dataset contains trades and quotes from major American stock exchanges. In our research we intend to use trades only. Trade data is an example of Big Data because the number of trades increased over time from 92 million trades in 1993 to 7.5 billion in 2008. The extensive number of trades implies a large size of the database itself. A cumulative size of databases containing TAQ trades from the beginning of 2007 until the end of 2012 is expected to be around 4TB. Number of trades by year Number of trades (in million) 3000 2500 2000 1500 1000 500 0 2000 2001 2002 2003 2004 2005 2006 Year Methodology The methodology we use in this research is a methodology in line with the existing literature stream (for example Gross- Klussmann, A. and N. Hautsch. When machines read the news: Using automated text analytics to quantify high frequency news-implied market reactions. Journal of Empirical Finance 18 (2), 321-340. 2011). The frequency and amount of data we have to process means we pre-process data within the database before we progress with the analysis. In the case of news data we calculate the sentiment, relevance and novelty of news using the textual analysis similar to Boudoukh et al. (Which news moves stock prices? A textual analysis. Technical report, National Bureau of Economic Research. 2013). Stock market data are sampled and only parts necessary for our analysis are selected. The analysis itself consists of two parts, an event study and the vector auto regression model. The goal is to explain the reaction of stock market given the characteristics of news. Acknowledgements Our news data database is based on the Reuters News Web Archive. 200000 0 2007 2008 2009 2010 2011 2012 Year Finance Group, Imperial Business School, Imperial College London, South Kensington Campus, London SW7 2AZ. Email: r.kosowski@imperial.ac.uk www.imperial.ac.uk/people/r.kosowski
Development of an ovarian cancer database for translational research Haonan Lu Christina Fotopoulou Ioannis Pandis Yike Guo Hani Gabra Ovarian cancer is a systemic disease which can be dysregulated through multiple mechanisms, therefore it is crucial to understand the detailed molecular pathways behind it. Recently, the Cancer Genome Atlas(TCGA) project has generated multiple levels of OMIC data from genome to phenome, which gives us a comprehensive view of high grade epithelial ovarian cancer. However, the cross-correlation of good quality clinical data to multilevel molecular profile is required to obtain valid biomarkers. Furthermore, the difficulty in accessing but also reproducing the TCGA data has been a known issue impairing interpretation and implementation of the findings. Multiple molecular profile constructed for 175 ovarian cancer cases We have previously systematically collected samples from 175 primary epithelial ovarian cancer patients and obtained molecular information across multiple platforms, including gene expression microarray, SNP array, exome sequencing and Reverse Phase Protein Array(Figure 1). A great advantage of these data is the samples are collected from a single institute which had much less bias on the sample type, therefore the clinical data is cleaner and the molecular data is more reliable. (a) (b) Metabolomics Serum and urine(to be done) Imperial College Gene expression profile >47,000 transcripts Genome Institute of Singapore 175 Ovarian Tumor Samples DNA copy number variation 5,677 CNV regions Genome Institute of Singapore Exome sequencing Whole exome London Research Institue Proteomics >160 proteins MD Anderson Figure 1. (a)type of molecular profile data obtained from 175 ovarian cancer patients. Coverage of each platform in bold and collaborators in italic. (b) Published result using part of the gene expression data. We compared the gene expression profile among three subtypes of ovarian cancer(benign, borderline and malignant). We found distinct gene expression pattern between benign and malignant tumor, whereas borderline tumor showed two distinct subgroups: one benign-like and the other malignant-like. Courtesy from Molecular subtypes of serous borderline ovarian tumor show distinct expression patterns of benign tumor and malignant tumor-associated signatures., Mod Pathol, 0893-3952, Curry EW, Stronach EA, Rama NR, et al., 2013. (i) (ii) (a) 50 45 40 35 30 25 20 15 10 5 0 Number of clinical parameters TCGA Hammersmith Total Surgical Chemotherapy Correlate with outcome (e.g. overall survival and progression free survival) Personalize Surgical operation Novel clinical parameters Biomarkers to stratify patients Correlate with molecular profile Personalize drug treatment Figure 2. (a)comparison of the number of clinical parameters collected from TCGA and Hammersmith. (b)planned workflow after obtaining the new clinical data. (b) Data Interpretation using transmart Apart from generating quality data, we ve also been working on making the data more accessible to researchers, by collaborating with the transmart project. transmart is a database platform with built-in analytical tools that is ready-touse for all the researchers. We are currently creating the Ovarian Cancer Database within the transmart platform, which contains our dataset together with other popular datasets to help researchers perform data analysis across multiple studies(a work example is shown in figure 3). We are aiming to significantly accelerate the ovarian cancer research for both clinicians and scientists. Figure 3. Example workflow of using Ovarian Cancer Database in transmart. (i)discovering the association between chemotherapy response and overall survival using GIS dataset. Kaplan- Meier plot shows patients responding to chemotherapy well(blue) has a significantly higher survival rate comparing to the chemo-resistant patients(red). (ii)differential gene expression between the two patient cohort(complete response and progressive disease). As the corresponding gene expression profile is available for these patients, differential gene expression analysis can be performed to discover potential marker genes for chemo resistance. (iii)crossvalidate gene of interest in multiple datasets and hence guide following experimental research. All the analysis shown is performed within transmart. (iii) Continuously updated clinical data In order to place these molecular data within the correct frame of context and be able to define valid biomarkers of surgical and clinical outcome, we are currently generating robust, updated and detailed surgical and clinical data to be cross-correlated with the molecular biological information(figure 2). Acknowledgements We specially thank Prof. Yike Guo, Dr. Ioannis Pandis and other group member for their help with transmart platform. Tothill Dataset TCGA Dataset Ovarian Cancer Action Research Centre, Department of Surgery and Cancer, Imperial College College London, Hammersmith Campus, London W12 0NN. Email: h.gabra@imperial.ac.uk
Data Science Institute, Imperial College London Institute for Security Science & Technology Donal Simmie Maria Grazia Vigliotti Erwan Le Martelot Chris Hankin Influence in Social Networks Influential agents in networks play a pivotal role in information diffusion. Influence may rise or fall quickly over time and thus capturing this evolution of influence is of benefit to a varied number of application domains. We propose a new model for capturing both time-invariant influence and also temporal influence. We performed a primary survey of our population users to elicit their views on influential users. The survey allowed us to validate the results of our classifier. We introduce a novel reward-based transformation to the Viterbi path of the observed sequences which provides an overall ranking for users. benefits to us in solving these problems by improving our memory and recall and presenting data to us in a manner that leads to insight and/or questions our decision for a more positive outcome. Sensemaking provenance captures the reasoning flow of an analyst during a specific task. We perform Machine Learning on the interactions of the analyst with the computer and the context of those actions to determine their probable reasoning. Our results show an improvement in ranking accuracy over using solely topology-based methods for the particular area of interest we sampled. Utilising the evolutionary aspect of the HMM we predict future states using current evidence. Our prediction algorithm significantly outperforms a collection of models, especially in the short term (1-3 weeks). TRAIN TEST Automated Sensemaking Recovery Complex data analysis is often multi-modal incorporating visualisations, structured and unstructured data sources possibly from numerous disparate data sources. Making sense of the presented data and interrogating it successfully to form hypotheses and conclusions are non-trivial tasks but they are aided by leveraging applications and bespoke tools designed for exactly this purpose. Humans are skilled at solving difficult problems or at exploring data and discovering new insights. However computers can provide Fast Multiscale Community Detection Many systems can be described using graphs, or networks. Detecting communities in these networks can provide information about the underlying structure and functioning of the original systems. Yet this detection is a complex task and a large amount of work was dedicated to it in the past decade. One important feature is that communities can be found at several scales, or levels of resolution, indicating several levels of organisations. Therefore solutions to the community structure may not be unique. Also networks tend to be large and hence require efficient processing. In this work, we present a new algorithm for the fast detection of communities across scales using a local criterion. We exploit the local aspect of the criterion to enable parallel computation and improve the algorithm's efficiency further. Acknowledgements Influence in Social Networks and Fast Multiscale Community Detection are supported by Making Sense project under EPRSC grant EP/H023135/1. Automated Sensemaking Recovery is supported by UKVAC project under funding by US DHS and UK Home Office. Institute for Security Science and Technology, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: d.simmie@@imperial.ac.uk http://www3.imperial.ac.uk/securityinstitute
Future science on exabytes of climate data David Ham When climate models execute on 100 million cores, and generate exabytes of data, how will we work with this data? How will we account for the diverse numerical schemes used to produce it? How will the users of climate research know that our calculations were valid and that our results can be relied on? Climate Model Intercomparison Climate modelling basis for the UN Intergovernmental Panel on Climate Change (IPCC) assessment reports. A very large component of modern climate science is based on the analysis of data from the CMIP simulations. As computing power increases, climate model resolutions become ever finer, and the resulting data sets demonstrate exponential growth. CMIP Phase 3 (2006) produced 36 Terabytes A proposed toolchain for high productivity, scalable and verifiable climate data science Rather than hand-writing bespoke low-level processing tools, climate researchers need to be able to state their questions in high level mathematical form. The code implementing the query will be automatically generated by the Firedrake system and applied to the climate model data. A query for the mean sea surface temperature in the North Atlantic might appear as: CMIP Phase 5 (2011) produced 3,3 Petabytes CMIP Phase 6 (~2020) is expected to yield 100s Petabytes 1 Exabyte Climate science queries Climate science questions typically require mathematical functions to be applied to reduce vast spatial and temporal field data sets to meaningful climate statistics. Across the vast field of climate science, each research project has its own specialised questions to ask. For example: Which models predict an increase in coastal flooding for the UK? How does Atlantic sea surface temperature differ in different simulations? What is the strength of the Gulf Stream in all of the CMIP simulations? Current methodology Data is downloaded by each researcher, and custom analysis scripts are developed for each query. This is: Labour-intensive: researchers, often PhD students and Postdocs, around the world are constantly re-implementing very similar work. Error-prone: every query script is bespoke and is a new source of errors. There is no systematic mechanism for finding errors. Untraceable and unverifiable: there is no effective mechanism to publish the actual techniques applied to the data, and verifying their correctness is next to impossible. The results published in the literature must currently be taken on trust, as there is no mechanism for establishing their provenance. north_atlantic = domain (latitude = (0., 60.), longitude = ( 60., 0.)) for date in <list of dates>: atlantic_multidecadal_oscillation = \ integral(sea_surface_temperature*dx(north_atlantic))/\ area(north_atlantic) An Imperial-developed system for automatic generation of high performance, parallel, numerical code from the mathematical query. Different numerics will be generated to execute the same mathematics on the outputs of different models. The code generator can be extensively tested to provide verifiably correct results. Generated code will be applied to the data using cloud resources attached to the archive site. The original data is not downloaded by the user. The original query is short and expressive and can therefore be included in publications. This will enable verification and reproduction of results, which is currently effectively impossible. Departments of Mathematics and Computing, Imperial College College London. Email: david.ham@imperial.ac.uk www.imperial.ac.uk/people/david.ham
Bio-Inspired Paradigm Within the Centre for Bio-Inspired Technology we utilise biological principals and mechanisms to create more efficient healthcare technology. This bio-inpsired paradigm allows for (1) Learning from biology to create more efficient healthcare technologies and (2) Modeling biology to understand it better. Expanding this principle we apply local intelligence to our devices to create more efficient data transmission and to implement closed-loop protocols. Biologist s Electrophysiology Models Electrical Engineer Applied Physics Intelligent Neural Interfacing Systems Amir Eftekhar Sivylla Paraskevopoulou Timothy Constandinou Christofer Toumazou Brain Interfacing The brain is a complex network of 100 billion neurons. To transmit the full quantity of data it produces would be nearly 16,000 Tb/s per person. In a chronic disease population of 1 million people if we were to monitor what can be achieved with modern electrodes and communication (100 electrodes) this equates to 16Tb/s. The same is true for other monitoring schemes: heart activity (ECG, 2-3 channels), non-invasive brain (EEG, up to 64 channels). Although lower in sampling frequency, they still equate to 3Gb/s per channel for a population of 1M, or 11Tb/hour. Biological Hardware Software Modelling Biology Understanding Biology High Density Microelectrode Array Organs /Systems Architectures Applications Simulating Biology Bio-Inspired Architectures Examples from our group include a closed-loop artificial pancreas, cochlea implant and retina chip. Some of our more recent work applies local intelligence neural interfacing. Implanted Biotelemetry Closed-Loop Appetite Control Obesity is one of the greatest public health challenges of the 21st century. Affecting over half a billion people worldwide, it increases the risk of stroke, heart disease, diabetes, cancers, depression and complications in pregnancy. Bariatric surgery is currently the only effective treatment available but is associated with significant risks of mortality and longterm complications. The peripheral nervous system is a complex network of over 45 miles of nerve with impulses at speeds of 275mph. In this project we are tapping into the Vagus Nerve to extract the signals that control appetite and electrically stimulate to regulate it. The gut is densely innervated by the Vagus nerve, thus its signals represent an integrated response to nutrients, gut physiology and hormones and have a powerful effect on appetite. The nerve is a complex structure so requires interfacing with dozens of electrodes monitoring chemical and electrical activity. Here we are utilising real-time, self learning algorithms for closed-loop control of appetite. Connectors Cuff Contacts Microchip Nerve Cuff Electrode Microspike array Spinal Stimulation TYPICAL NEURAL RECORDINGS Amplification Conditioning + Pre-processing Spike Detection Spike Sorting Analysis Stimulation TOWARDS INTELLIGENT NEXT GENERATION NEURAL INTERFACES Prosthetic Control Brain-Computer Interface External Transponder / Power unit With the advent of High Density Microelectrode Arrays we can tap into a subset of these. Neural activity can be monitored from 100 s of channels, with data rates exceeding 20Mbps this is not possible in medical implants. We require local, intelligent processing of neural signals can reduce this to less than 1Mbps which facilitates closedloop systems, such as for spinal cord stimulation. We have developed low power, realtime spike detection and sorting algorithms, part of the process for processing neural signals i.e. identifying which has fired in the vicinity of the electrode. We are currently in the process of developing the final generation of microchip with this processing embedded. With it, we can reduce the 1Tb/s to less than a Mb/s for 500 neurons. SINGLE NEURON SIGNALS Acknowledgements This work is primarily a multi-disciplinary among many researchers and students at the Centre for Bio-Inspired Technology and collaborators. Centre for Bio-Inspired Technology. Dept. of Electrical and Electronic Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: amir.eftekhar@imperial.ac.uk www.imperial.ac.uk/a.eftekhar
Digital Money Llewellyn Thomas Antoine Vernet David Gann Project Context and Goals Money is one of the most influential factors shaping human history, driving not only wealth creation and socio-economic development, as well as religion, ethics, morality, and fine art (Eagleton & Williams, 2011). Some have argued that digital money, as distinct to earlier forms of money, has the potential to provide major economic and social benefits, such as by removing the friction in transactions or enabling inclusive innovation (Dodgson et al., 2012). Moreover, the big data generated by digital money can used to improve business operating efficiency, develop novel business models, as well as complementing or even extending the notion of identity. However there is little, if any, systematic research into digital money, its adoption and impact. Given this gap, it is our ambition to address the following: Does digital money adoption make a difference? What are the big data implications of digital money? Is it possible to quantify the benefits to governments, corporations and individuals? What are the factors that affect the outcome of a digital money initiative? Conceptualizing Digital Money We define digital money as currency exchange by electronic means. Digital money is a socio-technical system that fulfils societal functions through technological production, diffusion and use (Geels, 2004). It is a system of value interchange relying on information and communication technologies that themselves form a systems. As a result and given the importance of regulation to digital money, we conceptualised the digital money system as four interacting components: the national institutional context, the enabling technological and financial infrastructure, the demand for digital money, and the industries that drive digital money supply. Taking these four components as the pillars of the composite index, we selected a range of indicators which measure progress along each pillar, ranked countries according to their digital money readiness, and, using cluster analysis, identified four stages of readiness. We also correlated our index with existing cashlessness measures, and found that although there is strong correlation, there are also developed and developing world outliers that reflect the social and cultural aspects of money. Future Directions This research has begun to widen the discussion of digital money to a broader academic audience. It has also provided a comprehensive definition of digital money that encompasses both the wide variety of existing digital means of exchange, as well as those future technologies that are undoubtedly to come. Our digital money readiness index also has important implications for policy makers. Moving forward, we intend to: Improve the transparency of the index; Include measures of digital currencies, such as Bitcoin; Implement a penalty for bottleneck to improve policy implications for the index; Investigate the big data implications of digital money; Digital Money Readiness To provide better insight into the differing readiness of countries for digital money, we have developed a Digital Money Readiness Index. By readiness we mean the level of development of the country with respect to the institutional, financial, technological, and economic factors that underpin digital money. Investigate whether the claimed economic and social benefits of digital money are indeed present. Acknowledgements We gratefully acknowledge both the financial and intellectual support of Citigroup, and would particularly like to thank Greg Baxter, Sandeep Dave, and Ashwin Shirvaikar. We also thank Lazlo Szerb and Erkko Autio for their suggestions on composite indices. Business School, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: llewellyn.thomas@imperial.ac.uk www.imperial.ac.uk/people/llewellyn.thomas
Impact of Changes in Primary Health Care Provision Elizabeth Cecil Alex Bottle Mike Sharland Sonia Saxena Unplanned hospital admissions in children have been rising across England over the last decade [1] Access to timely and effective primary care for minor or non urgent conditions prevents potentially avoidable hospital admission [2] GP s withdrawal from out of hours care in 2004 may have resulted in children being seen in hospital emergency departments where previously parents would have contacted their GP particularly for acute infectious illness The Quality and Outcomes Framework (QOF) has been successful in incentivising primary care to improve adult health outcomes for chronic disease. Yet children who make up 25% of GP workload are underrepresented in quality improvement targets in primary care. Hence children may access hospital based alternatives to primary care for acute exacerbations of chronic conditions [3] Aim: To investigate whether GP services changes have impacted on unplanned and short stay hospital admissions in children for infectious and chronic disease. Design: National population-based time trens study. GP Methods Alternative eg. walk in centres, telecare A&E We used Hospital Episode Statistics (HES) data from all English hospitals from 2000-2011 on children aged<15 years to calculate age/sex standardized admission rates for all unplanned admissions; short stays <=2 days with no readmissions and very short stays (no overnight stay). We adjusted for deprivation. The interrupted time series analysis model design allowed for a step change at and gradient change post 2004, in rate of unplanned hospital admissions in children. Outcomes: Total unplanned, short and very short stay hospital admission rates; for all cause, infectious and chronic disease. Exposure: Post 2004 Results Crude unplanned admission rates increased between 2000/1 and 2010/11 in all developmental age bands in children aged <15 years. The adjusted rate of all cause unplanned admissions increased by 2%/ year after the introduction of GP service changes in 2004, compared to the trend in previous years (rate ratio (RR) = 1.02 (95% CI: 1.02, 1.03)). The biggest changes were observed in very short stay admissions, those unplanned admissions with no overnight stay. There was an estimated step change of 8.5% (RR = 1.08, (95%CI: 1.07, 1.10)) in adjusted unplanned admission rates for all chronic diseases, in 2004. There was no evidence of a step change in the adjusted unplanned admission rates in infectious disease but the rate of increase doubled after 2004 from 1.2% to 2.3% per year 64 66 68 70 72 74 19 20 21 22 23 2000 2002 2004 2006 2008 2010 All Cause Chronic Disease 2000 2002 2004 2006 2008 2010 26 28 30 32 34 Infectious Disease 2000 2002 2004 2006 2008 2010 Standardized Rate Fitted Rate Department of Primary Care and Public Health, Imperial College College London, South Kensington Campus, London SW7 2AZ. Peadiatric Infectious Diseases Unit, St. George s, University of London, Cranmer Terrace, London SW17 0RE. Email: e.cecil@imperial.ac.uk www.imperial.ac.uk/medicine/people/e.cecil
Early in-hospital mortality following trainee doctors first day at work Min Hua Jen Alex Bottle Azeem Majeed Derek Bell Paul Aylin There is a commonly held assumption that early August is an unsafe period to be admitted to hospital in England, as newly qualified doctors start work in NHS hospitals on the first Wednesday of August. A previous UK study using national death certificate data found no effect, but could not discriminate between in and out of hospital deaths. US studies have suggested an equivalent July effect. We investigate whether in-hospital mortality is higher in the week following the first Wednesday in August than in the previous week using national hospital administrative data. Methods Two retrospective cohorts of all emergency patients admitted on the last Wednesday in July and the first Wednesday in August for 2000 to 2008, each followed up for one week. If by the end of the following Tuesday, a patient had died in hospital, we counted them as a death; otherwise we presumed them to have survived. We calculated the odds of death in admissions occurring on the week after the first Wednesday in August compared with those on the week before, adjusted for age (20 groups: <1 year, 1 4, 5 9, and five-year bands up to 90+), sex, area-level socio-economic deprivation (quintile of Carstairs index of deprivation), year (NHS financial year of discharge, from 1st April each year to the 31st March the next year) and comorbidity (using the Charlson index of co-morbidity, ranging from 0 to 6+). Results Odds ratios comparing odds of death in patients admitted on first Wednesday in August compared to last Wednesday in July (unadjusted and adjusted*). Discussion Strengths: Large study national, 9 years Only included deaths in hospital Denominator No overlap in care. Limitations: Only looked at those admitted on a single day Our figures equate to just 11 extra deaths per year Short follow up how long does effect last? Patients admitted on the first Wednesday in August have a higher death rate than those admitted on the last Wednesday in July in hospitals in England. There is also a statistically significantly higher death rate for medical patients that was not evident for surgical admissions or patients with malignancy. If this effect is due to the changeover of junior hospital staff, then this has potential implications not only for patient care, but for NHS management approaches to delivering safe care. We suggest further work to look at other measures such as patient safety, quality of care, process measures or medical chart review to identify preventable deaths rather than overall early mortality to further evaluate the effect of junior doctor changeover. Acknowledgements PA, MHJ, AB are employed within the Dr Foster Unit at Imperial College London. The Unit is funded from a research grant from Dr Foster Intelligence (an independent health service research organization). The unit is also affiliated with the CPSSQ at Imperial College Healthcare NHS Trust, which is funded by the NIHR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript or poster. Dr Foster Unit at Imperial College & Department of Primary Care and Public Health, School of Public Health, Imperial College London, South Kensington Campus, London SW7 2AZ. Department of Medicine, Imperial College London, Chelsea and Westminster Campus, 369 Fulham Road, London SW10 9NH.
Interrupted time-series analysis of London stroke services re-organisation Roxana Alexandrescu John Tayu Lee Alex Bottle Paul Aylin Stroke accounts for around 11% of all deaths in England. Most people survive a first stroke, but often have significant morbidity. In England, approximately 110,000 people have a first or recurrent stroke a year, and it is estimated stroke costs the economy around 7 billion per year from which 2.8 billion is a direct cost to the NHS. Prior to 2010, provision of stroke care in London was complex, with care spread across a number of units and only 53% of patients treated on a dedicated stroke ward.1 To improve the quality of service, eight Hyper Acute Stroke Units (HASUs) were established in London from February 2010. The units, which are dedicated to treating stroke patients, are open 24 hours, seven days a week to offer immediate access to stroke investigations and imaging, including CT brain scan and clot-busting thrombolysis drugs. Our aim was to assess the impact of the HASU policy using established stroke performance indicators based on national routine hospital administrative data. Methods We used Hospital Episode Statistics (HES) from April 2006 to March 2012 to include a time period before and after the policy introduction. We identified all admissions with a primary diagnosis of stroke in any episode of care based on an ICD-10 disease code of I60, I61, I62, I63 and I64. We examined six indicators defined previously. These were: Brain scan on the day of admission; Thrombolysis treatment; Diagnosis of aspiration pneumonia in the hospital; Seven-day in-hospital mortality; Discharge to usual place of residence within 56 days; and Thirty-day emergency readmission (all causes). We plotted the unadjusted rates for the process and outcome indicators by time (quarter of year). We tested for linear trends pre and post intervention (excluding a six-month intervention period Jan 10 to Jun 10) and for a step change at the time of the intervention for each indicator using an interrupted time series (ITS) negative binomial regression model.. Results During a 6-year period, April 2006 to March 2012, we identified 536,034 stroke admissions to hospitals in England, 61,643 of these (11.5%) being in the London area. Compared with areas outside London, 7 day in-hospital deaths rate reduced significantly following the restructuring of services, as did aspiration pneumonia. However, same day brain scans showed a small but significant reduction following the intervention, as well as a slowing down in the rate of increase. This study suggests that HASU policy was effective in improving the treatment of stroke patients in the London area, the intervention being associated with decreasing in-hospital mortality and decreasing rates of aspiration pneumonia in the post-intervention period. S c a n r a t e % Rates of same day brain scan by quarter of year, April 2006 - March 2012 Reorganisation of stroke services: London Jan.- July 2010 80 Intervention 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Area Time (quarter year) England without London London Rates of pneumonia by quarter of year, April 2006 - March 2012 Reorganisation of stroke services: London Jan.- July 2010 16 Intervention P n 14 e u 12 m o 10 n i 8 a 6 r a 4 t e 2 % 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Area Time (quarter year) England without London London Rates of discharge to usual place of residence by quarter of year, April 2006 - March 2012 Reorganisation of stroke services: London Jan.- July 2010 80 Intervention D i 70 s c 60 h a 50 r g 40 e 30 r a 20 t e 10 % 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time (quarter year) T h r o m b. r a t e % D e a t h r a t e % R e a d m i s s i o n r a t e % 16 Intervention 14 12 10 8 6 4 2 0 14 12 10 8 6 4 2 0 Rates of thrombolysis by quarter of year, April 2006 - March 2012 Reorganisation of stroke services: London Jan.- July 2010 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Area Time (quarter year) England without London London Rates of deaths within 7 days by quarter of year, April 2006 - March 2012 Reorganisation of stroke services: London Jan.- July 2010 16 Intervention 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Area Time (quarter year) England without London London Rates of emergency readmission by quarter of year, April 2006 - March 2012 Reorganisation of stroke services: London Jan.- July 2010 16 Intervention 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time (quarter year) Our model also included seasonal effect (dummy variable for each month), patient characteristics including age (six categories: 0-44, 45-54, 55-64, 65-74, 75-84 and 85 years or over), sex and socioeconomic deprivation status (Carstairs deprivation quintiles). Area England without London Area England without London London London Figure 1. Unadjusted temporal changes for the performance indicators for stroke care by study area Acknowledgements This poster represents independent research supported by NIHR Patient Safety Translational Research Centre. Dr Foster Unit at Imperial College & Department of Primary Care and Public Health, School of Public Health, Imperial College London, South Kensington Campus, London SW7 2AZ. Department of Medicine, Imperial College London, Chelsea and Westminster Campus, 369 Fulham Road, London SW10 9NH.
Adverse events recorded in English primary care Carmen Tsang Alex Bottle Azeem Majeed Paul Aylin The epidemiology of patient safety incidents in primary care remains inconclusive, with fluctuating estimates and a narrow focus on drug-related harm. More accurate and recent estimates of adverse events in primary care are necessary to assign resources for patient safety improvement, while predictors must be identified to ameliorate patient risk. This study determined the incidence of recorded iatrogenic harm in general practice and identified risk factors for these events, using standardised clinical diagnosis codes. Data Sources The Johns Hopkins Adjusted Clinical Groups (ACG) Case-Mix System, version 9.01i was used in analyses. The currently available version of the software is 10i. This study is based in part on data from the Full Feature General Practice Research Database (GPRD) obtained under licence from the UK Medicines and Healthcare Products Regulatory Agency (MHRA). The interpretation and conclusions contained in this study are those of the authors alone. Access to the GPRD database was funded through the Medical Research Council s licence agreement with MHRA. The GPRD Group has Trent Multi-Centre Research Ethics Committee approval for all observational research using GPRD data (reference: 05/MRE04/87). This study was granted approval by GPRD s Independent Scientific Advisory Committee (ISAC). Results The incidence was 6.0 adverse events per 1,000 person-years (95% CI 5.74-6.27), equivalent to 8 adverse events per 10,000 consultations (n=2,540,877). Greater risk of adverse events (adjusted results): Aged 65 to 84 years - RR 5.62, 95% CI 4.58-6.91; p<0.001. Greater number of consultations - RR 2.14, 95% CI 1.60-2.86; p<0.001. 5 emergency admissions - RR 2.08, 95% CI 1.66-2.60; p<0.001. More comorbidities (Johns Hopkins ACG Expanded Diagnosis Clusters) - RR 8.46, 95% CI 5.68-12.6; p<0.001. Methods Cross sectional sample of 74,763 patients at 457 English general practices between 1st January 1999 and 31st December 2008 obtained from the General Practice Research Database. Patient age at study entry, sex, ethnicity, deprivation, practice region, duration registered at practice, continuity of care, comorbidities and health service use (GP consultations, referrals and emergency admissions) were analysed. Adverse events were defined by designated diagnosis codes for complications of care from three Read Code chapters for external causes of injury and poisoning, including complications of medical and surgical care (chapters S, T and U). Comorbidities were measured by a modified Charlson Index and the Johns Hopkins Adjusted Clinical Groups (ACG) Case-Mix System. Crude and adjusted analyses were performed by Poisson regression using Generalized Estimating Equations (GEE). All analyses were performed using SAS Version 9.2a. The low incidence of recorded adverse events is comparable with other studies. The results demonstrate potential uses of routinely collected data for active safety surveillance, with identification of some risk factors that may be associated with iatrogenic harm. Data on the care setting where AEs occurred were unavailable, but the low rate may reflect under-recording of AEs occurring in primary care. Temporal sequencing of risk factors and case ascertainment would benefit from data triangulation. Future studies might explore whether first adverse events predict future incidents. Acknowledgements This poster represents independent research supported by NIHR Patient Safety Translational Research Centre. Dr Foster Unit at Imperial College & Department of Primary Care and Public Health, School of Public Health, Imperial College London, South Kensington Campus, London SW7 2AZ. Department of Medicine, Imperial College London, Chelsea and Westminster Campus, 369 Fulham Road, London SW10 9NH.
Systems medicine of severe asthma: the U-BIOPRED experience Ian M. Adcock Kian Fan Chung U-BIOPRED consortium The U-BIOPRED cohort The Unbiased Biomarkers for the Prediction of Respiratory Disease Outcomes (U-BIOPRED) consortium is a pan-european publicprivate collaboration funded by the Innovative Medicines Initiative (IMI) of the European Union and EFPIA. U-BIOPRED aims to sub phenotype adult and paediatric patients with severe refractory asthma by using an innovative systems medicine approach. Patient recruitment Clinical presentation Severe asthma handprint(s) Samples collected Severe asthma was defined according to IMI criteria 5 recruiting centres from 12 European countries participated A baseline visit included an assessment of current health status, atopy, pulmonary function, sputum, HRCT scans (n=179), questionnaires (QOL, anxiety/depression, adherence) together with blood and urine tests, exhaled air VOCs and induced sputum; with biobanking of samples for eicosanoid, lipidomic, transcriptomic, proteomic and immunohistochemistry analyses Severe asthma Adults & Children Sample Collection Omics data acquisition Knowledge management Data integration Networks, clusters, mapping, statistical analysis A sub-set of patients underwent fibreoptic bronchoscopy (n=194) Severe asthma cohorts are followed up longitudinally at 12 to 18 months and in some, an additional exacerbation visit. Data sets collected Clinical data: 900 clinical variables/patient (adults), 660 clinical variables/patient (paediatrics) omics data Transcriptomics (33K data points per array) Lipidomics (330-1000K data points per sample) Multiple biomatrices Plasma, Sputum, Urine, Breathe, Biopsies, Airway epithelial cells, Airway smooth muscle cells Patients recruited: Jan/12 Mar/13 Adult Paediatric Cohort number Cohort number A: Severe asthma 366 B: Severe asthma: smoking C: Non-severe asthma 121 130 D: Non-asthma 112 A: Severe asthma: school age B: Mild/moderate asthma: school age C: Severe asthma: pre-school D: Mild/moderate asthma: pre-school 102 58 83 56 Proteomics (500K data points used from 4000 collected per sample) Sonamalogics (1250 analytes tested) Eicosanoids (50 data points per sample) Breathomics (106 samples). Data analysis and integration A systems medicine approach will be used to integrate high dimensional data from invasive data, non-invasive data and patient-reported outcomes (PRO). Analysis methods will include WGCNA, GSEA & GSVA, Topological Data Analysis (TDA), Bayesian Belief Networks (BBN), Bi-clustering, and Pathway Analysis. Each analysis will be done by a domainknowledge expert in close collaboration with a data analyst or statistician. Airways Disease Section, NHLI, Imperial College College London, Royal Brompton Campus, London SW3 6LY. Email: f.chung@imperial.ac.uk or ian.adcock@imperial.ac.uk
Imperial Space Lab Steven J. Schwartz Overview Dark Matter Theories Scan of plausible models requires 100s of CPU-years and 100s of TB Bottleneck will hit LHC restart Dr Robertto Trotta, Dr Pat Scott, & Charlotte Strege Astrophysics, Blackett Laboratory (r.trotta@imperial.ac.uk) see accompanying poster 140 academics from 7 departments in 4 faculties Wide range of multidisciplinary interests PI-led funds ~ 90M in 290 grants Promotes internal collaboration and external engagement Examples here see accompanying posters for some details Earth Observation Space and Atmospheric Physics Satellite data addressing climate, e.g., aerosol impact on energy budget See Data Institute webpages for case study Dr Helen Brindley (h.brindley@imperial.ac.uk) Astronomically Big Data Data Science challenges are large - eg. SDSS, WISE, UKIDSS, VISTA Next generation bigger: Euclid: Hubble imaging over the entire sky; LSST: multicolour imaging of the entire sky = 100TB every 5 days Systems & algorithms to handle & exploit this data needed see accompanying posters Dr DL Clements, Prof. SJ Warren, Dr. D Mortlock, Prof A Heavens Astrophysics, Blackett Laboratory (d.clements@imperial.ac.uk) MODIS NASA Terra & Aqua Green vegetation + CO2 observations Improve models of primary CO2 production Prof Colin Prentice Grantham Institute for Climate Change (c.prentice@imperial.ac.uk) Electrical & Electronic Engineering Ongoing projects to address the handling of Big Data, e.g.: Tensor decomposition of Big Data (Danilo Mandic d.mandic@imperial.ac.uk) Machine learning to improve handling of large datasets, e.g., by FPGA devices (christos-savvas.bouganis@imperial.ac.uk) Guidance, control and optimization of mobile sensor networks, e.g. Earth observation and autonomous robotic exploration. (see accompanying poster) (e.kerrigan@imperial.ac.uk) Space Weather Modelling Natural hazards to space infrastructure, power grids, forecasting requires developing and mining large parameter space of inputs to/responses of Earth s magnetosphere Dr Jonathan Eastwood (SPAT, Blackett), Dr Jeremy Chittenden (Plasmas) (jonathan.eastwood@imperial.ac.uk) Blackett Laboratory, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: s.schwartz@imperial.ac.uk www.imperial.ac.uk/spacelab
Discovery Sciences Group David Johnson Orestis Tsinalis Ioannis Pandis Yang Xian Yike Guo The Discovery Sciences research group at the Department of Computing of Imperial College London investigates and develops infrastructure, methods and systems in the areas of Cloud Computing, Sensor Informatics and Translational Informatics. Group DSG Research sections Cloud Computing Translational Informatics Sensor Informatics Projects IC Cloud Elastic Algorithms etriks U-BIOPRED Ovarian Cancer DCE Concinnity Cloud Computing The research group develops and runs the IC Cloud, a Cloud Computing test bed designed specifically to research the use of cloud computing in data intensive research. It extends the basic cloud computing concepts to enable a set of generic, scalable and resource efficient services for scientific research. The group is also developing new algorithms to understand and exploit elastic qualities in Cloud Computing infrastructures to mediate factors such as performance, output quality, resources availability and pricing. Sensor Informatics We are developing a novel theoretical framework and associated computational model for information management in large-scale sensor networks by developing and using attention-like mechanisms based on Bayesian techniques, inspired by the human-sensory system, to identify, and focus on, relevant information to avoid information overload within a complex sensor network environment. Our work on sensor informatics takes place through the development of the Concinnity platform under the Digital City Exchange programme (see poster), by building infrastructure to harvest and integrate city-wide sensor data. Translational Informatics Our bioinformatics research aims at efficient storage and knowledge management of translational medicine data and the extraction of biological knowledge by combining clinical, transcriptomics, proteomics and lipidomics data through systems biology modelling by employing machine learning and data mining techniques. Our work on translational informatics is centred on two projects: etriks (Translational Information and Knowledge Management Services), U-BIOPRED (Unbiased Biomarkers in Prediction of respiratory disease outcomes; see poster), and supporting Ovarian Cancer studies (see poster). Step 6: Constructing in In silico model of silico models biological processes Hand prints of Step 5: Generating hand prints severe asthma of severe asthma subgroups subgroups Regulation pathway Signalling pathway Metabolic pathway Molecules Genes Proteins Lipids Other features clinical Physical CT Step 4: Finding finger prints of severe asthma subgroups Subgroups of severe asthmatics Step 3: Finding clusters within severe asthma Partial Hand Prints of severe asthma Regulation pathway Signalling pathway Metabolic pathway Step 2: Generating Partial Hand Prints of severe asthma Molecules Genes Proteins Lipids Experimental Platforms Step 1: Finding finger prints of severe asthma Expression Arrays Proteomics Lipidomics Discovery Sciences Group, Department of Computing, William Penney Laboratory, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: discoverysciencesgroup@imperial.ac.uk; y.guo@imperial.ac.uk http://dsg.doc.ic.ac.uk Acknowledgements Digital City Exchange is a five-year programme at Imperial College London funded by Research Councils UK s Digital Economy Programme (EPSRC Grant No. EP/I038837/1) etriks and U-BIOPRED are funded by the Innovative Medicines Initiative, a public private partnership between the European Union and the European Federation of Pharmaceutical Industries and Associations (EFPIA).
Data-intensive science Higgs Boson discovery at the LHC Duncan Rand Daniela Bauer Simon Fayer Adam Huffman David Colling The CMS experiment at the LHC CMS is one of the four main experiments on the Large Hadron Collider (LHC) at CERN and is the general purpose experiment that, along with ATLAS, discovered the Higgs boson in 2012. The experiment collaboration consists of more than 4300 scientists, engineers and students from 180 institutes in 40 countries. Physicists submit jobs from a number of institutes around the world which run on the cluster, streaming and analysing data from the disk storage. Results are copied back to storage at the remote site. Roughly a third of the CPU time consumed is made up of Monte Carlo simulations. The team also runs the GridPP Cloud cluster, developed the Real Time Monitor, and has pioneered the adoption of the IPv6 protocol within the WLCG. Pre-placed data transfers to IC over the last year. Note the large proportion from Fermilab near Chicago Any time, any place, anywhere Collision event recorded with the CMS detector in 2012 at a proton-proton centre-of-mass energy of 8 TeV showing characteristics expected from the decay of the SM Higgs boson to a pair of τ leptons. Worldwide LHC Computing Grid (WLCG) The WLCG is composed of four levels, or Tiers, called 0, 1, 2 and 3. Data produced by the CMS detector is stored on tape at the Tier-0 at CERN and a copy is simultaneously sent out over a dedicated optical fibre network known as the LHCOPN to eight Tier-1s for archiving and further processing. After processing at the Tier-1 data is then copied out to the Tier-2s for analysis. Imperial College runs a significantly sized Tier-2. We collaborate with other universities in London to form, collectively, the distributed London Tier-2 which forms part of the GridPP collaboration. GridPP is the UK part of the WLCG, the Worldwide Large Hadron Collider Computing Grid. The Computing Team in the High Energy Physics (HEP) Group at Imperial College runs the Tier-2 which, in addition to CMS, supports the LHCb and ATLAS experiments from the LHC, experiments studying neutrinos and rare Kaon decays, two accelerator experiments and also biomedical research. The cluster has approximately 3200 computing cores and almost 3 PB of disk storage. CMS data is copied in over the wide area network (WAN) using the PhEDEx data management software. During Run1 of the LHC, data was pre-placed at the Imperial College Tier-2 using the File Transfer System (FTS). Since then a data federation using Xrootd has been developed such that it is now possible for a site s worker nodes to read from storage at a remote location such as CERN. This new paradigm, known as AAA (Any time, Any place, Any where), is exciting in that it allows CMS physicists much greater flexibility in the way that they process data, but it also has the potential to significantly increase the usage of the wide area network. Imperial College, as one of the best connected Tier-2s in the UK with a 40 Gb/s link to Janet (the UK s research and education network), is wellplaced in this regard. We are in the process of testing the ability of our storage systems and network to ingest both pre-placed data and AAA data simultaneously. Test data transferred with FTS (peak 28 Gb/s) AAA data from CERN (peak 10 Gb/s) Department of Physics, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: d.colling@imperial.ac.uk, duncan.rand@imperial.ac.uk www.imperial.ac.uk/highenergyphysics
Big data in multiphase flow research Omar Matar Christopher Pain Experimental Visualization Multi-scale Examination of MultiPhase physics in flows (MEMPHIS) is a 5m EPSRC Program grant awarded to help create the next generation of modelling tools for complex multiphase flows. We use tools such as high speed photography to record interfacial configurations thousands of times per second. We also log data such as pressure, temperature, liquid holdup and velocity. The spatiotemporally resolve measurements, with pixel-to-pixel distance in the order of μm and sampling frequencies of up to 10000 Hz, present a challenge not only from the experimental but also from the evaluation and data processing point of view. Simulation: 4-phase model of polydispersed fluidized bed.. To analyse the data, we process the images and the signals automatically through cross-correlation, pattern recognition, various transforms etc. The programme will lead to the resolution of a number of fundamental open problems in multiphase flows leading to the development of validated simulation tools for the academic and industrial community. Simulation: Interface capturing Data science also allows us to use high performance computers to understand and optimise physical processes and products. Numerical simulations of complex multiphase systems rapidly generate enormous data sets. Efficient, largescale parallelisation rapidly becomes essential, especially for the computation of inverse problems. LIF + PIV/PTV Tracking The image below shows the application of non-intrusive laser based flow-field measurement techniques applied to co-current gas-liquid downwards annular flows. During the measurements, Laser Induced Fluorescence (LIF) and Particle Image/Tracking Velocimetries (PIV/PTV) are used to characterise the topological features of these interfacial flows and to obtain 2D velocity vector maps of the thin liquid films, respectively. Chemical Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: contact@memphis-multiphase.org, o.matar@imperial.ac.uk www.memphis-multiphase.org The above shows the model geometry, phase volume fractions, stream lines and velocity vectors for a simulation of flow through a static mixer for 2 miscible non-newtonian complex fluids, totalling several hundred thousand degrees of freedom. This visualisation alone required 3 processors on a college work station: the simulation was calculated using load balancing parallel mesh adaptivity across 12 processors on the college cluster.
Bioinformatics Support Service James Abbott Geraint Barton Derek Huntley Chris Tomlinson Mark Woodbridge Sarah Butcher The Imperial College Bioinformatics Support Service is a part of the Imperial College Centre for Integrative Systems Biology and Bioinformatics (CISBIO). Our aim is to provide state-of-the-art Bioinformatics support and resources to members of Imperial College for their research. The Biodata Problem More and more, project success is measured by publications AND by the data made available to the wider community. Data, including legacy data, are increasingly combined across fields and sources and often analysed in ways unforseen by the original generators. Our ability to generate data is greater than ever before due to a combination of changing technology, decreasing unit price and increasing speed of acquisition. In contrast to data from other disciplines, Biodata show Lack of structure, rapid growth but not huge volume, high heterogeneity Multiple file formats, widely differing sizes and rates of acquisition Considerable manual data collection Multiple format changes over data lifetime including production of (evolving) exchange formats Huge range of analysis methods, algorithms and software in use with wide ranging computational profiles Association with multiple metadata standards and ontologies, some of which are still evolving Increasing reference or link to patient data with associated security requirements What We Do We provide solutions supporting all stages in the data lifecycle from experimental design, data and metadata capture through primary and later stage analyses, data management, visualisation, sharing and publication for all 1. Large-scale genomics & Next Generation Sequencing Analyses 2. Mobile Application Development 3. Tools for multiplatform data and metadata management 4. Bespoke clinical and biological databases, tissue-banking 5. Software and script development, data visualisation 6. Web applications and sites for project-based data-sharing and project outcomes; also data management plans 7. Full grant-based collaboration across disciplines 8. New ways of working e.g. cloud, workflows 9. Teaching, Workshops and One-to-One tutorials Genome-scale analyses - assembly, automated annotation, management of collaborative 3 rd party annotation, visualisation, submission to public repository Mobile applications for customisable geo-tagged data capture in the field and automated remote database storage Data management for tissue-banking, clinical databases including sample and experimental metadata LabBook - Format-agnostic lab data capture, sharing and backup https://labbook.cc MRIdb sites Chernobyl Tissue Bank IC Tissue Bank Using GenomeThreader in the MapReduce framework New Ways of Working Quality assurance Confocal image analysis - feature detection Acknowledgements: GEB genome features Dynamic visualisation 16s microbiome OTUs Circos plot - complex position-specific genomic features Software and tool development for analysis and visualisation Part of whole genome variant calling pipeline (GATK) illustrating data volume per sample and computational profile at different stages. One dataset may comprise 1 100 s of samples Find us Centre for Integrative Systems Biology & Bioinformatics, Department of Life Sciences, Faculty of Natural Sciences, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: s.butcher@imperial.ac.uk bsshelp@imperial.ac.uk www.imperial.ac.uk/bioinfsupport
Advanced Statistical and Computational Methods for Emerging Challenges in Astronomy and Solar Physics Professor David A van Dyk, Statistics Section (Mathematics), CHASC International Center for Astrostatistics Big Data Challenges in Astrostatistics In recent years, technological advances have dramatically increased the quality and quantity of data available to astronomers. Newly launched or soon-to-be launched spacebased telescopes are tailored to data-collection challenges associated with specific scientific goals. These instruments provide massive new surveys resulting in new catalogs containing terabytes of data, high resolution spectrograph and imaging across the electromagnetic spectrum, and incredibly detailed movies of dynamic and explosive processes in the solar atmosphere. These new data streams are helping scientists make impressive strides in our understanding of the physical universe, but at the same time are generating massive data-analytic and data-mining challenges for scientists who study them. The complexity of the instruments, the complexity of the astronomical sources, and the complexity of the scientific questions lead to many subtle inference problems that require sophisticated statistical tools. For example, data are typically subject to non-uniform stochastic censoring, heteroscedastic errors in measurement, and background contamination. Scientists wish to draw conclusions as to the physical environment and structure of the source, the processes and laws which govern the birth and death of planets, stars, and galaxies, and ultimately the structure and evolution of the universe. Sophisticated astrophysics-based computer-models are used along with complex parameterized and/or flexible multi-scale models to predict the data observed from astronomical sources and populations of sources. The CHASC International Center for Astrostatistics tackles outstanding statistical problems generated in astro and solar physics by establishing frameworks for the analysis of complex data using state-of-the-art statistical, astronomical, and computer models. In doing so the researchers in the Center not only develop new methods for astronomy, but also use these problems as spring boards in the development of new general methods, especially in signal processing, multilevel modelling, computer modelling, and computational statistics. Here we outline a number of our current research activities. The Statistical Analysis of Stellar Evolution The physical processes that govern the evolution of sunlike stars first into red giants and then white dwarf stars can be described with mathematical models and explored using sophisticated computer models. These models can predict observed stellar brightness (magnitude) as a function of parameters of scientific interest, such as stellar age, mass, and metallicity. We embed these computer models into multilevel statistical models (see diagram) that are fitted using Bayesian analysis. This requires sophisticated computing, corrects for data contamination by field stars, accounts for complications caused by unresolved binary-star systems, and allows us to compare competing physics-based computer models for stellar evolution. Parameters of scientific interest can exhibit complex non-linear correlations (see figure below) that cannot be uncovered or summarized using standard methods. Principled statistical models and adaptive computational techniques are specially designed to fully explore such relationships. Embedding the Big Bang Cosmological Model into a Bayesian Hierarchical Model Image Credit: http://hyperphysics.phy-astr.gsu.edu/hbase/astro/snovcn.html The 2011 Nobel Prize in Physics was awarded for the discovery that the expansion of the Universe is accelerating. We have developed a Bayesian model that relates the difference between the apparent and intrinsic brightnesses of objects to their distance which in turn depends on parameters that describe this expansion. Type Ia Supernova, for example, occur only in a particular physical scenario. This allows us to estimate their intrinsic brightness and thus study the expansion history of the Universe. Sophisticated Markov chain Monte Carlo methods are used for model fitting and a secondary Bayesian analysis is conducted for residual analysis and model checking. Identifying Unspecified Structure in Low-Count X-ray Images original EMC2 image Possible extended loop of hot Image restoration including deconvolution techniques offers a powerful tool to improve resolution in images and to extract information on the multiscale structure stored in astronomical observations. Using a Bayesian model-based framework allows us both to quantify the uncertainty in the reconstructed images and to conduct formal statistical tests for unexpected structure in the image. NGC 6240, for example, is a nearby ultraluminous infrared galaxy that is the remnant of a merger between two smaller galaxies. The restored (EMC2) X- ray image of NGC 6240 shows a faint extended loop of hot gas. We are developing computationally efficient Monte Carlo techniques for quantifying the evidence for such structure. Classification, Tracking and Prediction of Solar Features (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) In order to take full advantage of the high-resolution and high-cadence Solar images that are now available, we must develop methods to automatically process and analyze large batches of such images. This involves reducing complex images to simple representations such as binary sketches or numerical summaries that capture embedded scientific information. The morphology of sunspot groups, for example, is predictive not only of their future evolution but also of explosive events associated with sunspots such as solar flares (top right) and coronal mass ejections. Using techniques involving mathematical morphology, we demonstrate how to reduce solar images into simple sketch representations and numerical summaries that can be used as features for an automated classification and tracking. Identifying Solar Thermal Features Using H-Means Image Segmentation Properly segmenting multi-band images of the Sun by their thermal properties helps to determine the thermal structure of the solar corona. Off-the-shelf segmentation algorithms, however, are typically inappropriate because temperature information is captured by the relative intensities in different pass-bands, while the absolute levels are not relevant. Input features are therefore pixel-wise proportions of photons observed in each band. To segment solar images based on these proportions, we use a modification of k-means clustering that we call the H-means algorithm because it uses the Hellinger distance to compare probability vectors. H-means has a closed-form expression for cluster centroids, so computation is as fast as k-means. Application of our method reveals never before seen structure in the solar corona see the large S- shaped feature in the right-hand panel of the figure. Low-Count Spectral Analysis in High-Energy Astrophysics Spectra describe the energy distribution of photon emitted from an astronomical source and carry information as to the composition and physical processes at work in the source. The space-based observatories that study high-energy (X-ray and γ- ray) spectra are subjected to stochastic data distortion processes (blurring, heterogeneous censoring, and background contamination). We build sophisticated multi-level statistical models that account for both physical-processes in the sources themselves and for stochastic data distortion (see diagram). Uncertainty in the calibration of the instruments is a particular challenge and is typically ignored in practice. The figure below illustrates the fitting of two spectral parameters ignoring calibration uncertainty (left panel) and with an ad hoc (middle panel) and a principled (right panel) correction. The statistically principled method captures the true parameter values without overstating uncertainty in the fit. θ 2 1.96 2.00 2.04 Default Default Effective Method Area Ad Pragmatic Hoc Correction Bayes Principled Fully Bayes Correction Fitted Value (with errors) True Value 0.90 0.95 1.00 1.05 θ 1 Acknowledgements We gratefully acknowledge funding for this project partially provided by the Royal Society (Wolfson Merit Award), the US National Science Foundation (DMS-12-08791), the European Commission (Marie-Curie Career Integration Grant), and the UK Science and Technology Facilities Council. Collaborators: θ 2 1.96 2.00 2.04 0.90 0.95 1.00 1.05 θ 1 Elizabeth Jeffrey (James Madison), William Jeffreys (Texas and Vermont), Xiyun Jiao (Imperial), Vinay Kashyap (Harvard Smithsonian Center for Astrophysics), Thomas Lee (UC Davis), Xiao Li Meng (Harvard), Erin O Malley (Dartmouth), Aneta Siegminowska (Harvard Smithsonian Center for Astrophysics), Shijing Si (Imperial), Nathan Stein (U Penn), David Stenning (UC Irvine), Roberto Trotta (Imperial), Ted von Hippel (Embry- Riddle), Jin Xu (UC Irvine), and Yaming Yu (UC Irvine). θ 2 1.96 2.00 2.04 Captures true value without over stating errors 0.90 0.95 1.00 1.05 θ 1
Big data in cosmology Andrew H. Jaffe The Cosmic Microwave Background The Cosmic Microwave Background (CMB) consists of light that last interacted with other matter in the Universe about 400,000 years after the big bang, nearly 14 billion years ago. At that time, the Universe was much hotter and denser than it is today. It cooled over time, transitioning from a time when the atoms were mostly ionized (separate protons and electrons), to being mostly neutral (electrons orbiting around protons in the nucleus). The ionized plasma was opaque to light, and the neutral gas was transparent. Hence, as we look further away, and further back in time, we eventually see what appears to us as the surface of an opaque cloud: this is the last scattering surface, when the CMB was formed. The Universe has expanded by a factor of more than 1000 in length since then, and its photons are now in the microwaves. By examining the surface of this cloud in the light of the laws of physics, we can reconstruct the contents and history of the Universe. The Planck Satellite To observe the CMB, the European Space Agency launched the Planck Satellite in 2009: Precision cosmology as data compression Planck produced nearly a quadrillion individual data points, all transmitted back to earth. From those noisy and correlated samples, we created maps of 50 million pixels over nine frequencies. Each of these contains a mixture of the cosmological signal and astrophysical foregrounds from the 14-billion-year flight of the photon. More processing is necessary to separate the components, using our understanding of the different physical and statistical processes. This enables us to produce a single map of the primordial signal and determine its power spectrum, which contains almost all of the cosmological information in the map, compressed down now to just 2,500 numbers: Planck observed the sky for about two and a half years, orbiting the sun about a million miles further out than us, in order to keep the bright sun, moon and earth all at its back, and stare out into the blackness of space. Even more amazing than the 2,500 numbers on this graph is the smooth curve, which needs just six numbers (and the big bang theory!) to describe it: data compression by 100 trillion! Acknowledgements Data and figures courtesy the Planck Collaboration and the European Space Agency. More information and acknowledgements at http://sci.esa.int/planck Imperial Centre for Inference in Cosmology, Department of Physics, Blackett Laboratory, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: a.jaffe@imperial.ac.uk astro.imperial.ac.uk/~jaffe
Goal Majid Ezzati Measuring the World s Cardiometabolic Risk Global Burden of Metabolic Risk Factors of Chronic Diseases Group To monitor major cardiometabolic risks, nationally and globally Provide evidence for priority setting and accountability Challenge 1: Data availability and access No centralised repository for large number of data sources Response Collaboration with WHO to access national sources Global network of collaborators Challenge 2: Data comparability Different metrics and measurement methods used Response Incorporate clinical knowledge of risk factor measurement in analytical models Challenge 4: Visualisation and communication Rich set of outputs and diverse group of users Response Develop dynamic and interactive visualisation tools Challenge 3: Data synthesis Complex relationships and non-linear trends Response Sophisticated algorithms, including Bayesian hierarchical models; High-performance and parallel computing School of Public Health, Imperial College College London, Medical Faculty Building, Norfolk Place, London E-mail: gbdmetabolicrisks@imperial.ac.uk www.imperial.ac.uk/medicine/globalmetabolics/
Tissue identification using metabolomic and lipidomic approaches Zoltan Takats iknife Electrosurgical unit - + - + - + Venturi air jet pump - Custom designed handpiece Atmospheric interface Scheme of iknife method Rapid evaporative ionisation mass spectrometry (REIMS) is an emerging technique that allows near real-time characterisation of human tissue in-vivo by analysis of the aerosol released during electrosurgical dissection. The tissue characterisation workflow includes the construction of a tissue-specific spectral database and using a multivariate classification algorithm and spectral identification algorithm. Our aim is to separate healthy and cancerous tissue (and different pathological alterations) and to characterise the tumor based on the REIMS fingerprint of each tissue type. I. Database building 1. Acquiring spectra with REIMS method II. Creating multivariate statistical models in the database Healthy tissue Adenocarcinoma Squamous cell carc. Metastasis Multivariate classifier Classifier 2. Histological validation Saving models into the database III. Testing the model Classifier Selecting a model for surgery Manual spectra upload Overview of iknife data analysis workflow Healthy tissue Adenocarcinoma Squamous cell carc. Metastasis 3. Histologically validated database Test spectra Our database contains spectra obtained from more than 700 patients, altogether 4959 cancerous and 6341 healthy entries from different organs. Our findings suggest that REIMS based tissue characterisation using complete lipid profiles of tissue spectra is feasible for realtime, in-vivo classification of human tissue. Chemical reconstruction of tissue regions of interest using multivariate molecular ion patterns: Optical H&E stained image with (top right), Aligned DESI-MSI RGB image (top left), Reconstruction of three distinct histological regions (fibrous lymphoid tissue, normal lymphocytes and tumour metastasis from gastric adenocarcinoma) This solution for histology driven MSI has the potential to provide fully automated, next generation molecular cancer diagnostics within hours of a patient attending their doctor. Identification of Microorganisms In the same way as for tissue identification, REIMS can be applied for the characterisation and identification of microorganisms and other unicellular organisms as cell lines. A large scale database of bacteria and fungi is created and will form the basis for a microbial identification algorithm to use on pure cultures and directly on clinical samples. The created database contains 4053 spectral profiles of 161 different bacterial species belonging to 75 genera of all major bacterial phyla recorded on four different instruments. Spectral features can be attributed to cell membrane lipids and intra- and extracellular metabolites as quorum-sensing molecules. Tissue Imaging Mass Spectrometry Imaging (MSI) is a rapidly advancing bio-analytical approach that enables the simultaneous measurement of thousands of molecular species from intact tissue sections in a spatially resolved manner. At present, the gold standard for cancer diagnosis remains histological interpretation of tissue biopsies. Here, we present a translational bioinformatics solution that covers a complete computational workflow for histology driven MSI. Using this bioinformatics solution, we have investigated region-specific lipid biochemistry in colorectal cancer, breast cancer and oesophago-gastric cancer tissue sections by Desorption Electrospray Ionisation (DESI)- MSI. Unique lipid patterns were observed using this approach according to tissue type, and a tissue recognition system using multivariate molecular ion patterns allowed highly accurate (>98%) identification of pixels according to morphology (e.g. cancer, healthy mucosa, smooth muscle, and microvasculature). Data analysis workflow for microorganisms REIMS-based profiles for microorganisms proved highly specific for the microbial species and subspecies analysed, requires no sample preparation and gives identification results within seconds. This makes REIMS a highly competitive technique as a routine microbial identification tool.
Software Environments for the Flexible and Efficient Analysis of Big Data John Darlington Jeremy Cohen Peter Austing Big data: Opportunities and Issues Big data is characterised by the availability of large data sets, sophisticated analytical methods, and high performance computational infrastructure. Data Sources Software Methods Computational Infrastructure Application-specific User Interfaces A generic environment for managing large-scale data and processing needs to handle domain-specific requirements, data formats and metadata. Metadata may be used to generate web-based user interfaces to handle input and output data. Accuracy User Level Speed / Cost Big Data Method Level Solver Type Polynomial Order Big Data requires the resolution of a range of issues. High Performance Computing is, itself, a complex technology with challenges in method development, use and efficient execution on a variety of platforms. Big Data in addition presents problems because the sheer size of the data often militates against transmission over networks and the analytical methods themselves are complex and computationally demanding. Libhpc II The EPSRC libhpc II project seeks to develop generic methods that will enable high performance computing capacity to be used easily by the end-user while enabling method developers and e-infrastructure operators to develop and maintain a repertoire of high performance methods. Libhpc focuses on the development of a set of software entities that handle both control (co-ordination forms) and data processing (components). Critically, both co-ordination forms and components can have alternative realisations and implementations that are suitable for different circumstances. An end-user constructs an application by composing these constructs and at deployment time, an intelligent mapper selects appropriate instances and machines. Matrix Vector Linear Solver Vector The example (above, left) shows Nekkloud, a user-interface for interacting with Nektar++ (http://www.nektar.info), a high-order finite element framework. Application metadata such as input parameters drive the required content of the user interface. Application processes can be represented using a decision tree (above, right) to aid automated operation and decision making by frameworks that simplify data processing and management. libhpc Architecture The EPSRC-funded libhpc and libhpc 2 projects are developing an architecture to support user-friendly deployment of complex scientific applications to heterogeneous resources. High-level application definitions based on data and control orchestration are mapped to appropriate hardware and software resources that are used to run a user s job. User User Interface Resource Level CF Repository FARM Implementation Machine Matrix Vector LU Vector Matrix Vector Jacobi Vector Hardware Metadata Repository Libhpc Mapper PIPE PAR Sequential LU Parallel LU (OpenMP) Parallel LU (MPI) Sequential Jacobi Parallel Jacobi (UPC) Hardware Metadata Co-ordination Forms + Metadata Software Component & Metadata Repository Component + Metadata End-user Application-specific Interfaces Libhpc Deployment Services Component + Metadata Given the generic libhpc architecture and repositories, easy to use menu driven interfaces can be derived by partially instantiating particular analysis pipelines and method and machine mappings. This technology would both enable domain specialist end-users to access Big Data processing capabilities and also enable method developers, e-infrastructure providers and system administrators to maintain and sustain repositories of Big Data analysis methods and metadata. FARM PIPE MAP PAR REDUCE FILTER FIND Batch Public Cloud Private Cloud Standalone Cluster (e.g. Amazon EC2) (e.g. OpenStack) Local Resources (e.g. PBS-based) Acknowledgements We would like to thank the EPSRC for funding the libhpc 1 (EP/I030239/1 ) and libhpc 2 (EP/ K038788/1) projects. London e-science Centre, Department of Computing, Imperial College London, South Kensington Campus, London SW7 2AZ. Email: Prof John Darlington, j.darlington@imperial.ac.uk http://www.imperial.ac.uk/lesc
Motivation StrainWise: Wireless Strain Sensor Nodes for Aircraft Environments Tzern Toh Steve Wright Michail Kiziroglou Paul Mitcheson Eric Yeatman Heat Storage Thermoelectric Energy Harvesting A phase change material is used to convert temperature variations with time into spatial variations using a phase change material stored in a heat storage unit. Power management electronics perform voltage rectification, maximum power transfer, energy storage and output voltage regulation. Aircraft electrical systems are becoming ever more complex and in the case of the Airbus A380, approximately 500 km of internal wiring is used. Wireless solutions are therefore desirable within the aircraft industry. These wireless systems are conventionally powered by finite energy sources such as batteries, which will need periodic replacement with an associated cost. A better solution would be to create a wireless sensor network (WSN) that is powered by a localised energy harvesting system. Using energy harvesting technology will increase the autonomy and ubiquity of wireless sensor nodes. Objectives The StrainWiSe project aims at delivering a fully autonomous WSN dedicated to the measurement of strain in the vertical tail plane and in the landing gear. The WSN will comprise several sensor nodes (SN), a pair of wireless data concentrators (WDC) for each cell and a WSN server. WSN Server WDC #1 WDC Backup #1 WDC #N WDC Backup #N Each SN will consist of a thermoelectric energy harvester, power management electronics, energy storage element and a microcontroller for sensor data management and wireless data transmission. The energy required per sensor node is 28.5 J per flight of 60-90 minutes. Wireless Data Communication SN SN SN Cell #1 Cell #N The wireless communication system must support a data rate of up to 1 kb/s per node (2 bytes of samples at a 500 Hz acquisition rate) and a maximum of 20 nodes per cell. The WSN server will synchronise the transmission of large amounts of data to and from the sensor node. The nodes must be synchronised with the server to allow time-stamping of each data sample at less than 1 ms error. SN SN SN SN SN SN SN Energy delivered to the secondary Li-ion battery is ~3 times that required by the SN. Cumulative energy [J] 140 120 100 80 60 40 20 TEGs Conclusions Prototype has been flight tested by Airbus. This is an example of a successful collaboration and demonstration of pervasive sensing and energy harvesting technologies. Acknowledgements This work was supported by the Clean Sky Joint Technology Initiative: JTI- CS-2010-1-SFWA-01-016 Collaborators: CSEM (Switzerland), SERMA (France), Airbus (Germany) Temperature [ C] 30 20 10 0 10 20 TSM9117, TSM9119, MAX1720 3.3 V Polarity detection 0 10 20 30 40 50 60 70 80 Elapsed time [minutes] Active rectification BSP149 SUD42N03 LDO regulator Electrical and Electronic Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ Email: tzern.toh@imperial.ac.uk www3.imperial.ac.uk/people/tzern.toh02 4 V T in T out BQ25504 Boost converter & battery charger Enable signal TLV70033 LIR2450 TEG output Rectifier output Battery input Li-ion battery 4 V, 120 mah 0 10 20 30 40 50 60 70 80 Elapsed time [minutes] 126 J 104 J 81 J 1.5 1 0.5 0 0.5 1 V teg [V]
Small Area Health Statistics Unit Peter Hambly Rebecca Ghosh Anna Hansell Paul Elliott
A Big Data approach to the complexity of human walking: discerning healthy and pathological gait Margarita Kotti Lynsey D. Duffell Aldo A. Faisal Alison H. McGregor Overview Osteoarthritis (OA) is the most common form of joint disease. Currently OA is the 2 nd leading cause of disability. Affects over 250 million people worldwide and approximately 8.5 million people in the UK. OA rates are rising because of a rapidly growing ageing population. Has high socio-economic cost. We aim to detect movement patterns that are characteristic of knee OA. To prove the complexity of the underlying structure of movement. Utilise Machine Learning: Probabilistic Principal Component Analysis. Bayes Classifier, Random Forests. Data Capture Data Analysis PPCA and a Bayes Classifier : 82.62% accuracy Big Data: volume, variety, veracity, value 180 subjects, 47 of which reported knee OA. Several activities: walking stair ascent / descent sit to stand / stand to sit squat timed up and go. Data types: kinetics (forces, moments) kinematics (angles) ground reaction forces spatiotemporal (velocity, stride length, etc) anthropometrics (height, age, etc) clinical assessments (like answers to questionnaires) All three planes: vertical, medio-lateral, and anterior-posterior. Walking at normal speed until 3 clean foot strikes had been recorded. Ground reaction forces: Two force plates (Kistler) at 1000 Hz. Random forests: 20% of subjects that claim not to have knee OA, present gait patterns similar to those that suffer knee OA. Discussion Machine learning is a powerful tool. Develop a novel, sensitive and objective screening tool. Design subject-tailored interventions. Pathological walking produces characteristic variability. Future: Ensembles of classifiers & novel sensors. Acknowledgements M. Kotti, L. Duffell, and Alison McGregor acknowledge support from the Medical Engineering Solutions in Osteoarthritis Centre of Excellence funded by the Wellcome Trust and the EPSRC. A. Faisal acknowledges support from his Human Frontiers in Science program grant (grant number RGP0022/2012). Dept. of Surgery & Cancer, Dept. of Bioengineering, Imperial College London Email: {m.kotti,l.duffell,a.faisal,a.mcgregor}@imperial.ac.uk http://www1.imperial.ac.uk/msklab/research/boneandjointdisease/computational_detection http://www.faisallab.com
OPTIMISE: A Vision for Personalised Medicine in Multiple Sclerosis Joel Raffel Yi-Ke Guo Paul Matthews OPTIMISE Consortium Multiple sclerosis (MS) The commonest cause of neurological disability in young adults Course of disease highly heterogeneous, and difficult to predict. 1 Relapses followed by Relapses with periods of remission incomplete recovery Gradual progressive accumulation of disability A big data solution for real-world data capture Patients Home tests of cognition, dexterity, etc Periodic questionnaires Patient-reported outcome measures Disability Personal health sensors (fitbit, GPS, gait sensors) Time (years) In recent years, we have available a choice of medications which decrease relapse rate, and slow the progression of disability. Efficacy varies between drugs, and response to treatment varies patient-to-patient. 2 At present, there is poor data capture of this heterogeneous response to treatment between individuals. This is a key priority for the future of MS research. Big questions from patients (that doctors still cannot answer) 1. What is my individual prognosis? 2. Which treatment option is right for me? 3. I ve been taking xxx medication for 6 months is it working?! would I benefit from switching to an alternative? We hypothesis that the answers to these questions, and others, are obtainable if we integrate multivariate data acquired in the real-world context of usual NHS care. OPTIMISE: personalised medicine in MS OPTIMISE: Optimisation of Prognosis and Treatment In MultIple SclErosis A UK national collaborative effort between 15 NHS MS specialist centres, corporate partners, and people with MS. Objective is to prospectively acquire computerised data in a large observational cohort study, integrated as an adjunct to usual care. Ultimate aim is to develop and evaluate multivariate predictive models, to give the right treatment to the right patient at the right time Healthcare Professionals The platform being developed will allow: Capture of periodically updated clinical data, from patients and healthcare professionals. Integration of large volumes of complex bioinformatic data. Real-world outcome measures to capture the patient experience: GPS, gait sensors, and patient reported outcomes. Data mining of electronic healthcare records using natural language processing. Analysis of codified data using machine based learning. Creating a tool that meets user needs Clinical events Periodic clinical examination Electronic healthcare records (data mining) Neuroimaging, Genomics, Transcriptomics, Biochemistry Data entry must be quick, easy, and pain-free Interface must engage and empower person with MS (visual representations of individual data; promote understanding of MS; ability to interact with others) Must add value to the clinical consultation Imperial Data Science Institute References Need for smart interface, to limit missing/unreliable data 1. Disanto G, et al. Heterogeneity in multiple sclerosis: scratching the surface of a complex disease. Autoimmune Dis. 2010:932351. Must be accessible to those with significant disability 2. Derfuss T. Personalized medicine in multiple sclerosis: hope or reality? BMC Med. 2012;10:116. Clinical Neurosciences; Faculty of Medicine; Department of Medicine; Imperial College London. Email: j.raffel@imperial.ac.uk; p.matthews@imperial.ac.uk
Data: The Visible & the Hidden Anil. A. Bharath Danilo Mandic Big Data? Big-Data is often associated with transactional logs, location coordinates, text messages and e-mails. Making sense of such data opens avenues for commerce and other useful analysis. Social Media Postings are considered Big Data : Twitter: 500 million 1 tweets/day; Average Tweets/second (TPS) of 6000 or around 12Mbit/s. Tensor 3 Tools There exist well-developed descriptions of algorithms using vector-matrix formulations both for 1D and 2D structured sources of data. What about 3-, 4-, X- dimensions? One solution is to use tensor subspace transformations to model common latent variables across both dependent and independent data. Other sources of Big Data also exist: audio streams, video streams and images. For example: A pair of human retinae 2 ~ 15Mbit/s! One irony is that visual data, interpretable to humans, is less interpretable to machines: we must map into an intermediate form in order to be able to index, retrieve and analyse items appropriately. Multi-linear singular value decomposition a tensor-based decomposition for multichannel and multi-dimensional data. Comparison of bandwidth in bits/s (log scale) of different types of sensory data. Vision, a source of structured data, occupies quite a bit of bandwidth. A single pair of human retinae produce approximately the same rate of data as Twitter! Tools for Structured Big-Data By structured, we refer to temporally-ordered or spatially-organised data, relevant to time-series, image data or video sequences. A recurring theme in analysing such data is that we must transform it in some way: for example, into the discrete Fourier domain, or the wavelet domain. Some types of Component Analysis (e.g. PCA, ICA) allow us to learn data dictionaries, or atoms a sort of data vocabulary. To encode data in this vocabulary, algorithms often expand the data sizes (temporarily) even more! Examples of atoms used to analyse visual data. Often hidden inside a piece of code, but implicit in its function. Notes 1. Twitter estimates: August 2013 average tweet-per-second figures (slightly adjusted) from https://blog.twitter.com/2013/newtweets-per-second-record-and-how. Assumed 250 (not 144) bytes/tweet, based on allowing for metadata. 2. Estimates of human vision bandwidth estimated as follows: estimates from Cavia porcellus, reported by K Koch et al in Current Biology 16, pp1428 1434, July 25, 2006, scaled up based on retinal cell population numbers. Max of 2 retinae/person ( ). 3. The term tensor is used in a different context to its classical use in physics. See TG Kolda, and BW Bader. "Tensor decompositions and applications." SIAM Review 51.3 (2009): pp455-500. Multi-channel time-series can be analysed using tensor decompositions allowing, for example, arm motion to be predicted from multi-channel electrocorticograms. Interest has also been growing in tensor-based descriptions from the point of view of reproducibly describing algorithms, and also for making them efficient. For example, the expansion in data, described in the left-handpanel, required for automated visual analysis, can be tamed by appropriate tools. Using tensors allows petabytes of information to be efficiently stored and computed. See Also Q Zhao, et al. "Higher Order Partial Least Squares (HOPLS): A Generalized Multilinear Regression Method," Pattern Analysis and Machine Intelligence, IEEE Transactions on 35.7 (2013): 1660-1673. A Cichocki, DP Mandic et al. "Tensor decompositions for signal processing applications," IEEE Signal Processing Magazine, in press, 2014. AA Bharath and J Ng, Keypoint Descriptor Generation by Complex Wavelet Analysis, US Patent App. US 13/127,901 (Accepted, award pending). AA Bharath and M Petrou, Next-Generation Artificial Vision Systems: Reverse-Engineering the Human Visual System, Artech House, 2008. Departments of Bioengineering & Electrical & Electronic Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ. Email: {a.bharath d.mandic}@imperial.ac.uk http://www.bicv.org http://www.commsp.ee.ic.ac.uk/~mandic