Big data for better science. Data Science Institute

Transcription

1 Big data for better science Data Science Institute

2

3

4 Sense and Sensibility Julie A. McCann Adaptive Embedded Systems The aim of the Adaptive Emergent Systems Engineering (AESE) group in the Department of Computing is to examine the relationships between embedded systems and their environments (physical and human) to better understand their behaviours and impacts and to exploit this knowledge to enhance the performance of such systems. ICRI Cities London Living Lab (L3) The Hyde Park L3 Platform will advance the use of sensing and social platforms deployed in-the-wild to support research into ecology, air quality, water quality, noise and light pollution, public engagement, and the communication and manageability of sensed data. This will enable e.g. the Royal Parks authority to visualise real and near-time data through a simple dashboard alongside deeper analysis of raw data. Also in the Isis Education Centre educators and their audiences can engage school children and the general public with a better understanding of the park and its ecology, usage and history. Crowdsourcing and Opportunistic Networking Future smart cities will require sensing on a scale hitherto unseen. Fixed infrastructures have limitations regarding sensor maintenance, placement and connectivity. Employing the ubiquity of mobile phones is one approach to overcoming some of these problems whereby the phone carries the data. This work is first to exploit underlying social networks and financial incentivisation, combining network science principles and Lyapunov optimisation techniques, we have shown that global social profit across a hybrid sensor and mobile phone network can be maximised. Smart Water Systems Water networks are moving away from sparsely instrumented telemetry systems. The vast majority of next generation approaches to manage such networks consist of denser sensor networking but these still require data to be sent back to some core management servers. Actuation technologies are becoming more on-line and in-line with sensor networking. This brings about opportunities to make water networks smarter and in turn more resilient and optimal. Such a network is an example of a Cyberphysical System (CPS). With sample rates of up to 120/s there is a strong need for big data analytics and adaptive cloud computing. Acknowledgements London Living Labs is sponsored by Intel and Future Cities Catapult. Smart Water Systems is sponsored by NEC Japan and FP7 WISDOM. Photos Ivan Stoianov. Opportunistic Sensing is sponsored by the Intel Collaborative Research Institute Sustainable Connected Cities. Department of Computing, Huxley Building, South Kensington Campus, Imperial College London, SW7 2AZ. jamm@imperial.ac.uk wp.doc.ic.ac.uk/aese

5 Co-design of Cyber-Physical Systems Eric Kerrigan Cyber-Physical Systems Cyber-physical systems (CPS) are composed of physical systems that affect computations, and vice versa, in a closed loop. By tightly integrating computing with physical systems one can design CPS that are smarter, cheaper, more reliable, efficient and environmentally friendly than systems based on physical design alone. Examples include modern automobiles (the 2013 Ford Fusion generates 25GB of data per hour), aircraft and trains, power systems, medical devices and manufacturing processes. The dramatic increase in sensors and computing power in CPS present unique big data challenges to the engineer of today and tomorrow. The key big data questions for CPS are what, where, when and how accurate to measure, compute, communicate and store? My team is providing answers to these by developing control systems theory and mathematical optimization methods to automatically design the computer architecture and algorithms at the same time as the physical system. This co-design process results in a better overall system compared to iterative methods, where sub-systems are independently designed and optimized. cyber-physical system optimal inputs u (y) physical system computing system disturbances u (y) :=argminf(u, y) u s.t. g(u, y) =0 h(u, y) 0 numerical errors measurements y optimal design parameters for physical system p c co-designer (p,c ):=argminφ(p, c) p,c s.t. α(p, c) =0 β(p, c) 0 optimal design parameters for computing system By understanding the nature and timescales of the physical dynamics one can dramatically reduce the amount of data needed in order to make a decision and/or increase the quality and quantity of information extracted from a given data set. Current work is concerned with model-based feedback methods that allow one to minimize the amount of measurements and computational resources to estimate, in real-time, information that can then be used to control and optimize the behaviour of the overall system. Mathematical Optimization Most CPS co-design problems can be formulated as a multiobjective and constrained mathematical optimization problem. Furthermore, CPS are optimal only if the computing system is executing tasks with the goal of optimising given performance criteria. We are therefore developing methods to: model and solve the non-smooth and uncertain optimization problems that result during the co-design process, and solve constrained, nonlinear optimization algorithms in realtime on embedded and distributed computing systems. Control and Dynamical Systems Theory The main technical challenge in the co-design of CPS is to merge abstractions from physics with computer science: the study of physical systems is based on differential equations, continuous mathematics and analogue data, whereas the study of computing systems is based on logical operations, discrete mathematics and digital data. Furthermore, while a computation is being carried out, time is ticking and the system continues to evolve according to the laws of physics. A designer therefore has to trade off system performance, robustness and physical resources against the timing and accuracy of measurements, communications, computations and model fidelity. We are developing system-theoretic methods to understand and exploit this hybrid and real-time nature of CPS. Current work includes the co-design of parallel computing architectures, linear algebra and optimization algorithms to increase the efficiency of the computations. Acknowledgements This research is in collaboration with George Constantinides, Jonathan Morrison, Rafael Palacios, Mike Graham and Jan Maciejowski (Univ. of Cambridge). Department of Electrical & Electronic Engineering and Department of Aeronautics, Imperial College College London, South Kensington Campus, London SW7 2AZ. e.kerrigan@imperial.ac.uk

6 Crystallisation of Biological Molecules for X-Ray Crystallography Lata Govada Sahir Khurshid Tim Ebbels Naomi E. Chayen The Problem Detailed understanding of protein structure is essential for rational design of therapeutic treatments and also for a variety of industrial applications. The most powerful method for determining the structure of proteins is X-ray crystallography which is totally reliant on the availability of high quality crystals. The crystallisation of proteins involves purified protein undergoing slow precipitation from an aqueous solution where the protein molecules organise themselves in a repeating lattice structure. The Challenge There is currently no means of predicting suitable crystallisation conditions for a new protein. Figure 3 illustrates a sample/cross section of the enormous chemical space explored during screening. Finding crystallization conditions for a new protein is like searching for a needle in a haystack. Initial attempts (referred to as screening) involve the exploration of multi-dimensional parameter space using 1000s of candidate conditions. The miniaturisation and automation of such screening trials has been of great benefit but crystallisation continues to remain the rate limiting step to structure determination (Figure 1). Figure 2 is a crystal of the Human Macrophage Migration Inhibitory Factor. Figure 3. Plot of crystal hits for 269 macromolecules from the structural genomics community. Dark blue indicates five or more crystal hits for that cocktail, medium blue 3-4 and light blue 1-2. White areas are unsampled areas of chemical space. Figure 1. Results from structural genomics centres worldwide (Target Track PSI). Figure 2. Crystal of Human Macrophage Migration Inhibitory Factor. The relevant parameters include the type and concentration of precipitating agent, the concentration of protein, the type and concentration of a secondary precipitating agent and/or of an additive, the ph and temperature amongst others. One or more of these conditions may show some promise, most often in the form of microcrystals, clusters, or microcrystalline suspension. The following, optimisation step consists of fine-tuning these promising conditions by changing the values of the various parameters, such as concentrations and ph, in small increments, until useful crystals are obtained. This common approach fails in 80% of cases even when high throughput methods are employed. High throughput has not yielded high output and significant amounts of protein sample, time and resources are wasted. Computational and Systems Medicine, Department of Surgery and Cancer, Imperial College College London, South Kensington Campus, London SW7 2AZ. n.chayen@imperial.ac.uk A wealth of public data (PDB, BMCD) exists which is not being tapped into efficiently. The ability to predict crystallisation conditions would revolutionise this field. Addressing this challenge will require two aspects of Big Data Science. Firstly, the data generated from structural genomics projects on crystallisation conditions is huge, with millions of combinations of protein sequence and conditions attempted in high throughput screens. Storage, search and retrieval of this data in an efficient way will require tools of big data bases. Secondly, the discovery of patterns in sequences and other molecular properties which predict optimal crystallisation conditions will require sophisticated statistical and machine learning algorithms in order to make sense of the high dimensional but still sparsely sampled data. The desired result would be a more efficient methodology for conducting crystallisation experiments and an in silico approach to prediction of crystallisability. This would save immense amounts of experimental time, protein sample etc., and transform the field.

7 Global fits of dark matter theories Roberto Trotta Pat Scott Charlotte Strege The Dark Matter mystery The experimental hunt for dark matter is entering a crucial phase. Decades of astrophysical and cosmological studies have shown almost conclusively that 80% of the matter in the Universe is made of a new type of particle. One of the key questions of cosmology and particle physics today is to determine the nature and characteristics of such a particle. The aim of our work is to put constraints on the physical parameters of theoretical models for dark matter (such as Supersymmetry) by combining four complementary probes: cosmology, direct detection, indirect detection and colliders. This is the so-called global fits approach. Experimental probes of Dark Matter Cosmology: Observations of the relic radiation from the Big Bang, the cosmic microwave background, constrain the amount of dark matter in the Universe with very high precision. Direct detection: Direct detection experiments aim at detecting dark matter by measuring the recoil energy of nuclei undergoing a collision with a dark matter particle. Some highly controversial claims for detection are directly contradicted by other experiments, which have not found any statistically significant signal. Indirect detection: Dark matter particles annihilating into Standard Model particles produce high energy photons and neutrinos, which can be detected using dedicated space and ground-based observatories. Colliders: The Large Hadron Collider at CERN is putting strong limits on the properties of putative particles beyond the Standard Model. The recent discovery of the Higgs boson (for which the Nobel Prize in physics 2013 was awarded) also puts strong constraints on the properties of such speculative theories. Our work implements, for the first time, the entire spectrum of these constraints in a statistically correct way, in order to extract the maximum information possible about the nature of dark matter. Statistical constraints from Global Fits on the dark matter mass and scattering cross section in a 15-dimensional theory (Strege et al, to appear) Big Data challenges Our group has developed a world-leading Bayesian approach to the problem, allowing us to explore in a statistically convergent way theoretical parameters spaces previously inaccessible to detailed numerical study. Our methodology couples advanced Bayesian techniques with fast approximated likelihood evaluations. Even so, it remains computationally very challenging: Each likelihood evaluation requires numerical simulation of the ATLAS detector. This involves the generation of a large number of simulated events, the production of a numerical likelihood function based on a binned analysis and the evaluation of the ensuing constraint. The above process is CPU and disk-space intensive: our current study (see above) required 100s of TB of disk space and 400 CPU-years of computing power. We studied theoretical models with up to 15 free parameters. The most general models have up to 105 parameters, so novel techniques are needed to explore such complex parameter spaces. Acknowledgements We thank Imperial High Performance Computing services and the University of Amsterdam for providing computing resources. This project is in collaboration with G Bertone, R Ruiz de Austri and S Caron. Map of the relic radiation from the Big Bang, used to measure the amount of dark matter in the Universe. Credit: Planck/ESA Astrophysics Group, Blackett Laboratory, Imperial College London, Prince Consort Road, London SW7 2AZ. r.trotta@imperial.ac.uk

8 Digital City Exchange David Birch Yike Guo Nilay Shah Orestis Tsinalis John Polak Koen van Dam Eric Yeatman Context and Challenge Cities are now home to more than half of the worlds population. They face significant challenges, such as congestion, air quality, provision of food and electricity, but also offer opportunities for innovation and collaboration as well as an increased efficiency enabled by their density. A smart city is a connected city: efficient use of resources through interaction and integration. This requires a better understanding of the complexity of cities and urban living. Approach A three tier solution comprising an ontology-supported sensor data store, a workflow engine and a web based interface to build a chain of connected data sets and models, enables the creation of services which take advantage of (real-time) data, analytics and predictive models. We have the data but how can we make the most of city data and cope with integration and the vast scale? City infrastructures are connected and influence one another. Currently data is collected, analysed and used in the traditional silos of energy, transport, education, waste, etc, but the hypothesis of the Digital City Exchange is that through data integration better decisions can be made. We are building the infrastructure to facilitate this and then test it with analytical and predictive models. City Data Data is collected by utility companies, (local) governments and service providers, but also by residents. This includes induction loops in the roads to measure traffic flows, air quality monitors, pothole reporting via smartphone, smart bins that report when they are full, social media messages sent, etc. Much of this data is closed and only one party has access to it, while other data is shared (possibly paid for) or even released as open data for anyone to use. Platforms are needed to store, analyse and collaborate using this data. Acknowledgements Digital City Exchange is a five-year programme at Imperial College London funded by Research Councils UK s Digital Economy Programme (EPSRC Grant No. EP/I038837/1) D.Stokes@imperial.ac.uk

9 Astronomically Big Data David L. Clements Steve Warren Daniel Mortlock Alan Heavens Large scale catalogs in astrophysics are already large, but the next generation of surveys will boost that size by orders of magnitude. In particular the Euclid Mission will provide Hubble Space Telescope quality near-ir images across the entire sky, while the Large Synoptic Survey Telescope (LSST) will image the entire (accessible) sky in 5 different colours every 5 days. Conventional methods of classifying objects (using image metrics or Citizen Science) may be inadequate for fully exploiting the discovery space of these vast surveys. Statistical analysis of these vast datasets, to test Einstein's theory of gravity and shed light on the Big Bang, also presents formidable data analysis challenges which need to be met if the power of the surveys is to be realised. Euclid & LSST: The coming deluge The forthcoming Euclid and LSST projects will be orders of magnitude beyond the scale of SDSS & similar current projects. Current state of the art: SDSS Sloan Digital Sky Survey (SDSS) observed ¼ of the sky in 5 optical bands, obtaining imaging & photometry for 500 million sources, and spectroscopy for 1 million. Images and spectra are automatically analysed, but human eyeball citizen science through Zooniverse has proved useful in finding truly unusual objects, for example Hanny s Voorwerp, the green object shown below, a previously unknown and poorly understood ionised gas cloud in the intergalactic medium, found through the citizen science project Galaxy Zoo. (Source: NASA/ESA/W) Euclid will observe ~40% of the sky to resolutions comparable to the Hubble Space Telescope (HST). 10 billion galaxies will be imaged, each of which will have 100 times the number of pixels in an SDSS image, for ~ 2000x the amount of data/night. LSST (Large Synoptic Survey Telescope) will be a wide field 8m telescope which will survey ~ ½ the sky (20000 sq. deg.) in 5 colours every 5 days. Can be combined to give time resolution to search for transient sources (eg. Supernovae), stacked to go deep, or some variety. Data rate is 30 Terrabytes/ night & it will run for >10 years. The discovery space for these projects is so big that it cannot be handled by either conventional computing or citizen science approaches. Physics Department, Imperial College College London, South Kensington Campus, London SW7 2AZ. d.clements@imperial.ac.uk

10 Future Computational Platforms Christos Bouganis Specialized Computational Platforms The increasing need for processing large amount of data as fast as possible, combined with the development of increasingly complex computational models for the more accurate modeling of the underlying processes, has led researchers and practitioners to adopt suboptimal approximation models or, in certain cases, to the heavy use of High-Performance Computing clusters. However, both approaches are not desirable as the former one does not provide the best possible solution where the latter approach results in low silicon efficiency and high power consumption as these systems are not tailored to the structure of a specific application. Probabilistic Inference Acceleration Our work also focuses on the bioinformatics domain where it is often required to analyze large amount of data using complex probabilistic models. As Probabilistic Inference algorithms are computational expensive, our work focuses on the design of computational platforms with an architecture that is tuned to the probabilistic inference algorithm. Recent results obtained form the acceleration of population-based MCMC algorithms show that two orders of magnitude speed-ups over traditional CPU code can be achieved with minimum power footprint. In the Circuit and Systems group of Electrical and Electronic Engineering Department, we contact research into core computational platforms that can be adapted to specific applications leading to high performance gains within a power budget compared to the classical computer architectures. Our current work involves the design of computational platforms for the acceleration of the training stage of computational demanding Machine Learning algorithms and the acceleration of probabilistic algorithms for Bayesian Inference when they applied to health care. Machine Learning Our group has developed a computational platform that accelerates the training stage of a Support Vector Machine algorithm making possible to achieve high classification rates within a limited time and power budget. By designing the architecture of the system to match the targeted algorithm, the system has achieved a speed up of two orders of magnitude consuming only a fraction of the power footprint compared to a personal computer. Other key aspects of our research are the optimization of the memory interface to maximize the bandwidth between computation and SDRAM memory, and data-path optimization, including computer arithmetic, for low power and high performance. Department of Electrical and Electronic Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ. christos-savvas.bouganis@imperial.ac.uk

11 Big Data in Medical Imaging Daniel Rueckert Ben Glocker Overview In medical imaging, a vast amount of information is collected about individual subjects, groups of subject or entire populations. A characteristic of medical imaging is that the sensors or devices (e.g. CT or MR machines) can produce 2D, 3D or even 4D datasets. While each dataset is large in itself, the amount of derived information from each dataset is often much larger than the original information. In the following we outline the challenges of big data in the context of medical imaging that are addressed in the Biomedical Image Analysis Group at Imperial College London. Big data from clinical studies/trials Over the last years there has been an explosion of imaging data generated from clinical trials. In addition to imaging data collected for drug development there is an increasing amount of data available for research purposes. Two of the most prominent examples this are the Alzheimer s Disease Neuroimaging Initiative (ADNI) and the Human Connectome Project (HCP). The latter project is build a comprehensive map of neuronal connections at macroscale. For this state-of-the art diffusion and functional MR imaging (see figure below left) is collected from 1200 subjects, producing more than 25GB of raw data per subject. The analyzed data (see figure below right) requires more than 1 PB storage. Machine learning for medical imaging The use of machine learning in the analysis of medical images plays an increasingly important role in many real-world, clinical applications ranging from the acquisition of images of moving organs such as the heart, liver and lungs to the computer-aided detection, diagnosis and therapy. For example, machine learning techniques such as manifold learning can be used to identify classes in the image data and classifiers may be used to differentiate clinical groups across images (see figure below left). In addition, the approaches allow the combination of imaging information and non-imaging information, e.g. genetics (see figure below right). Special vertices encoding, non-imaging information, e.g. ApoE genotype The figure below shows the application of these ideas for the automatic identification of subjects with dementia. Big data from population studies An example of big data from population studies is the UK Biobank imaging effort. This project has recently received funding for a large-scale feasibility study which, if successful, will allow it to conduct detailed imaging assessments of 100,000 UK Biobank participants. This more detailed characterisation of these participants will allow scientists to develop an even greater understanding of the causes of a wide range of different diseases (including dementia) and of ways to prevent and treat them. The imaging study will involve magnetic resonance imaging of the brain, heart and abdomen (see figure right), low power X-ray imaging of bones and joints and ultrasound of neck arteries. Biomedical Image Analysis Group, Department of Computing, Huxley Building, Imperial College London, South Kensington Campus, London SW7 2AZ d.rueckert@imperial.ac.uk, b.glocker@imperial.ac.uk

12 Effects of high-frequency company specific news on individual stocks Robert Kosowski Ras Molnar Research Objectives The aim of this research is to study the impact of high-frequency company specific news on individual stocks. The term high frequency news in this context means news that are reported electronically by news company during the day. Why are high-frequency news interesting to study? Highfrequency news is an important information source for all market participants and sheds light on economic transmission mechanisms that cannot be observed using lower frequency, for example, daily end of day closing prices or low frequency economic indicators. How is our research novel? The contribution of our research lies in the fact that we are not only measure the sentiment extracted from news but other news characteristics as well. We also utilize high-frequency data which have not been studied in this perspective extensively. What are expected outputs? We expect to find that high frequency news and novel sentiment measures have an economically significant impact on asset prices. It is likely that the innovations in our methodology will lead to more significant results compared to existing studies. Big Data For the purpose of our project, we use two main sources of the high-frequency information. Both imply a vast amount of data related to both news and trades. We use the news database based on the Reuters Site Archive. This dataset contains about 5.6 million Reuters news from the beginning of 2007 until the end of Raw HTML files take about 426GB, while the database containing news identifiers and news text is around 31GB large. Number of high-frequency news by year Number of news We use the TAQ database for high-frequency stock data. This dataset contains trades and quotes from major American stock exchanges. In our research we intend to use trades only. Trade data is an example of Big Data because the number of trades increased over time from 92 million trades in 1993 to 7.5 billion in The extensive number of trades implies a large size of the database itself. A cumulative size of databases containing TAQ trades from the beginning of 2007 until the end of 2012 is expected to be around 4TB. Number of trades by year Number of trades (in million) Year Methodology The methodology we use in this research is a methodology in line with the existing literature stream (for example Gross- Klussmann, A. and N. Hautsch. When machines read the news: Using automated text analytics to quantify high frequency news-implied market reactions. Journal of Empirical Finance 18 (2), ). The frequency and amount of data we have to process means we pre-process data within the database before we progress with the analysis. In the case of news data we calculate the sentiment, relevance and novelty of news using the textual analysis similar to Boudoukh et al. (Which news moves stock prices? A textual analysis. Technical report, National Bureau of Economic Research. 2013). Stock market data are sampled and only parts necessary for our analysis are selected. The analysis itself consists of two parts, an event study and the vector auto regression model. The goal is to explain the reaction of stock market given the characteristics of news. Acknowledgements Our news data database is based on the Reuters News Web Archive Year Finance Group, Imperial Business School, Imperial College London, South Kensington Campus, London SW7 2AZ. r.kosowski@imperial.ac.uk

13 Development of an ovarian cancer database for translational research Haonan Lu Christina Fotopoulou Ioannis Pandis Yike Guo Hani Gabra Ovarian cancer is a systemic disease which can be dysregulated through multiple mechanisms, therefore it is crucial to understand the detailed molecular pathways behind it. Recently, the Cancer Genome Atlas(TCGA) project has generated multiple levels of OMIC data from genome to phenome, which gives us a comprehensive view of high grade epithelial ovarian cancer. However, the cross-correlation of good quality clinical data to multilevel molecular profile is required to obtain valid biomarkers. Furthermore, the difficulty in accessing but also reproducing the TCGA data has been a known issue impairing interpretation and implementation of the findings. Multiple molecular profile constructed for 175 ovarian cancer cases We have previously systematically collected samples from 175 primary epithelial ovarian cancer patients and obtained molecular information across multiple platforms, including gene expression microarray, SNP array, exome sequencing and Reverse Phase Protein Array(Figure 1). A great advantage of these data is the samples are collected from a single institute which had much less bias on the sample type, therefore the clinical data is cleaner and the molecular data is more reliable. (a) (b) Metabolomics Serum and urine(to be done) Imperial College Gene expression profile >47,000 transcripts Genome Institute of Singapore 175 Ovarian Tumor Samples DNA copy number variation 5,677 CNV regions Genome Institute of Singapore Exome sequencing Whole exome London Research Institue Proteomics >160 proteins MD Anderson Figure 1. (a)type of molecular profile data obtained from 175 ovarian cancer patients. Coverage of each platform in bold and collaborators in italic. (b) Published result using part of the gene expression data. We compared the gene expression profile among three subtypes of ovarian cancer(benign, borderline and malignant). We found distinct gene expression pattern between benign and malignant tumor, whereas borderline tumor showed two distinct subgroups: one benign-like and the other malignant-like. Courtesy from Molecular subtypes of serous borderline ovarian tumor show distinct expression patterns of benign tumor and malignant tumor-associated signatures., Mod Pathol, , Curry EW, Stronach EA, Rama NR, et al., (i) (ii) (a) Number of clinical parameters TCGA Hammersmith Total Surgical Chemotherapy Correlate with outcome (e.g. overall survival and progression free survival) Personalize Surgical operation Novel clinical parameters Biomarkers to stratify patients Correlate with molecular profile Personalize drug treatment Figure 2. (a)comparison of the number of clinical parameters collected from TCGA and Hammersmith. (b)planned workflow after obtaining the new clinical data. (b) Data Interpretation using transmart Apart from generating quality data, we ve also been working on making the data more accessible to researchers, by collaborating with the transmart project. transmart is a database platform with built-in analytical tools that is ready-touse for all the researchers. We are currently creating the Ovarian Cancer Database within the transmart platform, which contains our dataset together with other popular datasets to help researchers perform data analysis across multiple studies(a work example is shown in figure 3). We are aiming to significantly accelerate the ovarian cancer research for both clinicians and scientists. Figure 3. Example workflow of using Ovarian Cancer Database in transmart. (i)discovering the association between chemotherapy response and overall survival using GIS dataset. Kaplan- Meier plot shows patients responding to chemotherapy well(blue) has a significantly higher survival rate comparing to the chemo-resistant patients(red). (ii)differential gene expression between the two patient cohort(complete response and progressive disease). As the corresponding gene expression profile is available for these patients, differential gene expression analysis can be performed to discover potential marker genes for chemo resistance. (iii)crossvalidate gene of interest in multiple datasets and hence guide following experimental research. All the analysis shown is performed within transmart. (iii) Continuously updated clinical data In order to place these molecular data within the correct frame of context and be able to define valid biomarkers of surgical and clinical outcome, we are currently generating robust, updated and detailed surgical and clinical data to be cross-correlated with the molecular biological information(figure 2). Acknowledgements We specially thank Prof. Yike Guo, Dr. Ioannis Pandis and other group member for their help with transmart platform. Tothill Dataset TCGA Dataset Ovarian Cancer Action Research Centre, Department of Surgery and Cancer, Imperial College College London, Hammersmith Campus, London W12 0NN. h.gabra@imperial.ac.uk

14 Data Science Institute, Imperial College London Institute for Security Science & Technology Donal Simmie Maria Grazia Vigliotti Erwan Le Martelot Chris Hankin Influence in Social Networks Influential agents in networks play a pivotal role in information diffusion. Influence may rise or fall quickly over time and thus capturing this evolution of influence is of benefit to a varied number of application domains. We propose a new model for capturing both time-invariant influence and also temporal influence. We performed a primary survey of our population users to elicit their views on influential users. The survey allowed us to validate the results of our classifier. We introduce a novel reward-based transformation to the Viterbi path of the observed sequences which provides an overall ranking for users. benefits to us in solving these problems by improving our memory and recall and presenting data to us in a manner that leads to insight and/or questions our decision for a more positive outcome. Sensemaking provenance captures the reasoning flow of an analyst during a specific task. We perform Machine Learning on the interactions of the analyst with the computer and the context of those actions to determine their probable reasoning. Our results show an improvement in ranking accuracy over using solely topology-based methods for the particular area of interest we sampled. Utilising the evolutionary aspect of the HMM we predict future states using current evidence. Our prediction algorithm significantly outperforms a collection of models, especially in the short term (1-3 weeks). TRAIN TEST Automated Sensemaking Recovery Complex data analysis is often multi-modal incorporating visualisations, structured and unstructured data sources possibly from numerous disparate data sources. Making sense of the presented data and interrogating it successfully to form hypotheses and conclusions are non-trivial tasks but they are aided by leveraging applications and bespoke tools designed for exactly this purpose. Humans are skilled at solving difficult problems or at exploring data and discovering new insights. However computers can provide Fast Multiscale Community Detection Many systems can be described using graphs, or networks. Detecting communities in these networks can provide information about the underlying structure and functioning of the original systems. Yet this detection is a complex task and a large amount of work was dedicated to it in the past decade. One important feature is that communities can be found at several scales, or levels of resolution, indicating several levels of organisations. Therefore solutions to the community structure may not be unique. Also networks tend to be large and hence require efficient processing. In this work, we present a new algorithm for the fast detection of communities across scales using a local criterion. We exploit the local aspect of the criterion to enable parallel computation and improve the algorithm's efficiency further. Acknowledgements Influence in Social Networks and Fast Multiscale Community Detection are supported by Making Sense project under EPRSC grant EP/H023135/1. Automated Sensemaking Recovery is supported by UKVAC project under funding by US DHS and UK Home Office. Institute for Security Science and Technology, Imperial College College London, South Kensington Campus, London SW7 2AZ. d.simmie@@imperial.ac.uk

15 Future science on exabytes of climate data David Ham When climate models execute on 100 million cores, and generate exabytes of data, how will we work with this data? How will we account for the diverse numerical schemes used to produce it? How will the users of climate research know that our calculations were valid and that our results can be relied on? Climate Model Intercomparison Climate modelling basis for the UN Intergovernmental Panel on Climate Change (IPCC) assessment reports. A very large component of modern climate science is based on the analysis of data from the CMIP simulations. As computing power increases, climate model resolutions become ever finer, and the resulting data sets demonstrate exponential growth. CMIP Phase 3 (2006) produced 36 Terabytes A proposed toolchain for high productivity, scalable and verifiable climate data science Rather than hand-writing bespoke low-level processing tools, climate researchers need to be able to state their questions in high level mathematical form. The code implementing the query will be automatically generated by the Firedrake system and applied to the climate model data. A query for the mean sea surface temperature in the North Atlantic might appear as: CMIP Phase 5 (2011) produced 3,3 Petabytes CMIP Phase 6 (~2020) is expected to yield 100s Petabytes 1 Exabyte Climate science queries Climate science questions typically require mathematical functions to be applied to reduce vast spatial and temporal field data sets to meaningful climate statistics. Across the vast field of climate science, each research project has its own specialised questions to ask. For example: Which models predict an increase in coastal flooding for the UK? How does Atlantic sea surface temperature differ in different simulations? What is the strength of the Gulf Stream in all of the CMIP simulations? Current methodology Data is downloaded by each researcher, and custom analysis scripts are developed for each query. This is: Labour-intensive: researchers, often PhD students and Postdocs, around the world are constantly re-implementing very similar work. Error-prone: every query script is bespoke and is a new source of errors. There is no systematic mechanism for finding errors. Untraceable and unverifiable: there is no effective mechanism to publish the actual techniques applied to the data, and verifying their correctness is next to impossible. The results published in the literature must currently be taken on trust, as there is no mechanism for establishing their provenance. north_atlantic = domain (latitude = (0., 60.), longitude = ( 60., 0.)) for date in <list of dates>: atlantic_multidecadal_oscillation = \ integral(sea_surface_temperature*dx(north_atlantic))/\ area(north_atlantic) An Imperial-developed system for automatic generation of high performance, parallel, numerical code from the mathematical query. Different numerics will be generated to execute the same mathematics on the outputs of different models. The code generator can be extensively tested to provide verifiably correct results. Generated code will be applied to the data using cloud resources attached to the archive site. The original data is not downloaded by the user. The original query is short and expressive and can therefore be included in publications. This will enable verification and reproduction of results, which is currently effectively impossible. Departments of Mathematics and Computing, Imperial College College London. david.ham@imperial.ac.uk

16 Bio-Inspired Paradigm Within the Centre for Bio-Inspired Technology we utilise biological principals and mechanisms to create more efficient healthcare technology. This bio-inpsired paradigm allows for (1) Learning from biology to create more efficient healthcare technologies and (2) Modeling biology to understand it better. Expanding this principle we apply local intelligence to our devices to create more efficient data transmission and to implement closed-loop protocols. Biologist s Electrophysiology Models Electrical Engineer Applied Physics Intelligent Neural Interfacing Systems Amir Eftekhar Sivylla Paraskevopoulou Timothy Constandinou Christofer Toumazou Brain Interfacing The brain is a complex network of 100 billion neurons. To transmit the full quantity of data it produces would be nearly 16,000 Tb/s per person. In a chronic disease population of 1 million people if we were to monitor what can be achieved with modern electrodes and communication (100 electrodes) this equates to 16Tb/s. The same is true for other monitoring schemes: heart activity (ECG, 2-3 channels), non-invasive brain (EEG, up to 64 channels). Although lower in sampling frequency, they still equate to 3Gb/s per channel for a population of 1M, or 11Tb/hour. Biological Hardware Software Modelling Biology Understanding Biology High Density Microelectrode Array Organs /Systems Architectures Applications Simulating Biology Bio-Inspired Architectures Examples from our group include a closed-loop artificial pancreas, cochlea implant and retina chip. Some of our more recent work applies local intelligence neural interfacing. Implanted Biotelemetry Closed-Loop Appetite Control Obesity is one of the greatest public health challenges of the 21st century. Affecting over half a billion people worldwide, it increases the risk of stroke, heart disease, diabetes, cancers, depression and complications in pregnancy. Bariatric surgery is currently the only effective treatment available but is associated with significant risks of mortality and longterm complications. The peripheral nervous system is a complex network of over 45 miles of nerve with impulses at speeds of 275mph. In this project we are tapping into the Vagus Nerve to extract the signals that control appetite and electrically stimulate to regulate it. The gut is densely innervated by the Vagus nerve, thus its signals represent an integrated response to nutrients, gut physiology and hormones and have a powerful effect on appetite. The nerve is a complex structure so requires interfacing with dozens of electrodes monitoring chemical and electrical activity. Here we are utilising real-time, self learning algorithms for closed-loop control of appetite. Connectors Cuff Contacts Microchip Nerve Cuff Electrode Microspike array Spinal Stimulation TYPICAL NEURAL RECORDINGS Amplification Conditioning + Pre-processing Spike Detection Spike Sorting Analysis Stimulation TOWARDS INTELLIGENT NEXT GENERATION NEURAL INTERFACES Prosthetic Control Brain-Computer Interface External Transponder / Power unit With the advent of High Density Microelectrode Arrays we can tap into a subset of these. Neural activity can be monitored from 100 s of channels, with data rates exceeding 20Mbps this is not possible in medical implants. We require local, intelligent processing of neural signals can reduce this to less than 1Mbps which facilitates closedloop systems, such as for spinal cord stimulation. We have developed low power, realtime spike detection and sorting algorithms, part of the process for processing neural signals i.e. identifying which has fired in the vicinity of the electrode. We are currently in the process of developing the final generation of microchip with this processing embedded. With it, we can reduce the 1Tb/s to less than a Mb/s for 500 neurons. SINGLE NEURON SIGNALS Acknowledgements This work is primarily a multi-disciplinary among many researchers and students at the Centre for Bio-Inspired Technology and collaborators. Centre for Bio-Inspired Technology. Dept. of Electrical and Electronic Engineering, Imperial College College London, South Kensington Campus, London SW7 2AZ. amir.eftekhar@imperial.ac.uk

17 Digital Money Llewellyn Thomas Antoine Vernet David Gann Project Context and Goals Money is one of the most influential factors shaping human history, driving not only wealth creation and socio-economic development, as well as religion, ethics, morality, and fine art (Eagleton & Williams, 2011). Some have argued that digital money, as distinct to earlier forms of money, has the potential to provide major economic and social benefits, such as by removing the friction in transactions or enabling inclusive innovation (Dodgson et al., 2012). Moreover, the big data generated by digital money can used to improve business operating efficiency, develop novel business models, as well as complementing or even extending the notion of identity. However there is little, if any, systematic research into digital money, its adoption and impact. Given this gap, it is our ambition to address the following: Does digital money adoption make a difference? What are the big data implications of digital money? Is it possible to quantify the benefits to governments, corporations and individuals? What are the factors that affect the outcome of a digital money initiative? Conceptualizing Digital Money We define digital money as currency exchange by electronic means. Digital money is a socio-technical system that fulfils societal functions through technological production, diffusion and use (Geels, 2004). It is a system of value interchange relying on information and communication technologies that themselves form a systems. As a result and given the importance of regulation to digital money, we conceptualised the digital money system as four interacting components: the national institutional context, the enabling technological and financial infrastructure, the demand for digital money, and the industries that drive digital money supply. Taking these four components as the pillars of the composite index, we selected a range of indicators which measure progress along each pillar, ranked countries according to their digital money readiness, and, using cluster analysis, identified four stages of readiness. We also correlated our index with existing cashlessness measures, and found that although there is strong correlation, there are also developed and developing world outliers that reflect the social and cultural aspects of money. Future Directions This research has begun to widen the discussion of digital money to a broader academic audience. It has also provided a comprehensive definition of digital money that encompasses both the wide variety of existing digital means of exchange, as well as those future technologies that are undoubtedly to come. Our digital money readiness index also has important implications for policy makers. Moving forward, we intend to: Improve the transparency of the index; Include measures of digital currencies, such as Bitcoin; Implement a penalty for bottleneck to improve policy implications for the index; Investigate the big data implications of digital money; Digital Money Readiness To provide better insight into the differing readiness of countries for digital money, we have developed a Digital Money Readiness Index. By readiness we mean the level of development of the country with respect to the institutional, financial, technological, and economic factors that underpin digital money. Investigate whether the claimed economic and social benefits of digital money are indeed present. Acknowledgements We gratefully acknowledge both the financial and intellectual support of Citigroup, and would particularly like to thank Greg Baxter, Sandeep Dave, and Ashwin Shirvaikar. We also thank Lazlo Szerb and Erkko Autio for their suggestions on composite indices. Business School, Imperial College College London, South Kensington Campus, London SW7 2AZ. llewellyn.thomas@imperial.ac.uk

18 Impact of Changes in Primary Health Care Provision Elizabeth Cecil Alex Bottle Mike Sharland Sonia Saxena Unplanned hospital admissions in children have been rising across England over the last decade [1] Access to timely and effective primary care for minor or non urgent conditions prevents potentially avoidable hospital admission [2] GP s withdrawal from out of hours care in 2004 may have resulted in children being seen in hospital emergency departments where previously parents would have contacted their GP particularly for acute infectious illness The Quality and Outcomes Framework (QOF) has been successful in incentivising primary care to improve adult health outcomes for chronic disease. Yet children who make up 25% of GP workload are underrepresented in quality improvement targets in primary care. Hence children may access hospital based alternatives to primary care for acute exacerbations of chronic conditions [3] Aim: To investigate whether GP services changes have impacted on unplanned and short stay hospital admissions in children for infectious and chronic disease. Design: National population-based time trens study. GP Methods Alternative eg. walk in centres, telecare A&E We used Hospital Episode Statistics (HES) data from all English hospitals from on children aged<15 years to calculate age/sex standardized admission rates for all unplanned admissions; short stays <=2 days with no readmissions and very short stays (no overnight stay). We adjusted for deprivation. The interrupted time series analysis model design allowed for a step change at and gradient change post 2004, in rate of unplanned hospital admissions in children. Outcomes: Total unplanned, short and very short stay hospital admission rates; for all cause, infectious and chronic disease. Exposure: Post 2004 Results Crude unplanned admission rates increased between 2000/1 and 2010/11 in all developmental age bands in children aged <15 years. The adjusted rate of all cause unplanned admissions increased by 2%/ year after the introduction of GP service changes in 2004, compared to the trend in previous years (rate ratio (RR) = 1.02 (95% CI: 1.02, 1.03)). The biggest changes were observed in very short stay admissions, those unplanned admissions with no overnight stay. There was an estimated step change of 8.5% (RR = 1.08, (95%CI: 1.07, 1.10)) in adjusted unplanned admission rates for all chronic diseases, in There was no evidence of a step change in the adjusted unplanned admission rates in infectious disease but the rate of increase doubled after 2004 from 1.2% to 2.3% per year All Cause Chronic Disease Infectious Disease Standardized Rate Fitted Rate Department of Primary Care and Public Health, Imperial College College London, South Kensington Campus, London SW7 2AZ. Peadiatric Infectious Diseases Unit, St. George s, University of London, Cranmer Terrace, London SW17 0RE. e.cecil@imperial.ac.uk

19 Early in-hospital mortality following trainee doctors first day at work Min Hua Jen Alex Bottle Azeem Majeed Derek Bell Paul Aylin There is a commonly held assumption that early August is an unsafe period to be admitted to hospital in England, as newly qualified doctors start work in NHS hospitals on the first Wednesday of August. A previous UK study using national death certificate data found no effect, but could not discriminate between in and out of hospital deaths. US studies have suggested an equivalent July effect. We investigate whether in-hospital mortality is higher in the week following the first Wednesday in August than in the previous week using national hospital administrative data. Methods Two retrospective cohorts of all emergency patients admitted on the last Wednesday in July and the first Wednesday in August for 2000 to 2008, each followed up for one week. If by the end of the following Tuesday, a patient had died in hospital, we counted them as a death; otherwise we presumed them to have survived. We calculated the odds of death in admissions occurring on the week after the first Wednesday in August compared with those on the week before, adjusted for age (20 groups: <1 year, 1 4, 5 9, and five-year bands up to 90+), sex, area-level socio-economic deprivation (quintile of Carstairs index of deprivation), year (NHS financial year of discharge, from 1st April each year to the 31st March the next year) and comorbidity (using the Charlson index of co-morbidity, ranging from 0 to 6+). Results Odds ratios comparing odds of death in patients admitted on first Wednesday in August compared to last Wednesday in July (unadjusted and adjusted*). Discussion Strengths: Large study national, 9 years Only included deaths in hospital Denominator No overlap in care. Limitations: Only looked at those admitted on a single day Our figures equate to just 11 extra deaths per year Short follow up how long does effect last? Patients admitted on the first Wednesday in August have a higher death rate than those admitted on the last Wednesday in July in hospitals in England. There is also a statistically significantly higher death rate for medical patients that was not evident for surgical admissions or patients with malignancy. If this effect is due to the changeover of junior hospital staff, then this has potential implications not only for patient care, but for NHS management approaches to delivering safe care. We suggest further work to look at other measures such as patient safety, quality of care, process measures or medical chart review to identify preventable deaths rather than overall early mortality to further evaluate the effect of junior doctor changeover. Acknowledgements PA, MHJ, AB are employed within the Dr Foster Unit at Imperial College London. The Unit is funded from a research grant from Dr Foster Intelligence (an independent health service research organization). The unit is also affiliated with the CPSSQ at Imperial College Healthcare NHS Trust, which is funded by the NIHR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript or poster. Dr Foster Unit at Imperial College & Department of Primary Care and Public Health, School of Public Health, Imperial College London, South Kensington Campus, London SW7 2AZ. Department of Medicine, Imperial College London, Chelsea and Westminster Campus, 369 Fulham Road, London SW10 9NH.

20 Interrupted time-series analysis of London stroke services re-organisation Roxana Alexandrescu John Tayu Lee Alex Bottle Paul Aylin Stroke accounts for around 11% of all deaths in England. Most people survive a first stroke, but often have significant morbidity. In England, approximately 110,000 people have a first or recurrent stroke a year, and it is estimated stroke costs the economy around 7 billion per year from which 2.8 billion is a direct cost to the NHS. Prior to 2010, provision of stroke care in London was complex, with care spread across a number of units and only 53% of patients treated on a dedicated stroke ward.1 To improve the quality of service, eight Hyper Acute Stroke Units (HASUs) were established in London from February The units, which are dedicated to treating stroke patients, are open 24 hours, seven days a week to offer immediate access to stroke investigations and imaging, including CT brain scan and clot-busting thrombolysis drugs. Our aim was to assess the impact of the HASU policy using established stroke performance indicators based on national routine hospital administrative data. Methods We used Hospital Episode Statistics (HES) from April 2006 to March 2012 to include a time period before and after the policy introduction. We identified all admissions with a primary diagnosis of stroke in any episode of care based on an ICD-10 disease code of I60, I61, I62, I63 and I64. We examined six indicators defined previously. These were: Brain scan on the day of admission; Thrombolysis treatment; Diagnosis of aspiration pneumonia in the hospital; Seven-day in-hospital mortality; Discharge to usual place of residence within 56 days; and Thirty-day emergency readmission (all causes). We plotted the unadjusted rates for the process and outcome indicators by time (quarter of year). We tested for linear trends pre and post intervention (excluding a six-month intervention period Jan 10 to Jun 10) and for a step change at the time of the intervention for each indicator using an interrupted time series (ITS) negative binomial regression model.. Results During a 6-year period, April 2006 to March 2012, we identified 536,034 stroke admissions to hospitals in England, 61,643 of these (11.5%) being in the London area. Compared with areas outside London, 7 day in-hospital deaths rate reduced significantly following the restructuring of services, as did aspiration pneumonia. However, same day brain scans showed a small but significant reduction following the intervention, as well as a slowing down in the rate of increase. This study suggests that HASU policy was effective in improving the treatment of stroke patients in the London area, the intervention being associated with decreasing in-hospital mortality and decreasing rates of aspiration pneumonia in the post-intervention period. S c a n r a t e % Rates of same day brain scan by quarter of year, April March 2012 Reorganisation of stroke services: London Jan.- July Intervention Area Time (quarter year) England without London London Rates of pneumonia by quarter of year, April March 2012 Reorganisation of stroke services: London Jan.- July Intervention P n 14 e u 12 m o 10 n i 8 a 6 r a 4 t e 2 % Area Time (quarter year) England without London London Rates of discharge to usual place of residence by quarter of year, April March 2012 Reorganisation of stroke services: London Jan.- July Intervention D i 70 s c 60 h a 50 r g 40 e 30 r a 20 t e 10 % Time (quarter year) T h r o m b. r a t e % D e a t h r a t e % R e a d m i s s i o n r a t e % 16 Intervention Rates of thrombolysis by quarter of year, April March 2012 Reorganisation of stroke services: London Jan.- July Area Time (quarter year) England without London London Rates of deaths within 7 days by quarter of year, April March 2012 Reorganisation of stroke services: London Jan.- July Intervention Area Time (quarter year) England without London London Rates of emergency readmission by quarter of year, April March 2012 Reorganisation of stroke services: London Jan.- July Intervention Time (quarter year) Our model also included seasonal effect (dummy variable for each month), patient characteristics including age (six categories: 0-44, 45-54, 55-64, 65-74, and 85 years or over), sex and socioeconomic deprivation status (Carstairs deprivation quintiles). Area England without London Area England without London London London Figure 1. Unadjusted temporal changes for the performance indicators for stroke care by study area Acknowledgements This poster represents independent research supported by NIHR Patient Safety Translational Research Centre. Dr Foster Unit at Imperial College & Department of Primary Care and Public Health, School of Public Health, Imperial College London, South Kensington Campus, London SW7 2AZ. Department of Medicine, Imperial College London, Chelsea and Westminster Campus, 369 Fulham Road, London SW10 9NH.