Statistical Inference for Big Data Problems in Molecular Biophysics

Size: px
Start display at page:

Download "Statistical Inference for Big Data Problems in Molecular Biophysics"

Transcription

1 Statistical Inference for Big Data Problems in Molecular Biophysics Arvind Ramanathan 1, Andrej Savol 2,4, Virginia Burger 2,4, Shannon Quinn 2,4, Pratul K. Agarwal 3, Chakra Chennubhotla 4 1 Computational Data Analytics Group, Computer Science and Engineering Division Oak Ridge National Laboratory, Oak Ridge, TN Joint Carnegie Mellon University-University of Pittsbugh Ph.D. Program in Computational Biology 3 Computational Biology Institute, Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN Department of Computational and Systems Biology University of Pittsburgh, Pittsburgh, PA Abstract We highlight the role of statistical inference techniques in providing biological insights from analyzing long time-scale molecular simulation data. Technological and algorithmic improvements in computation have brought molecular simulations to the forefront of techniques applied to investigating the basis of living systems. While these longer simulations, increasingly complex reaching petabyte scales presently, promise a detailed view into microscopic behavior, teasing out the important information has now become a true challenge on its own. Mining this data for important patterns is critical to automating therapeutic intervention discovery, improving protein design, and fundamentally understanding the mechanistic basis of cellular homeostasis. 1 Molecular Biophysics Over last 30 years biophysicists have taken advantage of the advances in computing power to run increasingly detailed simulations of biomolecules in order to investigate the mechanistic basis of their function. The structure, dynamics and function of biological macro-molecules such as proteins, de-oxy/ribose nucleic acid (DNA/RNA), carbohydrates and lipids control cellular function, and thus life. Proteins, the workhorses of the cell, are long polymers of amino-acid residues which fold into three-dimensional structures to perform their function. The biological function controlled by the dynamical interactions between various bio-molecules can occur at multiple time-scales from femto-seconds up to micro-, milli-, seconds and beyond, spanning more than 15 orders of magnitude between them. Molecular dynamics (MD) simulations provide insights into the dependence of biological function on interactions at multiple length and time scales. In this paper, we focus on using fully-atomistic simulations of proteins/biomolecules in solution as they best represent the cellular environment. MD simulations are governed by a potential energy function that includes both bonded and non-bonded interaction terms. The gradient of the energy function defines a force-field which is then applied to every atom in the molecule. At each time step, Newton s laws of motion are integrated to generate a trajectory. A time-step on the order of a femtosecond (10 15 s) is necessary for capturing the smallest vibrations of interest, whereas biological interesting events typically occur at microsecond (10 6 s) and higher time scales. With improvements in sampling techniques and available hardware resources, MD simulations have successfully crossed the millisecond (10 3 s) time-scale barrier [1] and have provided novel insights into the functioning of bimolecular systems. 1

2 Assuming that a typical protein has O(1000) atoms, representing a protein in full-atomistic detail in Cartesian space (x, y, z) requires at least 3 O(1000) single-precision numbers. Even if one stores the output from molecular dynamics (MD) simulation at 100 ps (10 10 s) intervals, assuming regular access to millisecond long simulations in the near future, typical datasets would have somewhere between conformations, which could easily occupy several terabytes of storage. Indeed, datasets of this order are being made available online by several research groups, and one can certainly expect more of these datasets in the near future. Such large-scale (and potentially large-volume) datasets from molecular simulations pose a significant big-data challenge. The purpose of MD simulations is to reveal mechanistic basis of protein function. Traditionally, biophysicists have relied on the availability of order-parameters (e.g., experimental observables such as dihedral angles, distance constraints between different parts of the protein/bio-molecule, or other thermodynamical measurements) as a means of analyzing these simulations. These knowledge based parameters are often difficult to obtain a priori and can be considered as a biological filter applied to the large data-set from which only a few small number of functionally relevant dimensions are drawn. Given the experts knowledge, these parameters were sufficient to analyze smaller timescale simulations. However, with the growth in MD data sets (reaching petabyte scale), there is a need to develop automated tools that can discover potential order parameters as well as reveal novel (hitherto unknown) features of the complex energy landscape. Machine learning and statistical inference techniques offer new avenues for elucidating novel relationships in the conformational landscape. Hence, our goal here is to bridge machine learning with molecular biophysics in the hope of discovering new biology. 2 Computational Challenges We ask how statistical inference tools can help sift through petascale simulation data to reveal the organizing principles of conformational landscapes. The challenges include: (1) building statistical insights into the time-dependent structural changes that the protein undergoes in the course of a simulation; (2) exploiting these structural dependencies to build a biophysically/biochemically relevant low-dimensional representation of the simulation data; (3) using the low-dimensional representations to generate kinetically and energetically coherent conformational sub-states and finally (4) drawing causality relations from time-dependent structural changes within each conformational sub-state to implicate functionally relevant residues. We discuss the first three issues in more detail below. Building statistical insights into time-dependent structural changes that a protein undergoes in the course of a simulation Naturally occurring ensembles, such as images, sounds and videos, have been shown to posses scale-invariant statistics. However, such invariances may be hard to find in the molecular simulation data, because the protein data is more likely to resemble an object-specific ensemble, such as a dataset of face images which are known to exhibit multi-scale behavior but not in a convenient scale invariance form. Thermal fluctuations allow the molecule to cross over energy barriers and tumble to places far removed from the starting configuration (Fig. 1). Moreover, recent evidence from experiments suggest that these rare, low-probability excursions from the mean conformation may have a significant bearing on biological function, including protein folding, enzyme catalysis and molecular recognition. The algorithmic aspects of sufficiently describing these rare-excursions and their properties and building rich representations from the simulation data remains a hard problem. McCammon s group published an early result pointing out the long-tailed behavior of the positional fluctuation data from picoseconds long simulation trajectories [4]. More recently, we documented this long-tail property in the positional data of ubiquitin and adenylate kinase simulations using micro-second long simulation trajectories [3, 5]. We observed the long-tails give rise to nonorthogonal couplings between the various portions of the biomolecule. While the long tails hint at the use of independent component analysis algorithms, just like the natural datasets of images and sounds, we also observed that respecting the non-orthogonal correlations intrinsic to the data is the key to discovering energetically coherent clusters and building low-dimensional biophysically relevant projections. 2

3 Figure 1: Biophysical insights gained from a statistical characterization of the simulation data. (A) Time-dependent structural changes are anharmonic as shown by the long-tails in the positional fluctuations from 0.5µs long simulation data of a protein: ubiquitin. Log-histograms of the positional deviations from the mean conformation for (i) backbone carbon alpha atoms (red, kurtosis κ = 6.3), (ii) all-atoms of the protein (blue, kurtosis κ = 8.2) and (iii) best fitting Gaussian to the carbon alpha data (dotted, kurtosis κ = 3). (B) Non-orthgonal or anharmonic coupling in the positional deviations of carbon alpha atoms of residues 31 and 45. PCA (black arrows) imposes orthogonality and misses the intrinsic orientation of this data, but a variant of JADE [2] (red arrows) that deconvolves fourth-order dependencies successfully captures the intrinsic anharmonic directions. (C) An emergent behavior from using higher-order statistics is the ability to discover energetically coherent clusters of conformations. (D) Biological insights derived from (C) implicating motions of different regions in the protein ubiquitin as being important for recognizing diverse substrates (2D3G and 2G45). Figure modified from [3]. Learning a biophysically/biochemically relevant low-dimensional representations For a molecular system with N atoms and t simulation frames, a full conformational description requires 3N t variables, that is, the x, y, z cartesian coordinates for every frame. Dimensionality reduction methods are concerned with describing the complexities of molecular motions with far fewer variables such that important structural shifts can be visualized and interpreted. Conceptually, such approaches map each 3N -dimensional snapshot to a datapoint on a lower-dimensional manifold, but the nature and complexity of such a mapping are nontrivial, and no current method can claim optimality. Designing this embedding, or reaction coordinates, such that biophysicallyrelevant transitions are observed has actually become more challenging because longer simulations necessarily access more structural diversity and rare transitions. A truly useful mapping must thus 3

4 (1) filter out thermal fluctuations, (2) be sensitive to the non-gaussian/anharmonic character of conformational shifts, and capture rare (i.e. outlier) transitions. Existing methods primarily identify basis vectors (in 3N conformational space) that align with highvariance direction. So called principal component analysis (PCA) methods, adapted from the factor analysis statistics literature, require strong assumptions about the nature of correlated motions within the studied biomolecules. We have shown that two such assumptions, linearity and orthogonality, are not valid for MD simulations. These findings are consistent with the mechanistic interpretation of intramolecular forces where bonded and non-bonded interactions promote inherently anharmonic coupling between different parts of the bio-molecule. We further emphasize two important drawbacks of existing dimensionality reduction methods. Firstly, current reaction coordinate formulations are not protein specific, meaning neither the inherent properties of molecular interactions nor the restrictions on protein structure are considered. Secondly, existing methods scale poorly, requiring matrix operations that become intractable for those systems that are of biological interest. The appearance of substantial atomistic simulations arrived primarily via hardware speedups. Useful interpretation of molecular simulations, on the other hand, requires a physics-centric approach to analyzing collective atomic motions. Our experience in highlighting potential weaknesses of existing methods and suggesting alternatives will be critical in expanding the possible tools researches can apply to molecular systems. Clustering simulation data with biological and thermodynamical relevance. A second class of analysis techniques concerns grouping simulation frames that share important structural or kinetic features. Clustering conformer snapshots presents a considerable challenge in that the data has a natural ordering, namely the temporal relationships between successive frames that many datasets do not. Clustering approaches that appreciate the temporal associations among snapshots and also their structural relationships can provide insight into how protein motions facilitate function, or dysfunction. Specifically, clustering enables the determination of a small number of parameters (referred to as order parameters) which are highly relevant for estimating rate kinetics and thermodynamics, which are directly measurable via experiments. Although many clustering techniques are regularly applied to MD simulations, the challenge is to determine which techniques can provide adequate insights into the biological underpinnings of function and relate them back in a seamless way to the experimental observables such as rate kinetics and thermodynamics. 3 Solution Space and Outlook We have outlined the current state-of-the-art in terms of developing statistical inference approaches for big-data problems in molecular biophysics. However, with the astronomical increase in the size of datasets from molecular simulations, it is clear that there is a need to develop integrated machine learning toolkits that provide for: handling access to and analysis of big simulation data in a distributed fashion such as using Hadoop online or near real-time analytics of simulations as they are progressing, to facilitate anomaly detection and large-scale biophysically relevant motion signatures [6, 7] developing toolkits that are particularly suited to exploit heterogeneous computing resources such as graphics processing units (GPUs) for molecular biophysics applications [8] The availability of packages such as HiMach [9], mdanalysis [10] and HOST4MD [5] will certainly facilitate the development of a computing infrastructure required to tackle the big data challenges from molecular biophysics. References [1] David E. Shaw, Ron O. Dror, John K. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, Cliff Young, Martin M. Deneroff, Brannon Batson, Kevin J. Bowers, Edmond Chow, Michael P. Eastwood, Douglas J. Ierardi, John L. Klepeis, Jeffrey S. Kuskin, Richard H. Larson, Kresten Lindorff-Larsen, Paul Maragakis, Mark A. Moraes, Stefano Piana, 4

5 Yibing Shan, and Brian Towles. Millisecond-scale molecular dynamics simulations on anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, pages 39:1 39:11, New York, NY, USA, ACM. [2] Jean-Francois Cardoso. High-order contrasts for independent component analysis. Neural Comput., 11(1): , [3] A. Ramanathan, A. J. Savol, C. J. Langmead, P. K. Agarwal, and C. S. Chennubhotla. Discovering conformational sub-states relevant to protein function. PLoS ONE, 6(1):e15827, [4] S. H. Northrup, M. R. Pearl, J. D. Morgan, J. A. McCammon, and M. Karplus. Molecular dynamics of ferrocytochrome c: magnitude and anisotropy of atomic displacements. J. Mol. Biol., 153: , [5] A. Ramanathan, A. J. Savol, P. K. Agarwal, and C. S. Chennubhotla. Event detection and sub-state discovery from biomolecular simulations using higher-order statistics: Application to enzyme adenylate kinase. Proteins: Struct., Funct., and Bioinform., 80(11): , [6] Willy Wriggers, Kate A. Stafford, Yibing Shan, Stefano Piana, Paul Maragakis, Kresten Lindorff-Larsen, Patrick J. Miller, Justin Gullingsrud, Charles A. Rendleman, Michael P. Eastwood, Ron O. Dror, and David E. Shaw. Automated event detection and activity monitoring in long molecular dynamics simulations. Journal of Chemical Theory and Computation, 5(10): , [7] A. Ramanathan, J.-O. Yoo, and C. J. Langmead. On-the-fly identification of conformational substates from molecular dynamics simulations. J. Chem. Theory Comput., 7(3): , [8] D. Brandt. Investigation of GPGPU for use in Processing of EEG in Real-time. PhD thesis, Kate Gleason College of Engineering, [9] Tiankai Tu, Charles A. Rendleman, David W. Borhani, Ron O. Dror, Justin Gullingsrud, Morten Ø. Jensen, John L. Klepeis, Paul Maragakis, Patrick Miller, Kate A. Stafford, and David E. Shaw. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC 08, pages 56:1 56:12, Piscataway, NJ, USA, IEEE Press. [10] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. Mdanalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10): ,

Statistical Inference for Big Data Problems in Molecular Biophysics

Statistical Inference for Big Data Problems in Molecular Biophysics Statistical Inference for Big Data Problems in Molecular Biophysics Arvind Ramanathan 1, Andrej Savol 2,4, Virginia Burger 2,4, Shannon Quinn 2,4, Pratul K. Agarwal 3, Chakra Chennubhotla 4 1 Computational

More information

Data Visualization for Atomistic/Molecular Simulations. Douglas E. Spearot University of Arkansas

Data Visualization for Atomistic/Molecular Simulations. Douglas E. Spearot University of Arkansas Data Visualization for Atomistic/Molecular Simulations Douglas E. Spearot University of Arkansas What is Atomistic Simulation? Molecular dynamics (MD) involves the explicit simulation of atomic scale particles

More information

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10 CSC 2427: Algorithms for Molecular Biology Spring 2006 Lecture 16 March 10 Lecturer: Michael Brudno Scribe: Jim Huang 16.1 Overview of proteins Proteins are long chains of amino acids (AA) which are produced

More information

The Quixote Project: a pioneering work in managing Computational Chemistry research data

The Quixote Project: a pioneering work in managing Computational Chemistry research data 1 The Quixote Project: a pioneering work in managing Computational Chemistry research data Pablo Echenique http://www.pabloechenique.com echenique.p@gmail.com 2 The protein folding problem Folding Native

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS 1. The Technology Strategy sets out six areas where technological developments are required to push the frontiers of knowledge

More information

Careers in Chemistry. 400-Level Classes in Chemistry. Undergrad Research Opportunities

Careers in Chemistry. 400-Level Classes in Chemistry. Undergrad Research Opportunities Careers in Chemistry 400-Level Classes in Chemistry Undergrad Research Opportunities Careers in Chemistry Employment & Salary Data Median salaries (means are higher) Careers in Chemistry Some Careers That

More information

Supplementary Figures S1 - S11

Supplementary Figures S1 - S11 1 Membrane Sculpting by F-BAR Domains Studied by Molecular Dynamics Simulations Hang Yu 1,2, Klaus Schulten 1,2,3, 1 Beckman Institute, University of Illinois, Urbana, Illinois, USA 2 Center of Biophysics

More information

Biomolecular Modelling

Biomolecular Modelling Biomolecular Modelling Carmen Domene Physical & Theoretical Chemistry Laboratory University of Oxford, UK THANKS Dr Joachim Hein Dr Iain Bethune Dr Eilidh Grant & Qi Huangfu 2 EPSRC Grant, Simulations

More information

PHYSICAL REVIEW LETTERS

PHYSICAL REVIEW LETTERS PHYSICAL REVIEW LETTERS VOLUME 86 28 MAY 21 NUMBER 22 Mathematical Analysis of Coupled Parallel Simulations Michael R. Shirts and Vijay S. Pande Department of Chemistry, Stanford University, Stanford,

More information

DISTANCE DEGREE PROGRAM CURRICULUM NOTE:

DISTANCE DEGREE PROGRAM CURRICULUM NOTE: Bachelor of Science in Electrical Engineering DISTANCE DEGREE PROGRAM CURRICULUM NOTE: Some Courses May Not Be Offered At A Distance Every Semester. Chem 121C General Chemistry I 3 Credits Online Fall

More information

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014 Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions

More information

Assessment Method 1: Oral seminar presentation

Assessment Method 1: Oral seminar presentation 2013-2014 Assessment Report College of Sciences & Mathematics Chemistry & Biochemistry Chemistry, Master's Expected Outcome 1: Effective Oral Communication Skills Students in M.S. degree Program will demonstrate

More information

On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data:

On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology using MapReduce Boyu Zhang, Trilce Estrada, Pietro Cicotti,

More information

Language: English Lecturer: Gianni de Fabritiis. Teaching staff: Language: English Lecturer: Jordi Villà i Freixa

Language: English Lecturer: Gianni de Fabritiis. Teaching staff: Language: English Lecturer: Jordi Villà i Freixa MSI: Molecular Simulations Descriptive details concerning the subject: Name of the subject: Molecular Simulations Code : MSI Type of subject: Optional ECTS: 5 Total hours: 125.0 Scheduling: 11:00-13:00

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Christian W. Frey 2012 Monitoring of Complex Industrial Processes based on Self-Organizing Maps and

More information

COURSE TITLE COURSE DESCRIPTION

COURSE TITLE COURSE DESCRIPTION COURSE TITLE COURSE DESCRIPTION CH-00X CHEMISTRY EXIT INTERVIEW All graduating students are required to meet with their department chairperson/program director to finalize requirements for degree completion.

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

ICT Perspectives on Big Data: Well Sorted Materials

ICT Perspectives on Big Data: Well Sorted Materials ICT Perspectives on Big Data: Well Sorted Materials 3 March 2015 Contents Introduction 1 Dendrogram 2 Tree Map 3 Heat Map 4 Raw Group Data 5 For an online, interactive version of the visualisations in

More information

Biochemistry Major Talk 2014-15. Welcome!!!!!!!!!!!!!!

Biochemistry Major Talk 2014-15. Welcome!!!!!!!!!!!!!! Biochemistry Major Talk 2014-15 August 14, 2015 Department of Biochemistry The University of Hong Kong Welcome!!!!!!!!!!!!!! Introduction to Biochemistry A four-minute video: http://www.youtube.com/watch?v=tpbamzq_pue&l

More information

Protein Dynamics Intro

Protein Dynamics Intro Protein Dynamics Intro From rigid structures to motions on energy landscapes Do you all remember Anfinsen? What concept now associated with his name made Anfinsen famous? Right, it is the concept that

More information

Hydrogen Bonds The electrostatic nature of hydrogen bonds

Hydrogen Bonds The electrostatic nature of hydrogen bonds Hydrogen Bonds Hydrogen bonds have played an incredibly important role in the history of structural biology. Both the structure of DNA and of protein a-helices and b-sheets were predicted based largely

More information

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains The Ramachandran Map of More Than 1 6,500 Perfect Polypeptide Chains Zoltán Szabadka, Rafael Ördög, Vince Grolmusz manuscript received March 19, 2007 Z. Szabadka, R. Ördög and V. Grolmusz are with Eötvös

More information

Statistical Analysis and Visualization for Cyber Security

Statistical Analysis and Visualization for Cyber Security Statistical Analysis and Visualization for Cyber Security Joanne Wendelberger, Scott Vander Wiel Statistical Sciences Group, CCS-6 Los Alamos National Laboratory Quality and Productivity Research Conference

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

How To Make Visual Analytics With Big Data Visual

How To Make Visual Analytics With Big Data Visual Big-Data Visualization Customizing Computational Methods for Visual Analytics with Big Data Jaegul Choo and Haesun Park Georgia Tech O wing to the complexities and obscurities in large-scale datasets (

More information

Phase determination methods in macromolecular X- ray Crystallography

Phase determination methods in macromolecular X- ray Crystallography Phase determination methods in macromolecular X- ray Crystallography Importance of protein structure determination: Proteins are the life machinery and are very essential for the various functions in the

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

JOMO KENYATTA UNIVERSITY OF AGRICULTURE AND TECHNOLOGY.

JOMO KENYATTA UNIVERSITY OF AGRICULTURE AND TECHNOLOGY. DAY 1 HRD 2101 - COMMUNICATION SKILLS ASS. HALL SZL 2111 - HIV/AIDS ASS. HALL 7/3/2011 SMA 2220 - VECTOR ANALYSIS ASS. HALL SCH 2103 - ORGANIC CHEMISTRY ASS. HALL SCH 2201 - PHYSICAL CHEMISTRY II ASS.

More information

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

S. Hartmann, C. Seiler, R. Dörner and P. Grimm

S. Hartmann, C. Seiler, R. Dörner and P. Grimm &DVH6WXG\9LVXDOL]DWLRQRI0HFKDQLFDO3URSHUWLHVDQG 'HIRUPDWLRQVRI/LYLQJ&HOOV S. Hartmann, C. Seiler, R. Dörner and P. Grimm Fraunhofer Anwendungszentrum für Computergraphik in Chemie und Pharmazie Varrentrappstraße

More information

Keystone Review Practice Test Module A Cells and Cell Processes. 1. Which characteristic is shared by all prokaryotes and eukaryotes?

Keystone Review Practice Test Module A Cells and Cell Processes. 1. Which characteristic is shared by all prokaryotes and eukaryotes? Keystone Review Practice Test Module A Cells and Cell Processes 1. Which characteristic is shared by all prokaryotes and eukaryotes? a. Ability to store hereditary information b. Use of organelles to control

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Classifying Manipulation Primitives from Visual Data

Classifying Manipulation Primitives from Visual Data Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if

More information

Multi-GPU Load Balancing for In-situ Visualization

Multi-GPU Load Balancing for In-situ Visualization Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately

More information

Process Modelling from Insurance Event Log

Process Modelling from Insurance Event Log Process Modelling from Insurance Event Log P.V. Kumaraguru Research scholar, Dr.M.G.R Educational and Research Institute University Chennai- 600 095 India Dr. S.P. Rajagopalan Professor Emeritus, Dr. M.G.R

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

A Novel Approach for Network Traffic Summarization

A Novel Approach for Network Traffic Summarization A Novel Approach for Network Traffic Summarization Mohiuddin Ahmed, Abdun Naser Mahmood, Michael J. Maher School of Engineering and Information Technology, UNSW Canberra, ACT 2600, Australia, Mohiuddin.Ahmed@student.unsw.edu.au,A.Mahmood@unsw.edu.au,M.Maher@unsw.

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Atsushi Matsumoto. Hisashi Ishida 1, Kei Yura 1, Takuma Kano 1 and Atsushi Matsumoto 1. 1. Introduction

Atsushi Matsumoto. Hisashi Ishida 1, Kei Yura 1, Takuma Kano 1 and Atsushi Matsumoto 1. 1. Introduction Chapter 3 Epoch Making Simulation Analysis of the Function of a Large-scale Supra-biomolecule System by Molecular Dynamics Simulation System, (Simulation Codes for huge Biomolecular Assembly) Project Representative

More information

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,

More information

1 The water molecule and hydrogen bonds in water

1 The water molecule and hydrogen bonds in water The Physics and Chemistry of Water 1 The water molecule and hydrogen bonds in water Stoichiometric composition H 2 O the average lifetime of a molecule is 1 ms due to proton exchange (catalysed by acids

More information

CNAS ASSESSMENT COMMITTEE CHEMISTRY (CH) DEGREE PROGRAM CURRICULAR MAPPINGS AND COURSE EXPECTED STUDENT LEARNING OUTCOMES (SLOs)

CNAS ASSESSMENT COMMITTEE CHEMISTRY (CH) DEGREE PROGRAM CURRICULAR MAPPINGS AND COURSE EXPECTED STUDENT LEARNING OUTCOMES (SLOs) CNAS ASSESSMENT COMMITTEE CHEMISTRY (CH) DEGREE PROGRAM CURRICULAR MAPPINGS AND COURSE EXPECTED STUDENT LEARNING OUTCOMES (SLOs) DEGREE PROGRAM CURRICULAR MAPPING DEFINED PROGRAM SLOs Course No. 11 12

More information

Math 215 HW #6 Solutions

Math 215 HW #6 Solutions Math 5 HW #6 Solutions Problem 34 Show that x y is orthogonal to x + y if and only if x = y Proof First, suppose x y is orthogonal to x + y Then since x, y = y, x In other words, = x y, x + y = (x y) T

More information

HPC & Visualization. Visualization and High-Performance Computing

HPC & Visualization. Visualization and High-Performance Computing HPC & Visualization Visualization and High-Performance Computing Visualization is a critical step in gaining in-depth insight into research problems, empowering understanding that is not possible with

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

CHEM 451 BIOCHEMISTRY I. SUNY Cortland Fall 2010

CHEM 451 BIOCHEMISTRY I. SUNY Cortland Fall 2010 CHEM 451 BIOCHEMISTRY I SUNY Cortland Fall 2010 Instructor: Dr. Frank Rossi Office: Bowers 135 Office Hours: Mon. 2:30-4:00, Wed. 4:00-5:30, Friday 2:30-3:00, or by appointment. Extra evening office hours

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Chemistry Course Descriptions

Chemistry Course Descriptions Chemistry Course Descriptions Please note: Course numbers and descriptions are given based on the UCF course offerings, if available. Courses Offered UCF BCC CFCC DBCC LSCC SCC VCC CHM 1015 (Pre-College

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Unit I: Introduction To Scientific Processes

Unit I: Introduction To Scientific Processes Unit I: Introduction To Scientific Processes This unit is an introduction to the scientific process. This unit consists of a laboratory exercise where students go through the QPOE2 process step by step

More information

Course Curriculum for Master Degree in Medical Laboratory Sciences/Clinical Biochemistry

Course Curriculum for Master Degree in Medical Laboratory Sciences/Clinical Biochemistry Course Curriculum for Master Degree in Medical Laboratory Sciences/Clinical Biochemistry The Master Degree in Medical Laboratory Sciences /Clinical Biochemistry, is awarded by the Faculty of Graduate Studies

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Data Driven Discovery In the Social, Behavioral, and Economic Sciences Data Driven Discovery In the Social, Behavioral, and Economic Sciences Simon Appleford, Marshall Scott Poole, Kevin Franklin, Peter Bajcsy, Alan B. Craig, Institute for Computing in the Humanities, Arts,

More information

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices overview Pipeline Pilot Enterprise Server Pipeline Pilot Enterprise Server (PPES) is a powerful client-server platform that streamlines the integration and analysis of the vast quantities of data flooding

More information

1) Chemical Engg. PEOs & POs Programme Educational Objectives

1) Chemical Engg. PEOs & POs Programme Educational Objectives 1) Chemical Engg. PEOs & POs Programme Educational Objectives The Programme has the following educational objectives: To prepare students for successful practice in diverse fields of chemical engineering

More information

Visualization of the Phosphoproteomic Data from AfCS with the Google Motion Chart Gadget

Visualization of the Phosphoproteomic Data from AfCS with the Google Motion Chart Gadget Visualization of the Phosphoproteomic Data from AfCS with the Google Motion Chart Gadget Huilei Xu 1, and Avi Ma ayan 1,* 1 Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine,

More information

Protein Dynamics by NMR. Why NMR is the best!

Protein Dynamics by NMR. Why NMR is the best! Protein Dynamics by NMR Why NMR is the best! Key Points NMR dynamics divided into 2 regimes: fast and slow. How protein mobons affect NMR parameters depend on whether they are faster or slower than the

More information

Conformational analysis of lipid molecules by self-organizing maps

Conformational analysis of lipid molecules by self-organizing maps THE JOURNAL OF CHEMICAL PHYSICS 126, 054707 2007 Conformational analysis of lipid molecules by self-organizing maps Teemu Murtola and Mikko Kupiainen Laboratory of Physics, Helsinki University of Technology,

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

CHEMISTRY. Real. Amazing. Program Goals and Learning Outcomes. Preparation for Graduate School. Requirements for the Chemistry Major (71-72 credits)

CHEMISTRY. Real. Amazing. Program Goals and Learning Outcomes. Preparation for Graduate School. Requirements for the Chemistry Major (71-72 credits) CHEMISTRY UW-PARKSIDE 2015-17 CATALOG Greenquist 344 262-595-2326 College: Natural and Health Sciences Degree and Programs Offered: Bachelor of Science Major - Chemistry Minor - Chemistry Certificate -

More information

Introduction to Engineering System Dynamics

Introduction to Engineering System Dynamics CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Lecture 19: Proteins, Primary Struture

Lecture 19: Proteins, Primary Struture CPS260/BGT204.1 Algorithms in Computational Biology November 04, 2003 Lecture 19: Proteins, Primary Struture Lecturer: Pankaj K. Agarwal Scribe: Qiuhua Liu 19.1 The Building Blocks of Protein [1] Proteins

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK

Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK Dai Lu, Ph.D. dlu@tamhsc.edu Tel: 361-221-0745 Office: RCOP, Room 307 Drug Discovery and Development Drug Molecules Medicinal

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Tracking in flussi video 3D. Ing. Samuele Salti

Tracking in flussi video 3D. Ing. Samuele Salti Seminari XXIII ciclo Tracking in flussi video 3D Ing. Tutors: Prof. Tullio Salmon Cinotti Prof. Luigi Di Stefano The Tracking problem Detection Object model, Track initiation, Track termination, Tracking

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Concept and Project Objectives

Concept and Project Objectives 3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

More information

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA 315 DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA

More information

MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Joint/Interdisciplinary Degree Programs

Joint/Interdisciplinary Degree Programs Joint/Interdisciplinary Degree Programs BACHELOR OF ENGINEERING (BENG) PROGRAM IN COMPUTER ENGINEERING Program Director: Amine BERMAK, Associate Professor of Electronic and Computer Engineering The Computer

More information

AS-D1 SIMULATION: A KEY TO CALL CENTER MANAGEMENT. Rupesh Chokshi Project Manager

AS-D1 SIMULATION: A KEY TO CALL CENTER MANAGEMENT. Rupesh Chokshi Project Manager AS-D1 SIMULATION: A KEY TO CALL CENTER MANAGEMENT Rupesh Chokshi Project Manager AT&T Laboratories Room 3J-325 101 Crawfords Corner Road Holmdel, NJ 07733, U.S.A. Phone: 732-332-5118 Fax: 732-949-9112

More information

TIETS34 Seminar: Data Mining on Biometric identification

TIETS34 Seminar: Data Mining on Biometric identification TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland Youming.Zhang@uta.fi Course Description Content

More information

The Challenge of Handling Large Data Sets within your Measurement System

The Challenge of Handling Large Data Sets within your Measurement System The Challenge of Handling Large Data Sets within your Measurement System The Often Overlooked Big Data Aaron Edgcumbe Marketing Engineer Northern Europe, Automated Test National Instruments Introduction

More information

A Statistician s View of Big Data

A Statistician s View of Big Data A Statistician s View of Big Data Max Kuhn, Ph.D (Pfizer Global R&D, Groton, CT) Kjell Johnson, Ph.D (Arbor Analytics, Ann Arbor MI) What Does Big Data Mean? The advantages and issues related to Big Data

More information

USING SELF-ORGANISING MAPS FOR ANOMALOUS BEHAVIOUR DETECTION IN A COMPUTER FORENSIC INVESTIGATION

USING SELF-ORGANISING MAPS FOR ANOMALOUS BEHAVIOUR DETECTION IN A COMPUTER FORENSIC INVESTIGATION USING SELF-ORGANISING MAPS FOR ANOMALOUS BEHAVIOUR DETECTION IN A COMPUTER FORENSIC INVESTIGATION B.K.L. Fei, J.H.P. Eloff, M.S. Olivier, H.M. Tillwick and H.S. Venter Information and Computer Security

More information

M.Sc. in Nano Technology with specialisation in Nano Biotechnology

M.Sc. in Nano Technology with specialisation in Nano Biotechnology M.Sc. in Nano Technology with specialisation in Nano Biotechnology Nanotechnology is all about designing, fabricating and controlling materials, components and machinery with dimensions on the nanoscale,

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Section 1.1. Introduction to R n

Section 1.1. Introduction to R n The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Molecular Docking: A Problem With Thousands Of Degrees Of Freedom

Molecular Docking: A Problem With Thousands Of Degrees Of Freedom Molecular Docking: A Problem With Thousands Of Degrees Of Freedom Miguel L. Teodoro 1 mteodoro@rice.edu George N. Phillips Jr 2 phillips@biochem.wisc.edu Lydia E. Kavraki 3 kavraki@rice.edu 1 Department

More information

A Chromium Based Viewer for CUMULVS

A Chromium Based Viewer for CUMULVS A Chromium Based Viewer for CUMULVS Submitted to PDPTA 06 Dan Bennett Corresponding Author Department of Mathematics and Computer Science Edinboro University of PA Edinboro, Pennsylvania 16444 Phone: (814)

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information