On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data:
|
|
- Adele McLaughlin
- 8 years ago
- Views:
Transcription
1 On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data: Case Study in Distributed Structural Biology using MapReduce Boyu Zhang, Trilce Estrada 2, Pietro 3, Michela Taufer University of Delaware 2 University of New Mexico 3 San Diego Supercomputer Center
2 The docking process in drug design High- throughput screening in drug design Docking of ligand conforma7ons into protein P- L docking algorithm Model Prot.- Lig. onto 3D grid Alter Lig. ConfiguraOon Dock Lig. Into Prot. MD Simulated Annealing Energy MinimizaOon Trial Calculate Scores Evaluate Candidate SoluOons
3 The docking process in drug design High- throughput screening in drug design Docking of ligand conforma7ons into protein P- L docking algorithm Model Prot.- Lig. onto 3D grid Alter Lig. ConfiguraOon Dock Lig. Into Prot. MD Simulated Annealing Energy MinimizaOon Trial Calculate Scores Evaluate Candidate SoluOons 2
4 Docking simulaoon = large conformaoon dataset How can we select well- docked (near- naove) conformaoons among the sampled conformaoons in the large dataset? 3
5 It is not just about the lowest energy. Number of conformations Energy Docked complexes are tradioonally scored based on energy A ligand scoring min energy does NOT always have a near- naove structure 4
6 It is not just about the lowest energy. Number of conformations Energy Docked complexes are tradioonally scored based on energy A ligand scoring min energy does NOT always have a near- naove structure 5
7 Searching for most popular pose Compare geometry of ligands, searching for the most popular pose Number of conformations RMSD Energy Energy Energy 6
8 Comparing ligand geometries Compare geometry of molecules, searching for dense spaces of similar poses Compute RMSDs, each between two 3D poses 7
9 Comparing ligand geometries Distributed data generaoon and storage Centralized data analysis Node Data movement Node When dealing with big, distributed datasets of conformaoons, clustering kills performance Dealing with uncertainoes, e.g., number of clusters Dealing with scalabilioes, e.g., I/O and storage limits
10 Capturing relevant properoes Extract geometrical shape (property) of docked ligand in the docking pocket of protein conformation geometry Perform space reducoon from data (atom coordinates of the ligand conformaoons) to extracted property (3D point) Expect conformaoons with similar geometry mapped into closed points 9
11 Capturing relevant properoes Encode ligand conformaoons into single 3D points Best- fit linear regression line of the 2D points Line slopes become coordinates of 3D point encoding geometry
12 From clustering problem to density search
13 CounOng property aggregates Deal with property- encoding points rather than raw data Transform the analysis problem from a clustering or classificaoon problem into a density search problem Build octree by assigning an octkey to each point represenong a ligand conformaoon based on its posioon in 3D space 2
14 Octree- based encoding and search Binary search through the octree hierarchy to find the deepest, most dense octant
15 Search for dense spaces 4
16 Search for dense spaces Octree nodes Reengineered ligand conformaoons 5
17 Search for dense spaces Deepest, more dense octant found by our algorithm Near- naove ligand structures 6
18 ImplementaOon in MapReduce Ligand conformaoons distributed across mulople nodes Define map and reduce funcoons - different variants possible: From global to local à move properoes From local to global à move densioes 7
19 Node Encode properoes Count densioes (SPAs) Exchange Node 2 Store locally Map properties Shuffle properties From global to local 3 properoes Count locally
20 Node 3 2 Encode properoes Count densioes (SPAs) Exchange densioes (SPAs) Node 2 Store locally Map properties Shuffle densities Count globally From local to global
21 Logical distribuoon of data Dataset: million protein-ligand records Strong convergence towards one ligand conformation Strong convergence towards two ligand conformations Weak convergence towards one ligand conformation No convergence!!! 2
22 Physical distribuoon of data Distributed datasets generated in semi- or fully- decentralized systems UNIFORM: property- encoding points that belong to the same subspace in the logical distribuoon are stored in the same physical storage ROUND- ROBIN: points that belong to the same subspace in the logical distribuoon are stored in separate physical storages in a round- robin manner RANDOM: points are randomly stored in the physical storages of all the system nodes 2
23 D Map Shuffle Overhead Reduce GL LG 2D D S UN D S UN D Round-robin
24 Round-robin.5 Random.5 GL LG GL LG GL GL LG LG GL LG.5.5 Reduce D GL LG Overhead Shuffle GL LG GL LG Map
25 Accuracy: Self- docking Self- docking: 23, 2, and 2 ligands dock into HIV, trypsin, and p38alpha respecovely Search across 56 datasets of, poses each Percentage (%) /23 /2 3/2 8/23 5/2 /2 HIV protease Trypsin P38alpha 24 octree-based energy
26 Lessons learned We can avoid data movement when analyzing big scienofic data distributed across mulople nodes Our approach performs a single pass of data to extract relevant properoes Geometry of a ligand conformaoon in a large dataset of million confirmaoons Only either properoes or property densioes are exchanges among nodes When exchanging property densioes, our approach delivers scalable performance and is NOT sensiove to scienofic contents 25
27 Acknowledgments Collaborators: volunteers Roger Armen (TJU) Request for informaoon: Michela Taufer: Global CompuOng Sponsors: 26
On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data:
On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology using MapReduce Boyu Zhang, Trilce Estrada, Pietro Cicotti,
More informationBandwidth Modeling in Large Distributed Systems for Big Data Applications
Bandwidth Modeling in Large Distributed Systems for Big Data Applications Bahman Javadi School of Computing, Engineering and Mathematics University of Western Sydney, Australia Email: b.javadi@uws.edu.au
More informationMolecular Docking. - Computational prediction of the structure of receptor-ligand complexes. Receptor: Protein Ligand: Protein or Small Molecule
Scoring and Docking Molecular Docking - Computational prediction of the structure of receptor-ligand complexes Receptor: Protein Ligand: Protein or Small Molecule Protein-Protein Docking Protein-Small
More informationThe Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO
The Fusion of Supercomputing and Big Data Peter Ungaro President & CEO The Supercomputing Company Supercomputing Big Data Because some great things never change One other thing that hasn t changed. Cray
More informationData Mining Analysis of HIV-1 Protease Crystal Structures
Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, and Rajni Garg AP0907 09 Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko 1, A.
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationHow To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationChapter 10: Scalability
Chapter 10: Scalability Contents Clustering, Load balancing, DNS round robin Introduction Enterprise web portal applications must provide scalability and high availability (HA) for web services in order
More informationBALANCING FOR DISTRIBUTED BACKUP
CONTENT-AWARE LOAD BALANCING FOR DISTRIBUTED BACKUP Fred Douglis 1, Deepti Bhardwaj 1, Hangwei Qian 2, and Philip Shilane 1 1 EMC 2 Case Western Reserve University 1 Starting Point Deduplicating disk-based
More informationRefinement of a pdb-structure and Convert
Refinement of a pdb-structure and Convert A. Search for a pdb with the closest sequence to your protein of interest. B. Choose the most suitable entry (or several entries). C. Convert and resolve errors
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationAutomatic Selection of Near-Native Protein-Ligand Conformations using a Hierarchical Clustering and Volunteer Computing
Automatic Selection of Near-Native Protein-Ligand Conformations using a Hierarchical Clustering and Volunteer Computing Trlce Estrada University of Delaware Dept. of Computer & Inf. Sciences Newark, DE,
More informationComparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationMultiobjective Robust Design Optimization of a docked ligand
Multiobjective Robust Design Optimization of a docked ligand Carlo Poloni,, Universitaʼ di Trieste Danilo Di Stefano, ESTECO srl Design Process DESIGN ANALYSIS MODEL Dynamic Analysis Logistics & Field
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationScalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationTowards energy-aware scheduling in data centers using machine learning
Towards energy-aware scheduling in data centers using machine learning Josep Lluís Berral, Íñigo Goiri, Ramon Nou, Ferran Julià, Jordi Guitart, Ricard Gavaldà, and Jordi Torres Universitat Politècnica
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationGold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996
Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 1 Genetic algorithms Inspired
More informationData-Driven Decisions: Role of Operations Research in Business Analytics
Data-Driven Decisions: Role of Operations Research in Business Analytics Dr. Radhika Kulkarni Vice President, Advanced Analytics R&D SAS Institute April 11, 2011 Welcome to the World of Analytics! Lessons
More informationLoad Balancing. Load Balancing 1 / 24
Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationD1.1 Service Discovery system: Load balancing mechanisms
D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction
More informationPACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.
PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD Natasha Balac, Ph.D. Brief History of SDSC 1985-1997: NSF national supercomputer center; managed by General Atomics
More informationLoad balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
More informationContent Delivery Network (CDN) and P2P Model
A multi-agent algorithm to improve content management in CDN networks Agostino Forestiero, forestiero@icar.cnr.it Carlo Mastroianni, mastroianni@icar.cnr.it ICAR-CNR Institute for High Performance Computing
More informationModernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com
DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit
More informationDynamic Adaptive Feedback of Load Balancing Strategy
Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationA bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
More informationA Survey Of Various Load Balancing Algorithms In Cloud Computing
A Survey Of Various Load Balancing Algorithms In Cloud Computing Dharmesh Kashyap, Jaydeep Viradiya Abstract: Cloud computing is emerging as a new paradigm for manipulating, configuring, and accessing
More informationParallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
More informationComputational Drug Repositioning by Ranking and Integrating Multiple Data Sources
Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources Ping Zhang IBM T. J. Watson Research Center Pankaj Agarwal GlaxoSmithKline Zoran Obradovic Temple University Terms and
More informationShared Parallel File System
Shared Parallel File System Fangbin Liu fliu@science.uva.nl System and Network Engineering University of Amsterdam Shared Parallel File System Introduction of the project The PVFS2 parallel file system
More informationBig Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
More informationPredictive Modeling and Big Data
Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation
More informationLoad Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC 2011-13
Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC 011-13 About me Ex EMDC from Batch 011 (the party batch) Currently PhD Student at KTH Royal Institute of Technology
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationParallel Analysis and Visualization on Cray Compute Node Linux
Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory
More informationPhase determination methods in macromolecular X- ray Crystallography
Phase determination methods in macromolecular X- ray Crystallography Importance of protein structure determination: Proteins are the life machinery and are very essential for the various functions in the
More informationIntegrating Bioinformatics, Medical Sciences and Drug Discovery
Integrating Bioinformatics, Medical Sciences and Drug Discovery M. Madan Babu Centre for Biotechnology, Anna University, Chennai - 600025 phone: 44-4332179 :: email: madanm1@rediffmail.com Bioinformatics
More informationHadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationT cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
More informationDuke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
More informationHadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
More informationHigh Performance Computing with Hadoop WV HPC Summer Institute 2014
High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop
More informationA Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems *
A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems * Dong Yuan, Yun Yang, Xiao Liu, Gaofeng Zhang, Jinjun Chen Faculty of Information and Communication
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationSolving (NP-Hard) Scheduling Problems with ovirt & OptaPlanner. Jason Brooks Red Hat Open Source & Standards SCALE13x, Feb 2015
Solving (NP-Hard) Scheduling Problems with ovirt & OptaPlanner Jason Brooks Red Hat Open Source & Standards SCALE13x, Feb 2015 What Is ovirt? Large scale, centralized management for server and desktop
More informationPyRy3D: a software tool for modeling of large macromolecular complexes MODELING OF STRUCTURES FOR LARGE MACROMOLECULAR COMPLEXES
MODELING OF STRUCTURES FOR LARGE MACROMOLECULAR COMPLEXES PyRy3D is a method for building low-resolution models of large macromolecular complexes. The components (proteins, nucleic acids and any other
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationComparison of Distributed Data- Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study!
Comparison of Distributed Data- Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study! Jianwu Wang, Daniel Crawl, Ilkay Altintas! Kostas Tzoumas, Volker Markl! San Diego Supercomputer
More informationBig Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014
Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions
More informationHigh Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
More informationHands-on exercises on solvent models & electrostatics EMBnet - Molecular Modeling Course 2005
Hands-on exercises on solvent models & electrostatics EMBnet - Molecular Modeling Course 2005 Exercise 1. The purpose of this exercise is to color the solvent accessible surface of a protein according
More informationBIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16
Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems
More informationAnt Colony Optimization (ACO)
Ant Colony Optimization (ACO) Exploits foraging behavior of ants Path optimization Problems mapping onto foraging are ACO-like TSP, ATSP QAP Travelling Salesman Problem (TSP) Why? Hard, shortest path problem
More informationCITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationEFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
More informationHow To Balance In Cloud Computing
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,
More informationHadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
More informationLOAD BALANCING TECHNIQUES
LOAD BALANCING TECHNIQUES Two imporatnt characteristics of distributed systems are resource multiplicity and system transparency. In a distributed system we have a number of resources interconnected by
More informationBSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
More informationProtein Studies Using CAChe
Protein Studies Using CAChe Exercise 1 Building the Molecules of Interest, and Using the Protein Data Bank In the CAChe workspace, click File / pen, and navigate to the C:\Program Files\Fujitsu\ CAChe\Fragment
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationConsensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design
Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design Masato Okada Faculty of Science and Technology, Masato Tsukamoto Faculty of Pharmaceutical Sciences, Hayato Ohwada
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationLarge Scale Learning
Large Scale Learning Data hypergrowth: an example Reuters- 21578: about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job Mtle data: about
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More information3D Building Roof Extraction From LiDAR Data
3D Building Roof Extraction From LiDAR Data Amit A. Kokje Susan Jones NSG- NZ Outline LiDAR: Basics LiDAR Feature Extraction (Features and Limitations) LiDAR Roof extraction (Workflow, parameters, results)
More informationcan you effectively plan for the migration and management of systems and applications on Vblock Platforms?
SOLUTION BRIEF CA Capacity Management and Reporting Suite for Vblock Platforms can you effectively plan for the migration and management of systems and applications on Vblock Platforms? agility made possible
More informationChristian Bettstetter. Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks
Christian Bettstetter Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks Contents 1 Introduction 1 2 Ad Hoc Networking: Principles, Applications, and Research Issues 5 2.1 Fundamental
More informationHow To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationEnabling Multi-pipeline Data Transfer in HDFS for Big Data Applications
Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications Liqiang (Eric) Wang, Hong Zhang University of Wyoming Hai Huang IBM T.J. Watson Research Center Background Hadoop: Apache Hadoop
More informationALPS - The Swiss Grand Challenge Programme on the Cray XT3. CUG 2007, Seattle Dominik Ulmer, CSCS
ALPS - The Swiss Grand Challenge Programme on the Cray XT3 CUG 2007, Seattle Dominik Ulmer, CSCS CSCS today Swiss National Supercomputing Centre, founded in 1991 as part of ETHZ. Since 2004, an autonomous
More informationRCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More informationA SURVEY ON LOAD BALANCING ALGORITHMS FOR CLOUD COMPUTING
A SURVEY ON LOAD BALANCING ALGORITHMS FOR CLOUD COMPUTING Avtar Singh #1,Kamlesh Dutta #2, Himanshu Gupta #3 #1 Department of Computer Science and Engineering, Shoolini University, avtarz@gmail.com #2
More informationCurrent status of image matching for Earth observation
Current status of image matching for Earth observation Christian Heipke IPI - Institute for Photogrammetry and GeoInformation Leibniz Universität Hannover Secretary General, ISPRS Content Introduction
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationMolecular Docking: A Problem With Thousands Of Degrees Of Freedom
Molecular Docking: A Problem With Thousands Of Degrees Of Freedom Miguel L. Teodoro 1 mteodoro@rice.edu George N. Phillips Jr 2 phillips@biochem.wisc.edu Lydia E. Kavraki 3 kavraki@rice.edu 1 Department
More informationHigh Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es
High Throughput Computing on P2P Networks Carlos Pérez Miguel carlos.perezm@ehu.es Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationReport: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationEnergy Efficient Load Balancing among Heterogeneous Nodes of Wireless Sensor Network
Energy Efficient Load Balancing among Heterogeneous Nodes of Wireless Sensor Network Chandrakant N Bangalore, India nadhachandra@gmail.com Abstract Energy efficient load balancing in a Wireless Sensor
More informationPepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010 Agenda Motivation Design Features Applications Evaluation Conclusion
More informationLet the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data
CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address
More informationLoad Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco
Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2 Stream Processing Engines
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More information