Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
|
|
- Abel Horn
- 8 years ago
- Views:
Transcription
1 /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of Science and Technology Thuwal, Saudi Arabia 2 IBM Watson Research Center Yorktown Heights, NY
2 /35 Importance of Graphs Graphs abstract application-specific algorithms into generic problems represented as interactions using vertices and edges Max flow in road network Diameter of WWW Ranking in social networks Simulating protein interactions Different applications vary in their computation requirements
3 3/35 How to Scale Graph Processing Pregel was introduced by Google as a scalable abstraction for graph processing Overcomes the limitations of processing graphs on MapReduce Based on vertex-centric computation Utilizes bulk synchronous parallel (BSP) System Programming Data Computations Abstraction Exchange map() Key-based MapReduce combine() grouping Disk-based reduce() compute() Message Pregel combine() passing In memory aggregate()
4 4/35 Pregel s BSP Graph Processing Superstep 1 Superstep 2 Superstep 3 Worker 1 Worker 1 Worker 1 Worker 2 Worker 2 Worker 2 Worker 3 Worker 3 Worker 3 BSP Barrier Balanced computation and communication is fundamental to Pregel s efficiency
5 5/35 Optimizing Graph Processing Existing work focus on optimizing for graph structure (static optimization): Optimize graph partitioning: Simple graph partitioning schemes (e.g., hash or range): Giraph User-defined partitioning function: Pregel Sophisticated partitioning techniques (e.g., min-cuts): GraphLab, PowerGraph, GoldenOrb and Surfer Optimize graph data access: Distributed data stores and graph indexing: GoldenOrb, Hama and Trinity
6 5/35 Optimizing Graph Processing Existing work focus on optimizing for graph structure (static optimization): Optimize graph partitioning: Simple graph partitioning schemes (e.g., hash or range): Giraph User-defined partitioning function: Pregel Sophisticated partitioning techniques (e.g., min-cuts): GraphLab, PowerGraph, GoldenOrb and Surfer Optimize graph data access: Distributed data stores and graph indexing: GoldenOrb, Hama and Trinity What about algorithm behavior? Pregel provides coarse-grained load balancing, is it enough?
7 6/35 Types of Algorithms Algorithms behaves differently, we classify algorithms in two categories depending on their behavior Algorithm Type In/Out Vertices Examples Messages State PageRank, Stationary Predictable Fixed Diameter estimation, WCC Distributed minimum spanning tree, Non-stationary Variable Variable Belief propagation, Graph queries, Ad propagation
8 7/35 Types of Algorithms First Superstep V2 V8 V2 V8 V4 V1 V6 V4 V1 V6 V7 V3 V5 V7 V3 V5 PageRank DMST
9 8/35 Types of Algorithms k Superstep V2 V8 V2 V8 V4 V1 V6 V4 V1 V6 V7 V3 V5 V7 V3 V5 PageRank DMST
10 9/35 Types of Algorithms k + m Superstep V2 V8 V2 V8 V4 V1 V6 V4 V1 V6 V7 V3 V5 V7 V3 V5 PageRank DMST
11 10/35 What Causes Computation Imbalance in Non-stationary Algorithms? Superstep 1 Superstep 2 Superstep 3 Worker 1 Worker 1 Worker 1 Worker 2 Worker 2 Worker 2 Worker 3 Worker 3 Worker 3 BSP Barrier -Vertex response time -Time to receive in messages -Time to send out messages
12 11/35 Why to Optimize for Algorithm Behavior? Difference between stationary and non-stationary algorithms In Messages (Millions) SuperSteps PageRank - Total PageRank - Max/W DMST - Total DMST - Max/W
13 12/35 What is Mizan? Open source BSP-based graph processing Provides runtime load balancing through fine grain vertex migrations Follows Pregel model
14 3/35 PageRank s compute() in Mizan void compute(messageiterator * messages, uservertexobject * data, messagemanager * comm) { double currval = data->getvertexvalue().getvalue(); double newval = 0; double c = 0.85; if (data->getcurrentss() > 1) { while (messages->hasnext()) { newval = newval + messages->getnext().getvalue(); } newval = newval * c + (1.0 - c) / ((double) vertextotal); data->setvertexvalue(mdouble(newval)); } else {newval = currval;} if (data->getcurrentss() <= maxsuperstep) { mdouble outval(newval / ((double) data->getoutedgecount())); for (int i = 0; i < data->getoutedgecount(); i++) { comm->sendmessage(data->getoutedgeid(i), outval); } } else {data->votetohalt();} }
15 14/35 Migration Planning Objectives Decentralized Simple Transparent No need to change Pregel s API Does not assume any a priori knowledge to graph structure or algorithm
16 15/35 Mizan s Migration Barrier Mizan performs both planning and migrations after all workers reach the BSP barrier Superstep 1 Superstep 2 Superstep 3 Worker 1 Worker 1 Worker 1 Worker 2 Worker 2 Worker 2 Worker 3 Worker 3 Worker 3 Migration Barrier BSP Barrier Migration Planner
17 16/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance: By comparing the worker s execution time against a normal distribution and flagging outliers
18 16/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance: By comparing the worker s execution time against a normal distribution and flagging outliers Mizan monitors for each vertex: Remote outgoing messages All incoming messages Response time High level summaries are broadcast to each worker Vertex Response Time Remote Outgoing Messages V4 V5 V6 V2 V3 V1 Mizan Worker 1 Local Incoming Messages Mizan Worker 2 Remote Incoming Messages
19 17/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance 2 Select the migration objective: Mizan finds the strongest cause of workload imbalance by comparing statistics for outgoing messages and incoming messages of all workers with the worker s execution time
20 17/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance 2 Select the migration objective: Mizan finds the strongest cause of workload imbalance by comparing statistics for outgoing messages and incoming messages of all workers with the worker s execution time The migration objective is either: Optimize for outgoing messages, or Optimize for incoming messages, or Optimize for response time
21 18/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance 2 Select the migration objective 3 Pair over-utilized workers with under-utilized ones: All workers create and execute the migration plan in parallel without centralized coordination Each worker is paired with one other worker at most W7 W2 W1 W5 W8 W4 W0 W6 W9 W
22 19/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance 2 Select the migration objective 3 Pair over-utilized workers with under-utilized ones 4 Select vertices to migrate: Depending on the migration objective
23 20/35 Mizan s Migration Planning Steps 1 Identify the source of imbalance 2 Select the migration objective 3 Pair over-utilized workers with under-utilized ones 4 Select vertices to migrate 5 Migrate vertices: How to migrate vertices freely across workers while maintaining vertex ownership and fast updates? How to minimize migration costs for large vertices?
24 21/35 Vertex Ownership Mizan uses distributed hash table (DHT) to implement a distributed lookup service: V s home worker ID = (hash(id) mod N) V can execute at any worker Workers ask the home worker of V for its current location The home worker is notified with the new location as V migrates Where is V5 & V8 Compute V1 V3 V2 V5 V8 V4 V2 V5 V6 V9 V7 V1 V3 V9 V8 DHT V1:W1 V4:W2 V7:W3 V2:W1 V5:W2 V8:W3 V3:W1 V6:W2 V9:W3 Worker 1 Worker 2 Worker 3
25 22/35 Migrating Vertices with Large Message Size Introduce delayed migration for very large vertices: At SS t: only ownership of vertex v is moved to worker new Superstep t Superstep t+1 Superstep t+2 Worker_old Worker_old Worker_old ' Worker_new 5 6 Worker_new 5 6 Worker_new
26 22/35 Migrating Vertices with Large Message Size Introduce delayed migration for very large vertices: At SS t: only ownership of vertex v is moved to worker new At SS t + 1: worker new receives vertex 7 messages worker old process vertex 7 Superstep t Superstep t+1 Worker_old Worker_old Worker_old ' Worker_new 5 6 Worker_new 5 6 Worker_new
27 22/35 Migrating Vertices with Large Message Size Introduce delayed migration for very large vertices: At SS t: only ownership of vertex v is moved to worker new At SS t + 1: worker new receives vertex 7 messages worker old process vertex 7 After SS t + 1: worker old moves state of vertex 7 to worker new Superstep t Superstep t+1 Superstep t+2 Worker_old Worker_old Worker_old ' Worker_new 5 6 Worker_new 5 6 Worker_new
28 23/35 Experimental Setup Mizan is implemented on C++ with MPI with three variations: Static Mizan: Emulates Giraph, disables dynamic migration Work stealing (WS): Emulates Pregel s coarse-grained dynamic load balancing Mizan: Activates dynamic migration Local cluster of 21 machines: Mix of i5 and i7 processors with 16GB RAM Each IBM Blue Gene/P supercomputer with 1024 compute nodes: Each is a 4 core PowerPC450 CPU at 850MHz with 4GB RAM
29 24/35 Datasets Experiments on public datasets: Stanford Network Analysis Project (SNAP) The Laboratory for Web Algorithmics (LAW) Kronecker generator name nodes edges kg1 (synthetic) 1,048,576 5,360,368 kg4m68m (synthetic) 4,194,304 68,671,566 web-google 875,713 5,105,039 LiveJournal1 4,847,571 68,993,773 hollywood ,180, ,985,632 arabic ,744, ,999,458
30 25/35 Giraph vs. Static Mizan Run Time (Min) Giraph Static Mizan kg1 web-google kg4m68m LiveJournal1 Run Time (Min) Giraph Static Mizan Vertex count (Millions) Figure : PageRank on social network and random graphs Figure : PageRank on regular random graphs, each has around 17M edge
31 26/35 Effectiveness of Dynamic Vertex Migration PageRank on a social graph (LiveJournal1) The shaded columns: algorithm runtime The unshaded columns: initial partitioning cost 30 Run Time (Min) Static WS Mizan Static WS Mizan Static WS Mizan Hash Range Metis
32 27/35 Effectiveness of Dynamic Vertex Migration Comparing the performance on highly variable messaging pattern algorithms, which are DMST and ad propagation simulation, on a metis partitioned social graph (LiveJournal1) Run Time (Min) Static Work Stealing Mizan 0 DMST Advertisment
33 28/35 Mizan s Overhead with Scaling Speedup 8 Mizan - hollywood Compute Nodes Figure : Speedup on Linux Cluster Speedup Mizan - Arabic Compute Nodes Figure : Speedup on IBM Blue Gene/P supercomputer 1024
34 29/35 Future Work Work with skewed graphs Improving Pregel s fault tolerance
35 30/35 Conclusion Mizan is a Pregel system that uses fine-grained vertex migration to load balance computation and communication across supersteps Mizan is an open source project developed within InfoCloud group in KAUST in collaboration with IBM, programmed in C++ with MPI Mizan scales up to thousands of workers Mizan improves the overall computation cost between 40% up to two order of magnitudes with less than 10% migration overhead Download it at: Try it on EC2 (ami-52ed743b)
36 31/35 Extra: Migration Details and Costs Run Time (Minutes) Superstep Runtime Average Workers Runtime Migration Cost Balance Outgoing Messages Balance Incoming Messages Balance Response Time Supersteps
37 32/35 Extra: Migrated Vertices per Superstep Migrated Vertices (x1000) Migration Cost Total Migrated Vertices Max Vertices Migrated by Single Worker Migration Cost (Minutes) Supersteps
38 33/35 Extra: Mizan s Migration Planning Compute() Migration Barrier No BSP Barrier Superstep Imbalanced? Yes Identify Source of Imbalance Pair with Underutilized Worker Select Vertices and Migrate No Worker Overutilized? Yes
39 34/35 Extra: Architecture of Mizan Mizan Worker Vertex Compute() Migration Planner BSP Processor Communicator - DHT Storage Manager IO HDFS/Local Disks
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat Karim Awara Amani Alonazi Hani Jamjoom Dan Williams Panos Kalnis King Abdullah University of Science and Technology,
More informationSoftware tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
More informationOverview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
More informationAn Experimental Comparison of Pregel-like Graph Processing Systems
An Experimental Comparison of Pregel-like Graph Processing Systems Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, Tianqi Jin David R. Cheriton School of Computer Science, University
More informationEvaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
More informationAdapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
More informationLARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
More informationSystems and Algorithms for Big Data Analytics
Systems and Algorithms for Big Data Analytics YAN, Da Email: yanda@cse.cuhk.edu.hk My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining
More informationApache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
More informationReplication-based Fault-tolerance for Large-scale Graph Processing
Replication-based Fault-tolerance for Large-scale Graph Processing Peng Wang Kaiyuan Zhang Rong Chen Haibo Chen Haibing Guan Shanghai Key Laboratory of Scalable Computing and Systems Institute of Parallel
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationOptimizations and Analysis of BSP Graph Processing Models on Public Clouds
Optimizations and Analysis of BSP Graph Processing Models on Public Clouds Mark Redekopp, Yogesh Simmhan, and Viktor K. Prasanna University of Southern California, Los Angeles CA 989 {redekopp, simmhan,
More informationFast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
More informationBig Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationLarge-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationMinimize Response Time Using Distance Based Load Balancer Selection Scheme
Minimize Response Time Using Distance Based Load Balancer Selection Scheme K. Durga Priyanka M.Tech CSE Dept., Institute of Aeronautical Engineering, HYD-500043, Andhra Pradesh, India. Dr.N. Chandra Sekhar
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationInformation Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
More informationLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph Sebastian Schelter Invited talk at GameDuell Berlin 29th May 2012 the mandatory about me slide PhD student at the Database Systems and Information Management
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationConvex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
More informationSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
More informationMOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [mastroserio@na.infn.it] Francesco Maria Taurino [taurino@na.infn.it] Gennaro Tortone [tortone@na.infn.it] Napoli Index overview on Linux farm farm
More informationPresto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
More informationEFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS A.Neela madheswari 1 and R.S.D.Wahida Banu 2 1 Department of Information Technology, KMEA Engineering College,
More informationBig Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
More informationImplementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya Ojah.chandana@gmail.com
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 33 Outline
More informationAsking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
More informationBig Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
More informationPerformance Evaluation of Improved Web Search Algorithms
Performance Evaluation of Improved Web Search Algorithms Esteban Feuerstein 1, Veronica Gil-Costa 2, Michel Mizrahi 1, Mauricio Marin 3 1 Universidad de Buenos Aires, Buenos Aires, Argentina 2 Universidad
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationOutline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
More informationGraph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationAn NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013
U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise
More information- Behind The Cloud -
- Behind The Cloud - Infrastructure and Technologies used for Cloud Computing Alexander Huemer, 0025380 Johann Taferl, 0320039 Florian Landolt, 0420673 Seminar aus Informatik, University of Salzburg Overview
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationEfficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationMesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
More informationHigh Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es
High Throughput Computing on P2P Networks Carlos Pérez Miguel carlos.perezm@ehu.es Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationPerformance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
More information! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationBlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
More informationDistributed Computing over Communication Networks: Topology. (with an excursion to P2P)
Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...
More informationIMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationBig Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv volker.markl@tu-berlin.de Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
More informationChing-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationLayer Load Balancing and Flexibility
Periodic Hierarchical Load Balancing for Large Supercomputers Gengbin Zheng, Abhinav Bhatelé, Esteban Meneses and Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign,
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationPerformance And Scalability In Oracle9i And SQL Server 2000
Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability
More informationORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationGeneral purpose Distributed Computing using a High level Language. Michael Isard
Dryad and DryadLINQ General purpose Distributed Computing using a High level Language Michael Isard Microsoft Research Silicon Valley Distributed Data Parallel Computing Workloads beyond standard SQL,
More informationMapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io
MapGraph A High Level API for Fast Development of High Performance Graphic Analytics on GPUs http://mapgraph.io Zhisong Fu, Michael Personick and Bryan Thompson SYSTAP, LLC Outline Motivations MapGraph
More informationGraph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
More informationEfficient Monitoring in Actor-based Mobile Hybrid Cloud Framework. Kirill Mechitov, Reza Shiftehfar, and Gul Agha
Efficient Monitoring in Actor-based Mobile Hybrid Cloud Framework Kirill Mechitov, Reza Shiftehfar, and Gul Agha Motivation: mobile cloud Mobile apps Huge variety Many developers/organizations Increasingly
More informationSIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationData Processing in the Era of Big Data
Department of Computer Science and Information Engineering National Taiwan University October 3, 2014 Big Data a New Jargon Importance Importance Big data is a collection of data sets so large and complex
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationGroup Based Load Balancing Algorithm in Cloud Computing Virtualization
Group Based Load Balancing Algorithm in Cloud Computing Virtualization Rishi Bhardwaj, 2 Sangeeta Mittal, Student, 2 Assistant Professor, Department of Computer Science, Jaypee Institute of Information
More informationUsing Map-Reduce for Large Scale Analysis of Graph-Based Data
Using Map-Reduce for Large Scale Analysis of Graph-Based Data NAN GONG KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2011 TRITA-ICT-EX-2011:218 Using Map-Reduce
More informationDistributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
More informationDynamic Load Balancing of Virtual Machines using QEMU-KVM
Dynamic Load Balancing of Virtual Machines using QEMU-KVM Akshay Chandak Krishnakant Jaju Technology, College of Engineering, Pune. Maharashtra, India. Akshay Kanfade Pushkar Lohiya Technology, College
More informationGraph Databases. Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo
Graph Databases Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo 1 Why are graphs important? Modeling chemical and biological data Social networks The web Hierarchical data 2 What is a graph database? A database
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More information1. Comments on reviews a. Need to avoid just summarizing web page asks you for:
1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationInfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
More informationDependency Free Distributed Database Caching for Web Applications and Web Services
Dependency Free Distributed Database Caching for Web Applications and Web Services Hemant Kumar Mehta School of Computer Science and IT, Devi Ahilya University Indore, India Priyesh Kanungo Patel College
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationThe Power of Relationships
The Power of Relationships Opportunities and Challenges in Big Data Intel Labs Cluster Computing Architecture Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationA Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
More information