Big Graph Processing: Some Background



Similar documents
An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Machine Learning over Big Data

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Large-Scale Data Processing

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Apache Hama Design Document v0.6

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Binary search tree with SIMD bandwidth optimization using SSE

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Cray: Enabling Real-Time Discovery in Big Data

Big Data Systems CS 5965/6965 FALL 2015

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data Analytics. Lucas Rego Drumond

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Systems and Algorithms for Big Data Analytics

Graph Processing and Social Networks

STINGER: High Performance Data Structure for Streaming Graphs

Using an In-Memory Data Grid for Near Real-Time Data Analysis

RevoScaleR Speed and Scalability

Presto/Blockus: Towards Scalable R Data Analysis

Evaluating partitioning of big graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Data Centric Systems (DCS)

Graph Database Proof of Concept Report

bigdata Managing Scale in Ontological Systems

DATA ANALYSIS II. Matrix Algorithms

Understanding Neo4j Scalability

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BSPCloud: A Hybrid Programming Library for Cloud Computing *

A1 and FARM scalable graph database on top of a transactional memory layer

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Distributed communication-aware load balancing with TreeMatch in Charm++

The Power of Relationships

B669 Sublinear Algorithms for Big Data

High Performance Computing. Course Notes HPC Fundamentals

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Map-Reduce for Machine Learning on Multicore

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

2009 Oracle Corporation 1

Chapter 18: Database System Architectures. Centralized Systems

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Unified Big Data Processing with Apache Spark. Matei

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Distance Degree Sequences for Network Analysis

Symmetric Multiprocessing

FPGA-based Multithreading for In-Memory Hash Joins

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Parallels Cloud Storage

Tableau Server 7.0 scalability

Enterprise Applications

CS Data Science and Visualization Spring 2016

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Overlapping Data Transfer With Application Execution on Clusters

CSE-E5430 Scalable Cloud Computing Lecture 2

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Multi-Threading Performance on Commodity Multi-Core Processors

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Challenges for Data Driven Systems

Energy Efficient MapReduce

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

SGI High Performance Computing

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

The Sierra Clustered Database Engine, the technology at the heart of

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Big Data With Hadoop

TECHNICAL OVERVIEW HIGH PERFORMANCE, SCALE-OUT RDBMS FOR FAST DATA APPS RE- QUIRING REAL-TIME ANALYTICS WITH TRANSACTIONS.

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Parallel Simplification of Large Meshes on PC Clusters

Using In-Memory Computing to Simplify Big Data Analytics

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Optimizing the Performance of Your Longview Application

DataStax Enterprise, powered by Apache Cassandra (TM)

Distributed R for Big Data

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Datacenter Operating Systems

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Transcription:

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu

Graphs are everywhere! o A graph is a collection of binary relationships, i.e. networks of pairwise interactions including social networks, digital networks part of Internet Mines CSCI-580, Bo Wu brain network 2

Scale of the first graph o Nearly 300 years ago the first graph problem consisted of 4 vertices and 7 edges Seven Bridges of Konigsberg problem Is it possible to cross each of the seven bridges exactly once? Not too hard to fit in memory Mines CSCI-580, Bo Wu 3

Scale of real-world graphs o Graph scale in current CS literature on order of billions of edges, tens of gigabytes Mines CSCI-580, Bo Wu 4

Big Data begets Big Graphs o Increasing volume,velocity,variety of Big Data are significant challenges to scalable algorithms o How will graph applications adapt to Big Data at petabyte scale? o Ability to store and process Big Graphs impacts typical data structures Mines CSCI-580, Bo Wu 5

Social Scale o 1 billion vertices, 100 billion edges 111 PB adjacency matrix 2.92 TB adjacency list 2.92 TB edge list Mines CSCI-580, Bo Wu 6

Web scale o 50 billion vertices, 1 trillion edges 271 EB adjacency matrix 29.5 TB adjacency list 29.1 TB edge list Mines CSCI-580, Bo Wu 7

Brain scale o 100 billion vertices, 100 trillion edge 2.84 PB adjacency list 2.84 PB edge list Mines CSCI-580, Bo Wu 8

Benchmarking scalability on Big Graphs o Big Graph challenge our conventional thinking on both algorithms and computer architecture! o New Graph500.org benchmark provides a foundation for conducting experiments on graph datasets Mines CSCI-580, Bo Wu 9

Graph algorithms are challenging o Difficult to parallelize irregular data accesses increase latency skewed data distribution creates bottlenecks Celebrity nodes in social networks o Increased size imposes greater storage overhead IO burden! Mines CSCI-580, Bo Wu 10

Problem: How do we store and process Big Graphs? o Conventional approach is to store and compute inmemory o Shared memory Parallel Random Access Machine (PRAM) data in globally-shared memory implicit communication by updating memory fast-random access o Distributed memory Bulk Synchronous Parallel (BSP) data distributed to local, private memory explicit communication by sending messages easier to scale by adding more machines Mines CSCI-580, Bo Wu 11

Memory is fast but o Algorithms must exploit computer memory hierarchy designed for spatial and temporal locality registers, L1,L2,L3 cache, TLB, pages, disk... great for unit-stride access common in many scientific codes, e.g. linear algebra o But common graph algorithm implementations have... lots of random access to memory causing... many cache and TLB misses Mines CSCI-580, Bo Wu 12

Poor locality increases latency o Question: What is the memory throughput if 90% TLB hit and 0.01% page fault on miss? Mines CSCI-580, Bo Wu 13

If it fits o Graph problems that fit in memory can leverage excellent advances in architecture and libraries... Cray XMT2 designed for latency-hiding SGI UV2 designed for large, cache-coherent shared-memory body of literature and libraries Parallel Boost Graph Library (PBGL) Indiana University Multithreaded Graph Library (MTGL) Sandia National Labs GraphCT/STINGER Georgia Tech GraphLab Carnegie Mellon University Giraph Apache Software Foundation Mines CSCI-580, Bo Wu 14

We can add more memory, but o Memory capacity is limited by... number of CPU pins, memory controller channels, DIMMs per channel memory bus width o Globally-shared memory limited by... CPU address space cache-coherence Mines CSCI-580, Bo Wu 15

Larger systems, greater latency o Increasing memory can increase latency traverse more memory addresses larger system with greater physical distance between machines Fundamental limitation: speed of light o Latency causes significant inefficiency in new CPU architectures Mines CSCI-580, Bo Wu 16

Easier to increase capacity using disks o Current Intel Xeon E5 architectures: 384 GB max. per CPU (4 channels x 3 DIMMS x 32 GB) 64 TB max. globally-shared memory (46-bit address space) 3881 dual Xeon E5 motherboards to store Brain Graph 98 racks o Disk capacity not unlimited but higher than memory Larget disk on market: 8TB needs 364 to store Brain Graph which can fit in 5 racks o Disk is not enough applications will still require memory for processing Mines CSCI-580, Bo Wu 17

Big Graph Processing Frameworks

Why not just map reduce? o Developed by Google o Excellent for embarrassingly massively parallel computations No communication needed Many machine learning algorithms fall into this category o Not efficient for iterative algorithms that have dependences Unnecessary IO traffic 19

What s the natural way to program graph computation? 20

Most famous parallel graph processing framework 21

GAS 22

We still need parallelism 23

Graph partition: not easy at all at scale 24

Power-law distribution count 10 10 10 8 10 6 10 4 More$than$10 8 $ver+ces$$ have$one$neighbor.$ Top$1%$of$ver+ces$are$ High%Degree)) adjacent$to$ Ver+ces) 50%$of$the$edges!$ 10 2 10 0 10 0 10 2 10 4 10 6 10 8 degree 25

Power-law degree distribution 26

Random graph partitioning o Graph parallel abstractions rely on partitioning: Minimize communication Balance computation and storage 10 Machines à 90% of edges cut 100 Machines à 99% of edges cut! Machine 1 Machine 2 27

Challenges of high-degree vertices Y Data transmitted across network O(# cut edges) Machine 1 Machine 2 28

Idea of vertex cut 29

GAS decomposition 30

Random Edge- Placement Randomly assign edges to machines Machine 1 Machine 2 Machine 3 Balanced Vertex- Cut Y Spans 3 Machines Z Spans 2 Machines Not cut! YY Y Z 31

Greedy Vertex- Cuts Place edges on machines which already have the vertices in that edge. A B B C Machine1 Machine 2 AB DE 32

Example What s the popularity of this user? Popular?) 33

PargeRank Algorithm R[i] =0.15 + Rank%of% user%i" X j2nbrs(i) w ji R[j] Weighted%sum%of% neighbors %ranks" o Update ranks in parallel o Iterate until convergence 34

PageRank in Graphlab GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in in_neighbors(i)): total = total + R[j] * w ji! // Update the PageRank R[i] = 0.1 + total! // Trigger neighbors to run again if R[i] not converged then foreach( j in out_neighbors(i)) signal vertex- program on j Gather Information About Neighborhood Update Vertex Signal Neighbors & Modify Edge Data 35

Triangle counting on Twitter 36

What if I don t have a cluster? 37

GraphChi disk-based GraphLab o Challenge Random disk accesses! o Naive solutions Graph clustering Prefetching! o Solution Novel graph representation in disk Parallel sliding window Minimizes random accesses 38

Parallel sliding window layout A shard is easy to fit in memory 39

Parallel sliding window execution O(P^2) random accesses per pass on entire graph 40

Triangle counting on Twitter graph 41