A System for Distributed Graph-Parallel Machine Learning
|
|
|
- Peregrine Richardson
- 9 years ago
- Views:
Transcription
1 A System for Distributed Graph-Parallel Machine Learning The Team: Joseph Gonzalez Postdoc, UC Berkeley AMPLab CMU Yucheng Low Haijie Gu Aapo Kyrola Danny Bickson Carlos Guestrin Alex Smola Guy Blelloch
2 About me Scalable Machine Learning + Big Graphical Models BigLearning 2
3 Graphs are Essential to Data-Mining and Machine Learning Identify influential people and information Find communities Target ads and products Model complex data dependencies 3
4 Example: Estimate Political Bias??? Liberal???? Conservative?? Post Post???? Post Post Post?? Post Post Post?? Post Post Post?? Post?? Post? Post Loopy Belief Propagation Post???? Conditional Random Field?? Post?? 4
5 Collaborative Filtering: Exploiting Dependencies Women on the Verge of a Nervous Breakdown The Celebration What do I recommend??? City of God Wild Strawberries La Dolce Vita
6 Users Netflix Movies Iterate: Matrix Factorization Alternating Least Squares (ALS) x Users f(i) Movies f(j) User Factors (U) f(1) f(2) r 13 r 14 r 24 r25 f(3) f(4) f(5) Movie Factors (M) Factor for User i Factor for Movie j
7 PageRank Rank of user i Weighted sum of neighbors ranks Everyone starts with equal ranks Update ranks in parallel Iterate until convergence 7
8 How should we program graph-parallel algorithms? Low-level tools like MPI and Pthreads? - Me, during my first years of grad school 8
9 Threads, Locks, and MPI ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a single parallel platform Six months later the conference paper contains: We implemented in parallel. The resulting code: is difficult to maintain and extend couples learning model and implementation 9
10 How should we program graph-parallel algorithms? High-level Abstractions! -Me, now 10
11 The Graph-Parallel Abstraction A user-defined Vertex-Program runs on each vertex Graph constrains interaction along edges Think Using messages like (e.g. Pregel a [PODC 09, Vertex. SIGMOD 10]) Through shared state (e.g., GraphLab [UAI 10, VLDB 12]) -Malewicz et al. [SIGMOD 10] Parallelism: run multiple vertex programs simultaneously 11
12 Graph-parallel Abstractions Better for Machine Learning Messaging i Shared State i Synchronous Dynamic Asynchronous 12
13 The GraphLab Vertex Program Vertex Programs directly access adjacent vertices and edges GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = total R[4] * w // Trigger neighbors to run again priority = R[i] oldr[i] if R[i] not converged then signal neighborsof(i) with priority 13
14 Benefit of Dynamic PageRank Num-Vertices Better % updated only once! Number of Updates 14
15 GraphLab Asynchronous Execution The scheduler determines the order that vertices are executed CPU 1 a b c d Scheduler ba hi CPU 2 h e f g i j k Scheduler can prioritize vertices.
16 Asynchronous Belief Propagation Challenge = Boundaries Many Updates Synthetic Noisy Image Cumulative Vertex Updates Few Updates Graphical Model Algorithm identifies and focuses on hidden sequential structure
17 GraphLab Ensures a Serializable Execution Enables: Gauss-Seidel iterations, Gibbs Sampling, Graph Coloring,
18 Never Ending Learner Project (CoEM) Language modeling: named entity recognition Hadoop (BSP) 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Distributed GraphLab 6x 32 fewer EC2 CPUs! machines 80 secs 15x Faster! 0.3% of Hadoop time 18
19 Thus far GraphLab provided a powerful new abstraction But We couldn t scale up to Altavista Webgraph from B vertices, 6.6B edges
20 Natural Graphs Graphs derived from natural phenomena 20
21 Properties of Natural Graphs Regular Mesh Natural Graph Power-Law Degree Distribution 21
22 Number of Vertices count Power-Law Degree Distribution More than 10 8 vertices have one neighbor. Top High-Degree 1% of vertices are adjacent to Vertices 50% of the edges! AltaVista WebGraph 1.4B Vertices, 6.6B Edges Degree 4 degree
23 Power-Law Degree Distribution Star Like Motif President Obama Followers 23
24 Challenges of High-Degree Vertices Sequentially process edges Sends many messages (Pregel) Touches a large fraction of graph (GraphLab) Edge meta-data too large for single machine Asynchronous Execution requires heavy locking (GraphLab) Synchronous Execution prone to stragglers (Pregel) 24
25 Graph Partitioning Graph parallel abstractions rely on partitioning: Minimize communication Balance computation and storage Comm. Cost O(# cut edges) Machine 1 Machine 2 25
26 Power-Law Graphs are Difficult to Partition CPU 1 CPU 2 Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. [Abou-Rjeili et al. 06] 26
27 Random Partitioning GraphLab resorts to random (hashed) partitioning on natural graphs 10 Machines 90% of edges cut 100 Machines 99% of edges cut! Machine 1 Machine 2 27
28 Program For This Run on This Machine 1 Machine 2 Split High-Degree vertices New Abstraction Equivalence on Split Vertices 28
29 A Common Pattern for Vertex-Programs GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = total // Trigger neighbors to run again priority = R[i] oldr[i] if R[i] not converged then signal neighbors(i) with priority Gather Information About Neighborhood Update Vertex Signal Neighbors & Modify Edge Data 29
30 Formal GraphLab2 Semantics Gather(SrcV, Edge, DstV) A Collect information from neighbors Sum(A, A) A Commutative associative Sum Apply(V, A) V Update the vertex Scatter(SrcV, Edge, DstV) (Edge, signal) Update edges and signal neighbors 30
31 PageRank in GraphLab2 GraphLab2_PageRank(i) Gather( j i ) : return w ji * R[j] sum(a, b) : return a + b; Apply(i, Σ) : R[i] = Σ Scatter( i j ) : if R[i] changed then trigger j to be recomputed 31
32 GAS Decomposition Machine 1 Machine 2 Gather Apply Master Y Y Y Σ Σ Σ 2 Y Mirror Scatter Σ 3 Σ 4 Machine 3 Mirror Mirror Machine 4 32
33 Minimizing Communication in PowerGraph Communication Y is linear in New Theorem: the number of machines For any edge-cut we can directly each vertex spans construct a vertex-cut which requires strictly less communication and storage. A vertex-cut minimizes machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] 33
34 Constructing Vertex-Cuts Evenly assign edges to machines Minimize machines spanned by each vertex Assign each edge as it is loaded Touch each edge only once Propose two distributed approaches: Random Vertex Cut Greedy Vertex Cut 34
35 Random Vertex-Cut Randomly assign edges to machines Machine 1 Machine 2 Machine 3 Balanced Vertex-Cut Y Spans 3 Machines Z Spans 2 Machines Not cut! YY Y Z 35
36 Random Vertex-Cuts vs. Edge-Cuts Expected improvement from vertex-cuts: 100 Reduction in Comm. and Storage 10 1 Order of Magnitude Improvement Number of Machines 36
37 Streaming Greedy Vertex-Cuts Place edges on machines which already have the vertices in that edge. A B B C Machine1 Machine 2 AB DE 37
38 Greedy Vertex-Cuts Improve Performance Runtime Relative to Random PageRank Collaborative Filtering Shortest Path Random Greedy Greedy partitioning improves computation performance. 38
39 System Design PowerGraph (GraphLab2) System MPI/TCP-IP PThreads HDFS EC2 HPC Nodes Implemented as C++ API Uses HDFS for Graph Input and Output Fault-tolerance is achieved by check-pointing Snapshot time < 5 seconds for twitter network 39
40 Implemented Many Algorithms Collaborative Filtering Alternating Least Squares Stochastic Gradient Descent SVD Non-negative MF Statistical Inference Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Graph Analytics PageRank Triangle Counting Shortest Path Graph Coloring K-core Decomposition Computer Vision Image stitching Language Modeling LDA 40
41 PageRank on the Twitter Follower Graph Total Network (GB) Natural Graph with 40M Users, 1.4 Billion Links Communication GraphLab Pregel (Piccolo) PowerGraph Seconds GraphLab Runtime Pregel (Piccolo) PowerGraph Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x) 41
42 PageRank on Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Runtime Per Iteration Hadoop GraphLab Twister Piccolo Order of magnitude by exploiting properties of Natural Graphs PowerGraph Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. 10] 42
43 GraphLab2 is Scalable Yahoo Altavista Web Graph (2002): One of the largest publicly available web graphs 1.4 Billion Webpages, 6.6 Billion Links 7 Seconds per Iter. 1B links processed per second 64 HPC Nodes 30 lines of user code 1024 Cores (2048 HT) 43
44 Topic Modeling English language Wikipedia 2.6M Documents, 8.3M Words, 500M Tokens Computationally intensive algorithm Million Tokens Per Second Smola et al. 100 Yahoo! Machines Specifically engineered for this task PowerGraph 64 cc2.8xlarge EC2 Nodes 200 lines of code & 4 human hours 44
45 Triangle Counting For each vertex in graph, count number of triangles containing it Measures both popularity of the vertex and cohesiveness of the vertex s community: Fewer Triangles Weaker Community More Triangles Stronger Community
46 Triangle Counting on The Twitter Graph Identify individuals with strong communities. Counted: 34.8 Billion Triangles Hadoop [WWW 11] 1536 Machines 423 Minutes 64 Machines 1.5 Minutes 282 x Faster Why? Wrong Abstraction Broadcast O(degree 2 ) messages per Vertex S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, WWW 11 46
47 Machine Learning and Data-Mining Toolkits Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering GraphLab2 System MPI/TCP-IP PThreads HDFS EC2 HPC Nodes Apache 2 License
48 GraphChi: Going small with GraphLab Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)
49 GraphChi disk-based GraphLab Novel Parallel Sliding Windows algorithm Single-Machine Parallel, asynchronous execution Solves big problems That are normally solved in cloud Efficiently exploits disks Optimized for stream acces Efficient on both SSD and hard-drives
50 Triangle Counting in Twitter Graph 40M Users 1.2B Edges Total: 34.8 Billion Triangles Hadoop GraphChi PowerGraph 1536 Machines 423 Minutes 59 Minutes, 1 Mac Mini! 64 Machines, 1024 Cores 1.5 Minutes Hadoop results from [Suri & Vassilvitskii '11]
51 Apache 2 License Documentation Code Tutorials (more on the way)
52 Active Work Cross language support (Python/Java) Support for incremental graph computation Integration with Graph Databases Declarative representations of GAS decomposition: my.pr := nbrs.in.map(x => x.pr).reduce( (a,b) => a + b ) Joseph E. Gonzalez Postdoc, UC Berkeley [email protected] [email protected] 52
53 Why not use Map-Reduce for Graph Parallel algorithms?
54 Data Dependencies are Difficult Difficult to express dependent data in Map Reduce Substantial data transformations User managed graph structure Costly data replication Independent Data Records
55 Iterative Computation is Difficult System is not optimized for iteration: Iterations Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Startup Penalty CPU 2 CPU 3 Disk Penalty Data Data Data Data Data Startup Penalty CPU 2 CPU 3 Disk Penalty Data Data Data Data Data Startup Penalty CPU 2 CPU 3 Disk Penalty Data Data Data Data Data Data Data Data Data
56 The Pregel Abstraction Vertex-Programs interact by sending messages. Pregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg i // Update the rank of this vertex R[i] = total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(r[i]) to vertex j Malewicz et al. [PODC 09, SIGMOD 10] 56
57 Pregel Synchronous Execution Compute Communicate Barrier
58 Communication Overhead for High-Degree Vertices Fan-In vs. Fan-Out 58
59 Pregel Message Combiners on Fan-In A B + Sum D C Machine 1 Machine 2 User defined commutative associative (+) message operation: 59
60 Pregel Struggles with Fan-Out A B D C Machine 1 Machine 2 Broadcast sends many copies of the same message to the same machine! 60
61 Fan-In and Fan-Out Performance PageRank on synthetic Power-Law Graphs Piccolo was used to simulate Pregel with combiners Total Comm. (GB) Power-Law Constant α More high-degree vertices 61
62 GraphLab Ghosting A A B D B D C Machine 1 Ghost C Machine 2 Changes to master are synced to ghosts 62
63 GraphLab Ghosting A A B D B D C Machine 1 C Ghost Machine 2 Changes to neighbors of high degree vertices creates substantial network traffic 63
64 Fan-In and Fan-Out Performance PageRank on synthetic Power-Law Graphs GraphLab is undirected Total Comm. (GB) Power-Law Constant alpha More high-degree vertices 64
65 Comparison with GraphLab & Pregel Total Network (GB) PageRank on Synthetic Power-Law Graphs: Communication Pregel (Piccolo) GraphLab Power-Law Constant α High-degree vertices Seconds Runtime Pregel (Piccolo) GraphLab Power-Law Constant α High-degree vertices GraphLab2 is robust to high-degree vertices. 65
66 GraphLab on Spark #include <graphlab.hpp> struct vertex_data : public graphlab::is_pod_type { float rank; vertex_data() : rank(1) { } }; typedef graphlab::empty edge_data; typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type; class pagerank : public graphlab::ivertex_program<graph_type, float>, public graphlab::is_pod_type { float last_change; public: float gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { return edge.source().data().rank / edge.source().num_out_edges(); } void apply(icontext_type& context, vertex_type& vertex, const gather_type& total) { const double newval = 0.15*total ; last_change = std::fabs(newval - vertex.data().rank); vertex.data().rank = newval; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { if (last_change > TOLERANCE) context.signal(edge.target()); } }; struct pagerank_writer { std::string save_vertex(graph_type::vertex_type v) { std::stringstream strm; strm << v.id() << "\t" << v.data() << "\n"; return strm.str(); } std::string save_edge(graph_type::edge_type e) { return ""; } }; int main(int argc, char** argv) { graphlab::mpi_tools::init(argc, argv); graphlab::distributed_control dc; graphlab::command_line_options clopts("pagerank algorithm."); graph_type graph(dc, clopts); graph.load_format( biggraph.tsv, "tsv"); import spark.graphlab._ val sc = spark.sparkcontext(master, pagerank ) val graph = Graph.textFile( biggraph.tsv ) val vertices = graph.outdegree().mapvalues((_, 1.0, 1.0)) val pr = Graph(vertices, graph.edges).iterate( (meid, e) => e.source.data._2 / e.source.data._1, (a: Double, b: Double) => a + b, (v, accum) => (v.data._1, ( *a), v.data._2), (meid, e) => abs(e.source.data._2-e.source.data._1)>0.01) pr.vertices.saveastextfile( results ) Interactive! graphlab::omni_engine<pagerank> engine(dc, graph, clopts); engine.signal_all(); engine.start(); graph.save(saveprefix, pagerank_writer(), false, true false); graphlab::mpi_tools::finalize(); return EXIT_SUCCESS; } 66
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
The Power of Relationships
The Power of Relationships Opportunities and Challenges in Big Data Intel Labs Cluster Computing Architecture Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud Yucheng Low Carnegie Mellon University [email protected] Danny Bickson Carnegie Mellon University [email protected] Joseph
Systems and Algorithms for Big Data Analytics
Systems and Algorithms for Big Data Analytics YAN, Da Email: [email protected] My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Presto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
Research Statement Joseph E. Gonzalez [email protected]
As we scale to increasingly parallel and distributed architectures and explore new algorithms and machine learning techniques, the fundamental computational models and abstractions that once separated
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Big Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
GraphX: Unifying Data-Parallel and Graph-Parallel nalytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel rankshaw, nkur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph Sebastian Schelter Invited talk at GameDuell Berlin 29th May 2012 the mandatory about me slide PhD student at the Database Systems and Information Management
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
GraphChi: Large-Scale Graph Computation on Just a PC
GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrola Carnegie Mellon University [email protected] Guy Blelloch Carnegie Mellon University [email protected] Carlos Guestrin University of Washington
MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping
: Fast Billion-Scale Graph Computation on a PC via Memory Mapping Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng (Polo) Chau Georgia Tech Atlanta, Georgia {zlin48, kahng, kmsabrin, polo}@gatech.edu
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Distributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich [email protected] T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Scalability! But at what COST?
Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Microsoft Research Unaffiliated Abstract We offer a new metric for big data platforms, COST, or the Configuration
Report: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
An Experimental Comparison of Pregel-like Graph Processing Systems
An Experimental Comparison of Pregel-like Graph Processing Systems Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, Tianqi Jin David R. Cheriton School of Computer Science, University
SYNC or ASYNC: Time to Fuse for Distributed Graph-Parallel Computation
or : Time to Fuse for Distributed Graph-Parallel Computation Chenning Xie Rong Chen Haibing Guan Binyu Zang Haibo Chen Shanghai Key Laboratory of Scalable Computing and Systems Institute of Parallel and
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
X. SHI ET AL. 1 Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model Xuanhua Shi 1, Xuan Luo 1, Junling Liang 1, Peng Zhao 1, Sheng Di 2, Bingsheng He 3, and Hai Jin 1 1 Services Computing
Information Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Extreme Machine Learning with GPUs John Canny
Extreme Machine Learning with GPUs John Canny Computer Science Division University of California, Berkeley GTC, March, 2014 Big Data Event and text data: Microsoft Yahoo Ebay Quantcast MOOC logs Social
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat Karim Awara Amani Alonazi Hani Jamjoom Dan Williams Panos Kalnis King Abdullah University of Science and Technology,
Using Map-Reduce for Large Scale Analysis of Graph-Based Data
Using Map-Reduce for Large Scale Analysis of Graph-Based Data NAN GONG KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2011 TRITA-ICT-EX-2011:218 Using Map-Reduce
How To Create A Graphlab Framework For Parallel Machine Learning
This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies: GraphLab Tamara
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Use of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Big Data Science. Prof. Lise Getoor University of Maryland, College Park. http://www.cs.umd.edu/~getoor. October 17, 2013
Big Data Science Prof Lise Getoor University of Maryland, College Park October 17, 2013 http://wwwcsumdedu/~getoor BIG Data is not flat 2004-2013 lonnitaylor Data is multi-modal, multi-relational, spatio-temporal,
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS
REVIEW PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS Vijay Srinivas Agneeswaran, PhD, Pranay Tonpay, and Jayati Tiwary, BE Impetus Infotech India Private Limited Bangalore, Karnataka, India Abstract
