How To Create A Graphlab Framework For Parallel Machine Learning



Similar documents
Spark and the Big Data Library

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Unified Big Data Processing with Apache Spark. Matei

Presto/Blockus: Towards Scalable R Data Analysis

Linux Performance Optimizations for Big Data Environments

Estimating PageRank Values of Wikipedia Articles using MapReduce

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Data looks Tiny from the Stratosphere

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Machine Learning over Big Data

Spark: Making Big Data Interactive & Real-Time

Challenges for Data Driven Systems

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Apache Hadoop. Alexandru Costan

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Xiaoming Gao Hui Li Thilina Gunarathne

Scaling Up HBase, Hive, Pegasus

Hadoop Parallel Data Processing

HIGH PERFORMANCE BIG DATA ANALYTICS

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Big Data and Scripting Systems build on top of Hadoop

Big Graph Processing: Some Background

Big Data Analytics with Cassandra, Spark & MLLib

Unified Big Data Analytics Pipeline. 连 城

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data Too Big To Ignore

Mammoth Scale Machine Learning!

Apache Hama Design Document v0.6

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Graph Processing and Social Networks

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Distributed R for Big Data

Chase Wu New Jersey Ins0tute of Technology

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop & Spark Using Amazon EMR

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

A Brief Introduction to Apache Tez

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Spark: Cluster Computing with Working Sets

Deal with big data in R using bigmemory package

Reference Architecture, Requirements, Gaps, Roles

RevoScaleR Speed and Scalability

Amazon EC2 Product Details Page 1 of 5

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Data processing goes big

Hadoop Installation MapReduce Examples Jake Karnes

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Parallelization: Binary Tree Traversal

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Data Mining with Hadoop at TACC

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Ali Ghodsi Head of PM and Engineering Databricks

ratings.dat ( UserID::MovieID::Rating::Timestamp ) users.dat ( UserID::Gender::Age::Occupation::Zip code ) movies.dat ( MovieID::Title::Genres )

Report: Declarative Machine Learning on MapReduce (SystemML)

MLlib: Scalable Machine Learning on Spark

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Virtual Systems with qemu

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Bringing Big Data Modelling into the Hands of Domain Experts

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Big Data and Scripting Systems beyond Hadoop

Google Cloud Data Platform & Services. Gregor Hohpe

FREE computing using Amazon EC2

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Large-Scale Data Processing

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Distributed R for Big Data

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Assignment # 1 (Cloud Computing Security)

Introduction to Engineering Using Robotics Experiments Lecture 18 Cloud Computing

Running Knn Spark on EC2 Documentation

HiBench Introduction. Carson Wang Software & Services Group

Apache Flink Next-gen data analysis. Kostas

Transcription:

This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies: GraphLab Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and Benjamin C. Lee

Tutorial Schedule Time Topic 09:00-09:15 Introduction 09:15-10:30 Setting up MARSSx86 and DRAMSim2 10:30-11:00 Break 11:00-12:00 Spark simulation 12:00-13:00 Lunch 13:00-13:30 Spark continued 13:30-14:30 GraphLab simulation 14:30-15:00 Break 15:00-16:15 Web search simulation 16:15-17:00 Case studies 2 / 36

Agenda Objectives be able to deploy graph analytics framework be able to simulate GraphLab engine, tasks Outline Learn GraphLab for recommender, clustering Instrument GraphLab for simulation Create checkpoints Simulate from checkpoints 3 / 36

Types of Graph Analysis Iterative, batch processing over entire graph dataset Clustering PageRank Pattern Mining Real-time processing over fraction of the entire graph Reachability Shortest-path Graph pattern matching 4 / 36

Parallel Graph Algorithms Common Properties Sparse data dependencies Local computations Iterative updates Difficult programming models Race conditions, deadlocks Shared memory synchronization GraphLab: A New Framework for Parallel Machine Learning by Low et al. 5 / 36

Graph Computation Challenges Poor memory locality I/O intensive Limited data parallelism Limited scalability http://infolab.stanford.edu Lumsdaine et. al. [Parallel Processing Letters 07] 6 / 36

MapReduce for Graphs MapReduce performs poorly for parallel graph analysis Graph is re-loaded, re-processed iteratively MapReduce writes intermediate results to disk between iterations 7 / 36

GraphLab, An Alternative Approach Captures data dependencies Performs iterative analysis Updates data asynchronously Enables parallel execution models Multiprocessor Distributed machines www.select.cs.cmu.edu/code/ graphlab Y. Low et. al., Distributed GraphLab, VLDB 12 8 / 36

The GraphLab Framework Represent data as graph Specify update functions, user computation Choose consistency model Choose task scheduler www.cs.cmu.edu/~pavlo/courses/fall2013/ static/slides/graphlab.pdf 9 / 36

Represent Data as Graph Data graph associates data to each vertex and edge C. Guestrin. A distributed abstraction for large-scale machine learning. 10 / 36

Update Functions and Scope Computation with stateless Scheduler prioritizes computation Scope determines affected edges and vertices http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/graphlab.pdf 11 / 36

Scheduling Tasks and Updates Update scheduler orders update functions Static scheduling Models: synchronous, round-robin Task scheduler adds, reorders tasks Dynamic scheduling Algorithms: FIFO, priority, etc. 12 / 36

GraphLab Software Stack Collaborative filtering recommendation system Clustering Kmeans++ http://img.blog.csdn.net/ 13 / 36

Questions? More information: http://graphlab.com/ 14 / 36

This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies Getting Started with GraphLab Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and Benjamin C. Lee

Agenda Objectives be able to deploy graph analytics framework be able to simulate GraphLab engine, tasks Outline Learn GraphLab for recommender, clustering Instrument GraphLab for simulation Create checkpoints Simulate from checkpoints 16 / 36

GraphLab Setup Get a product key from: http://graphlab.com/products/create/ quick-start-guide.html Launch QEMU emulator: $ qemu - system - x86_64 -m 4G -drive file = micro2014. qcow2, cache = unsafe - nographic In QEMU, install required tools and GraphLab-create python package # apt - get install python - pip python - dev build - essential gcc # pip install graphlab - create ==1. 1 17 / 36

GraphLab Setup Register product with generated key by opening file /root/.graphlab/config and editing it as follows [ Product ] product_key = < generated_key > Create a directory for GraphLab # mkdir graphlab # cd graphlab Create a directory for the dataset # mkdir dataset # cd dataset 18 / 36

Downloading a Dataset Download the dataset: 10 million movie ratings by 72,000 users on 10,000 movies # wget files. grouplens. org / datasets / movielens /ml -10 m. zip # unzip ml -10 m. zip # sed s /::/,/g ml -10 M100K / ratings. dat > ratings. csv Open the file and add column names on the first line: userid,moveid,rating,timestamp 19 / 36

GraphLab Recommender Toolkit We will create a factorization recommender program. Create a new python file called recommender.py import graphlab as gl data = gl. SFrame. read_csv ( / root / graphlab / datasets / ratings. csv, column_type_hints ={ rating : int }, header = True ) model = gl. recommender. create (data, user_id = userid, item_id = movieid, target = rating ) results = model. recommend ( users =None,k =5) print results The gl.recommender.create(args) command chooses a recommendation model based on the input dataset format, which is the factorization recommender in this case. 20 / 36

GraphLab Recommender Toolkit The user can specify recommendation model item similarity recommender, factorization recommender, ranking factorization recommender, popularity-based recommender. When user specifies model explicitly, she can also specify number of latent factors, number of maximum iterations, etc. model.recommend(args) returns the k-highest scored items for each user. When users parameter is None, it returns recommendation for all users. 21 / 36

Setup for Creating Checkpoints Copy file ptlcalls.h from marss.dramsim directory # scp user01@sail03. egr. duke. edu :/ home / user01 / marss. dramsim / ptlsim / tools / ptlcalls.h. Create libptlcalls.cpp file (next slide) 22 / 36

libptlcalls.cpp # include < iostream > # include " ptlcalls.h" # include <stdlib.h> extern " C" void create_ checkpoint (){ char * ch_name = getenv (" CHECKPOINT_NAME "); if( ch_name!= NULL ) { printf (" creating checkpoint %s\n",ch_name ); ptlcall_checkpoint_and_shutdown ( ch_name ); } } extern " C" void stop_ simulation (){ printf (" Stopping simulation \n"); ptlcall_kill (); } 23 / 36

Compile libptlcalls.cpp Compile C++ code # g ++ - c - fpic libptlcalls. cpp - o libptlcalls.o Create shared library for Python # g++ -shared -Wl,- soname, libptlcalls.so - o libptlcalls. so libptlcalls. o 24 / 36

Setup for Creating Checkpoints Include the library in recommender.py source code from ctypes import cdll lib = cdll. LoadLibrary (./ libptlcalls.so ) Call function to create checkpoint before the recommender is created. Stop the simulation after recommend function. lib.create checkpoint() model = gl. recommender. create (data, user_id = userid, item_id = movieid, target = rating ) lib.stop simulation() results = model. recommend ( users =None, k =20) 25 / 36

Creating Checkpoints Shutdown QEMU emulator # poweroff Once the emulator is shut down change into the marss.dramsim directory $ cd marss. dramsim Run MARSSx86 QEMU emulator $./ qemu /qemu - system - x86_64 -m 4G -drive file =/ hometemp / userxx / micro2014. qcow2, cache = unsafe - nographic Export CHECKPOINT NAME # export CHECKPOINT_ NAME = graphlab Run recommender.py # python graphlab / recommender. py 26 / 36

Running from Checkpoints Add -simconfig micro2014.simcfg to specify the simulation configuration Add -loadvm option to load from newly created checkpoint Add -snapshot to prevent the simulation from modifying disk image >./ qemu /qemu - system - x86_64 -m 4G -drive file =/ hometemp / userxx / micro2014. qcow2, cache = unsafe - nographic - simconfig micro2014. simcfg - loadvm graphlab - snapshot 27 / 36

GraphLab Clustering Toolkit We will now perform k-means++ clustering We will use airline ontime information for 2008 Download dataset from Statistical Computing web site. Decompress it # wget stat - computing. org / dataexpo /2009/2008. csv. bz2 # bzip2 -d 2008. csv. bz2 28 / 36

GraphLab Clustering Toolkit Create a new python file called clustering.py import graphlab as gl from math import sqrt data_url = 2008. csv data = gl. SFrame. read_csv ( data_url ) # remove empty rows data_ good, data_ bad = data. dropna_ split () # determine the number of rows in the dataset n = len ( data_good ) # compute the number of clusters to create k = int ( sqrt ( n / 2.0) ) print " Starting k- means with % d clusters " % k model = gl. kmeans. create ( data_good, num_clusters =k) ## print some information on clusters created model [ cluster_info ][[ cluster_id, within_distance, size ]] 29 / 36

Setup for Creating Checkpoints Include the library in clustering.py source code from ctypes import cdll lib = cdll. LoadLibrary (./ libptlcalls.so ) Call the function to create checkpoint before k-means clustering model is created. print " Starting k- means with % d clusters " % k lib.create checkpoint() model = gl. kmeans. create ( data_good, num_clusters =k) lib.stop simulation() 30 / 36

Creating Checkpoints Shutdown QEMU emulator # poweroff Once the emulator is shut down change into the marss.dramsim directory $ cd marss. dramsim Run MARSSx86 QEMU emulator $./ qemu /qemu - system - x86_64 -m 4G -drive file =/ hometemp / userxx / micro2014. qcow2, cache = unsafe - nographic Export CHECKPOINT NAME # export CHECKPOINT_ NAME = kmeans 31 / 36

Running from Checkpoints Run clustering.py # python graphlab / clustering. py The checkpoint will be created. Then the VM will shutdown Once the VM shuts down, update micro2014.simcfg to specify number of instructions to simulate -stopinsns 1B Run MARSSx86 from the checkpoint $./ qemu /qemu - system - x86_64 -m 4G -drive file =/ hometemp / userxx / micro2014. qcow2, cache = unsafe - nographic - simconfig micro2014. simcfg - loadvm kmeans - snapshot 32 / 36

Datasets Problem Domain Contributor Link Misc Amazon Web Services public datasets dataset Social Graphs Stanford Large Network Dataset dataset (SNAP) Social Graphs Laboratory for Web Algorithms dataset Collaborative Filtering Million Song dataset dataset Collaborative Filtering Movielens dataset GroupLens dataset Collaborative Filtering KDD Cup 2012 by Tencent, Inc. dataset Collaborative Filtering University of Florida sparse matrix collection dataset (matrix factorization based methods) Classification Airline on time performance dataset Classification SF restaurants dataset dataset GraphLab Resources: http://graphlab.org/resources/datasets.html 33 / 36

Agenda Objectives be able to deploy graph analytics framework be able to simulate GraphLab engine, tasks Outline Learn GraphLab for recommender, clustering Instrument GraphLab for simulation Create checkpoints Simulate from checkpoints 34 / 36

GraphLab For more information visit www.graphlab.org 35 / 36

Tutorial Schedule Time Topic 09:00-09:15 Introduction 09:15-10:30 Setting up MARSSx86 and DRAMSim2 10:30-11:00 Break 11:00-12:00 Spark simulation 12:00-13:00 Lunch 13:00-13:30 Spark continued 13:30-14:30 GraphLab simulation 14:30-15:00 Break 15:00-16:15 Web search simulation 16:15-17:00 Case studies 36 / 36