PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS



Similar documents
Machine Learning over Big Data

Big Data and Scripting Systems beyond Hadoop

Unified Big Data Processing with Apache Spark. Matei

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Spark and the Big Data Library

Beyond Hadoop with Apache Spark and BDAS

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Big Data and Scripting Systems build on top of Hadoop

Unified Big Data Analytics Pipeline. 连 城

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Apache Flink Next-gen data analysis. Kostas

Large-Scale Data Processing

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

How Companies are! Using Spark

Big Data. Lyle Ungar, University of Pennsylvania

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Brave New World: Hadoop vs. Spark

Spark: Cluster Computing with Working Sets

Conquering Big Data with BDAS (Berkeley Data Analytics)

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Hadoop Ecosystem B Y R A H I M A.

Architectures for Big Data Analytics A database perspective

A Brief Introduction to Apache Tez

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Graph Processing: Some Background

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data With Hadoop

Apache Hama Design Document v0.6

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Moving From Hadoop to Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Advanced Big Data Analytics with R and Hadoop

A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes

Big Data Analytics. Lucas Rego Drumond

Parallel Computing. Benson Muite. benson.

MLlib: Scalable Machine Learning on Spark

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Ali Ghodsi Head of PM and Engineering Databricks

The Internet of Things and Big Data: Intro

Evaluating partitioning of big graphs

Processing Large Amounts of Images on Hadoop with OpenCV

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Map-Based Graph Analysis on MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 11

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Challenges for Data Driven Systems

Big Data Analytics. Lucas Rego Drumond

Big Data and Data Science: Behind the Buzz Words

The Big Data Market: Business Case, Market Analysis & Forecasts

Hybrid Software Architectures for Big

Hadoop & Spark Using Amazon EMR

Streaming items through a cluster with Spark Streaming

Spark: Cluster Computing with Working Sets

Implementation of Spark Cluster Technique with SCALA

CSE-E5430 Scalable Cloud Computing Lecture 2

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Next-Gen Big Data Analytics using the Spark stack

The Stratosphere Big Data Analytics Platform

How To Understand The Business Case For Big Data

Presto/Blockus: Towards Scalable R Data Analysis

What s next for the Berkeley Data Analytics Stack?

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Introduction to Spark

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

BIG DATA TOOLS. Top 10 open source technologies for Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data Research in the AMPLab: BDAS and Beyond

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

Market for Telecom Structured Data, Big Data, and Analytics: Business Case, Analysis and Forecasts

Technologies and algorithms to

Big Data Analytics Hadoop and Spark

A survey on platforms for big data analytics

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Map-Reduce for Machine Learning on Multicore

An Approach to Implement Map Reduce with NoSQL Databases

HIGH PERFORMANCE BIG DATA ANALYTICS

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Big Data Explained. An introduction to Big Data Science.

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Transcription:

REVIEW PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS Vijay Srinivas Agneeswaran, PhD, Pranay Tonpay, and Jayati Tiwary, BE Impetus Infotech India Private Limited Bangalore, Karnataka, India Abstract The article explains the three generations of machine learning algorithms with all three trying to operate on big data. The first generation tools are SAS, SPSS, etc., while second generation realizations include Mahout and RapidMiner (that work over Hadoop), and the third generation paradigms include Spark and GraphLab, among others. The essence of the article is that for a number of machine learning algorithms, it is important to look beyond the Hadoop s Map-Reduce paradigm in order to make them work on big data. A number of promising contenders have emerged in the third generation that can be exploited to realize deep analytics on big data. Introduction Google s seminal paper on Map-Reduce (MR) 1 was the trigger that led to a lot of developments in the big data space. Though the Map-Reduce paradigm was known in the functional programming literature, the paper provided scalable implementations of the paradigm on a cluster of nodes. The paper, along with Apache Hadoop, the open source implementation of the MR paradigm, enabled end users to process large data sets on a cluster of nodes a usability paradigm shift. Hadoop, which comprises the MR implementation along with the Hadoop Distributed File System (HDFS), has now become the de facto standard for data processing, with lot of Industrial game-changers such as Disney, Sears, Walmart, and AT&T having their own Hadoop cluster installations. The big data puzzle can be understood better by looking at venture capitalist (VC) funding information. This information has been generated from various VC sources on the web. Some of the interesting areas of funding, along with the corresponding startups, are also given in Table 1. It captures a futuristic landscape of the big data space our perspective of important players/start-ups/software of big data. It is evident from the table that a number of companies are focusing on building analytics on top of the Hadoop framework, leading to the emergence of the term big data analytics. By the term big data analytics, we refer to the ability to ask questions on large data sets and answer them appropriately, possibly by using machine learning techniques as the foundation. The emerging focus of big data analytics is to make traditional techniques such as market basket analysis scale and work on large data sets. This is reflected in the approach of SAS and other traditional vendors to build Hadoop connectors. The other emerging approach for analytics focuses on new algorithms including machine learning and data mining techniques for solving complex analytical problems including those in video and real-time analytics. The goal of this article is to review the literature and practices in this emerging subject area and to explore the fundamental paradigms that are useful to realizing big data analytics. Our perspective is that Hadoop is just one such paradigm, with a whole new set of others that are emerging including Bulk Synchronous Parallel (BSP)-based paradigms and graph processing paradigms, which are more suited to realize iterative machine learning algorithms. DOI: 10.1089/big.2013.0006 MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD207

PARADIGMS ML ALGORITHMS Table 1. Important Players, Start-Ups, and Software of Big Data Area of funding Analytics Computing paradigms Hadoop distributions Storage Visualization Miscellaneous Important players/start-ups/software Analytical Apps and App Dev: Digital Reasoning, Klouddata, JackBe, Accretive, Tibco Spotfire, Pentaho, ParStream, Concurrent, Birst, SAS, ClearStory Data, Terracotta, MailChimp, WibiData, Palantir, Quibole Analytics & Tools: Karmasphere, Pervasive (Actian), Zettaset, Datameer, Splunk, Alpine Data Labs, Knime, HStreaming, Skytree, Bloomreach, Versium, Alteryx, Zemetis, Gauvus, MuSigma, RevolutionR Analytic DBs & Platforms: Teradata/Asterdata, IBM/Netezza, HP/Vertica, Pivotal/Greenplum DB, ParAccel/Actian (and Actian Vectorwise), 1010data Video Analytics: OpenCV, Ooyala, TubeMoghul, 3VR, Video Breakouts Map-Reduce Hadoop, Spark, HaLoop, Twister MR, Real-time/CEP Storm, Kafka, S4, Akka Graph Processing GraphLab, Pregel, Apache Giraph, Stanford GPS, Golden ORB, GraphX, Graph Search (Facebook) Interactive Query Dremel, Apache Drill Distributed SQL Cloudera Impala, Shark Cloudera, HW, GP/Pivotal, MapR, IBM, DataStax, AWS (EMR), Intel, WanDisco NoSQL/NewSQL: DataStax (Cassandra), 10gen (MongoDB), VoltDB, Couchbase, TinyDB, NuoDB, Phoenix, FoundationDB, SciDB, Neo4J, OrientDB, GraphDB, HBase, Redis, etc. File/Storage: Appistry, AmpliStor, HDFS, RainStor, EMC Isilon, Lustre, QFS Datameer, Pentaho, Actuate, Jaspersoft, QlikTech, Tableau, Platfora, chart.io, Cirro, Cognos, Dataspora, Intellicus, Ayasdi, Zoomdata, SiSense, Centrifuge Systems, JQ Plot, D3.js, etc. Software defined networking: Nicira (VMware), Contrail ( Juniper), Arista Data Munging: Dataspora, Trifacta Big-data Governance: Druid (Metamarkets), Infochimps, DataMarket, Timetric, Intel Rhino, Zettaset The rest of the article is organized as follows: The next section, Big Data Analytics, gives a bird s eye view of the three generations of realizations of machine learning algorithms. The subsequent two sections explain the two third-generation paradigms, namely Spark and GraphLab. The final section provides concluding remarks for the article. Big Data Analytics: Three Generations of Machine Learning Realizations We will explain the different paradigms available for implementing machine learning ( ML) algorithms both from the literature and from the open source community. First of all, we would like to furnish a view of the three generations of ML tools available to us today: 1. The traditional ML tools for machine learning and statistical analysis including SAS, SPSS, Weka, and the R language allow deep analysis on smaller data sets that can fit the memory of the node on which the tool runs. 2. Second-generation ML tools such as Mahout, Pentaho, or RapidMiner allow what we call a shallow analysis of big data. 3. The third-generation tools such as Spark, Twister, HaLoop, Hama, and GraphLab facilitate deeper analysis of big data. THE FIRST-GENERATION ML TOOLS CAN FACILITATE DEEP ANALYTICS, AS THEY HAVE A WIDE SET OF ML ALGORITHMS. First-generation ML tools/paradigms The first-generation ML tools can facilitate deep analytics, as they have a wide set of ML algorithms. However, not all of them can work on large data sets like tera-petabytes of data, due to scalability limitations (limited by the nondistributed nature of the tool). In other words, they are vertically scalable (i.e., you can increase the processing power of the node on which the tool runs), but not horizontally scalable (i.e., not all of them can run on a cluster). These limitations are being addressed by building Hadoop connectors as well as providing clustering options meaning that the vendors have made efforts to reengineer the tools such as R and SAS to scale horizontally. This would fall under the second/thirdgeneration tools as covered subsequently. Second-generation ML tools/ paradigms The second-generation tools (we can now term the traditional ML tools such as SAS as first-generation tools) such as Mahout (http:// mahout.apache.com), Rapidminer, or Pentaho provide the ability to scale to large data sets by implementing the algorithms over Hadoop, the open source MR implementation. These tools are maturing fast and are open source (especially Mahout). Mahout has a set of algorithms for clustering and classification as well as a very good recommendation algorithm. 2 Mahout can thus be said to work on big data with a number of production use cases 208BD BIG DATA DECEMBER 2013

REVIEW mainly for the recommendation system. We have also used Mahout in production systems for realizing recommendation algorithms in financial domain and found it to be scalable, though not without issues (we had to tweak the source significantly). One observation about Mahout is that it implements only a smaller subset of ML algorithms over Hadoop: only 25 algorithms are production quality, with only 8 of 9 usable over Hadoop, meaning scalable over large data sets. These include linear regression, linear support vector machine (SVM), k-means clustering, etc. It does provide a fast sequential implementation of the logistic regression, with parallelized training. However, as several others have also noted (see Quora, for instance), it does not have implementations of nonlinear SVMs or multivariate logistic regression (otherwise known as discrete choice model). Overall, this article is not intended for Mahout bashing; however, our point is that it is quite hard to implement certain ML algorithms including the kernel SVM and conjugate gradient descent (note that Mahout has an implementation of stochastic gradient descent) over Hadoop. This has been pointed out by several others as well for instance, see the paper by Srirama. 3 This paper makes detailed comparisons between Hadoop and Twister Map-Reduce 4 with respect to iterative algorithms such as Conjugate Gradient Descent (CGD) and shows that the overheads can be significant for Hadoop. What do we mean by iterative? A set of entities that perform a certain computation, wait for results from neighbors or other entities, and start the next iteration. The CGD is a perfect example of an iterative ML algorithm: each CGD can be broken down into daxpy, ddot, and matmul as the primitives. We will explain these three primitives: daxpy is an operation that takes a vector x, multiplies it by a constant k and adds another vector y to it; ddot computes the dot product of two vectors x and y; matmul multiplies a matrix by a vector and produces a vector output. This means 1 MR per primitive, leading to 6 MRs per iteration, and eventually 100s of MRs per CG computation, as well as few GBs of communication, even for small matrices. In essence, the setup cost per iteration (which includes reading from HDFS into memory) overwhelms the computation for that iteration, leading to performance degradation in Hadoop MR. In contrast, Twister distinguishes between static and variable data, allowing data to be in memory across MR iterations as well as a combine phase for collecting all reduce outputs, and hence performs significantly better. We would like to make the comment about Hadoop that it is good for embarrassingly parallel applications, but certain hindrances remain for enterprise adoption, including the following: THE SECOND-GENERATION TOOLS.PROVIDE THE ABILITY TO SCALE TO LARGE DATA SETS BY IMPLEMENTING THE ALGORITHMS OVER HADOOP, THE OPEN SOURCE MR IMPLEMENTATION. 1. Lack of object database connectivity (ODBC). A lot of business intelligence (BI) tools are forced to build separate Hadoop connectors. 2. Data splits. If data splits are interrelated or computation needs to access data across splits, this may involve joins and may not run efficiently over Hadoop. 3. Iterative computations. Hadoop MR is not well suited for two reasons: One is the overhead of fetching data from HDFS for each iteration (which can be amortized by a distributed caching layer) and second, is the lack of longlived MR jobs in Hadoop. This implies that for each iteration, new MR jobs need to be initialized overhead of initialization could overwhelm computation for the iteration and this could cause significant performance hits. The other second-generation tools are the traditional tools that have been scaled to work over Hadoop. The choices in this space include the work done by Revolution Analytics to scale R over Hadoop and the work to implement a scalable runtime over Hadoop for R programs. 5 The SAS in-memory analytics, part of the High Performance Analytics toolkit from SAS, is another attempt at scaling a traditional tool by using a Hadoop cluster. However, the recently released version works over Greenplum/Teradata in addition to Hadoop: this could then be seen as a third-generation approach. The other interesting work is by a small start-up known as concurrent systems, which provides a Predictive Modeling Markup Language (PMML) run-time over Hadoop. PMML is like the XML of modeling, allowing models to be saved in descriptive language files. Traditional tools such as R and SAS allow the models to be saved as PMML files. The runtime over Hadoop would allow these model files to be scaled over a Hadoop cluster, so this also falls under our second-generation tools/ paradigms. Third-generation ML tools/paradigms The limitations of Hadoop and its lack of suitability for certain classes of applications has motivated some researchers to come up with alternatives. Researchers at the University of California, Berkeley have proposed Spark 6 as one such alternative. In other words, Spark could be seen as the next generation data processing alternative to Hadoop in the big data space. The key idea distinguishing Spark is its inmemory computation, allowing data to be cached in memory across iterations/interactions. The Berkeley researchers have proposed Berkeley Data Analytics Stack (BDAS) as a collection of technologies that help in running data analytics tasks across a cluster of nodes. The lowest level component of the BDAS is Mesos, the cluster manager that helps in task allocation and resource management tasks of the cluster. The MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD209

PARADIGMS ML ALGORITHMS second component is the Tachyon file system built on top of Mesos. Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. In Spark, the computation paradigm is realized over Tachyon and Mesos in a specific embodiment, though it could be realized without Tachyon and even without Mesos for clustering. Shark, which is realized over Spark, provides an SQL abstraction over a cluster similar to the abstraction Hive provides over Hadoop. This article explores Spark, which is the main ingredient over which ML algorithms can be built. The main motivation for Spark was that the commonly used MR paradigm, while being suitable for some applications that can be expressed as acyclic data flows, was not suitable for other applications, such as those that need to reuse working sets across iterations. So they proposed a new paradigm for cluster computing that can provide similar guarantees or fault-tolerance as MR but would also be suitable for iterative and interactive applications. The HaLoop work 7 also extends Hadoop for iterative ML algorithms. HaLoop not only provides a programming abstraction for expressing iterative applications, it also uses the notion of caching to share data across iterations and for fixpoint verification (termination of iteration), thereby improving efficiency. Twister (http://iterativemapreduce.org) is another effort similar to HaLoop. The other important tool that has looked beyond Hadoop MR comes from Google: the Pregel framework for realizing graph computations. 8 Computations in Pregel comprise a series of iterations known as supersteps. Each vertex in the graph is associated with a user-defined compute function; Pregel ensures at each superstep that the user-defined compute function is invoked in parallel on each edge. The vertices can send messages through the edges and exchange values with other vertices. There is also the global barrier, which moves forward once all compute functions are terminated. Readers familiar with bulk synchronous parallel (BSP) can realize why Pregel is a perfect example of BSP (a set of entities computing in parallel with global synchronization and able to exchange messages). Apache Hama 9 is the open source equivalent of Pregel, being an implementation of the BSP. Hama realized BSP over HDFS, as well as the Dryad engine from Microsoft. It may be that they do not want to be seen as being different from the Hadoop community. But the important thing is that BSP is an inherently well-suited paradigm for iterative computations and Hama has parallel implementations of the CG, which we said is not easy to realize over Hadoop. It must be noted that the BSP engine in Hama is realized over message passing THE LIMITATIONS OF HADOOP AND ITS LACK OF SUITABILITY FOR CERTAIN CLASSES OF APPLICATIONS HAS MOTIVATED SOME RESEARCHERS TO COME UP WITH ALTERNATIVES. interface (MPI), the father (and mother) of parallel programming literature (www.mcs.anl.gov/research/projects/ mpi/). The other projects that are similar to Pregel are Apache Giraph, Golden orb, and Stanford GPS. Google are yet to open source Pregel, to the best knowledge of the authors. Third-Generation ML Tool: Spark The key notion in Spark is the Resilient Distributed DataSet (RDD), which can be cached in memory on different nodes and can be used across iterations, thereby improving the performance significantly. Spark also addresses fault-tolerance for the RDDs, meaning that if a node crashes, there is enough information in other RDDs to recreate or reconstruct the lost RDD partition through what they call a lineage. Spark is implemented in Scala, 10 a high-level language that is similar to Java and gaining popularity. The main difference between Java and Scala is that Scala is purely object oriented, much like the classical object-oriented language Smalltalk. Scala also unifies functional programming (of the Lisp kind) with object-oriented programming. This means that some classes may inherit from functions [e.g., array type of Scala inherits from functions and is written as A(i)]. Spark offers some alternatives for building RDDs: 1. RDDs can be built from a file in HDFS. 2. They can be built by parallelizing a Spark collection (this is slicing like an array). 3. RDDs can also be built by transforming existing RDDs (specify operations such as map or filter or join). 4. RDDs can also be built by changing the persistence or saving options of existing RDDs ( save allows an RDD to be saved to HDFS or cache constructs). It must be noted that the cache construct is only a hint: if there is not enough memory across the cluster, Spark will reconstruct the RDD on the fly. This implies that Spark can scale (handle increasing data size with reduced performance) and will be fault tolerant. RDDs can be used in actions operations that return a value or export data to storage. Examples of such actions include count, collect, save, and reduce. Parallelism in RDDs is facilitated by constructs like foreach, reduce, and collect and the user-defined function, which is a first-class object in Scala. Reduce is an associative function that combines the data set elements to produce a result at the driver program. The collect construct collects all elements of the data set to the driver program. An array can be updated by parallelize, map and collect operations. The foreach construct allows a user provided function to be executed on each element of an RDD in parallel. 210BD BIG DATA DECEMBER 2013

REVIEW Programmers can pass functions or closures to invoke the map, filter, and reduce operations in Spark. Normally, when Spark runs these functions on worker nodes, the local variables within the scope of the function are copied. Spark has the notion of shared variables for emulating globals using the broadcast variable and accumulators. Broadcast variables are used by the programmer to copy read-only data once to all the workers (static matrices in CGD kind of algorithms can be broadcast variables). Accumulators are variables that can only be added by the workers and read by the driver program: parallel sums can be realized fault-tolerantly. It must be noted that the programming interface of Spark is similar to DyradLinQ. 11 However, DyradLinQ does not have the concept of RDDs or any way for data to be shared across iterations, the main differentiator of Spark. Spark is emerging as a promising Hadoop alternative in the big data analytics space, with a number of production use cases, mainly from start-ups, small, and medium enterprises. We have conducted performance studies of Spark compared with second generation paradigms. In this direction, we have implemented several ML algorithms over Spark, including the k-means clustering a commony used algorithm for clustering data sets. The performance comparison of the Mahout s k-means algorithm realized over Hadoop versus the k- means algorithm realized over Spark is given in Table 2. The performance studies were done over a three-node cluster, each node being a 32-core Xeon with 32 GB RAM and 32 GB swap. This tells us that the Spark implementation can be significantly faster and would scale better over a cluster of nodes compared with the Mahout implementation. The main reason for the slow performance of Mahout is because of the use of sequence files and consequent disk accesses, whereas Spark Table 2. Performance Comparison of the Mahout s k-means Algorithm Realized over Hadoop Versus the k-means Algorithm Realized over Spark Time taken (in seconds) Number of points clustered (in millions) Spark Mahout 0.1 4 363 0.2 6 368 0.3 7 374 0.4 9 391 0.5 11 407 0.6 13 425 0.7 16 428 0.8 19 433 0.9 21 435 1 24 437 2.5 91 570 THE KEY IDEA DISTINGUISHING SPARK IS ITS IN-MEMORY COMPUTATION, ALLOWING DATA TO BE CACHED IN MEMORY ACROSS ITERATIONS/INTERACTIONS. performs in-memory computation with the RDDs. Figure 1 shows the end-to-end performance of the logistic regression algorithm in Mahout and Spark on the same 3-node cluster. This comparison is in line with the comparison made by the AmpLab team on the Spark website for logistic regression (http://spark.incubator.apache.org/examples.html). Third-Generation ML Tool: GraphLab While Pregel is good at graph parallel abstraction, is easy to reason with, and ensures deterministic computation, it leaves it to the user to architect the movement of data. Further, like all BSP systems, it also suffers from the curse of the slow jobs, meaning that even a single slow job (which could be due to load fluctuations or other reasons) can slow down the whole computation. To alleviate some of the above issues, Graph- Lab has been proposed (http://graphlab.org). Their paper appeared in Proceedings of the VLDB Endowment. 12 The main motivation for coming up with GraphLab was to build an asynchronous graph processing paradigm one that is not affected by the curse of the slow job (asynchrony implies does not need barrier synchronization). It also allows dynamic iterative computations. User-defined update functions live on each vertex and transform data in the scope of the vertex. It can choose to trigger neighbor updates (for example, can be triggered only if the rank changes drastically in a page-rank kind of algorithm) and can run without global synchronization. Importantly, while Pregel lives with sequential dependencies on the graphs, Graph- Lab allows parallelism, which is important in certain ML algorithms, including collaborative filtering. GraphLab has implemented several ML algorithms including Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling, etc. The paper also shows significant speed-up compared with Hadoop for an expectation maximization algorithm. GraphLab2 13 is a new asynchronous shared memory abstraction in which the user-defined vertex programs share a distributed graph, which has data associated with both vertices and edges. Each vertex program can access data associated with the vertex or incident edges or neighboring vertices. It can also schedule computation to be run on the neighboring vertices in the future. Serializability is ensured, as GraphLab does not allow neighboring vertex programs to run simultaneously. It characterizes a graph computation as comprising three phases: gather (where values from neighbors are gathered by the vertex program), apply (these values are applied to the current vertex sum or union of data on this and neighboring vertices and edges can be specified), and scatter (neighboring vertex programs can be scheduled to MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD211

PARADIGMS ML ALGORITHMS FIG. 1. Logistic regression comparison of Spark versus Mahout. take the new value of this vertex). The gather and scatter phases can determine appropriate fan-in and fan-out of the vertices. By separating this out, GraphLab can be efficient on high fan-in or high fan-out edges, which may occur in power law graphs. The main challenges of natural graphs include the asymmetric fanin/fan-out distribution leading to work imbalance in systems (such as Pregel) that treat vertices uniformly. The partitioning mechanism is another challenge. Since Pregel and Graphlab1 use hash-based random partitioning, they are not well suited for natural graphs. Handling natural graphs also requires parallelism abstraction within individual vertices, which is not handled in systems such as Pregel. GraphLab2 provides new ways of partitioning a power law graph (where high-degree vertices could limit parallelism) by defining an abstraction known as Power- Graph. Vertices are associated with nodes, while edges can span multiple nodes. PowerGraph gets the best of both Pregel THE MAIN MOTIVATION FOR COMING UP WITH GRAPHLAB WAS TO BUILD AN ASYNCHRONOUS GRAPH PROCESSING PARADIGM ONE THAT IS NOT AFFECTED BY THE CURSE OF THE SLOW JOB (ASYNCHRONY IMPLIES DOES NOT NEED BARRIER SYNCHRONIZATION). (associative, commutative gather concept) and GraphLab1 (shared memory abstraction). PowerGraph introduces novelway vertex cuts, which allow it to represent the graph better in a distributed system compared with GraphLab1 or Pregel. PowerGraph factors the vertex program into gather, sum, apply, and scatter functions and can hence distribute the vertex program across the nodes of a distributed system. The gather function is invoked on the edges adjacent to the current one, depending on whether the gather_nbrs parameter is set to none, all, in, or out. The sum is a commutative and associative operator. The apply function computes a new value of the vertex. The scatter function is invoked on the neighbors (again based on the scatter_nbrs parameter). PowerGraph has both a synchronous (resembling Pregel) and an asynchronous execution mode (resembling Piccolo). However, asynchronous execution is nondeterministic in Piccolo, which could lead to instability or nonconvergence for certain ML algorithms such as statistical simulations. 14 In contrast, PowerGraph enforces serializability, which implies that every parallel execution has a corresponding serial execution order. While GraphLab1 provided the same, it was inefficient and unfair to high-degree vertices due to a sequential locking protocol. PowerGraph introduces a new parallel locking protocol that is fair to highdegree vertices, allowing parallelization of the vertex program on a cluster. The PowerGraph paper also shows significant performance benefits for PageRank and triangle counting problems on the Twitter graph. The different graph processing paradigms have been compared and contrasted in Table 3 for quick reference. Table 4 carries a comparison of the various paradigms across different nonfunctional features such as scalability, Table 3. Comparison of Graph Processing Paradigms Graph paradigms Computation (synchronous or asynchronous) Determinism and its effects Efficiency vs. asynchrony Efficiency of processing power law graphs GraphLab1 Asynchronous Deterministic: serializable computation Uses inefficient locking protocols Inefficient: locking protocol is unfair to high degree vertices Pregel Synchronous: Bulk Deterministic: serial computation NA Inefficient: curse of slow jobs Synchronous Parallel based Piccolo Asynchronous Nondeterministic: nonserializable Efficient, but may lead May be efficient computation to incorrectness GraphLab2 (PowerGraph) Asynchronous Deterministic: serializable computation Uses parallel locking for efficient, serializable, and asynchronous computations Efficient: parallel locking and other optimizations for processing natural graphs 212BD BIG DATA DECEMBER 2013

REVIEW Table 4. Comparison of the Various Paradigms Across Different Nonfunctional Features Generation First generation Second generation Third generation Examples SAS, R, Weka, SPSS in native form Mahout, Pentaho, Revolution R, SAS In-memory Analytics (Hadoop) Spark, HaLoop, GraphLab, Pregel, SAS In-memory Analytics (Greenplum/Teradata), Giraph, GoldenOrb, Stanford GPS, ML over Storm Scalability Vertical Horizontal (over Hadoop) Horizontal (Beyond Hadoop) Algorithms Available Algorithms Not Available Huge collection of algorithms Practically nothing Small subset: sequential logistic regression, linear SVMs, Stochastic Gradient Descent, k-means clustering, Random Forests, etc. Vast no.: Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descent, ALS, etc. Fault-Tolerance Single point of failure Most tools are FT, as they are built on top of Hadoop FT, fault-tolerant; GPS, graph processing system; ML, machine learning; SVM, support vector machine. Much wider: including Conjugate Gradient Descent (CGD), Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling, etc. Multivariate logistic regression in general form, k-means clustering etc.: work in progress to expand the set of algorithms available. FT: HaLoop, Spark Not FT: Pregel, GraphLab, Giraph fault tolerance, and the algorithms that have been implemented. It can be inferred that while the traditional tools have worked on only a single node and may not scale horizontally and may also have single point of failure, recent reengineering efforts have made them move across generations. The other point to be noted is that most of the graph processing paradigms are not fault tolerant, while Spark and HaLoop are among the third-generation tools that provide fault tolerance. Conclusions This article has provided a comprehensive review of the three generations of tools/paradigms that realize machine learning algorithms for big data: The first-generation tools, which include SAS and SPSS, can help in deep analytics but may only scale vertically. The second-generation tools such as Mahout and RapidMiner can scale horizontally but may be limited by the Hadoop MR over which they are realized (in terms of the number of algorithms that can be implemented nonserially). The third-generation realizations such as Spark and GraphLab, among others, promise the most in terms of horizontal scalability and the large number of ML algorithms that can be potentially implemented. However, it will take plenty of effort, time, and widespread adoption and industrial use to make sure that these third-generation tools realize their true potential and allow deep analytics of big data. It must be noted that only Spark has production use cases, with other tools starting to be used for testing/development. This article has also given certain performance studies of ML algorithms realized over Spark and Mahout over Hadoop, illustrating the power of the third-generation paradigm. Acknowledgments The authors wish to thank Impetus management including Pankaj Mittal and Vineet Tyagi for providing encouragement and support. We also wish to thank Dr. Nitin Agarwal and Joydeb Mukherjee from the Data Sciences Practice team at Impetus. We also wish to express our gratitude to Gurvinder Arora, our technical writer, for reviewing the document and improving the language. Author Disclosure Statement No financial conflicts exist. References 1. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1. 2008, pp. 107 113. 2. Ekstrand MD, Ludwig M, Konstan JA, Riedl JT. Rethinking the recommender research ecosystem: Reproducibility, openness, and LensKit. Proceedings of the fifth ACM conference on Recommender systems (RecSys 11). ACM, New York: 2011, pp.133 140. 3. Srirama SN, Jakovits P, Vainikko E. Adapting scientific computing problems to clouds using MapReduce. Future Gener Comput Syst 2012; 28: 184 192. 4. Ekanayake J, Li H, Zhang B, et al. A runtime for iterative MapReduce. Proceedings of the 19th ACM International MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD213

PARADIGMS ML ALGORITHMS Symposium on High Performance Distributed Computing. Chicago, IL: 2010. 5. Venkataraman S, Roy I, Young AA, Schreiber RS. Using R for iterative and incremental processing. Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing (HotCloud 12). Berkeley, CA: USENIX Association, 2012, p. 11. 6. Zaharia M, Chowdhury M, Franklin MJ, et al. Spark: cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud 10). Berkeley, CA: USENIX Association, 2010, p. 10. 7. Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: Efficient iterative data processing on large clusters. Proceedings VLDB Endowment 2010; 3:285 296. 8. Malewicz G, Malewicz G, Austern MH, et al. Pregel: A system for large-scale graph processing. Proceedings of SIGMOD International Conference on Management of Data (SIGMOD 10). New York: Association for Computing Machinery (ACM), 2010, pp. 135 146. 9. Seo S, Yoon EJ, Kim J, et al. HAMA: An efficient matrix computation with the MapReduce framework. Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CLOUDCOM 10). Washington, DC: IEEE Computer Society, 2010, pp. 721 726. 10. Odersky M, Spoon L, Venners B. Programming in Scala, 2nd ed. Walnut Creek, CA: Artima Publishers, 2010. 11. Yu Y, Isard M, Fetterly D, et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 08). Berkeley, CA: USENIX Association, 2008, pp. 1 14. 12. Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 2012; 5:716 727. 13. Gonzalez J, Low Y, Gretton A, Guestrin C. Parallel gibbs sampling: From colored fields to thin junction trees. AISTATS 2011; 15:324 332. 14. Gonzalez JE, Low Y, Gu H, et al. PowerGraph: Distributed graph-parallel computation on natural graphs. Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). Berkeley, CA: USENIX Association, 2012, pp. 17 30. Address correspondence to: Vijay Srinivas Agneeswaran, PhD Innovation Labs Impetus Infotech India Private Limited Pritech Park SEZ Bellandur Outer Ring Road Bangalore, Karnataka 560103 India E-mail: vijay.sa@impetus.co.in This work is licensed under a Creative Commons Attribution 3.0 United States License. You are free to copy, distribute, transmit and adapt this work, but you must attribute this work as Big Data. Copyright 2013 Mary Ann Liebert, Inc. http://liebertpub.com/big, used under a Creative Commons Attribution License: http://creativecommons.org/licenses/ by/3.0/us/ 214BD BIG DATA DECEMBER 2013