PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS
|
|
|
- Colin Day
- 9 years ago
- Views:
Transcription
1 REVIEW PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS Vijay Srinivas Agneeswaran, PhD, Pranay Tonpay, and Jayati Tiwary, BE Impetus Infotech India Private Limited Bangalore, Karnataka, India Abstract The article explains the three generations of machine learning algorithms with all three trying to operate on big data. The first generation tools are SAS, SPSS, etc., while second generation realizations include Mahout and RapidMiner (that work over Hadoop), and the third generation paradigms include Spark and GraphLab, among others. The essence of the article is that for a number of machine learning algorithms, it is important to look beyond the Hadoop s Map-Reduce paradigm in order to make them work on big data. A number of promising contenders have emerged in the third generation that can be exploited to realize deep analytics on big data. Introduction Google s seminal paper on Map-Reduce (MR) 1 was the trigger that led to a lot of developments in the big data space. Though the Map-Reduce paradigm was known in the functional programming literature, the paper provided scalable implementations of the paradigm on a cluster of nodes. The paper, along with Apache Hadoop, the open source implementation of the MR paradigm, enabled end users to process large data sets on a cluster of nodes a usability paradigm shift. Hadoop, which comprises the MR implementation along with the Hadoop Distributed File System (HDFS), has now become the de facto standard for data processing, with lot of Industrial game-changers such as Disney, Sears, Walmart, and AT&T having their own Hadoop cluster installations. The big data puzzle can be understood better by looking at venture capitalist (VC) funding information. This information has been generated from various VC sources on the web. Some of the interesting areas of funding, along with the corresponding startups, are also given in Table 1. It captures a futuristic landscape of the big data space our perspective of important players/start-ups/software of big data. It is evident from the table that a number of companies are focusing on building analytics on top of the Hadoop framework, leading to the emergence of the term big data analytics. By the term big data analytics, we refer to the ability to ask questions on large data sets and answer them appropriately, possibly by using machine learning techniques as the foundation. The emerging focus of big data analytics is to make traditional techniques such as market basket analysis scale and work on large data sets. This is reflected in the approach of SAS and other traditional vendors to build Hadoop connectors. The other emerging approach for analytics focuses on new algorithms including machine learning and data mining techniques for solving complex analytical problems including those in video and real-time analytics. The goal of this article is to review the literature and practices in this emerging subject area and to explore the fundamental paradigms that are useful to realizing big data analytics. Our perspective is that Hadoop is just one such paradigm, with a whole new set of others that are emerging including Bulk Synchronous Parallel (BSP)-based paradigms and graph processing paradigms, which are more suited to realize iterative machine learning algorithms. DOI: /big MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD207
2 PARADIGMS ML ALGORITHMS Table 1. Important Players, Start-Ups, and Software of Big Data Area of funding Analytics Computing paradigms Hadoop distributions Storage Visualization Miscellaneous Important players/start-ups/software Analytical Apps and App Dev: Digital Reasoning, Klouddata, JackBe, Accretive, Tibco Spotfire, Pentaho, ParStream, Concurrent, Birst, SAS, ClearStory Data, Terracotta, MailChimp, WibiData, Palantir, Quibole Analytics & Tools: Karmasphere, Pervasive (Actian), Zettaset, Datameer, Splunk, Alpine Data Labs, Knime, HStreaming, Skytree, Bloomreach, Versium, Alteryx, Zemetis, Gauvus, MuSigma, RevolutionR Analytic DBs & Platforms: Teradata/Asterdata, IBM/Netezza, HP/Vertica, Pivotal/Greenplum DB, ParAccel/Actian (and Actian Vectorwise), 1010data Video Analytics: OpenCV, Ooyala, TubeMoghul, 3VR, Video Breakouts Map-Reduce Hadoop, Spark, HaLoop, Twister MR, Real-time/CEP Storm, Kafka, S4, Akka Graph Processing GraphLab, Pregel, Apache Giraph, Stanford GPS, Golden ORB, GraphX, Graph Search (Facebook) Interactive Query Dremel, Apache Drill Distributed SQL Cloudera Impala, Shark Cloudera, HW, GP/Pivotal, MapR, IBM, DataStax, AWS (EMR), Intel, WanDisco NoSQL/NewSQL: DataStax (Cassandra), 10gen (MongoDB), VoltDB, Couchbase, TinyDB, NuoDB, Phoenix, FoundationDB, SciDB, Neo4J, OrientDB, GraphDB, HBase, Redis, etc. File/Storage: Appistry, AmpliStor, HDFS, RainStor, EMC Isilon, Lustre, QFS Datameer, Pentaho, Actuate, Jaspersoft, QlikTech, Tableau, Platfora, chart.io, Cirro, Cognos, Dataspora, Intellicus, Ayasdi, Zoomdata, SiSense, Centrifuge Systems, JQ Plot, D3.js, etc. Software defined networking: Nicira (VMware), Contrail ( Juniper), Arista Data Munging: Dataspora, Trifacta Big-data Governance: Druid (Metamarkets), Infochimps, DataMarket, Timetric, Intel Rhino, Zettaset The rest of the article is organized as follows: The next section, Big Data Analytics, gives a bird s eye view of the three generations of realizations of machine learning algorithms. The subsequent two sections explain the two third-generation paradigms, namely Spark and GraphLab. The final section provides concluding remarks for the article. Big Data Analytics: Three Generations of Machine Learning Realizations We will explain the different paradigms available for implementing machine learning ( ML) algorithms both from the literature and from the open source community. First of all, we would like to furnish a view of the three generations of ML tools available to us today: 1. The traditional ML tools for machine learning and statistical analysis including SAS, SPSS, Weka, and the R language allow deep analysis on smaller data sets that can fit the memory of the node on which the tool runs. 2. Second-generation ML tools such as Mahout, Pentaho, or RapidMiner allow what we call a shallow analysis of big data. 3. The third-generation tools such as Spark, Twister, HaLoop, Hama, and GraphLab facilitate deeper analysis of big data. THE FIRST-GENERATION ML TOOLS CAN FACILITATE DEEP ANALYTICS, AS THEY HAVE A WIDE SET OF ML ALGORITHMS. First-generation ML tools/paradigms The first-generation ML tools can facilitate deep analytics, as they have a wide set of ML algorithms. However, not all of them can work on large data sets like tera-petabytes of data, due to scalability limitations (limited by the nondistributed nature of the tool). In other words, they are vertically scalable (i.e., you can increase the processing power of the node on which the tool runs), but not horizontally scalable (i.e., not all of them can run on a cluster). These limitations are being addressed by building Hadoop connectors as well as providing clustering options meaning that the vendors have made efforts to reengineer the tools such as R and SAS to scale horizontally. This would fall under the second/thirdgeneration tools as covered subsequently. Second-generation ML tools/ paradigms The second-generation tools (we can now term the traditional ML tools such as SAS as first-generation tools) such as Mahout ( mahout.apache.com), Rapidminer, or Pentaho provide the ability to scale to large data sets by implementing the algorithms over Hadoop, the open source MR implementation. These tools are maturing fast and are open source (especially Mahout). Mahout has a set of algorithms for clustering and classification as well as a very good recommendation algorithm. 2 Mahout can thus be said to work on big data with a number of production use cases 208BD BIG DATA DECEMBER 2013
3 REVIEW mainly for the recommendation system. We have also used Mahout in production systems for realizing recommendation algorithms in financial domain and found it to be scalable, though not without issues (we had to tweak the source significantly). One observation about Mahout is that it implements only a smaller subset of ML algorithms over Hadoop: only 25 algorithms are production quality, with only 8 of 9 usable over Hadoop, meaning scalable over large data sets. These include linear regression, linear support vector machine (SVM), k-means clustering, etc. It does provide a fast sequential implementation of the logistic regression, with parallelized training. However, as several others have also noted (see Quora, for instance), it does not have implementations of nonlinear SVMs or multivariate logistic regression (otherwise known as discrete choice model). Overall, this article is not intended for Mahout bashing; however, our point is that it is quite hard to implement certain ML algorithms including the kernel SVM and conjugate gradient descent (note that Mahout has an implementation of stochastic gradient descent) over Hadoop. This has been pointed out by several others as well for instance, see the paper by Srirama. 3 This paper makes detailed comparisons between Hadoop and Twister Map-Reduce 4 with respect to iterative algorithms such as Conjugate Gradient Descent (CGD) and shows that the overheads can be significant for Hadoop. What do we mean by iterative? A set of entities that perform a certain computation, wait for results from neighbors or other entities, and start the next iteration. The CGD is a perfect example of an iterative ML algorithm: each CGD can be broken down into daxpy, ddot, and matmul as the primitives. We will explain these three primitives: daxpy is an operation that takes a vector x, multiplies it by a constant k and adds another vector y to it; ddot computes the dot product of two vectors x and y; matmul multiplies a matrix by a vector and produces a vector output. This means 1 MR per primitive, leading to 6 MRs per iteration, and eventually 100s of MRs per CG computation, as well as few GBs of communication, even for small matrices. In essence, the setup cost per iteration (which includes reading from HDFS into memory) overwhelms the computation for that iteration, leading to performance degradation in Hadoop MR. In contrast, Twister distinguishes between static and variable data, allowing data to be in memory across MR iterations as well as a combine phase for collecting all reduce outputs, and hence performs significantly better. We would like to make the comment about Hadoop that it is good for embarrassingly parallel applications, but certain hindrances remain for enterprise adoption, including the following: THE SECOND-GENERATION TOOLS.PROVIDE THE ABILITY TO SCALE TO LARGE DATA SETS BY IMPLEMENTING THE ALGORITHMS OVER HADOOP, THE OPEN SOURCE MR IMPLEMENTATION. 1. Lack of object database connectivity (ODBC). A lot of business intelligence (BI) tools are forced to build separate Hadoop connectors. 2. Data splits. If data splits are interrelated or computation needs to access data across splits, this may involve joins and may not run efficiently over Hadoop. 3. Iterative computations. Hadoop MR is not well suited for two reasons: One is the overhead of fetching data from HDFS for each iteration (which can be amortized by a distributed caching layer) and second, is the lack of longlived MR jobs in Hadoop. This implies that for each iteration, new MR jobs need to be initialized overhead of initialization could overwhelm computation for the iteration and this could cause significant performance hits. The other second-generation tools are the traditional tools that have been scaled to work over Hadoop. The choices in this space include the work done by Revolution Analytics to scale R over Hadoop and the work to implement a scalable runtime over Hadoop for R programs. 5 The SAS in-memory analytics, part of the High Performance Analytics toolkit from SAS, is another attempt at scaling a traditional tool by using a Hadoop cluster. However, the recently released version works over Greenplum/Teradata in addition to Hadoop: this could then be seen as a third-generation approach. The other interesting work is by a small start-up known as concurrent systems, which provides a Predictive Modeling Markup Language (PMML) run-time over Hadoop. PMML is like the XML of modeling, allowing models to be saved in descriptive language files. Traditional tools such as R and SAS allow the models to be saved as PMML files. The runtime over Hadoop would allow these model files to be scaled over a Hadoop cluster, so this also falls under our second-generation tools/ paradigms. Third-generation ML tools/paradigms The limitations of Hadoop and its lack of suitability for certain classes of applications has motivated some researchers to come up with alternatives. Researchers at the University of California, Berkeley have proposed Spark 6 as one such alternative. In other words, Spark could be seen as the next generation data processing alternative to Hadoop in the big data space. The key idea distinguishing Spark is its inmemory computation, allowing data to be cached in memory across iterations/interactions. The Berkeley researchers have proposed Berkeley Data Analytics Stack (BDAS) as a collection of technologies that help in running data analytics tasks across a cluster of nodes. The lowest level component of the BDAS is Mesos, the cluster manager that helps in task allocation and resource management tasks of the cluster. The MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD209
4 PARADIGMS ML ALGORITHMS second component is the Tachyon file system built on top of Mesos. Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. In Spark, the computation paradigm is realized over Tachyon and Mesos in a specific embodiment, though it could be realized without Tachyon and even without Mesos for clustering. Shark, which is realized over Spark, provides an SQL abstraction over a cluster similar to the abstraction Hive provides over Hadoop. This article explores Spark, which is the main ingredient over which ML algorithms can be built. The main motivation for Spark was that the commonly used MR paradigm, while being suitable for some applications that can be expressed as acyclic data flows, was not suitable for other applications, such as those that need to reuse working sets across iterations. So they proposed a new paradigm for cluster computing that can provide similar guarantees or fault-tolerance as MR but would also be suitable for iterative and interactive applications. The HaLoop work 7 also extends Hadoop for iterative ML algorithms. HaLoop not only provides a programming abstraction for expressing iterative applications, it also uses the notion of caching to share data across iterations and for fixpoint verification (termination of iteration), thereby improving efficiency. Twister ( is another effort similar to HaLoop. The other important tool that has looked beyond Hadoop MR comes from Google: the Pregel framework for realizing graph computations. 8 Computations in Pregel comprise a series of iterations known as supersteps. Each vertex in the graph is associated with a user-defined compute function; Pregel ensures at each superstep that the user-defined compute function is invoked in parallel on each edge. The vertices can send messages through the edges and exchange values with other vertices. There is also the global barrier, which moves forward once all compute functions are terminated. Readers familiar with bulk synchronous parallel (BSP) can realize why Pregel is a perfect example of BSP (a set of entities computing in parallel with global synchronization and able to exchange messages). Apache Hama 9 is the open source equivalent of Pregel, being an implementation of the BSP. Hama realized BSP over HDFS, as well as the Dryad engine from Microsoft. It may be that they do not want to be seen as being different from the Hadoop community. But the important thing is that BSP is an inherently well-suited paradigm for iterative computations and Hama has parallel implementations of the CG, which we said is not easy to realize over Hadoop. It must be noted that the BSP engine in Hama is realized over message passing THE LIMITATIONS OF HADOOP AND ITS LACK OF SUITABILITY FOR CERTAIN CLASSES OF APPLICATIONS HAS MOTIVATED SOME RESEARCHERS TO COME UP WITH ALTERNATIVES. interface (MPI), the father (and mother) of parallel programming literature ( mpi/). The other projects that are similar to Pregel are Apache Giraph, Golden orb, and Stanford GPS. Google are yet to open source Pregel, to the best knowledge of the authors. Third-Generation ML Tool: Spark The key notion in Spark is the Resilient Distributed DataSet (RDD), which can be cached in memory on different nodes and can be used across iterations, thereby improving the performance significantly. Spark also addresses fault-tolerance for the RDDs, meaning that if a node crashes, there is enough information in other RDDs to recreate or reconstruct the lost RDD partition through what they call a lineage. Spark is implemented in Scala, 10 a high-level language that is similar to Java and gaining popularity. The main difference between Java and Scala is that Scala is purely object oriented, much like the classical object-oriented language Smalltalk. Scala also unifies functional programming (of the Lisp kind) with object-oriented programming. This means that some classes may inherit from functions [e.g., array type of Scala inherits from functions and is written as A(i)]. Spark offers some alternatives for building RDDs: 1. RDDs can be built from a file in HDFS. 2. They can be built by parallelizing a Spark collection (this is slicing like an array). 3. RDDs can also be built by transforming existing RDDs (specify operations such as map or filter or join). 4. RDDs can also be built by changing the persistence or saving options of existing RDDs ( save allows an RDD to be saved to HDFS or cache constructs). It must be noted that the cache construct is only a hint: if there is not enough memory across the cluster, Spark will reconstruct the RDD on the fly. This implies that Spark can scale (handle increasing data size with reduced performance) and will be fault tolerant. RDDs can be used in actions operations that return a value or export data to storage. Examples of such actions include count, collect, save, and reduce. Parallelism in RDDs is facilitated by constructs like foreach, reduce, and collect and the user-defined function, which is a first-class object in Scala. Reduce is an associative function that combines the data set elements to produce a result at the driver program. The collect construct collects all elements of the data set to the driver program. An array can be updated by parallelize, map and collect operations. The foreach construct allows a user provided function to be executed on each element of an RDD in parallel. 210BD BIG DATA DECEMBER 2013
5 REVIEW Programmers can pass functions or closures to invoke the map, filter, and reduce operations in Spark. Normally, when Spark runs these functions on worker nodes, the local variables within the scope of the function are copied. Spark has the notion of shared variables for emulating globals using the broadcast variable and accumulators. Broadcast variables are used by the programmer to copy read-only data once to all the workers (static matrices in CGD kind of algorithms can be broadcast variables). Accumulators are variables that can only be added by the workers and read by the driver program: parallel sums can be realized fault-tolerantly. It must be noted that the programming interface of Spark is similar to DyradLinQ. 11 However, DyradLinQ does not have the concept of RDDs or any way for data to be shared across iterations, the main differentiator of Spark. Spark is emerging as a promising Hadoop alternative in the big data analytics space, with a number of production use cases, mainly from start-ups, small, and medium enterprises. We have conducted performance studies of Spark compared with second generation paradigms. In this direction, we have implemented several ML algorithms over Spark, including the k-means clustering a commony used algorithm for clustering data sets. The performance comparison of the Mahout s k-means algorithm realized over Hadoop versus the k- means algorithm realized over Spark is given in Table 2. The performance studies were done over a three-node cluster, each node being a 32-core Xeon with 32 GB RAM and 32 GB swap. This tells us that the Spark implementation can be significantly faster and would scale better over a cluster of nodes compared with the Mahout implementation. The main reason for the slow performance of Mahout is because of the use of sequence files and consequent disk accesses, whereas Spark Table 2. Performance Comparison of the Mahout s k-means Algorithm Realized over Hadoop Versus the k-means Algorithm Realized over Spark Time taken (in seconds) Number of points clustered (in millions) Spark Mahout THE KEY IDEA DISTINGUISHING SPARK IS ITS IN-MEMORY COMPUTATION, ALLOWING DATA TO BE CACHED IN MEMORY ACROSS ITERATIONS/INTERACTIONS. performs in-memory computation with the RDDs. Figure 1 shows the end-to-end performance of the logistic regression algorithm in Mahout and Spark on the same 3-node cluster. This comparison is in line with the comparison made by the AmpLab team on the Spark website for logistic regression ( Third-Generation ML Tool: GraphLab While Pregel is good at graph parallel abstraction, is easy to reason with, and ensures deterministic computation, it leaves it to the user to architect the movement of data. Further, like all BSP systems, it also suffers from the curse of the slow jobs, meaning that even a single slow job (which could be due to load fluctuations or other reasons) can slow down the whole computation. To alleviate some of the above issues, Graph- Lab has been proposed ( Their paper appeared in Proceedings of the VLDB Endowment. 12 The main motivation for coming up with GraphLab was to build an asynchronous graph processing paradigm one that is not affected by the curse of the slow job (asynchrony implies does not need barrier synchronization). It also allows dynamic iterative computations. User-defined update functions live on each vertex and transform data in the scope of the vertex. It can choose to trigger neighbor updates (for example, can be triggered only if the rank changes drastically in a page-rank kind of algorithm) and can run without global synchronization. Importantly, while Pregel lives with sequential dependencies on the graphs, Graph- Lab allows parallelism, which is important in certain ML algorithms, including collaborative filtering. GraphLab has implemented several ML algorithms including Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling, etc. The paper also shows significant speed-up compared with Hadoop for an expectation maximization algorithm. GraphLab2 13 is a new asynchronous shared memory abstraction in which the user-defined vertex programs share a distributed graph, which has data associated with both vertices and edges. Each vertex program can access data associated with the vertex or incident edges or neighboring vertices. It can also schedule computation to be run on the neighboring vertices in the future. Serializability is ensured, as GraphLab does not allow neighboring vertex programs to run simultaneously. It characterizes a graph computation as comprising three phases: gather (where values from neighbors are gathered by the vertex program), apply (these values are applied to the current vertex sum or union of data on this and neighboring vertices and edges can be specified), and scatter (neighboring vertex programs can be scheduled to MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD211
6 PARADIGMS ML ALGORITHMS FIG. 1. Logistic regression comparison of Spark versus Mahout. take the new value of this vertex). The gather and scatter phases can determine appropriate fan-in and fan-out of the vertices. By separating this out, GraphLab can be efficient on high fan-in or high fan-out edges, which may occur in power law graphs. The main challenges of natural graphs include the asymmetric fanin/fan-out distribution leading to work imbalance in systems (such as Pregel) that treat vertices uniformly. The partitioning mechanism is another challenge. Since Pregel and Graphlab1 use hash-based random partitioning, they are not well suited for natural graphs. Handling natural graphs also requires parallelism abstraction within individual vertices, which is not handled in systems such as Pregel. GraphLab2 provides new ways of partitioning a power law graph (where high-degree vertices could limit parallelism) by defining an abstraction known as Power- Graph. Vertices are associated with nodes, while edges can span multiple nodes. PowerGraph gets the best of both Pregel THE MAIN MOTIVATION FOR COMING UP WITH GRAPHLAB WAS TO BUILD AN ASYNCHRONOUS GRAPH PROCESSING PARADIGM ONE THAT IS NOT AFFECTED BY THE CURSE OF THE SLOW JOB (ASYNCHRONY IMPLIES DOES NOT NEED BARRIER SYNCHRONIZATION). (associative, commutative gather concept) and GraphLab1 (shared memory abstraction). PowerGraph introduces novelway vertex cuts, which allow it to represent the graph better in a distributed system compared with GraphLab1 or Pregel. PowerGraph factors the vertex program into gather, sum, apply, and scatter functions and can hence distribute the vertex program across the nodes of a distributed system. The gather function is invoked on the edges adjacent to the current one, depending on whether the gather_nbrs parameter is set to none, all, in, or out. The sum is a commutative and associative operator. The apply function computes a new value of the vertex. The scatter function is invoked on the neighbors (again based on the scatter_nbrs parameter). PowerGraph has both a synchronous (resembling Pregel) and an asynchronous execution mode (resembling Piccolo). However, asynchronous execution is nondeterministic in Piccolo, which could lead to instability or nonconvergence for certain ML algorithms such as statistical simulations. 14 In contrast, PowerGraph enforces serializability, which implies that every parallel execution has a corresponding serial execution order. While GraphLab1 provided the same, it was inefficient and unfair to high-degree vertices due to a sequential locking protocol. PowerGraph introduces a new parallel locking protocol that is fair to highdegree vertices, allowing parallelization of the vertex program on a cluster. The PowerGraph paper also shows significant performance benefits for PageRank and triangle counting problems on the Twitter graph. The different graph processing paradigms have been compared and contrasted in Table 3 for quick reference. Table 4 carries a comparison of the various paradigms across different nonfunctional features such as scalability, Table 3. Comparison of Graph Processing Paradigms Graph paradigms Computation (synchronous or asynchronous) Determinism and its effects Efficiency vs. asynchrony Efficiency of processing power law graphs GraphLab1 Asynchronous Deterministic: serializable computation Uses inefficient locking protocols Inefficient: locking protocol is unfair to high degree vertices Pregel Synchronous: Bulk Deterministic: serial computation NA Inefficient: curse of slow jobs Synchronous Parallel based Piccolo Asynchronous Nondeterministic: nonserializable Efficient, but may lead May be efficient computation to incorrectness GraphLab2 (PowerGraph) Asynchronous Deterministic: serializable computation Uses parallel locking for efficient, serializable, and asynchronous computations Efficient: parallel locking and other optimizations for processing natural graphs 212BD BIG DATA DECEMBER 2013
7 REVIEW Table 4. Comparison of the Various Paradigms Across Different Nonfunctional Features Generation First generation Second generation Third generation Examples SAS, R, Weka, SPSS in native form Mahout, Pentaho, Revolution R, SAS In-memory Analytics (Hadoop) Spark, HaLoop, GraphLab, Pregel, SAS In-memory Analytics (Greenplum/Teradata), Giraph, GoldenOrb, Stanford GPS, ML over Storm Scalability Vertical Horizontal (over Hadoop) Horizontal (Beyond Hadoop) Algorithms Available Algorithms Not Available Huge collection of algorithms Practically nothing Small subset: sequential logistic regression, linear SVMs, Stochastic Gradient Descent, k-means clustering, Random Forests, etc. Vast no.: Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descent, ALS, etc. Fault-Tolerance Single point of failure Most tools are FT, as they are built on top of Hadoop FT, fault-tolerant; GPS, graph processing system; ML, machine learning; SVM, support vector machine. Much wider: including Conjugate Gradient Descent (CGD), Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling, etc. Multivariate logistic regression in general form, k-means clustering etc.: work in progress to expand the set of algorithms available. FT: HaLoop, Spark Not FT: Pregel, GraphLab, Giraph fault tolerance, and the algorithms that have been implemented. It can be inferred that while the traditional tools have worked on only a single node and may not scale horizontally and may also have single point of failure, recent reengineering efforts have made them move across generations. The other point to be noted is that most of the graph processing paradigms are not fault tolerant, while Spark and HaLoop are among the third-generation tools that provide fault tolerance. Conclusions This article has provided a comprehensive review of the three generations of tools/paradigms that realize machine learning algorithms for big data: The first-generation tools, which include SAS and SPSS, can help in deep analytics but may only scale vertically. The second-generation tools such as Mahout and RapidMiner can scale horizontally but may be limited by the Hadoop MR over which they are realized (in terms of the number of algorithms that can be implemented nonserially). The third-generation realizations such as Spark and GraphLab, among others, promise the most in terms of horizontal scalability and the large number of ML algorithms that can be potentially implemented. However, it will take plenty of effort, time, and widespread adoption and industrial use to make sure that these third-generation tools realize their true potential and allow deep analytics of big data. It must be noted that only Spark has production use cases, with other tools starting to be used for testing/development. This article has also given certain performance studies of ML algorithms realized over Spark and Mahout over Hadoop, illustrating the power of the third-generation paradigm. Acknowledgments The authors wish to thank Impetus management including Pankaj Mittal and Vineet Tyagi for providing encouragement and support. We also wish to thank Dr. Nitin Agarwal and Joydeb Mukherjee from the Data Sciences Practice team at Impetus. We also wish to express our gratitude to Gurvinder Arora, our technical writer, for reviewing the document and improving the language. Author Disclosure Statement No financial conflicts exist. References 1. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, , pp Ekstrand MD, Ludwig M, Konstan JA, Riedl JT. Rethinking the recommender research ecosystem: Reproducibility, openness, and LensKit. Proceedings of the fifth ACM conference on Recommender systems (RecSys 11). ACM, New York: 2011, pp Srirama SN, Jakovits P, Vainikko E. Adapting scientific computing problems to clouds using MapReduce. Future Gener Comput Syst 2012; 28: Ekanayake J, Li H, Zhang B, et al. A runtime for iterative MapReduce. Proceedings of the 19th ACM International MARY ANN LIEBERT, INC. VOL. 1 NO. 4 DECEMBER 2013 BIG DATA BD213
8 PARADIGMS ML ALGORITHMS Symposium on High Performance Distributed Computing. Chicago, IL: Venkataraman S, Roy I, Young AA, Schreiber RS. Using R for iterative and incremental processing. Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing (HotCloud 12). Berkeley, CA: USENIX Association, 2012, p Zaharia M, Chowdhury M, Franklin MJ, et al. Spark: cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud 10). Berkeley, CA: USENIX Association, 2010, p Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: Efficient iterative data processing on large clusters. Proceedings VLDB Endowment 2010; 3: Malewicz G, Malewicz G, Austern MH, et al. Pregel: A system for large-scale graph processing. Proceedings of SIGMOD International Conference on Management of Data (SIGMOD 10). New York: Association for Computing Machinery (ACM), 2010, pp Seo S, Yoon EJ, Kim J, et al. HAMA: An efficient matrix computation with the MapReduce framework. Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CLOUDCOM 10). Washington, DC: IEEE Computer Society, 2010, pp Odersky M, Spoon L, Venners B. Programming in Scala, 2nd ed. Walnut Creek, CA: Artima Publishers, Yu Y, Isard M, Fetterly D, et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 08). Berkeley, CA: USENIX Association, 2008, pp Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 2012; 5: Gonzalez J, Low Y, Gretton A, Guestrin C. Parallel gibbs sampling: From colored fields to thin junction trees. AISTATS 2011; 15: Gonzalez JE, Low Y, Gu H, et al. PowerGraph: Distributed graph-parallel computation on natural graphs. Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). Berkeley, CA: USENIX Association, 2012, pp Address correspondence to: Vijay Srinivas Agneeswaran, PhD Innovation Labs Impetus Infotech India Private Limited Pritech Park SEZ Bellandur Outer Ring Road Bangalore, Karnataka India [email protected] This work is licensed under a Creative Commons Attribution 3.0 United States License. You are free to copy, distribute, transmit and adapt this work, but you must attribute this work as Big Data. Copyright 2013 Mary Ann Liebert, Inc. used under a Creative Commons Attribution License: by/3.0/us/ 214BD BIG DATA DECEMBER 2013
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Big Data. Lyle Ungar, University of Pennsylvania
Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Conquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes
A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept.
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
Map-Based Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data
TABLE OF CONTENTS 1 Chapter 1: Introduction 1.1 Executive Summary 1.2 Topics Covered 1.3 Key Findings 1.4 Target Audience 1.5 Companies Mentioned 2 Chapter 2: Big Data Technology & Business Case 2.1 Defining
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
The Big Data Market: Business Case, Market Analysis & Forecasts 2015-2020
Brochure More information from http://www.researchandmarkets.com/reports/2983902/ The Big Data Market: Business Case, Market Analysis & Forecasts 2015-2020 Description: Big Data refers to a massive volume
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Matei Zaharia N. M. Mosharaf Chowdhury Michael Franklin Scott Shenker Ion Stoica Electrical Engineering and Computer Sciences University of California at Berkeley
Implementation of Spark Cluster Technique with SCALA
International Journal of Scientific and Research Publications, Volume 2, Issue 11, November 2012 1 Implementation of Spark Cluster Technique with SCALA Tarun Kumawat 1, Pradeep Kumar Sharma 2, Deepak Verma
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Next-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
How To Understand The Business Case For Big Data
Brochure More information from http://www.researchandmarkets.com/reports/2643647/ Big Data and Telecom Analytics Market: Business Case, Market Analysis & Forecasts 2014-2019 Description: Big Data refers
Presto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
What s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National
BIG DATA TOOLS. Top 10 open source technologies for Big Data
BIG DATA TOOLS Top 10 open source technologies for Big Data We are in an ever expanding marketplace!!! With shorter product lifecycles, evolving customer behavior and an economy that travels at the speed
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Big Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Mind Commerce. http://www.marketresearch.com/mind Commerce Publishing v3122/ Publisher Sample
Mind Commerce http://www.marketresearch.com/mind Commerce Publishing v3122/ Publisher Sample Phone: 800.298.5699 (US) or +1.240.747.3093 or +1.240.747.3093 (Int'l) Hours: Monday - Thursday: 5:30am - 6:30pm
Market for Telecom Structured Data, Big Data, and Analytics: Business Case, Analysis and Forecasts 2015-2020
Brochure More information from http://www.researchandmarkets.com/reports/3128462/ Market for Telecom Structured Data, Big Data, and Analytics: Business Case, Analysis and Forecasts 2015-2020 Description:
Technologies and algorithms to
Big Data: Technologies and algorithms to deal with challenges Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
A survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: [email protected] Department
Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai [email protected] Intel Software and Services Group
Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai [email protected] Intel Software and Services Group Project Overview Research & open source projects initiated by AMPLab in UC Berkeley
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
Beyond Batch Processing: Towards Real-Time and Streaming Big Data
Beyond Batch Processing: Towards Real-Time and Streaming Big Data Saeed Shahrivari, and Saeed Jalili Computer Engineering Department, Tarbiat Modares University (TMU), Tehran, Iran [email protected],
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
