http://glennengstrand.info/analytics/fp



Similar documents
HDFS. Hadoop Distributed File System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Internet of Things and Big Data: Intro

Workshop on Hadoop with Big Data

Using distributed technologies to analyze Big Data

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Scaling Out With Apache Spark. DTL Meeting Slides based on

Building a Scalable News Feed Web Service in Clojure

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

@Scalding. Based on talk by Oscar Boykin / Twitter

Moving From Hadoop to Spark

Sentimental Analysis using Hadoop Phase 2: Week 2

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Upcoming Announcements

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data for the JVM developer. Costin Leau,

Deploying Hadoop with Manager

How To Create A Data Visualization With Apache Spark And Zeppelin

Data processing goes big

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Hadoop IST 734 SS CHUNG

The Inside Scoop on Hadoop

Introduction To Hive

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Analytics - Accelerated. stream-horizon.com

From Spark to Ignition:

Chase Wu New Jersey Ins0tute of Technology

Other Map-Reduce (ish) Frameworks. William Cohen

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop Job Oriented Training Agenda

Big Data Weather Analytics Using Hadoop

Harnessing Big Data with KNIME

Large scale processing using Hadoop. Ján Vaňo

Hadoop Big Data for Processing Data and Performing Workload

Using Data Mining and Machine Learning in Retail

HiBench Introduction. Carson Wang Software & Services Group

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data with Open Source

Qsoft Inc

Alternatives to HIVE SQL in Hadoop File Structure

HDP Hadoop From concept to deployment.

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

ITG Software Engineering

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Too Big To Ignore

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Cloudera Certified Developer for Apache Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Big Data Analytics. Lucas Rego Drumond

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data and Scripting Systems build on top of Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Click Stream Data Analysis Using Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Apache Hadoop: The Big Data Refinery

Big Data on Microsoft Platform

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Open source Google-style large scale data analysis with Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Ecosystem B Y R A H I M A.

HDP Enabling the Modern Data Architecture

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Architectures for massive data management

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Comparing SQL and NOSQL databases

Introduction to Big Data Training

Safe Harbor Statement

Real Time Big Data Processing

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

HDFS Users Guide. Table of contents

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop and MySQL for Big Data

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Transcription:

Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state, higher order functions, the processing of data as streams, lambda expressions, and concurrency through Software Transactional Memory. What is Clojure? That is a Functional Programming language similar to Lisp that runs in the Java Virtual Machine. Additional features include; dynamic programming, lazy sequences, persistent data structures, multimethod dispatch, deep Java interoperation, atoms, agents, and references. In a previous blog, I shared a learning adventure where I wrote a news feed service in Clojure and tested it, under load, on AWS. One of the biggest impressions that Clojure made on me was how its collection oriented functions like union, intersection, difference, map, and reduce made it well suited for data processing. I wondered how well that this data processing suitability would translate to a distributed computing environment. http://clojure.org/ Recently, I attended OSCON 2014 which is a tech conference that focuses on open source software. One of the tracks was on Functional Programming. Two of the presentations that I attended talked about big data and mentioned open source projects that integrate Clojure with Hadoop. This blog explores functional programming and Hadoop by evaluating those two open source projects. What is Big Data? Systems whose data flows at a high volume, at fast velocity, comes in a wide variety, and whose veracity is significant are attributed as Big Data. Usually this translates to petabytes per day. Big data must be processed in a distributed manner. Big data clusters usually scale horizontally on commodity hardware. The Apache Hadoop project is the most popular enabling technology. A typical Big Data project is built with either Cloudera or Hortonworks offerings. Usually OLAP serves as the front end where the processed data is analyzed. http://glennengstrand.info/software/architecture/oss/clojure

Functional Programming and Big Data page 2 of 5 pages In one of those presentations, entitled Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries, two engineers at Netflix talked about a lot of open source projects that they use for their big data pipe. One of those projects was called Pig Pen. For anyone with experience with report writers or SQL, Pig Pen will look like it is pretty easy to understand. Commands like load tsv, map, filter, group by, and store seem pretty intuitive at first glance. https://github.com/netflix/pigpen/ From an architectural perspective, Pig Pen is an evil hack. Your Clojure code gets interpreted as a User Defined Function in a Pig script that loads and iterates through the specified input file. First, you have to run your code to generate the script. Then, you have to remove the part of the project specification that identifes the main procedure and compile your code to create an uberjar and copy it to a place where that Pig script can load it. That, and the fact that this was the slowest running and worse documented project that I evaluated, makes Pig Pen pretty much a non starter as far as I am concerned. Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries http://www.oscon.com/oscon2014/public/schedule/detail/34159

Functional Programming and Big Data page 3 of 5 pages Cascalog was one of the recommended open source projects that integrates Clojure with Hadoop. Cascalog is built on top of Cascading which is another open source project for managing ETL within Hadoop. http://cascalog.org/ Cascalog takes advantage of Clojure homoiconicity where you declare your query to be run later in a multi pass process. Many things, such as grouping are not explicitly specified. The system just figures it out. Order of statements doesn't matter either. http://www.cascading.org/ At first glance, Cascalog looks like it is ideomatic Clojure but it really isn't. It is declarative instead of imperative. It looks to be pure Functional Programming but, under the covers, it isn't. More on this in a future blog. Data Workflows for Machine Learning http://www.oscon.com/oscon2014/public/schedule/detail/34913

Functional Programming and Big Data page 4 of 5 pages Apache Spark is a relatively new project that is gaining a lot of traction in the Big Data world. It can integrate with Hadoop but can also stand alone. The only Functional Programming language supported by Spark is Scala. http://www.scala lang.org/ If you are coming to Scala from Java or just have little FP experience, then what you might notice the most about Scala is lambda expressions, streams, and for comprehensions. If you are already familiar with Clojure, then what you might notice the most about Scala is strong typing. https://spark.apache.org/ Spark has three main flavors; map reduce, streaming, and machine learning. What I am covering here is the map reduce functionality. I used the Spark shell which is more about learning and interactively exploring locally stored data. Spark was, by far, the fastest technology that took the fewest lines of code to run the same query. Like Pig Pen, Map Reduce Spark was pretty limited on reduce side aggregation. All I could do was maintain counters in order to generate histograms. Cascalog had much more functional and capable reduce side functionality. What is reduce side aggregation? There are two parts to the map reduce approach to distributed computing, map side processing and reduce side processing. The map side is concerned with taking the input data in its raw format and mapping that data to one or more key value pairs. The infrastructure performs the grouping functionality where the values are grouped together by key. The reduce side then does whatever processing it needs to do on that per key grouping of values. Why is map reduce Spark more limited in its reduce side aggregation capability than Cascalog? In the map reduce flavor of Apache Spark, the reducer function takes two values and is expected to return one value which is the reduced aggregation of the two input values. The infrastructure then keeps calling your reducer function until only one value is left (or one value per key depending on which reduce method you use). The reducer function in Cascalog gets called only once for each key but the input parameter is a sequence of tuples, each tuple representing the mapped data that is a value associated with that key.

Functional Programming and Big Data page 5 of 5 pages I wrote the same job, generating the same report, in all three technologies but Spark and Scala was the fastest. Is that a fair comparison? You have to run both the Pig Pen job and the Cascalog job in Hadoop. For my evaluation, I was using single node mode. Spark is different. You don't run it inside of Hadoop and I am pretty sure that it isn't doing much or anything with Hadoop locally. Large Spark map reduce jobs would have to integrate with Hadoop so this may not be a fair comparison. But the real test with any technology isn't how you can use it but rather how you can extend its usefulness. For Cascalog, that comes in the form of custom reduce side functions. In this area, Pig Pen was completely broken. In the next blog, I will review my findings when I reproduced the News Feed Performance map reduce job, used to load the Clojure news feed performance data into OLAP, with a custom function written in Cascalog. In the next blog, I will also cover how Spark streaming can be made to reproduce the News Feed Performance map reduce job in real time. In this article, I explored Functional Programming in Big Data by evaluating three different FP technologies for Big Data. I did this by writing the same Big Data report job in those technologies then comparing and contrasting my results. Hive is still the most popular way to aggregate Big Data in Hadoop. What would that job, used to evaluate those technologies, have looked like in Hive? First, you would have had to load the raw text data into HDFS. bin/hadoop dfs put perfpostgres.csv /user/hive/warehouse/perfpostgres Then you would have had to define that raw text data to Hive. create external table news_feed_perf_raw (year string, month string, day string, hour string, minute string, entity string, action string, duration string) row format of delimited fields terminated by ',' lines terminated by '\n' stored as textfile location '/user/hive/warehouse/perfpostgres'; Here is the map part of the processing. insert overwrite local directory 'postactions' select concat(year, ',', month, ',', day, ',', hour, ',', minute) as ts from news_feed_perf_raw where action = 'post'; Here is the reduce part of the processing. insert overwrite local directory 'perminutepostactioncounts' select ts, count(1) from postactions group by ts; http://glennengstrand.info/analytics/oss