Big Data and Scripting Systems build on top of Hadoop

Similar documents
Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems build on top of Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

How To Scale Out Of A Nosql Database

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Ecosystem B Y R A H I M A.

Apache Flink Next-gen data analysis. Kostas

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data and Apache Hadoop s MapReduce

Unified Big Data Processing with Apache Spark. Matei

Unified Big Data Analytics Pipeline. 连 城

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Spark

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Big Data Analytics. Lucas Rego Drumond

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Moving From Hadoop to Spark

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop IST 734 SS CHUNG

CSE-E5430 Scalable Cloud Computing Lecture 11

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Introduction to Big Data with Apache Spark UC BERKELEY

Scaling Up HBase, Hive, Pegasus

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Integrating Big Data into the Computing Curricula

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Beyond Hadoop with Apache Spark and BDAS

Spark: Cluster Computing with Working Sets

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data With Hadoop

Large scale processing using Hadoop. Ján Vaňo

Spark and the Big Data Library

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Implement Hadoop jobs to extract business value from large and varied data sets

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

A Brief Outline on Bigdata Hadoop

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Workshop on Hadoop with Big Data

Apache Spark and Distributed Programming

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Introduction to Big Data Training

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Brave New World: Hadoop vs. Spark

The Hadoop Eco System Shanghai Data Science Meetup

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Data Frameworks: Scala and Spark Tutorial

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Hadoop. Alexandru Costan

Managing large clusters resources

Qsoft Inc

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Architectures for massive data management

HiBench Introduction. Carson Wang Software & Services Group

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Apache Hama Design Document v0.6

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Deploying Hadoop with Manager

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

NoSQL and Hadoop Technologies On Oracle Cloud

Dell In-Memory Appliance for Cloudera Enterprise

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Challenges for Data Driven Systems

Internals of Hadoop Application Framework and Distributed File System

Hadoop & Spark Using Amazon EMR

Introduction to Apache Hive

The Stratosphere Big Data Analytics Platform

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 2012/12/13 Beijing Apache Asia Road Show

Cloud Scale Distributed Data Storage. Jürmo Mehine

COURSE CONTENT Big Data and Hadoop Training

Hadoop and Map-Reduce. Swati Gore

Spark: Making Big Data Interactive & Real-Time

ZooKeeper. Table of contents

Hadoop Job Oriented Training Agenda

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A programming model in Cloud: MapReduce

Chase Wu New Jersey Ins0tute of Technology

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Transcription:

Big Data and Scripting Systems build on top of Hadoop 1,

2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is similar to query languages like SQL still procedural 1 (in contrast to SQL) extendable using various languages originally developed at Yahoo, moved to the Apache Software foundation in 2007 pig.apache.org 1 commands describe actions to execute, not the desired result

3, Pig/Latin overview execute commands that can be run on a Hadoop cluster simple/easy to learn language enable rapid prototyping of map reduce applications use map/reduce cluster similar to a database system interactive or batch mode commands are translated into Hadoop jobs executed on the Hadoop system

4, Pig/Latin concepts commands (operators) act on relations a relation is typically a CSV-file from hdfs example: assign a relation with named fields to variable A A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float); further operators then transform relations into other relations example: group items by field age B = GROUP A BY age; lazy evaluation 2 relations can have schemas (column names and types) schemas can be used to ensure type-safety 2 nothing is executed until needed

5, extending Pig/Latin new functions can be added by implementing them in various languages Java, Python, JavaScript, Ruby most extensive support is provided in Java: extend class org.apache.pig.evalfunc register in Pig REGISTER myudfs.jar; Java programs can be run native on Hadoop special interfaces allow more efficient integration of special types of functions

6, Apache Hive distributed data-warehouse allowing queries and transformations use various file systems as backend (HDFS, Amazon S3 fs,... ) SQL-like query language HiveQL execution by translation into map reduce jobs indexing to accelerate queries command line interface Hive CLI originally developed by Facebook turned into Apache project hive.apache.org

7, HiveQL examples create a table with two columns: hive> create table student (sid string, sname string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY, ; tables correspond to directories in the underlying file system stored as CSV-files load some data: hive> LOAD DATA INPATH /tmp/students.txt INTO TABLE student; imports the file content into Hive s storage dropping the table deletes data and index from Hive storage, does not affect external data select * from student;

8, HiveQL examples multiple tables can be joined only equality joins SELECT * FROM student JOIN scores ON (student.sid = marks.sid); many standard SQL statements available, e.g. GROUP BY INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; grouping and aggregation write result into new table

9, Hive data organization top level organization: databases containing tables tables correspond to (top-level) directories tables are divided into partitions sub directories of the table directory tables can be partitioned by arbitrary column CREATE TABLE table (col1 INT, col2 STRING) PARTITIONED BY (col3 DATE); partitions are divided into buckets further break downs of partitions allow better organization with respect to map reduce storage of actual data in flat files arbitrary formats can be used, description via regular expressions

10, Hive summary bring together SQL functionality and scaling features of Hadoop subset of table operations specified by SQL no low-latency queries optimized for scalability storage in flat files on distributed file system querying/processing by translation into map reduce jobs extending storage by indexing due to distributed storage: no individual updates

11, HBase sparse, distributed, persistent multidimensional sorted map implementation of the Bigtable 3 idea (google) uses HDFS and Hadoop, Zookeeper for storage and execution implements servers for administration and storage/computation mapping keys to values keys are structured, values are arbitrary implements random read/write access on top of HDFS provides consistency (on certain levels) accessible via shell or Java-API hbase.apache.org 3 research.google.com/archive/bigtable.html

12, HBase structure keys are stored sorted, allowing range queries keys are highly structured into: rowkey column family column timestamp tables are stored sparsely missing values are not encoded but simply not stored every value has to be stored with full address data is distributed, load balancing automated consistency is guaranteed on row-key level: all changes within one rowkey are atomic all data of one rowkey is stored on a single machine

13, HBase storage and access data is partitioned by keys column families define storage properties columns are only label for the corresponding values principle operations: put insert/update value delete delete value get retrieve single value scan retrieve collection of values sequential reading

14, HBase guarantees atomicity mutations are atomic within a row operation result reported not atomic over multiple rows (parts may fail, others succeed) consistency and isolation returned rows consist of complete rows: contained data may have changed in between the data returned refers to a single point in the past 4 scans are not consistent over multiple rows different rows may refer to different points in time 4 HBase keeps data for various time points

15, HBase guarantees visibility after successful writing, data is immediately visible to all clients versions of rows strictly increase durability refers to data being stored on disk data that has been read from a cell is guaranteed to be durable successful operations are durable, failed operations not visibility and durability may be tuned for performance individual reads without visibility guarantees instead of durability only periodic writing of data

comparison Pig Latin allows to view data as tables provides ad hoc queries extendable to arbitrary map reduce jobs Hive tries to provide SQL-functionality slow, large scale queries structured query language, query planner Hbase more like a NOSQL database or key/value store no sql operations, only storage and retrieval guarantees for operations optimized for random, real-time access note: Pig and Hive can access data from Hbase directly note: Cassandra is a dbs similar to HBase, optimized for security 16,

17, Mahout scalable implementations of data mining/learning algorithms provide a library for easy access to machine learning implementations provide algorithms for the most common problems, e.g.: clustering classification frequent pattern mining,... optimize for practical (e.g. business) usage language: Java mahout.apache.org note: Mahout is currently switching from map/reduce to Spark

18, Mahout overview provides large API of Java-classes integration into other applications execution on top of distributed cluster 5 implementations can be adapted to specific problems provide individual I/O classes individual similarity/distance functions,... integration of Apache Lucene (document search engine) 5 e.g. Hadoop or Spark

Systems beyond Hadoop 19,

20, ZooKeeper distributed coordination service many problems/functions are shared among distributed systems ZooKeeper provides a single implementation of these avoid repeated implementation of the same services provide primitives for synchronization configuration maintenance naming optimized for failure tolerance, reliability and performance used in other projects as sub service another Apache top-level project (zookeeper.apache.org)

21, Zookeeper provides tree-like information storage update guarantees sequential consistency (keep update order) atomicity single system image (one state for views) reliability (applied updates persist) timeliness (time bounds for updates) extremely simple interface create/delete test node existance get/set data get children sync (wait for update to propagate)

22, Mesos 6 Hadoop: use one (physical) cluster of machines exclusively Mesos: share a physical cluster between multiple distributed systems implent intermediate layer between distributed frameworks and hardware administrate physical resources distribute these to involved frameworks improve cluster utilization implement prioritization and failure tolerance 6 Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011

23, Mesos example scenarios multiple Hadoop systems on the same (physical) set of machines production system, takes priority testing implementations or execute analyses that are of general interest but should not disturb the production system test new versions of Hadoop all involved Hadoop instances use the same data as input different distributed frameworks on the same cluster different tasks benefit from different optimization approaches the map reduce approach is not optimal in every situation still, the different frameworks might work on the same base data

24, Mesos dividing tasks scheduling distribute tasks to available resources consider data locality send tasks to nodes that already store the involved data depends on framework (optimization strategy), job (algorithm) and task order should be implemented by the framework resource distribution distribute available resources to frameworks keep track of system usage ensure priorities between different frameworks should be implemented by intermediate layer (Mesos)

25, Mesos architecture centralized master-slave system frameworks run tasks on slave nodes master implements sharing using resource offers: list of free resources on various slaves master decides which (and how many) resources are offered to which framework implements organizational policy frameworks have two parts: scheduler - accepts or declines offers from Mesos executor process - started on computing nodes, executes tasks framework decides which task is solved on a particular resource tasks are executed by sending task description to Master

26, Mesos summary/overview a framework/library/set of servers allows to run several distributed frameworks on top of a single cluster of machines administrates and distributes resources with respect to configurable priorities is an actually implemented and used system: mesos.apache.org made it into the apache incubator started at UC Berkeley AMP Lab uses Zookeeper

27, Spark 7 map reduce is not optimal for all problems many algorithms iterate a number of times over source data example: gradient descent each iteration uses source data to compute new gradient in Hadoop, every iteration reads all source data completely from disk, computes a single step and writes result approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines 7 Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010

28, Spark: overview cluster computing system, comparable to Hadoop provides primitives for in-memory cluster computing data types are distributed in the cluster parts on the individual machines kept in memory speedup in comparison to Hadoop for certain (iterating) algorithms (e.g. logistic regression) build on top of Mesos provides APIs for Scala, Java, Python originally developed for iterative algorithms (iterations using the same source data) interactive data mining spark.apache.org often seen as the successor of Hadoop

29, Spark programming model Spark applications consist of a driver program implements the global, high-level control flow launches operations that are executed in parallel distribution and parallelization is achieved with resilient distributed data sets (RDDs) parallel operations working on RDDs shared variables RDDs are read-only, distributed collections constructed from input data or transformation held in memory (if possible)

30, Spark RDDs: resilient distributed datasets lazy evaluation: creating handle describes derivation derivation is only executed when necessary ephemeral: not guaranteed to stay in memory recreated on demand state can be changed using cache and save cache: still not evaluated after first evaluation kept in memory if possible save: triggers evaluation writes to distributed storage handle of saved RDD points to persistently stored object

31, Spark parallel transformations RDD are transformed by parallel transformation result is always a new RDD emulating map reduce is simple: flatmap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call) reducebykey(function) called on collections of (K,V) key/value pairs groups by key, aggregate with function other transformations include union() (of two RDDs), distinct() (distinct elements) sort(), groupbykey() join() (equi-join on key), cartesian() cogroup() maps (K,V), (K,W) (K, Seq(V), Seq(W))

32, Spark actions actions extract data from the RDD and transport it back into the context of the driver program: collect() retrieve all elements of an RDD first(), take(n), retrieve first/first n elements reduce(func) use commutative, associative func for parallel reduction and retrieve final result foreach(func) run function over all elements (e.g. for statistics) count() get number of elements

33, Spark shared variables parallel functions transport variables from their original context to the node they are executed at these have to be transport every time a function is send over the network Spark supports 2 additional forms for special use cases broadcast variables: transported only once to all involved nodes read only in parallel functions accumulators parallel functions can add to accumulators adding is some associative operation can be read only by driver program

34, Pregel 8 solve large scale problems on graphs/networks example: PageRank distributed system designed for graph computations assumption: many graph algorithms traverse graph via edges access data very local e.g. computations for a node involve values of its neighbors Pregel implements such a system, but is not public/open source Giraph is an open source framework implementing the same idea: giraph.apache.org 8 Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010

35, Pregel overview basic unit of computation: node with unique id and its incident edges node perform computations in parallel communicate with each other via messages a superstep is one round where each node computes in each superstep: node receives messages from last round update/computes/sends messages to be received in next round each node can vote for stopping turns node inactive gets reactivated by received message computation stops when all nodes vote for stop

36, example: connected components assume: node ids are totally ordered node initializes minid with its ID sends its ID to all neighbors in each following round: collect all received IDs update minimum if minid changed send new minid to neighbors else vote for stop result: all nodes in a component have same minid nodes in different components have different minid

master slave system master node: determines end of algorithm takes care of node failure synchronizes node communication basic idea can be extended: nodes can mutate the graph (create/delete nodes/edges) 37,