Big Data and Scripting Systems build on top of Hadoop 1,
2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is similar to query languages like SQL still procedural 1 (in contrast to SQL) extendable using various languages originally developed at Yahoo, moved to the Apache Software foundation in 2007 pig.apache.org 1 commands describe actions to execute, not the desired result
3, Pig/Latin overview execute commands that can be run on a Hadoop cluster simple/easy to learn language enable rapid prototyping of map reduce applications use map/reduce cluster similar to a database system interactive or batch mode commands are translated into Hadoop jobs executed on the Hadoop system
4, Pig/Latin concepts commands (operators) act on relations a relation is typically a CSV-file from hdfs example: assign a relation with named fields to variable A A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float); further operators then transform relations into other relations example: group items by field age B = GROUP A BY age; lazy evaluation 2 relations can have schemas (column names and types) schemas can be used to ensure type-safety 2 nothing is executed until needed
5, extending Pig/Latin new functions can be added by implementing them in various languages Java, Python, JavaScript, Ruby most extensive support is provided in Java: extend class org.apache.pig.evalfunc register in Pig REGISTER myudfs.jar; Java programs can be run native on Hadoop special interfaces allow more efficient integration of special types of functions
6, Apache Hive distributed data-warehouse allowing queries and transformations use various file systems as backend (HDFS, Amazon S3 fs,... ) SQL-like query language HiveQL execution by translation into map reduce jobs indexing to accelerate queries command line interface Hive CLI originally developed by Facebook turned into Apache project hive.apache.org
7, HiveQL examples create a table with two columns: hive> create table student (sid string, sname string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY, ; tables correspond to directories in the underlying file system stored as CSV-files load some data: hive> LOAD DATA INPATH /tmp/students.txt INTO TABLE student; imports the file content into Hive s storage dropping the table deletes data and index from Hive storage, does not affect external data select * from student;
8, HiveQL examples multiple tables can be joined only equality joins SELECT * FROM student JOIN scores ON (student.sid = marks.sid); many standard SQL statements available, e.g. GROUP BY INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; grouping and aggregation write result into new table
9, Hive data organization top level organization: databases containing tables tables correspond to (top-level) directories tables are divided into partitions sub directories of the table directory tables can be partitioned by arbitrary column CREATE TABLE table (col1 INT, col2 STRING) PARTITIONED BY (col3 DATE); partitions are divided into buckets further break downs of partitions allow better organization with respect to map reduce storage of actual data in flat files arbitrary formats can be used, description via regular expressions
10, Hive summary bring together SQL functionality and scaling features of Hadoop subset of table operations specified by SQL no low-latency queries optimized for scalability storage in flat files on distributed file system querying/processing by translation into map reduce jobs extending storage by indexing due to distributed storage: no individual updates
11, HBase sparse, distributed, persistent multidimensional sorted map implementation of the Bigtable 3 idea (google) uses HDFS and Hadoop, Zookeeper for storage and execution implements servers for administration and storage/computation mapping keys to values keys are structured, values are arbitrary implements random read/write access on top of HDFS provides consistency (on certain levels) accessible via shell or Java-API hbase.apache.org 3 research.google.com/archive/bigtable.html
12, HBase structure keys are stored sorted, allowing range queries keys are highly structured into: rowkey column family column timestamp tables are stored sparsely missing values are not encoded but simply not stored every value has to be stored with full address data is distributed, load balancing automated consistency is guaranteed on row-key level: all changes within one rowkey are atomic all data of one rowkey is stored on a single machine
13, HBase storage and access data is partitioned by keys column families define storage properties columns are only label for the corresponding values principle operations: put insert/update value delete delete value get retrieve single value scan retrieve collection of values sequential reading
14, HBase guarantees atomicity mutations are atomic within a row operation result reported not atomic over multiple rows (parts may fail, others succeed) consistency and isolation returned rows consist of complete rows: contained data may have changed in between the data returned refers to a single point in the past 4 scans are not consistent over multiple rows different rows may refer to different points in time 4 HBase keeps data for various time points
15, HBase guarantees visibility after successful writing, data is immediately visible to all clients versions of rows strictly increase durability refers to data being stored on disk data that has been read from a cell is guaranteed to be durable successful operations are durable, failed operations not visibility and durability may be tuned for performance individual reads without visibility guarantees instead of durability only periodic writing of data
comparison Pig Latin allows to view data as tables provides ad hoc queries extendable to arbitrary map reduce jobs Hive tries to provide SQL-functionality slow, large scale queries structured query language, query planner Hbase more like a NOSQL database or key/value store no sql operations, only storage and retrieval guarantees for operations optimized for random, real-time access note: Pig and Hive can access data from Hbase directly note: Cassandra is a dbs similar to HBase, optimized for security 16,
17, Mahout scalable implementations of data mining/learning algorithms provide a library for easy access to machine learning implementations provide algorithms for the most common problems, e.g.: clustering classification frequent pattern mining,... optimize for practical (e.g. business) usage language: Java mahout.apache.org note: Mahout is currently switching from map/reduce to Spark
18, Mahout overview provides large API of Java-classes integration into other applications execution on top of distributed cluster 5 implementations can be adapted to specific problems provide individual I/O classes individual similarity/distance functions,... integration of Apache Lucene (document search engine) 5 e.g. Hadoop or Spark
Systems beyond Hadoop 19,
20, ZooKeeper distributed coordination service many problems/functions are shared among distributed systems ZooKeeper provides a single implementation of these avoid repeated implementation of the same services provide primitives for synchronization configuration maintenance naming optimized for failure tolerance, reliability and performance used in other projects as sub service another Apache top-level project (zookeeper.apache.org)
21, Zookeeper provides tree-like information storage update guarantees sequential consistency (keep update order) atomicity single system image (one state for views) reliability (applied updates persist) timeliness (time bounds for updates) extremely simple interface create/delete test node existance get/set data get children sync (wait for update to propagate)
22, Mesos 6 Hadoop: use one (physical) cluster of machines exclusively Mesos: share a physical cluster between multiple distributed systems implent intermediate layer between distributed frameworks and hardware administrate physical resources distribute these to involved frameworks improve cluster utilization implement prioritization and failure tolerance 6 Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011
23, Mesos example scenarios multiple Hadoop systems on the same (physical) set of machines production system, takes priority testing implementations or execute analyses that are of general interest but should not disturb the production system test new versions of Hadoop all involved Hadoop instances use the same data as input different distributed frameworks on the same cluster different tasks benefit from different optimization approaches the map reduce approach is not optimal in every situation still, the different frameworks might work on the same base data
24, Mesos dividing tasks scheduling distribute tasks to available resources consider data locality send tasks to nodes that already store the involved data depends on framework (optimization strategy), job (algorithm) and task order should be implemented by the framework resource distribution distribute available resources to frameworks keep track of system usage ensure priorities between different frameworks should be implemented by intermediate layer (Mesos)
25, Mesos architecture centralized master-slave system frameworks run tasks on slave nodes master implements sharing using resource offers: list of free resources on various slaves master decides which (and how many) resources are offered to which framework implements organizational policy frameworks have two parts: scheduler - accepts or declines offers from Mesos executor process - started on computing nodes, executes tasks framework decides which task is solved on a particular resource tasks are executed by sending task description to Master
26, Mesos summary/overview a framework/library/set of servers allows to run several distributed frameworks on top of a single cluster of machines administrates and distributes resources with respect to configurable priorities is an actually implemented and used system: mesos.apache.org made it into the apache incubator started at UC Berkeley AMP Lab uses Zookeeper
27, Spark 7 map reduce is not optimal for all problems many algorithms iterate a number of times over source data example: gradient descent each iteration uses source data to compute new gradient in Hadoop, every iteration reads all source data completely from disk, computes a single step and writes result approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines 7 Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010
28, Spark: overview cluster computing system, comparable to Hadoop provides primitives for in-memory cluster computing data types are distributed in the cluster parts on the individual machines kept in memory speedup in comparison to Hadoop for certain (iterating) algorithms (e.g. logistic regression) build on top of Mesos provides APIs for Scala, Java, Python originally developed for iterative algorithms (iterations using the same source data) interactive data mining spark.apache.org often seen as the successor of Hadoop
29, Spark programming model Spark applications consist of a driver program implements the global, high-level control flow launches operations that are executed in parallel distribution and parallelization is achieved with resilient distributed data sets (RDDs) parallel operations working on RDDs shared variables RDDs are read-only, distributed collections constructed from input data or transformation held in memory (if possible)
30, Spark RDDs: resilient distributed datasets lazy evaluation: creating handle describes derivation derivation is only executed when necessary ephemeral: not guaranteed to stay in memory recreated on demand state can be changed using cache and save cache: still not evaluated after first evaluation kept in memory if possible save: triggers evaluation writes to distributed storage handle of saved RDD points to persistently stored object
31, Spark parallel transformations RDD are transformed by parallel transformation result is always a new RDD emulating map reduce is simple: flatmap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call) reducebykey(function) called on collections of (K,V) key/value pairs groups by key, aggregate with function other transformations include union() (of two RDDs), distinct() (distinct elements) sort(), groupbykey() join() (equi-join on key), cartesian() cogroup() maps (K,V), (K,W) (K, Seq(V), Seq(W))
32, Spark actions actions extract data from the RDD and transport it back into the context of the driver program: collect() retrieve all elements of an RDD first(), take(n), retrieve first/first n elements reduce(func) use commutative, associative func for parallel reduction and retrieve final result foreach(func) run function over all elements (e.g. for statistics) count() get number of elements
33, Spark shared variables parallel functions transport variables from their original context to the node they are executed at these have to be transport every time a function is send over the network Spark supports 2 additional forms for special use cases broadcast variables: transported only once to all involved nodes read only in parallel functions accumulators parallel functions can add to accumulators adding is some associative operation can be read only by driver program
34, Pregel 8 solve large scale problems on graphs/networks example: PageRank distributed system designed for graph computations assumption: many graph algorithms traverse graph via edges access data very local e.g. computations for a node involve values of its neighbors Pregel implements such a system, but is not public/open source Giraph is an open source framework implementing the same idea: giraph.apache.org 8 Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010
35, Pregel overview basic unit of computation: node with unique id and its incident edges node perform computations in parallel communicate with each other via messages a superstep is one round where each node computes in each superstep: node receives messages from last round update/computes/sends messages to be received in next round each node can vote for stopping turns node inactive gets reactivated by received message computation stops when all nodes vote for stop
36, example: connected components assume: node ids are totally ordered node initializes minid with its ID sends its ID to all neighbors in each following round: collect all received IDs update minimum if minid changed send new minid to neighbors else vote for stop result: all nodes in a component have same minid nodes in different components have different minid
master slave system master node: determines end of algorithm takes care of node failure synchronizes node communication basic idea can be extended: nodes can mutate the graph (create/delete nodes/edges) 37,