Big Data and Scripting Systems beyond Hadoop

Transcription

1 Big Data and Scripting Systems beyond Hadoop 1,

2 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid repeated implementation of the same services provide primitives for synchronization configuration maintenance naming optimized for failure tolerance, reliability and performance used in other projects as sub service another Apache top-level project

3 3, Zookeeper provides tree-like information storage update guarantees sequential consistency (keep update order) atomicity single system image (one state for views) reliability (applied updates persist) timeliness (time bounds for updates) extremely simple interface create/delete test node existance get/set data get children sync (wait for update to propagate)

4 4, ZooKeeper environment run standalone or replicated/distributed interfaces for Java and C independent (but often used in together with) Hadoop/HDFS memory based (stored data is kept in memory and therefore limited) prefers heavy reading over heavy writing

5 5, Mesos 1 Hadoop: use one (physical) cluster of machines exclusively Mesos: enable sharing of a cluster between multiple distributed computing systems implent intermediate layer between distributed framework (e.g. Hadoop) and hardware administrate physical resources distribute these to involved frameworks improve cluster utilization implement prioritization and failure tolerance 1 Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011

6 6, example scenarios multiple Hadoop systems on the same (physical) set of machines production system, takes priority testing implementations or execute analyses that are of general interest but should not disturb the production system test new versions of Hadoop all involved Hadoop instances use the same data as input different distributed frameworks on the same cluster different tasks benefit from different optimization approaches the map reduce approach is not optimal in every situation still, the different frameworks might work on the same base data

7 7, dividing tasks scheduling distribute tasks to available resources consider data locality send tasks to nodes that already store the involved data depends on framework (optimization strategy), job (algorithm) and task order should be implemented by the framework resource distribution distribute available resource to frameworks keep track of system usage ensure priorities between different frameworks should be implemented by intermediate layer (Mesos)

8 8, architecture centralized master-slave system frameworks run tasks on slave nodes master implements sharing using resource offers: list of free resources on various slaves master decides which (and how many) resources are offered to which framework implements organizational policy frameworks have two parts: scheduler - accepts or declines offers from Mesos executor process - started on computing nodes, executes tasks framework decides which task is solved on a particular resource tasks are executed by sending task description to Master

9 9, design resource allocation each framework gets a guaranteed amount of resources corresponding to its share when resources are available, framework receives additional offers can cause conflicts (another framework starts using its share) over usage of first framework resolved by revoking (i.e. killing) tasks frameworks can indicate demand, otherwise free resource are offered freely when used resources are below share, no task is revoked when usage above share, any task can be revoked

10 design robustness and fault tolerance frameworks are treated as unreliable: resources offered to a framework count into its usage unanswered offers are interpreted as rejection the corresponding resources are offered to another framework single master could pose a single point of failure master has soft state: replace master can reconstruct state from information held by slaves and framework schedulers recover: active slaves active frameworks running tasks inactive standby replacement masters elect new leader using ZooKeeper frameworks can also register replacement schedulers 10,

11 11, summary/overview a framework/library/set of servers allows to run several distributed frameworks on top of a single cluster of machines administrates and distributes resources with respect to configurable priorities is an actually implemented and used system: mesos.apache.org made it into the apache incubator started at UC Berkeley AMP Lab

12 12, Spark: an alternative distributed framework 2 map reduce is not the optimal framework for all algorithms problem: many algorithms iterate a number of times over their source data example: gradient descent, each iteration uses source data to compute new gradient in map reduce/hadoop every iteration reads all source data completely from disk and computes a single step approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines 2 Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010

13 13, Spark: overview cluster computing system, comparable to Hadoop provides primitives for in-memory cluster computing data types that are distributed throughout parts on the individual machines kept in memory in applications that benefit from this approach, Spark tends to be much faster than Hadoop provides APIs for Scala, Java, Python build on top of Mesos originally developed for iterative algorithms (iterations using the same source data) interactive data mining spark-project.org not (yet) an Apache project, but open-source (BSD-license)

14 14, intermission: Scala integrates object-oriented and functional language features Java-based (in part) similar syntax compiles to Java-bytecode (runs on standard JRE) type save allows functional programming interactive usage with console full object-oriented compilable programming (including GUI) free:

15 15, Spark programming model Spark applications consist of a driver program implements the global, high-level control flow launches operations that are executed in parallel distribution and parallelization is achieved with resilient distributed data sets (RDDs) parallel operations working on RDDs shared variables

16 16, resilient distributed datasets RDDs are read-only partitioned across the compute nodes not necessarily on physical storage: described by handle handle contains information to infer RDD from reliable storage parts can be rebuild in case of data loss constructed: from files (e.g. in HDFS) from local collection (e.g. distribute array across cluster) by transformation from existing RDD by changing persistence of existing RDD

17 17, resilient distributed datasets lazy evaluation: creating handle describes derivation derivation is only executed, when necessary ephemeral: not guaranteed to stay in memory recreated on demand state can be changed using cache and save cache: still not evaluated after first evaluation kept in memory, if possible save: triggers evaluation writes to distributed storage handle of saved RDD points to persistently stored object

18 18, parallel transformations emulating map reduce is simple: flatmap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call) reducebykey(function) called on collections of (K,V) key/value pairs groups by key, aggregate with function other transformations include map(function) apply function to all elements filter(), sample() union() (of two RDDs), distinct() (distinct elements) sort(), groupbykey() join() (equi-join on key), cartesian() cogroup() maps (K,V), (K,W) (K, Seq(V), Seq(W))

19 19, actions to access distributed data actions extract data from the RDD and transport it back into the context of the driver program: collect() retrieve all elements of an RDD first(), take(n), retrieve first/first n elements reduce(func) use commutative, associative func for parallel reduction and retrieve final result foreach(func) run function over all elements (e.g. for statistics) count() get number of elements

20 20, shared variables parallel functions transport variables from their original context to the node they are executed at these have to be transport every time a function is send over the network Spark supports 2 additional forms for special use cases broadcast variables: transported only once to all involved nodes read only in parallel functions accumulators parallel functions can add to accumulators adding is some associative operation can be read only by driver program

21 21, Spark - summary system for distributed computation master-slave architecture main difference to map reduce: in-memory computations with RDDs

22 22, Pregel 3 many large scale problem involve graphs (graph: nodes connected by edges) example: PageRank distributed system designed for graph computations assumption: many graph algorithms traverse nodes via edges access data very local e.g. computations for a node involve values of its neighbors Pregel implements such a system, but is not public/open source Giraph is an open source framework implementing the same idea: giraph.apache.org 3 Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010

23 23, Pregel: idea basic unit of computation: node with unique id and its incident edges node can perform computations nodes communicate with each other via messages a superstep is one round where each node computes in each superstep: node receives messages from last round update/computes/sends messages to be received in next round each node can vote for stopping turns node inactive gets reactivated by received message computation stops when all nodes vote for stop

24 24, example: connected components assume: node ids are totally ordered node initializes minid with its ID sends its ID to all neighbors in each following round: collect all received IDs update minimum if minid changed send new minid to neighbors else vote for stop result: all nodes in a component have same minid nodes in different components have different minid

25 master slave system master node: determines end of algorithm takes care of node failure synchronizes node communication basic idea can be extended: nodes can mutate the graph (create/delete nodes/edges) 25,