Hadoop Masterclass. Part 4 of 4: Analyzing Big Data. Lars George EMEA Chief Architect Cloudera

Transcription

1 Hadoop Masterclass Part 4 of 4: Analyzing Big Data Lars George EMEA Chief Architect Cloudera 1

2 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

3 Access to Data There a many ways to access data. For Hadoop there are at least the following two major categories: Batch Access This is using, for example, MapReduce or an abstracfon on top of it, for example Pig and Hive. Near and Real- 1me Access Data is read directly from HDFS or memory, for example using Impala or Solr.

5 Pig Pig was developed within Yahoo! to widen the access to MapReduce across the organizafon. Programming jobs in Java is tedious and takes 1me, while SQL is allegedly to restric1ve. Pig defines a query language called Pig La1n, an impera1ve language (as opposed to the declarafve SQL), which follows many concepts found in such languages. Pig is extensible using User- defined FuncFons (UDFs) and supplies a command- line Shell called Grunt.

6 Pig ExecuFon When execu1ng a Pig script it is translated into one or more than one MapReduce job. They then execute the supplied Pig JAR file which is responsible for reading the input file(s), and perform the defined ac1ons, resulfng in output file(s). For a developer Pig should be easy to read since it reflects the processing steps one a[er the other, very similar to BASIC or Bash shell scripfng. The schema is applied at run1me and within the script.

7 Pig Data Modell There are Rela1ons, Bags and Tuples, as well as simple data types (Fields). RelaFons are similar to Tables and contain Bags, which in turn contain Tuples. The la\er contains the actual Fields. In other words, the Tuples are a Row in a Table. This is also very similar to the relafonal algebra forming the basis for RDBMSs. Pig supplies classes to convert data from files into the above concepts.

8 Pig There are two execufon modes, local and distributed: /* local mode */ $ pig -x local... /* mapreduce mode */ $ pig... or $ pig -x mapreduce...

9 Pig Example -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature!= 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp;

11 Hive Facebook had a similar problem compared to Yahoo!, i.e. it wanted a larger audience for the data stored in Hadoop. They had many analysts that were already familiar with SQL, so they set out to develop an SQL front- end to Hadoop. The syntax is following loosely the one from MySQL, because that is what was already widely in use at FB. HiveQL (Hive s SQL) has a few extension, but also some restric1ons compared to the full SQL standard.

12 Hive Overview Apache Hive is a high level abstracfon on top of MapReduce Uses a SQL- like language called HiveQL Generates MapReduce jobs that run on the Hadoop cluster Originally developed by Facebook for data warehousing Now an open/source Apache project SELECT zipcode, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id WHERE zipcode LIKE '63%' GROUP BY zipcode ORDER BY total DESC;

13 Hive Overview Hive itself runs on client machine, i.e. is a tool that takes the HiveQL statement from a file or interac1vely and translates it into a series of MapReduce jobs on the actual Hadoop cluster.

14 Why use Hive? More producfve than wrifng MapReduce directly Five lines of HiveQL might be equivalent to 100 lines or more of Java Brings large scale data analysis to a broader audience No so[ware development experience required Leverage exisfng knowledge of SQL Offers interoperability with other systems Extensible through Java and external scripts Many business intelligence (BI) tools support Hive

15 Data Access Hive s queries operate on tables, just like in an RDBMS A table is simply an HDFS directory containing one or more files Default path: /user/hive/warehouse/<table_name> Hive supports many formats for data storage and retrieval How does Hive know the structure and loca1on of tables? These are specified when tables are created This metadata is stored in Hive s metastore Contained in an RDBMS such as MySQL

16 Data Access Hive consults the metastore to determine data format and loca1on. The query itself operates on data stored on a file system (typically HDFS).

17 Hive vs. RDBMS Client- server database management systems have many strengths Very fast response Fme Support for transacfons Allows modificafon of exisfng records Can serve thousands of simultaneous clients Hive does not turn your Hadoop cluster into an RDBMS It simply produces MapReduce jobs from HiveQL queries LimitaFons of HDFS and MapReduce sfll apply

18 18 Hive vs. RDBMS

19 Hive as a Service IniFally the Hive project offered a ThriM based server called aptly Hive Server that could run queries. It has serious limita1ons especially around security and has been replaced (sfll work- in- progress) with Hive Server 2. The Hive Server in general offers two major services: it acts as a gateway for local and remote clients, and can handle API as well as JDBC/ODBC calls from the dedicated implementafons.

20 Hive Server 2

21 Hive CLI s As with the Hive Server, there are two implementafons of the command line tools: Hive Shell and Beeline. The former is bypassing any control and talks directly to the metastore and submits MapReduce jobs. The la\er, Beeline, is for HiveServer 2 and uses it to handle all communicafon with the cluster internally. This includes the metastore as well as the MapReduce itself. The following diagram shows the difference between the two.

22 Hive CLI and HiveServer2

23 Hive Shell You can execute HiveQL statements in the Hive Shell This interacfve tool is similar to the MySQL shell Run the hive command to start the Hive shell The Hive shell will display its hive> prompt Each statement must be terminated with a semicolon Use the quit command to exit the Hive shell $ hive hive> SELECT cust_id, fname, lname FROM customers WHERE zipcode=20525; Quentin Shepard Brandon Louis Marilyn Ham hive> quit; $

24 Hive Shell You can also execute a file containing HiveQL code using the -f opfon $ hive -f myquery.hql Or use HiveQL directly from the command line using the -e opfon $ hive e SELECT * FROM users Use the -S (silent) opfon to suppress informafonal messages Can also be used with e or f opfons $ hive -S

25 Beeline To connect to HiveServer2, use Hue or the Beeline CLI You cannot use the Hive shell For secure deployments, supply your user ID and password Example: starfng Beeline and connecfng to HiveServer2 analyst]$ beeline Beeline version cdh4.2.1 by Apache Hive beeline>!connect jdbc:hive2://localhost:10000 training mypwd org.apache.hive.jdbc.hivedriver Connecting to jdbc:hive2://localhost:10000 Connected to: Hive (version ) Driver: Hive (version cdh4.2.1) Transaction isolation: TRANSACTION_REPEATABLE_READ beeline> 25

27 Impala While all previously discussed systems essenfally run MapReduce batch jobs Impala is more like a true MPP (massively parallel processing) system with properfes of a distributed database. Just like Hive, Impala supports HiveQL and the Hive Metastore nafvely and acts like Hive from the outside. Internally it is completely different though as it has processes running permanently and reads data directly of the lower storage layers (HDFS or HBase). There is no MapReduce involved at all and hence does not share the same latency issues nofceable there.

28 Impala Architecture Two binaries: impalad and statestored Impala daemon (impalad) Handles client requests and all internal requests related to query execu1on Exports Thri[ services for these two roles State store daemon (statestored) Provides name service and metadata distribufon Also exports a Thri[ service

29 Impala Architecture Query execufon phases Request arrives via odbc/beeswax Thri[ API Planner turns request into collecfons of plan fragments Coordinator inifates execufon on remote impalad's During execufon Intermediate results are streamed between executors Query results are streamed back to client Subject to limitafons imposed to blocking operators (top- n, aggregafon)

30 Impala Architecture: Planner 2- phase planning process: Single- node plan: le[- deep tree of plan operators Plan parffoning: parffon single- node plan to maximize scan locality, minimize data movement Plan operators: Scan, HashJoin, HashAggregaFon, Union, TopN, Exchange Distributed aggregafon: pre- aggregafon in all nodes, merge aggregafon in single node Join order = FROM clause order

31 Query Planner Example: Query with JOIN and AggregaFon SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 Hdfs Scan Top N Agg Hash Join Hbase Scan Top N Agg Exch Hdfs Scan Agg Hash Join Exch Hbase Scan at coordinator at DataNodes at region servers

32 Impala Architecture: Query ExecuFon Request arrives via odbc/beeswax Thri[ API SQL App ODBC Hive Metastore HDFS NN Statestore SQL request Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase

33 Impala Architecture: Query ExecuFon Planner turns request into collecfons of plan fragments Coordinator inifates execufon on remote impalad's SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase

34 Impala Architecture: Query ExecuFon Intermediate results are streamed between impalad's Query results are streamed back to client SQL App ODBC Hive Metastore HDFS NN Statestore query results Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase

35 Comparing Impala to Hive Hive: MapReduce as an execufon engine High latency, low throughput queries Fault- tolerance model based on MapReduce's on- disk checkpoinfng; materializes all intermediate results Java runfme allows for easy late- binding of funcfonality: file formats and UDFs. Extensive layering imposes high runfme overhead Impala: Direct, process- to- process data exchange No fault tolerance An execufon engine designed for low runfme overhead

36 Comparing Impala to Hive Impala's performance advantage over Hive: no hard numbers, but Impala can get full disk throughput (~100MB/sec/disk); I/O- bound workloads o[en faster by 3-4x Queries that require mulfple mapreduce phases in Hive see a higher speedup Queries that run against in- memory data see a higher speedup (observed up to 100x)

38 Apache Lucene Besides Hadoop (and Avro) Doug Cunng did found another project (actually the first one spawning the others): Lucene. It offers an open- source implementafon of a text search, with many interesfng features, e.g. Boolean logic (AND, OR etc.), term boosfng, fuzzy matching. Lucene is great at doing a full text search across datasets and is a natural complement to the big data space and Hadoop.

39 Apache Solr Lucene itself though is built for a single machine setup and to be embedded into another process. Scaling it requires a framework that can handle many distributed search indices that are maintained centrally. This funcfonality is provided by Solr, a wrapper around Lucene. Solr adds a server component that hosts the Lucene libraries and exports a REST API that can be used from clients. On top of Solr there is also SolrCloud, an extension that adds the distribu1on part, so that many Solr servers can form an ensemble of machines in a cluster.

41 Data Pipelines Processing data is omen a task with many smaller steps, which perform par1al computafon. For example, one subtask could perform an inifal data cleansing, another pre- compute sta1s1cs or machine learning models which then serve as an input to subsequent steps, or even other pipelines. In pracfce these steps are combined to the so called data pipelines, which are then a[er thorough tesfng deployed on a producfon cluster to run automa1cally.

42 Data Pipelines The data pipelines can be categorized into (at least) two major classes: macro and micro pipelines. Macro pipelines are typically handled by process schedulers, where there are dedicated ones like Apache Oozie or Azkaban, or generic ones like Quartz. Micro pipelines are inline processing helpers that can be used to repeatedly transform enffes either one by one or as a larger code construct. Example for that la\er are Crunch, Cascading or Morphlines for the former.

43 Macro Pipelines The Scheduler systems are usually independent applicafons that have a server component with an API and a UI. They allow the construcfng of data processing pipelines by chaining smaller processing jobs together. The following will explain the schedulers using Oozie as an example. The other systems menfoned are Azkaban, developed by LinkedIn (see h\p://azkaban.github.io/azkaban2/), or Luigi by SpoFfy (see h\ps://github.com/spoffy/luigi).

45 Apache Oozie As menfoned in pracfce it is very common to combine mul1ple subtasks into one larger workflow, and those eventually to even larger data pipelines. Apache Oozie is a server based coordina1on system (a scheduler) which can handle such workflows. There are triggers that can start workflows based on 1me ( every hour, once per week ) and also based on dependent data sources ( start when all the data from the previous step has arrived ). Workflows are defined as graphs.

46 Oozie Workflows Oozie workflows are a combinafon of Ac1ons, e.g. MapReduce, Hive/Pig jobs, which are defined in a control dependent Direct Acyclic Graph (DAG). Control dependent here means that a second acfon on the same path in the graph can only execute when the previous one has completed. The language to define these DAGs are hpdl, an XML based Process DefiniFon Language (PDL).

47 Oozie Workflows A workflow can contain control as well as ac1on nodes. The former define the start and end of the workflow and allow the user to influence the execufon path (make decisions, split flow, and join it again). In addifon, a workflow can be parameterized, which makes it possible to reuse the same workflow with, for example, different input or output files. This builds a form of template library of essenfal workflows used throughout the data pipelines. The parameters are very powerful and include macros too.

48 Oozie AcFons The ac1ons of a workflow are always executed out of band from Oozie, usually in a dedicated cluster. When an acfon has completed it sends a callback to Oozie which in turn starts the next acfon. In other words, they are the smallest unit of work in Oozie and handle the actual task execufon. There is a list of acfons already included in Oozie, but they can also be extended by custom acfons as needed.

49 Oozie AcFons The supplied acfons are, for example: MapReduce Pig and Hive HDFS operafons SSH HTTP Oozie sub- workflows

50 Example: Oozie Workflow Note: round means control node, square means acfon node

51 Example: Oozie Workflow Notes: There is always only one start and one end node, i.e. the graph has to be defined in that way or it is invalid. All parameters that are used must be defined, for example using a properfes file. If that is not the case, then the workflow will not run at all.

52 Example: Oozie Workflow <workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobtracker}</job-tracker> <name-node>${namenode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.wordcount.map</value> </property>

53 Example: Oozie Workflow <property> <name>mapred.reducer.class</name> <value>org.myorg.wordcount.reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputdir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputdir}</value> </property> </configuration> </map-reduce>

54 Example: Oozie Workflow <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>something went wrong: \ </kill/> ${wf:errorcode('wordcount')}</message> <end name='end'/> </workflow-app>

55 Example: Oozie Workflow #2

56 Workflow Server Oozie includes a REST based server (using Apache Tomcat) with a simple web based UI. The UI can only read the current state and display it, but does not allow any other access. If a workflow needs interac1on then the REST API must be used. The supplied Oozie command line client allows to send commands from the CLI to the server in a convenient way. Another way to interact with Oozie is Hue (installed in the VM). The server executes the workflows and monitors their status. The details are stored in a relafonal database

57 Workflow ApplicaFon There are three parts needed to form a workflow: 1. File workflow.xml Contains the workflow defini1on in hpdl 2. Libraries Op1onal directory lib/ for JAR and SO files 3. Proper1es file Contains the parameters for the workflow (for the workflow.xml) and must at least set oozie.wf.application.path

58 Workflow ApplicaFon Example for a properfes file: namenode=hdfs://localhost:8020 jobtracker=localhost:8021 queuename=default inputdir=${namenode}/data.in outputdir=${namenode}/out user.name=training oozie.wf.application.path=\ ${namenode}/user/${user.name}/oozie/ workflow/wordcount/

59 Workflow ApplicaFon The required files are copied into the path in HDFS specified with oozie.wf.application.path and then submi\ed through the CLI. Once this is done the workflow can be queried through the CLI or the UI (or with custom code) and monitored. There are addifonal opfons which allow for an acfve callback from Oozie once a workflow is complete. This can be used, for example, to trigger outside acfons in external applicafons.

60 Submit a Workflow Submit a workflow: $ oozie job run -config job.properties \ -oozie Workflow ID: oozie-wrkf-W Check the status of a workflows: $ oozie job info oozie-wrkf-W \ -oozie Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING]

61 Resume a Workflow In case an acfon fails while processing a workflow, the failed operafons can be retried a[er for example fixing the problem. First though the flow is aborted as per the DAG defini1on. Then the flow can be resumed with either oozie.wf.rerun.failnodes=true or oozie.wf.rerun.skip.nodes=<k1>,... and the parameter rerun. Only one of those command line op1ons can be used at a Fme, they are mutually exclusive.

62 Oozie Coordinators Another concept provided by Oozie are the coordinators. They are the mechanism allowing to use triggers that start workflows based on specific condi1ons, i.e. 1me or data based. They require an addi1onal XML file that is stored in the workflow directory and contains the coordinator defini1on. This is useful to define automated and repe11ve execufon of workflows, for instance to chain workflows together into data pipelines.

63 Oozie Coordinators Coordinators are special workflows, yet are submi\ed the same way as normal workflows, using the CLI client. They are essenfally long running tasks executed inside the Oozie server, which then starts the contained workflow as defined in the XML. The same commands apply to coordinators as do for workflows.

64 Time- based Coordinators When a workflow needs to run at specific intervals or as added very recently specific Fmes (like CRON) a 1me- based coordinator definifon can be used. The interval based approach uses frequencies, like every day or every 2 hours. The CRON like approach uses a very specific format that can handle fine grained Fme control. All coordinator definifons have to be stored in a file called coordinator.xml located in the main directory of the workflow in HDFS.

65 Time- based Coordinators Example: <coordinator-app name="coordinator1" frequency="${frequency}" start="$ {starttime}" end="${endtime}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>${workflowpath}</app-path> </workflow> </action> </coordinator-app>

66 Time- based Coordinators Just like the workflow.xml the coordinator.xml file can be parameterized as well. For example:... frequency=60 In minutes starttime= t20\:20z endtime= t20\:20z timezone=gmt+0530 workflowpath=${namenode}/user/$ {user.name}/oozie/workflow/wordcount/

67 Data/File- based Coordinators When adding a data (aka file) based element to the coordinator.xml it can be defined to start workflows automa1cally when specific data (in files) arrives. With it the output of another workflow can serve as an input to the next one. Or, data wri\en by Flume into an input directory can be processed once it is complete. A special semaphore allows to flag the la\er, i.e. when data is fully wri\en and a workflow can safely start.

68 Data/File- based Coordinators Example definifon:... <datasets> <dataset name="input1" frequency="${datasetfrequency}" \ initial-instance="${datasetinitialinstance} \ timezone="${datasettimezone}"> <uri-template>${dataseturitemplate}/${year}/${month}/ \ ${DAY}/${HOUR}/${MINUTE}</uri-template> <done-flag> </done-flag> </dataset> </datasets> <input-events> <data-in name="coordinput1" dataset="input1"> <start-instance>${inputeventstartinstance}</start-instance> <end-instance>${inputeventendinstance}</end-instance> </data-in> </input-events>...

69 Data/File- based Coordinators Example properfes: In minutes... datasetfrequency=15 datasetinitialinstance= t15:30z datasettimezone=utc dataseturitemplate=${namenode}/srvs/s0001/in inputeventstartinstance=${coord:current(0)} inputeventendinstance=${coord:current(0)} workflowpath=${namenode}/user/${user.name}/oozie/workflow/ wordcount/... inputdir= ${coord:datain('coordinput1')} outputdir=${namenode}/out oozie.coord.application.path=${namenode}/user/${user.name}/ coordoozie/coordinatorfilebased-events>

70 Submit Coordinators Submit a coordinator based workflow: $ oozie job run -config job.properties \ -oozie job: oozie-hado-C Suspend a coordinator based workflow: $ oozie job suspend oozie-hado-C \ -oozie Resume a suspended coordinator based workflow: $ oozie job resume oozie-hado-C \ -oozie Killing a coordinator based workflow: $ oozie job kill oozie-hado-C \ -oozie

71 Oozie Bundles The newest addifon to Oozie are the so called Bundles, which allow to build sophis1cated data pipelines as discussed without the need to submit them separately as workflows. Bundles combine mulfple coordinator based workflows into a single package. The definifon allows to define dependencies between the workflows too, so that they can be honored by Oozie during execufon.

72 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie Informa1on Architecture Spark

73 InformaFon Architecture A[er looking at the tools to define and build, as well as reliably execute data pipelines there is another topic that should be discussed, which is how data is organized. Especially in shared, mul1- tenant setups it is vital to define upfront how data is stored and laid out in HDFS. Otherwise it will be difficult for users to find their own data, but also the shared data from other departments. This is called the informa1on architecture.

74 InformaFon Architecture For solufons based on Hadoop this can be covered by using services. For that all use- cases are mapped into user groups, both for regular users and service administrators. In HDFS, for example, all of the files are stored in specific directories and with specific access rights, each reflecfng one service or user group. Examples: /srvs/s1/ or /system/etl/sales

75 InformaFon Architecture Within each group/service level there are further well defined directories created for incoming, complete, failed, and currently being processed data called working. In the working directory another level is created with a directory per workflow that has a unique ID (usually Fmestamp based), so that mul1ple jobs of the same kind can run in parallel without any overlap or causing issues with already exisfng files.

76 InformaFon Architecture Complete Example: /system /system/etl/<group>/<process> /incoming /complete /failed /working /<epoch_idx> /incoming /complete /system/data/<dataset>

77 InformaFon Architecture Files in these directories are then assigned the owner and group rights for the service they belong to, e.g. s1admin, s1user, s2admin, s2user,... If now, for instance, a user from service 2 (s2) wants to access the files from service 1 (s1) then all that needs to be done is add the user in quesfon to the user group of service 1, e.g. s1user. This concept can be extended to build further hierarchies that handle more complex setups.

78 InformaFon Architecture The user groups can also be used to grant access to the job queues in MapReduce or YARN. This allows to specify which service or group of users with specific files can use the available resources. Limits can be applied and fine grained control is available to ensure each use- case is able to deliver their final product. Final words: Hadoop does not imply or enforce any rules on its own, the informafon architecture has to do that.

80 Batch vs. Real- Fme With increasing use of batch oriented solufons, for example MapReduce, within organizafons, user eventually ask for more 1mely answers. It becomes more and more important to extend the batch plauorm with near- or real- 1me components. There are a few choices to do that, revolving around the storage loca1on of the data. It could be stored on persistent storage first and read back or processed directly in memory.

81 Persisted Data This kind of approach we have seen already, for example Impala. First the data is wri]en to the persistent storage (e.g. disks) and then subsequently processed. This causes a slight delay in how current the data is. The main advantage is the persistency and scalability, because off- memory storage is much more affordable in comparison. And you can sfll use faster media, like SSDs or even PCI Flash drives.

82 In- Memory Data The other approach is to keep data in memory, where is can be processed with tremendous speed. Some examples here are SAP Hana or Oracle Exaly1cs. The obvious drawback is the cost of such a solufon, as memory is sfll much more expensive compared to the slower, but larger media like disks. In general it is ques1onable how such systems might scale.

83 In- Memory Data Another topic is the complex event processing (CEP) or stream processing. In this case only relevant informafon is kept in memory and is available for querying. This is useful to e.g. track trending topics, or sums over a window of Fme. An example for this kind of technology is Storm by Twi\er (previously BackType), or Spark Streaming. Even Flume can be coerced into similar funcfonality using interceptors, though that is less of its strong point.

84 Hybrid Systems: Spark Talking about Spark, it is more of a hybrid system, as it can do not just one of the previously discussed processing, but it spans into both. Spark can cache data in memory for itera1ve processing without loading the data again. But it can also work directly of disks if needed. In addifon, it has a much more flexible data processing model, defining the processing steps in a DAG akin to the ones found in Oozie.

85 Spark RDDs textfile = sc.textfile( SomeFile.txt )! RDD RDD RDD RDD Action Value Transformations lineswithspark.count()! 74!! lineswithspark.first()! # Apache Spark! lineswithspark = textfile.filter(lambda line: "Spark in line)!

86 Hybrid Systems: Spark Workflow definifons can be wri\en in Java, Python or Scala. Similar to PigLaFn there are transforma1ons and ac1ons available. The la\er are the ones that trigger the execufon of the processing graph up to the current locafon. Any intermediate data can be cached so that repeffve access does not require extra load operafons.

87 Create RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textfile( file.txt ) > sc.textfile( directory/*.txt ) > sc.textfile( hdfs://namenode:9000/path/file ) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopfile(keyclass, valclass, inputfmt, conf)

88 Simple TransformaFons > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatmap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2}

89 Simple AcFons > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveastextfile( hdfs://file.txt )

90 Spark Example: PageRank Spark is very suitable for mathemafcal algorithms that need itera1ons, for example PageRank as used by Google to build its search index. The approach is to start at some generic senngs and then iterate over the computafon unfl the results converge. Basic idea: Links from many pages - > high rank Link from a high- rank page - > high rank

91 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs

96 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs Final result:

97 Spark Scala ImplementaFon val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues( * _) } ranks.saveastextfile(...)

98 Spark PageRank Performance Itera1on 1me (s) Hadoop Spark Number of machines

99 Hybrid Systems: Spark Spark can not only handle iterafve, but also linear algorithms well, including MapReduce. IniFal numbers show a tremendous improvement of how long processing needs. O[en Spark jobs are Fmes faster compared to tradifonal MapReduce. In addifon, Spark also has streaming component. Summarizing, Spark is one example how Hadoop keeps developing into a more and more able plauorm for generic yet fast data processing.

100 Sources Flume Project: h\p://flume.apache.org/ PresentaFon: h\p:// hug Oozie Project: h\p://oozie.apache.org/ Spark Summit Exercises: h\p://spark- summit.org/2013/exercises/ Hadoop OperaFons Eric Sammer, O Reilly

101