Hadoop Masterclass. Part 4 of 4: Analyzing Big Data. Lars George EMEA Chief Architect Cloudera

Size: px
Start display at page:

Download "Hadoop Masterclass. Part 4 of 4: Analyzing Big Data. Lars George EMEA Chief Architect Cloudera"

Transcription

1 Hadoop Masterclass Part 4 of 4: Analyzing Big Data Lars George EMEA Chief Architect Cloudera 1

2 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

3 Access to Data There a many ways to access data. For Hadoop there are at least the following two major categories: Batch Access This is using, for example, MapReduce or an abstracfon on top of it, for example Pig and Hive. Near and Real- 1me Access Data is read directly from HDFS or memory, for example using Impala or Solr.

4 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

5 Pig Pig was developed within Yahoo! to widen the access to MapReduce across the organizafon. Programming jobs in Java is tedious and takes 1me, while SQL is allegedly to restric1ve. Pig defines a query language called Pig La1n, an impera1ve language (as opposed to the declarafve SQL), which follows many concepts found in such languages. Pig is extensible using User- defined FuncFons (UDFs) and supplies a command- line Shell called Grunt.

6 Pig ExecuFon When execu1ng a Pig script it is translated into one or more than one MapReduce job. They then execute the supplied Pig JAR file which is responsible for reading the input file(s), and perform the defined ac1ons, resulfng in output file(s). For a developer Pig should be easy to read since it reflects the processing steps one a[er the other, very similar to BASIC or Bash shell scripfng. The schema is applied at run1me and within the script.

7 Pig Data Modell There are Rela1ons, Bags and Tuples, as well as simple data types (Fields). RelaFons are similar to Tables and contain Bags, which in turn contain Tuples. The la\er contains the actual Fields. In other words, the Tuples are a Row in a Table. This is also very similar to the relafonal algebra forming the basis for RDBMSs. Pig supplies classes to convert data from files into the above concepts.

8 Pig There are two execufon modes, local and distributed: /* local mode */ $ pig -x local... /* mapreduce mode */ $ pig... or $ pig -x mapreduce...

9 Pig Example -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature!= 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp;

10 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

11 Hive Facebook had a similar problem compared to Yahoo!, i.e. it wanted a larger audience for the data stored in Hadoop. They had many analysts that were already familiar with SQL, so they set out to develop an SQL front- end to Hadoop. The syntax is following loosely the one from MySQL, because that is what was already widely in use at FB. HiveQL (Hive s SQL) has a few extension, but also some restric1ons compared to the full SQL standard.

12 Hive Overview Apache Hive is a high level abstracfon on top of MapReduce Uses a SQL- like language called HiveQL Generates MapReduce jobs that run on the Hadoop cluster Originally developed by Facebook for data warehousing Now an open/source Apache project SELECT zipcode, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id WHERE zipcode LIKE '63%' GROUP BY zipcode ORDER BY total DESC;

13 Hive Overview Hive itself runs on client machine, i.e. is a tool that takes the HiveQL statement from a file or interac1vely and translates it into a series of MapReduce jobs on the actual Hadoop cluster.

14 Why use Hive? More producfve than wrifng MapReduce directly Five lines of HiveQL might be equivalent to 100 lines or more of Java Brings large scale data analysis to a broader audience No so[ware development experience required Leverage exisfng knowledge of SQL Offers interoperability with other systems Extensible through Java and external scripts Many business intelligence (BI) tools support Hive

15 Data Access Hive s queries operate on tables, just like in an RDBMS A table is simply an HDFS directory containing one or more files Default path: /user/hive/warehouse/<table_name> Hive supports many formats for data storage and retrieval How does Hive know the structure and loca1on of tables? These are specified when tables are created This metadata is stored in Hive s metastore Contained in an RDBMS such as MySQL

16 Data Access Hive consults the metastore to determine data format and loca1on. The query itself operates on data stored on a file system (typically HDFS).

17 Hive vs. RDBMS Client- server database management systems have many strengths Very fast response Fme Support for transacfons Allows modificafon of exisfng records Can serve thousands of simultaneous clients Hive does not turn your Hadoop cluster into an RDBMS It simply produces MapReduce jobs from HiveQL queries LimitaFons of HDFS and MapReduce sfll apply

18 18 Hive vs. RDBMS

19 Hive as a Service IniFally the Hive project offered a ThriM based server called aptly Hive Server that could run queries. It has serious limita1ons especially around security and has been replaced (sfll work- in- progress) with Hive Server 2. The Hive Server in general offers two major services: it acts as a gateway for local and remote clients, and can handle API as well as JDBC/ODBC calls from the dedicated implementafons.

20 Hive Server 2

21 Hive CLI s As with the Hive Server, there are two implementafons of the command line tools: Hive Shell and Beeline. The former is bypassing any control and talks directly to the metastore and submits MapReduce jobs. The la\er, Beeline, is for HiveServer 2 and uses it to handle all communicafon with the cluster internally. This includes the metastore as well as the MapReduce itself. The following diagram shows the difference between the two.

22 Hive CLI and HiveServer2

23 Hive Shell You can execute HiveQL statements in the Hive Shell This interacfve tool is similar to the MySQL shell Run the hive command to start the Hive shell The Hive shell will display its hive> prompt Each statement must be terminated with a semicolon Use the quit command to exit the Hive shell $ hive hive> SELECT cust_id, fname, lname FROM customers WHERE zipcode=20525; Quentin Shepard Brandon Louis Marilyn Ham hive> quit; $

24 Hive Shell You can also execute a file containing HiveQL code using the -f opfon $ hive -f myquery.hql Or use HiveQL directly from the command line using the -e opfon $ hive e SELECT * FROM users Use the -S (silent) opfon to suppress informafonal messages Can also be used with e or f opfons $ hive -S

25 Beeline To connect to HiveServer2, use Hue or the Beeline CLI You cannot use the Hive shell For secure deployments, supply your user ID and password Example: starfng Beeline and connecfng to HiveServer2 analyst]$ beeline Beeline version cdh4.2.1 by Apache Hive beeline>!connect jdbc:hive2://localhost:10000 training mypwd org.apache.hive.jdbc.hivedriver Connecting to jdbc:hive2://localhost:10000 Connected to: Hive (version ) Driver: Hive (version cdh4.2.1) Transaction isolation: TRANSACTION_REPEATABLE_READ beeline> 25

26 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

27 Impala While all previously discussed systems essenfally run MapReduce batch jobs Impala is more like a true MPP (massively parallel processing) system with properfes of a distributed database. Just like Hive, Impala supports HiveQL and the Hive Metastore nafvely and acts like Hive from the outside. Internally it is completely different though as it has processes running permanently and reads data directly of the lower storage layers (HDFS or HBase). There is no MapReduce involved at all and hence does not share the same latency issues nofceable there.

28 Impala Architecture Two binaries: impalad and statestored Impala daemon (impalad) Handles client requests and all internal requests related to query execu1on Exports Thri[ services for these two roles State store daemon (statestored) Provides name service and metadata distribufon Also exports a Thri[ service

29 Impala Architecture Query execufon phases Request arrives via odbc/beeswax Thri[ API Planner turns request into collecfons of plan fragments Coordinator inifates execufon on remote impalad's During execufon Intermediate results are streamed between executors Query results are streamed back to client Subject to limitafons imposed to blocking operators (top- n, aggregafon)

30 Impala Architecture: Planner 2- phase planning process: Single- node plan: le[- deep tree of plan operators Plan parffoning: parffon single- node plan to maximize scan locality, minimize data movement Plan operators: Scan, HashJoin, HashAggregaFon, Union, TopN, Exchange Distributed aggregafon: pre- aggregafon in all nodes, merge aggregafon in single node Join order = FROM clause order

31 Query Planner Example: Query with JOIN and AggregaFon SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 Hdfs Scan Top N Agg Hash Join Hbase Scan Top N Agg Exch Hdfs Scan Agg Hash Join Exch Hbase Scan at coordinator at DataNodes at region servers

32 Impala Architecture: Query ExecuFon Request arrives via odbc/beeswax Thri[ API SQL App ODBC Hive Metastore HDFS NN Statestore SQL request Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase

33 Impala Architecture: Query ExecuFon Planner turns request into collecfons of plan fragments Coordinator inifates execufon on remote impalad's SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase

34 Impala Architecture: Query ExecuFon Intermediate results are streamed between impalad's Query results are streamed back to client SQL App ODBC Hive Metastore HDFS NN Statestore query results Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase

35 Comparing Impala to Hive Hive: MapReduce as an execufon engine High latency, low throughput queries Fault- tolerance model based on MapReduce's on- disk checkpoinfng; materializes all intermediate results Java runfme allows for easy late- binding of funcfonality: file formats and UDFs. Extensive layering imposes high runfme overhead Impala: Direct, process- to- process data exchange No fault tolerance An execufon engine designed for low runfme overhead

36 Comparing Impala to Hive Impala's performance advantage over Hive: no hard numbers, but Impala can get full disk throughput (~100MB/sec/disk); I/O- bound workloads o[en faster by 3-4x Queries that require mulfple map- reduce phases in Hive see a higher speedup Queries that run against in- memory data see a higher speedup (observed up to 100x)

37 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

38 Apache Lucene Besides Hadoop (and Avro) Doug Cunng did found another project (actually the first one spawning the others): Lucene. It offers an open- source implementafon of a text search, with many interesfng features, e.g. Boolean logic (AND, OR etc.), term boosfng, fuzzy matching. Lucene is great at doing a full text search across datasets and is a natural complement to the big data space and Hadoop.

39 Apache Solr Lucene itself though is built for a single machine setup and to be embedded into another process. Scaling it requires a framework that can handle many distributed search indices that are maintained centrally. This funcfonality is provided by Solr, a wrapper around Lucene. Solr adds a server component that hosts the Lucene libraries and exports a REST API that can be used from clients. On top of Solr there is also SolrCloud, an extension that adds the distribu1on part, so that many Solr servers can form an ensemble of machines in a cluster.

40 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

41 Data Pipelines Processing data is omen a task with many smaller steps, which perform par1al computafon. For example, one subtask could perform an inifal data cleansing, another pre- compute sta1s1cs or machine learning models which then serve as an input to subsequent steps, or even other pipelines. In pracfce these steps are combined to the so called data pipelines, which are then a[er thorough tesfng deployed on a producfon cluster to run automa1cally.

42 Data Pipelines The data pipelines can be categorized into (at least) two major classes: macro and micro pipelines. Macro pipelines are typically handled by process schedulers, where there are dedicated ones like Apache Oozie or Azkaban, or generic ones like Quartz. Micro pipelines are inline processing helpers that can be used to repeatedly transform enffes either one by one or as a larger code construct. Example for that la\er are Crunch, Cascading or Morphlines for the former.

43 Macro Pipelines The Scheduler systems are usually independent applicafons that have a server component with an API and a UI. They allow the construcfng of data processing pipelines by chaining smaller processing jobs together. The following will explain the schedulers using Oozie as an example. The other systems menfoned are Azkaban, developed by LinkedIn (see h\p://azkaban.github.io/azkaban2/), or Luigi by SpoFfy (see h\ps://github.com/spoffy/luigi).

44 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

45 Apache Oozie As menfoned in pracfce it is very common to combine mul1ple subtasks into one larger workflow, and those eventually to even larger data pipelines. Apache Oozie is a server based coordina1on system (a scheduler) which can handle such workflows. There are triggers that can start workflows based on 1me ( every hour, once per week ) and also based on dependent data sources ( start when all the data from the previous step has arrived ). Workflows are defined as graphs.

46 Oozie Workflows Oozie workflows are a combinafon of Ac1ons, e.g. MapReduce, Hive/Pig jobs, which are defined in a control dependent Direct Acyclic Graph (DAG). Control dependent here means that a second acfon on the same path in the graph can only execute when the previous one has completed. The language to define these DAGs are hpdl, an XML based Process DefiniFon Language (PDL).

47 Oozie Workflows A workflow can contain control as well as ac1on nodes. The former define the start and end of the workflow and allow the user to influence the execufon path (make decisions, split flow, and join it again). In addifon, a workflow can be parameterized, which makes it possible to reuse the same workflow with, for example, different input or output files. This builds a form of template library of essenfal workflows used throughout the data pipelines. The parameters are very powerful and include macros too.

48 Oozie AcFons The ac1ons of a workflow are always executed out of band from Oozie, usually in a dedicated cluster. When an acfon has completed it sends a callback to Oozie which in turn starts the next acfon. In other words, they are the smallest unit of work in Oozie and handle the actual task execufon. There is a list of acfons already included in Oozie, but they can also be extended by custom acfons as needed.

49 Oozie AcFons The supplied acfons are, for example: MapReduce Pig and Hive HDFS operafons SSH HTTP Oozie sub- workflows

50 Example: Oozie Workflow Note: round means control node, square means acfon node

51 Example: Oozie Workflow Notes: There is always only one start and one end node, i.e. the graph has to be defined in that way or it is invalid. All parameters that are used must be defined, for example using a properfes file. If that is not the case, then the workflow will not run at all.

52 Example: Oozie Workflow <workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobtracker}</job-tracker> <name-node>${namenode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.wordcount.map</value> </property>

53 Example: Oozie Workflow <property> <name>mapred.reducer.class</name> <value>org.myorg.wordcount.reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputdir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputdir}</value> </property> </configuration> </map-reduce>

54 Example: Oozie Workflow <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>something went wrong: \ </kill/> ${wf:errorcode('wordcount')}</message> <end name='end'/> </workflow-app>

55 Example: Oozie Workflow #2

56 Workflow Server Oozie includes a REST based server (using Apache Tomcat) with a simple web based UI. The UI can only read the current state and display it, but does not allow any other access. If a workflow needs interac1on then the REST API must be used. The supplied Oozie command line client allows to send commands from the CLI to the server in a convenient way. Another way to interact with Oozie is Hue (installed in the VM). The server executes the workflows and monitors their status. The details are stored in a relafonal database

57 Workflow ApplicaFon There are three parts needed to form a workflow: 1. File workflow.xml Contains the workflow defini1on in hpdl 2. Libraries Op1onal directory lib/ for JAR and SO files 3. Proper1es file Contains the parameters for the workflow (for the workflow.xml) and must at least set oozie.wf.application.path

58 Workflow ApplicaFon Example for a properfes file: namenode=hdfs://localhost:8020 jobtracker=localhost:8021 queuename=default inputdir=${namenode}/data.in outputdir=${namenode}/out user.name=training oozie.wf.application.path=\ ${namenode}/user/${user.name}/oozie/ workflow/wordcount/

59 Workflow ApplicaFon The required files are copied into the path in HDFS specified with oozie.wf.application.path and then submi\ed through the CLI. Once this is done the workflow can be queried through the CLI or the UI (or with custom code) and monitored. There are addifonal opfons which allow for an acfve callback from Oozie once a workflow is complete. This can be used, for example, to trigger outside acfons in external applicafons.

60 Submit a Workflow Submit a workflow: $ oozie job run -config job.properties \ -oozie Workflow ID: oozie-wrkf-W Check the status of a workflows: $ oozie job info oozie-wrkf-W \ -oozie Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING]

61 Resume a Workflow In case an acfon fails while processing a workflow, the failed operafons can be retried a[er for example fixing the problem. First though the flow is aborted as per the DAG defini1on. Then the flow can be resumed with either oozie.wf.rerun.failnodes=true or oozie.wf.rerun.skip.nodes=<k1>,... and the parameter rerun. Only one of those command line op1ons can be used at a Fme, they are mutually exclusive.

62 Oozie Coordinators Another concept provided by Oozie are the coordinators. They are the mechanism allowing to use triggers that start workflows based on specific condi1ons, i.e. 1me or data based. They require an addi1onal XML file that is stored in the workflow directory and contains the coordinator defini1on. This is useful to define automated and repe11ve execufon of workflows, for instance to chain workflows together into data pipelines.

63 Oozie Coordinators Coordinators are special workflows, yet are submi\ed the same way as normal workflows, using the CLI client. They are essenfally long running tasks executed inside the Oozie server, which then starts the contained workflow as defined in the XML. The same commands apply to coordinators as do for workflows.

64 Time- based Coordinators When a workflow needs to run at specific intervals or as added very recently specific Fmes (like CRON) a 1me- based coordinator definifon can be used. The interval based approach uses frequencies, like every day or every 2 hours. The CRON like approach uses a very specific format that can handle fine grained Fme control. All coordinator definifons have to be stored in a file called coordinator.xml located in the main directory of the workflow in HDFS.

65 Time- based Coordinators Example: <coordinator-app name="coordinator1" frequency="${frequency}" start="$ {starttime}" end="${endtime}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>${workflowpath}</app-path> </workflow> </action> </coordinator-app>

66 Time- based Coordinators Just like the workflow.xml the coordinator.xml file can be parameterized as well. For example:... frequency=60 In minutes starttime= t20\:20z endtime= t20\:20z timezone=gmt+0530 workflowpath=${namenode}/user/$ {user.name}/oozie/workflow/wordcount/

67 Data/File- based Coordinators When adding a data (aka file) based element to the coordinator.xml it can be defined to start workflows automa1cally when specific data (in files) arrives. With it the output of another workflow can serve as an input to the next one. Or, data wri\en by Flume into an input directory can be processed once it is complete. A special semaphore allows to flag the la\er, i.e. when data is fully wri\en and a workflow can safely start.

68 Data/File- based Coordinators Example definifon:... <datasets> <dataset name="input1" frequency="${datasetfrequency}" \ initial-instance="${datasetinitialinstance} \ timezone="${datasettimezone}"> <uri-template>${dataseturitemplate}/${year}/${month}/ \ ${DAY}/${HOUR}/${MINUTE}</uri-template> <done-flag> </done-flag> </dataset> </datasets> <input-events> <data-in name="coordinput1" dataset="input1"> <start-instance>${inputeventstartinstance}</start-instance> <end-instance>${inputeventendinstance}</end-instance> </data-in> </input-events>...

69 Data/File- based Coordinators Example properfes: In minutes... datasetfrequency=15 datasetinitialinstance= t15:30z datasettimezone=utc dataseturitemplate=${namenode}/srvs/s0001/in inputeventstartinstance=${coord:current(0)} inputeventendinstance=${coord:current(0)} workflowpath=${namenode}/user/${user.name}/oozie/workflow/ wordcount/... inputdir= ${coord:datain('coordinput1')} outputdir=${namenode}/out oozie.coord.application.path=${namenode}/user/${user.name}/ coordoozie/coordinatorfilebased-events>

70 Submit Coordinators Submit a coordinator based workflow: $ oozie job run -config job.properties \ -oozie job: oozie-hado-C Suspend a coordinator based workflow: $ oozie job suspend oozie-hado-C \ -oozie Resume a suspended coordinator based workflow: $ oozie job resume oozie-hado-C \ -oozie Killing a coordinator based workflow: $ oozie job kill oozie-hado-C \ -oozie

71 Oozie Bundles The newest addifon to Oozie are the so called Bundles, which allow to build sophis1cated data pipelines as discussed without the need to submit them separately as workflows. Bundles combine mulfple coordinator based workflows into a single package. The definifon allows to define dependencies between the workflows too, so that they can be honored by Oozie during execufon.

72 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie Informa1on Architecture Spark

73 InformaFon Architecture A[er looking at the tools to define and build, as well as reliably execute data pipelines there is another topic that should be discussed, which is how data is organized. Especially in shared, mul1- tenant setups it is vital to define upfront how data is stored and laid out in HDFS. Otherwise it will be difficult for users to find their own data, but also the shared data from other departments. This is called the informa1on architecture.

74 InformaFon Architecture For solufons based on Hadoop this can be covered by using services. For that all use- cases are mapped into user groups, both for regular users and service administrators. In HDFS, for example, all of the files are stored in specific directories and with specific access rights, each reflecfng one service or user group. Examples: /srvs/s1/ or /system/etl/sales

75 InformaFon Architecture Within each group/service level there are further well defined directories created for incoming, complete, failed, and currently being processed data called working. In the working directory another level is created with a directory per workflow that has a unique ID (usually Fmestamp based), so that mul1ple jobs of the same kind can run in parallel without any overlap or causing issues with already exisfng files.

76 InformaFon Architecture Complete Example: /system /system/etl/<group>/<process> /incoming /complete /failed /working /<epoch_idx> /incoming /complete /system/data/<dataset>

77 InformaFon Architecture Files in these directories are then assigned the owner and group rights for the service they belong to, e.g. s1admin, s1user, s2admin, s2user,... If now, for instance, a user from service 2 (s2) wants to access the files from service 1 (s1) then all that needs to be done is add the user in quesfon to the user group of service 1, e.g. s1user. This concept can be extended to build further hierarchies that handle more complex setups.

78 InformaFon Architecture The user groups can also be used to grant access to the job queues in MapReduce or YARN. This allows to specify which service or group of users with specific files can use the available resources. Limits can be applied and fine grained control is available to ensure each use- case is able to deliver their final product. Final words: Hadoop does not imply or enforce any rules on its own, the informafon architecture has to do that.

79 Part 4: Analyzing Big Data Pig Hive Impala Search Data Pipelines Oozie InformaFon Architecture Spark

80 Batch vs. Real- Fme With increasing use of batch oriented solufons, for example MapReduce, within organizafons, user eventually ask for more 1mely answers. It becomes more and more important to extend the batch plauorm with near- or real- 1me components. There are a few choices to do that, revolving around the storage loca1on of the data. It could be stored on persistent storage first and read back or processed directly in memory.

81 Persisted Data This kind of approach we have seen already, for example Impala. First the data is wri]en to the persistent storage (e.g. disks) and then subsequently processed. This causes a slight delay in how current the data is. The main advantage is the persistency and scalability, because off- memory storage is much more affordable in comparison. And you can sfll use faster media, like SSDs or even PCI Flash drives.

82 In- Memory Data The other approach is to keep data in memory, where is can be processed with tremendous speed. Some examples here are SAP Hana or Oracle Exaly1cs. The obvious drawback is the cost of such a solufon, as memory is sfll much more expensive compared to the slower, but larger media like disks. In general it is ques1onable how such systems might scale.

83 In- Memory Data Another topic is the complex event processing (CEP) or stream processing. In this case only relevant informafon is kept in memory and is available for querying. This is useful to e.g. track trending topics, or sums over a window of Fme. An example for this kind of technology is Storm by Twi\er (previously BackType), or Spark Streaming. Even Flume can be coerced into similar funcfonality using interceptors, though that is less of its strong point.

84 Hybrid Systems: Spark Talking about Spark, it is more of a hybrid system, as it can do not just one of the previously discussed processing, but it spans into both. Spark can cache data in memory for itera1ve processing without loading the data again. But it can also work directly of disks if needed. In addifon, it has a much more flexible data processing model, defining the processing steps in a DAG akin to the ones found in Oozie.

85 Spark RDDs textfile = sc.textfile( SomeFile.txt )! RDD RDD RDD RDD Action Value Transformations lineswithspark.count()! 74!! lineswithspark.first()! # Apache Spark! lineswithspark = textfile.filter(lambda line: "Spark in line)!

86 Hybrid Systems: Spark Workflow definifons can be wri\en in Java, Python or Scala. Similar to PigLaFn there are transforma1ons and ac1ons available. The la\er are the ones that trigger the execufon of the processing graph up to the current locafon. Any intermediate data can be cached so that repeffve access does not require extra load operafons.

87 Create RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textfile( file.txt ) > sc.textfile( directory/*.txt ) > sc.textfile( hdfs://namenode:9000/path/file ) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopfile(keyclass, valclass, inputfmt, conf)

88 Simple TransformaFons > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatmap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2}

89 Simple AcFons > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveastextfile( hdfs://file.txt )

90 Spark Example: PageRank Spark is very suitable for mathemafcal algorithms that need itera1ons, for example PageRank as used by Google to build its search index. The approach is to start at some generic senngs and then iterate over the computafon unfl the results converge. Basic idea: Links from many pages - > high rank Link from a high- rank page - > high rank

91 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs

92 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs

93 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs

94 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs

95 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs

96 PageRank Algorithm 1. Every page starts with a rank of 1 2. On each iterafon have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to contribs Final result:

97 Spark Scala ImplementaFon val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues( * _) } ranks.saveastextfile(...)

98 Spark PageRank Performance Itera1on 1me (s) Hadoop Spark Number of machines

99 Hybrid Systems: Spark Spark can not only handle iterafve, but also linear algorithms well, including MapReduce. IniFal numbers show a tremendous improvement of how long processing needs. O[en Spark jobs are Fmes faster compared to tradifonal MapReduce. In addifon, Spark also has streaming component. Summarizing, Spark is one example how Hadoop keeps developing into a more and more able plauorm for generic yet fast data processing.

100 Sources Flume Project: h\p://flume.apache.org/ PresentaFon: h\p:// hug Oozie Project: h\p://oozie.apache.org/ Spark Summit Exercises: h\p://spark- summit.org/2013/exercises/ Hadoop OperaFons Eric Sammer, O Reilly

101

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Using Apache Spark Pat McDonough - Databricks

Using Apache Spark Pat McDonough - Databricks Using Apache Spark Pat McDonough - Databricks Apache Spark spark.incubator.apache.org github.com/apache/incubator-spark user@spark.incubator.apache.org The Spark Community +You! INTRODUCTION TO APACHE

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc. Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems Impala Overview:

More information

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here JusIn Erickson Senior Product Manager, Cloudera Speaker Name or Subhead Goes Here May 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Introduc8on to Apache Spark

Introduc8on to Apache Spark Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared

More information

Case Study : 3 different hadoop cluster deployments

Case Study : 3 different hadoop cluster deployments Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Unlocking Hadoop for Your Rela4onal DB. Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Unlocking Hadoop for Your Rela4onal DB. Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData. Unlocking Hadoop for Your Rela4onal DB Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.be April 4, 2014 Who Am I? Started 3 yr ago as 1 st Cloudera Support Eng Now

More information

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Akmal B. Chaudhri* Hortonworks *about.me/akmalchaudhri My background ~25 years experience in IT Developer (Reuters) Academic (City

More information

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Introduc)on to. Eric Nagler 11/15/11

Introduc)on to. Eric Nagler 11/15/11 Introduc)on to Eric Nagler 11/15/11 What is Oozie? Oozie is a workflow scheduler for Hadoop Originally, designed at Yahoo! for their complex search engine workflows Now it is an open- source Apache incubator

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://

More information

Using RDBMS, NoSQL or Hadoop?

Using RDBMS, NoSQL or Hadoop? Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest

More information

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1 Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Impala: A Modern, Open-Source SQL

Impala: A Modern, Open-Source SQL Impala: A Modern, Open-Source SQL Engine Headline for Goes Hadoop Here Marcel Speaker Kornacker Name Subhead marcel@cloudera.com Goes Here CIDR 2015 Cloudera Impala Agenda Overview Architecture and Implementation

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com. 2012/12/13 Beijing Apache Asia Road Show

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com. 2012/12/13 Beijing Apache Asia Road Show BigData in Real-time Impala Introduction TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com 2012/12/13 Beijing Apache Asia Road Show Background (Disclaimer) Impala is NOT an Apache Software

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent. Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years sarah@cloudera.com 2 What is Hadoop? An open-source

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Apache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com

Apache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Apache Sentry Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture

More information

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came

More information

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5 Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Certified Big Data and Apache Hadoop Developer VS-1221

Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification

More information

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop

More information

Oracle Big Data Fundamentals Ed 1 NEW

Oracle Big Data Fundamentals Ed 1 NEW Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

BIG DATA - HADOOP PROFESSIONAL amron

BIG DATA - HADOOP PROFESSIONAL amron 0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording

More information

07/11/2014 Julien! Poorna! Andreas

07/11/2014 Julien! Poorna! Andreas Ad-hoc Query Brown Bag Session 07/11/2014 Julien Poorna Andreas User Story Procedures are only developer friendly and not ad-hoc Open datasets to broader audience of non developers Introduce schema to

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013 Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

More information

Real Time Data Processing using Spark Streaming

Real Time Data Processing using Spark Streaming Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

BIG DATA HADOOP TRAINING

BIG DATA HADOOP TRAINING BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

SQL on NoSQL (and all of the data) With Apache Drill

SQL on NoSQL (and all of the data) With Apache Drill SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information