INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Transcription

1 INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

2 AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary

3

4 BIG DATA FACTS In what timeframe do we now create the same amount of information that we created from the dawn of civilization until 2003? 2 days 90% of the world s data was created in the last (how many years)? 2 years What is 1024 petabytes also knows as? 1 exabyte

5 DATA IS GETTING BIGGER Rapid growth of global data from From 1 to 35 Zetabytes 70% of the data generated by individuals 1 The number of mobile-connected devices exceeded the world's population in billion Every minute in the Internet Twitter tweets shared Facebook content Global mobile data traffic will surpass 2 10 exabytes in 2016 (1) CSC Report big data growth infographic, (2) Cisco Visual Networking Index , (3) Intel

6 DATA EXPLOSION COMPOUNDS CHALLENGES 80% of the effort involved in dealing with data is cleaning it up in the first place 1 (1) O'Reilly Media

7 BIG DATA INCLUDES ALL TYPES OF DATA

8 BIG DATA INCLUDES ALL TYPES OF DATA Structured Pre-defined schema Example: Relational database systems Semi-structured Inconsistent structure Cannot be stored in rows and tables in a typical database Examples: logs, tweets, sensor feeds Unstructured Lacks structure or Part of it lack structure Examples: free-form text, reports, customer feedback forms

9 EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS Big data analytics combines enterprises data with other relevant information Gaming Industry Advertising Buys Social Media Sentiments Web Browsing Patterns Enterprise Data Movie Releases

10 EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS to create predictive model of trends.

11 A NEW SOLUTION Hadoop = HDFS + Map/Reduce HDFS provides storage MapReduce provides analysis

12 THE HADOOP APPROACH Distribute large amounts of data across thousands of commodity hardware nodes Process data in parallel Replicate data across cluster for reliability Analysis moved to data Avoids data copy Scanning of data Avoids random seek Easiest way to proccess

13 A NEW PARADIGM Process data locally Reduce dependence on bandwidth Expect failure Handle failover elegantly Duplicate finite blocks of data to small groups of nodes (rather than entire database) Reduce elapse seek time Place no conditions on the structure of the data

14 HADOOP OVERVIEW Zookeeper Hive Chukwa PIG MapReduce HDFS HBase Sqoop Disk Disk Disk Disk

15 HADOOP COMPONENTS (1/2) Essentials: HDFS - a scalable, high-performance distributed file system. MapReduce - A Java-based job tracking, node management, and application container for mappers and reducers. Frameworks: Chukwa - a data collection system for monitoring, displaying, and analyzing logs from large distributed systems. Hive - structured data warehousing infrastructure that provides a mechanisms for storage, data extraction, transformation, and loading (ETL), and a SQL-like language for querying and analysis. HBase - a column-oriented (NoSQL) database designed for real-time storage, retrieval, and search of very large tables (billions of rows/millions of columns) running atop HDFS.

16 HADOOP COMPONENTS (2/2) Utilities: Pig - a set of tools for programmatic flat-file data analysis that provides a programming language, data transformation, and parallelized processing. Sqoop - a tool for importing and exporting data stored in relational databases into Hadoop or Hive, and vice versa using MapReduce tools and standard JDBC drivers. ZooKeeper - a distributed application management tool used for managing the nodes in a Hadoop computational network.

17 HDFS HADOOP DISTRIBUTED FILE SYSTEM

19 WHAT IS? A scalable, high-performance distributed file system Primary storage system for Hadoop Fast reliable Designed for consistency Presents a single view of multiple physical disks or file systems Deployed only on Linux

20 HDFS CHARACTERISTICS Persistent Replicated Linear scalable Applications sequentially stream reads Often from very large files Optimized for read performance Avoids random disk seeks Write once and read many times Data stored in blocks Distributed over many nodes Block size often range from 128MB to 1GB

21 HDFS ARCHITECTURE NameNode Secondary NameNode Block Map Metadata DataNode DataNode DataNode BL1 BL6 BL1 BL3 BL1 BL7 BL2 BL7 BL6 BL2 BL8 BL9

22 HDFS COMPONENTS NameNode Manages DataNodes Keeps metadata for all nodes & blocks DataNodes Manages block reads/writes for HDFS Manages block replication Live on racks (rack-aware data organization) Client Talks directly to NameNode then DataNodes

23 VS HDFS distributed file system that is well suited for the storage of large files. It is NOT a general purpose file system! HDFS does not work well with less than 5 DataNodes HBASE Built on top of HDFS Suitable for hundreds of millions or billions of rows Should not be used for tables with few thousand/million rows More a Data Store than Data Base RDBMS apps cannot be "ported" to HBase by simply changing a JDBC driver!

24 MAP/REDUCE HOW DOES IT WORKS?

26 WHAT IS MAP/REDUCE? (1/2) A framework written in Java Big Data analytics and processing Node-local computation Parallel processes Handles node fail-over It all started when Google needed a way to: Determine which web sites to provide for searches Do page ranking

27 WHAT IS MAP/REDUCE? (2/2) Map applies to all the members of the dataset and returns a list of results Reduce collates and resolves the results from one or more mapping operations executed in parallel Very large datasets are split into large subsets called splits Separates business logic from multi-processing logic MapReduce framework developers focus on process dispatching, locking, and logic flow App developers focus on implementing the business logic without worrying about infrastructure or scalability issues

28 HOW MAP/REDUCE WORKS

29 Map Reduce John was.. ( John, 1) ( John, 3) BigData Hi, John! Result

30 MAP/REDUCE EXAMPLE (1/2) Find maximum temperature for each city out of 5 files: Toronto, 20 Dubna, 25 Geneva, 22 Rome, 32 Toronto, 4 Rome, 38 Geneva, 18 Mapper task result: (Toronto, 20) (Dubna, 25) (Geneva, 22) (Rome, 38) Let s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: (Toronto, 18) (Dubna, 27) (Geneva, 32) (Rome, 37) (Toronto, 32) (Dubna, 20) (Geneva, 20) (Rome, 33)(Toronto, 22) (Dubna, 19) (Geneva, 33) (Rome, 31)(Toronto, 31) (Dubna, 22) (Geneva, 19) (Rome, 30)

31 MAP/REDUCE EXAMPLE (2/2) All five of these output streams are fed into the reduce tasks, which combine the input results and output a single value for each city Final Result: (Toronto, 32) (Dubna, 27) (Geneva, 33) (Rome, 38)

32 PIG A HADOOP SCRIPTING LANGUAGE

34 WHAT IS PIG? A high-level data-flow language (Pig Latin) and execution framework for parallel computation Pig is made of two main components: A SQL-like data processing language called Pig Latin A compiler that compiles and runs Pig Latin scripts Pig Latin provides: Ease of programming. Trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks Optimization opportunities. Permits the system to optimize execution of tasks automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions

35 THE ORIGINS Pig was created by Yahoo! to make it easier to analyze the data in HDFS without the complexities of writing a traditional MapReduce program. With Pig, it is possible to develop MapReduce jobs with a few lines of Pig Latin

36 PIG IN THE ECO SYSTEM Pig MapReduce HBase HDFS Pig runs on Hadoop utilizing both HDFS and MapReduce By default, Pig reads and writes files from HDFS Pig stores intermediate data among MapReduce jobs

37 RUNNING PIG A Pig Latin script executes in thee modes 1. MapReduce: the code executes as a MapReduce application on a Hadoop cluster (default mode) 2. Local: the code executes locally in a single JVM using a local text file (for development purposes) 3. Interactive: Pig commands are entered manually at a command prompt known as the Grunt shell

38 PIG EXAMPLES UNION grunt> a = LOAD 'A' USING PigStorage(',') AS (a1:int, a2:int, a3:int); grunt> b = LOAD 'B' USING PigStorage(',') AS (b1:int, b2:int, b3:int); grunt> DUMP a; (0,1,2) (1,3,4) grunt> DUMP b; (0,5,2) (1,7,8) grunt> c = UNION a, b AS (c1:int, c2:int, c3:int); grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8)

39 PIG EXAMPLES SPLIT grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1; grunt> DUMP d; (0,1,2) (0,5,2) grunt> DUMP e; (1,3,4) (1,7,8)

40 PIG EXAMPLES FOREACH grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8) grunt> mult = FOREACH c GENERATE c2, c2 * c3; grunt> DUMP mult; (1,2) (5,10) (3,12) (7,56)

41 EXAMPLE OF A PIG SCRIPT Find the top 10 URLS for users between 18 and 25 Users = LOAD users AS (name, age); FilteredUsers = FILTER Users BY age >= 18 AND age <= 25; Pages = LOAD pages AS (user, url) JoinResult = JOIN FilteredUsers BY name, Pages BY users; Grouped = GROUP JoinResult BY url; Summed = FOREACH Grouped GENERATE group; COUNT(JoinResult) AS clicks; Sorted = ORDER Summed BY clicks desc; Top10 = LIMIT sorted 10; STORE Top10 INTO top10sites ;

42 HIVE A DATA WAREHOUSE SYSTEM FOR HADOOP

44 WHAT IS HIVE? Hive is a data warehouse system for Hadoop facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL

45 HIVEQL EXAMPLE The underlying table www_access consists of three fields: ip, url, and time. Number of Records: SELECT COUNT(1) FROM www_access; Number of Unique IPs that accessed the Top Page: SELECT COUNT(distinct v['ip']) FROM www_access WHERE v['url']='/ ;

46 SUMMARY WHAT DID WE LEARN?

47 TO TAKE AWAY Data is getting bigger and more complex to handle Hadoop = HDFS + Map/Reduce Will Hadoop replace relational databases? No!

48 QUESTIONS? THANK YOU FOR YOUR ATTENTION!