HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Transcription

1 HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

2 AGENDA Introduction What is Hadoop and the rationale behind it Hadoop Distributed File System (HDFS) and MapReduce Common Hadoop use cases How Hadoop integrates with other systems like Relational Databases and Data Warehouses The other components in a typical Hadoop stack such as: Hive, Pig, HBase, Sqoop, Flume and Oozie Conclusion

3 ABOUT TRIFORCE Triforce provides critical, reliable IT infrastructure solutions and services to Australian and New Zealand listed corporations and government agencies. Triforce has qualified and experienced technical and sales consultants and demonstrated experience in designing and delivering enterprise Apache Hadoop solutions.

4 TRIFORCE BIG DATA PARTNERSHIP NetApp The NetApp Open Solution for Hadoop provides customers with flexible choices for delivering enterprise-class Hadoop. Cloudera Cloudera is the market leader in Hadoop enterprise solutions. Cloudera s 100% open-source distribution including Apache Hadoop (CDH), combined with Cloudera Enterprise, comprises the most reliable and complete Hadoop solution available.

5 WHAT IS HADOOP? a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. ( Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. (

6 THE RATIONALE FOR HADOOP Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. ( Hadoop processes petabytes of unstructured data in parallel across potentially thousands of commodity boxes using an open source filesystem and related tools Hadoop has been all about innovative ways to process, store, and eventually analyse huge volumes of multi-structured data.

7 EXAMPLES 2.7 Zettabytes of data exist in the digital universe today. (Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte) Facebook stores, accesses, and analyses 30+ Petabytes of user generated data. Decoding the human genome originally took 10 years to process; now it can be achieved in one week. YouTube users upload 48 hours of new video every minute of the day. 100 terabytes of data uploaded daily to Facebook

8 HADOOP Handles all types of data structured, unstructured, log files, pictures, audio files, communications records, No prior need for a schema you don t need to know how you intend to query your data before you store it Makes all of your data useable By making all of your data useable, not just what s in your databases, Hadoop lets you see relationships that were hidden before and reveal answers that have always been just out of reach. You can start making more decisions based on hard data instead of hunches and look at complete data sets, not just samples. Two parts to Hadoop MapReduce Hadoop Distributed File System (HDFS)

9 What is this Big Elephant? HADOOP Geever Paul Pulikkottil BigData Solutions Architect (CCAH,CCDH)

10 CASE FOR BIGDATA Databases here for more than 20yrs continue to store structured transactional data Large server (s) Multi CPUs Huge Memory Buffer SAN disks Relatively low latency queries, indexed data

11 CASE FOR BIGDATA TYPICAL WORKLOADS DATABASE OLTP (online transaction processing) Typical Use: e-commerce, banking Nature: User facing, real-time, low latency, highly-concurrent Job: relatively small set of standard transactional queries Data access pattern: random reads, updates, writes (relatively small data) OLAP (online analytical processing) Typical Use: BI, Data Mining Nature: Back-end processing, Batch workloads Job: complex analytical queries, often ad hoc Data access: Table scans, Large query

12 CASE FOR BIGDATA Data warehouse: Consolidated database loaded from CRM, ERP, OLTP Process: Staging, Cleansing, Loading Purpose: BI Reporting, Forecasts, Quarterly reporting Size: larger server, multiple CPUs, SAN disks- many TBs Challenge: As the data grows overtime, things getting slower Batch should fit in within daily, weekly loading cycle Relatively expensive to license, store, manage

13 CASE FOR BIGDATA New Objective: Businesses wants to connect with the customer We are generating lots of data most discarded them Likes and Dislikes Facebook, Twitter, Linked-in Predictable outcomes - you can when you know the customer React quickly time missed = opportunity lost! Question: Can DW provide that? Where can you store TB or PB s unstructured data more economically How can you scale out easily, rather than forklift upgrades How can I finish batch jobs when the data grows beyond TBs Need a scalable, distributed system that can store and process large amounts of data

14 CASE FOR BIGDATA Distributed systems are not NEW: Common frameworks include MPI, PVM Focuses on distributing the processing workload Powerful compute nodes with Separate systems for data storage Fast network connections Infiniband Typical processing pattern: Step 1: Copy input data from storage to compute node Step 2: Perform necessary processing Step 3: Copy output data back to storage Often hundreds to thousands of nodes with GPUs

15 CASE FOR BIGDATA Distributed HPC relatively small amounts of data doesn t scale with large amounts of data more time spent copying data than actually processing getting data to the processors is the bottleneck getting worse as more compute nodes are added each node competing for the same bandwidth compute nodes become starved for data Distributed systems pay for compute scalability by adding complexity CudaFortran, PGI programing?

16 BIGDATA SOLUTION: HADOOP What is Hadoop open source distributed computing platform based on Google s GFS File system commodity hardware, no SAN, no infiniband scale up from single servers to thousands of machines each offering local computation and storage designed to detect and handle failures at the application layer adding more nodes, increase performance and capacity with no penalty commodity hardware is prone to failures, Hadoop knows that!

17 HADOOP CLUSTER STACK Master Nodes (1 st rack) - Name Node - Standby Name Node - Job Tracker Slave Nodes (all racks) - Data Nodes with direct attached large capacity disks (SATA) Plus: - Management or Admin Node - Hadoop Client Node(s) - Typical setup

18 MAPREDUCE PROGRAMING Hadoop is great for large-data processing! - MapReduce code requires you to write Java class, driver code - Its complicated to write MapReduce jobs so we need a simpler method. - Develop a higher-level language to facilitate large data processing - Hive: SQL language for Hadoop, called HQL - Pig: Pig Latin is scripting language, a bit like Perl - Both translate and run a series of Map only or MapReduce Jobs

19 ECOSYSTEM TOOLS: HIVE AND PIG Hive: Pig: Objective: - Data warehousing application in Hadoop - Query language is HQL, variant of SQL - Tables stored on HDFS as flat files - Developed by Facebook, now open source - large-scale data processing system - Scripts are written in Pig Latin - Dataflow language Developed by Yahoo!, now open source - Higher-level language to facilitate large-data processing - Higher-level language compiles down to Hadoop jobs

20 HIVE AND PIG EXAMPLE CODE Hive example: Pig example:

21 ECOSYSTEM TOOLS: SQOOP Import data from RDBMS to Hadoop Individual tables, Portions (where clause) or entire Databases Stored to HDFS as delimited text files or Sequence Files Provides the ability to import from SQL databases straight into your Hive Datawarehouse JDBC to connect to RDBMS, additional connectors available to BI/DW Sqoop automatically generates a Java class to import data into Hadoop Sqoop provides an incremental import mode Export tables to RDBMS from Hadoop

22 SQOOP IMPORT EXAMPLES > Importing Data into HDFS as Hive table using SQOOP sqoop --connect jdbc:mysql://db.example.com/website --table USERS --local \ --hive-import > Importing Data to HDFS as compressed sequence files (No Hive) using SQOOP user@dbserver$>sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --as-sequencefile > Importing Data into HBase using SQOOP: $ sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --hbase-create-table --hbase-table ORDERS --column-family mysql >Exporting Data to RDBMS using SQOOP: $ sqoop export --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --export-dir /user/arvind/orders This would connect to the MySQL database on this server and import the USERS table into HDFS. The -local option instructs Sqoop to take advantage of a local MySQL connection. The -hive-import option after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdir of your HDFS home dir)

23 SQOOP CUSTOM CONNECTORS Sqoop Works with standard JDBC connection with common Databases, custom faster tuned connectors available for Cloudera Connector for Teradata Cloudera Connector for Netezza Cloudera Connector for MicroStrategy Cloudera Connector for Tableau Quest Data Connector for Oracle and Hadoop

24 ECOSYSTEM TOOLS: FLUME Flume: Gather data/logs from Multiple systems, inserting them into HDFS as they are generated. Typically used to ingest log files from real-time systems such as Web servers, firewalls and mail servers into HDFS. Each Flume agent has a source and a sink Source Tells the node where to receive data from Sink Tells the node where to send data to Channel A queue between the Source and Sink Can be in memory only or Durable Durable channels will not lose data if power is lost

25 ECOSYSTEM TOOLS: FUSE FUSE : Filesystem in Userspace Allows HDFS to be mounted as a UNIX file system User can operate 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use standard Posix libraries like open, write, read, close. You can export a fuse mount using NFS,

26 ECOSYSTEM TOOLS: OOZIE Oozie: Oozie is a workflow engine Runs workflows of Hadoop jobs Pig, Hive, Sqoop jobs Jobs can be run at specific times, One-off or recurring Jobs can also be run when data is present in a directory

27 ECOSYSTEM TOOLS: MAHOUT Mahout: - Mahout is a Machine Learning library - Contains many pre written ML algorithms - R is another set of open source library used by Data Scientists

28 ECOSYSTEM TOOLS: IMPALA <CDH4.1> IMPALA: Brings real-time, ad hoc query Query data stored in HDFS or HBase SELECT, JOIN, and aggregate functions in real time. Uses the same Hive Metadata SQL syntax (Hive SQL), ODBC driver User interface (Hue Beeswax) as Hive and Impala shell Released 26 th Oct 2012 CDH4.1

29 HBASE REAL TIME DATA WITH UPDATE HBase is a distributed, sparse, column-oriented data store Real-time read/write access to data on HDFS Modeled after Google s Bitable data store Designed to use multiple machines to store and serve data Leverages HDFS to store data Each row may or may not have values for all columns Data is stored grouped by column, rather than by row Columns are grouped into column families, which define what columns are physically stored together Scales to provide very high write throughput Hundreds of thousands of inserts per second Has a constrained access model: NO SQL Insert a row, retrieve a row, do a full or partial table scan Only one column (the row key ) is indexed Based on Key/value Store: [rowkey, column family, column qualifier, timestamp] -> Cell Value [TheRealMT, info, password, ] -> abc123 [TheRealMT, info, password, ] -> newpass123

30 HBASE Hbase: Indexed by [rowkey+column qualifier +timestamp] HBase is Not a Relational Database No SQL Query language (GET/PUT/SCAN) No Joins, No Secondary Indexing, No Transactions Table is split into Regions Regions are served by Region Servers Region Servers are Java processes, on DataNodes two special tables: ROOT and.meta MemStore, Hfiles Every Memstore flush creates one HFile per Col.Fam Compactions Major/Minor reduce consolidated hfiles

31 DATA HAS CHANGED

32 HADOOP USE CASES: What do we know today? We love to be connected and collaborated We love to share emotions likes and dislikes Digital marketing has focus towards social media Get more insights across collection of data Need all sorts of data to store and analyse Real-time recommendation engines Predictive modelling with data science

33 COMMON HADOOP USE CASES Financial Services Consumer & market risk modelling Personalization & recommendations Fraud detection & anti-money laundering Portfolio valuations

34 COMMON HADOOP USE CASES Government Cyber security & fraud detection, Geospatial image & video processing

35 COMMON HADOOP USE CASES Media & Entertainment Search & recommendation optimization, User engagement & digital content analysis, Ad/offer targeting, Sentiment & social media analysis

36 HADOOP USE CASES: DATA STORES OLTP database (OLTP) for user-facing transaction, Retain records Extract-Transform-Load (ETL) Periodic ETL (e.g., nightly), Extract records from source Transform: clean data, check integrity, aggregate, etc. Load into OLAP database OLAP database for Data Warehousing (DW) Business Intelligence: reporting, ad hoc queries, data mining

37 HADOOP USE CASES: REPLACE DW? Reporting is often a nightly task ETL is often slow, runs after the day What happens if processing 24 hours of data takes longer than 24hr Hadoop is perfect Most likely, you already have some DW Ingest is limited by speed of HDFS Scales out with more nodes Massively parallel Ability to use any processing tool Much cheaper than parallel databases ETL is a batch process anyway!

38 CLOUDERA DISTRIBUTION HADOOP 4.1 Cloudera Enterprise Subscription Options: Cloudera Enterprise Core Cloudera Enterprise RTD (Real-Time Delivery) Cloudera Enterprise RTQ (Real-Time Query)

39 WHERE TO FROM HERE? Understand Use Cases Build a business Case Design a solution Deploy Hadoop Infrastructure Confirm Data sources Use Hadoop to answer questions

40 CONTACT TRIFORCE Call View our Big Data Resources page at Follow us on LinkedIN