MySQL and Hadoop. Percona Live 2014 Chris Schneider
|
|
- Hubert Newman
- 8 years ago
- Views:
Transcription
1 MySQL and Hadoop Percona Live 2014 Chris Schneider
2 About Me Chris Schneider, Database Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for the past ~3 years chschneider@groupon.com
3 What we ll cover Apache Hadoop CDH Cloudera Distribution for Apache Hadoop Use cases for Hadoop Simple Map Reduce Overview Scoop Hive Impala Tungsten Replicator MySQL -> HDFS
4 What is Hadoop? An open-source framework for storing and processing data on a cluster of servers Based on Google s whitepapers of the Google File System (GFS) and MapReduce Scales linearly Designed for batch processing Optimized for streaming reads
5 Where do I start? Image by: Aryan Nava - Hadoop Ecosystem
6 Hadoop Distribution Cloudera Provides a distribution for Apache Hadoop Along with many other components in the Hadoop ecosystem What is a distribution Repositories Documentation Bug fixes Tested Cloudera Manager There are other companies who provide distributions
7 Why Hadoop Volume Use Hadoop when you cannot or should not use traditional RDBMS Image Source:
8 Why Hadoop Velocity Can ingest terabytes of data per day Image Source:
9 Why Hadoop Variety You can have structured or unstructured data Image Source:
10 Use cases for Hadoop Recommendation engine Netflix recommends movies Ad targeting, log processing, search optimization Glam Media, ebay, Orbitz Machine learning and classification Yahoo Mail s spam detection Financial Institutions Identity theft and credit risk Social Graph Facebook, Linkedin and eharmony connections
11 Some Details about Hadoop Two Main Pieces of Hadoop Hadoop Distributed File System (HDFS) Distributed and redundant data storage using many nodes Hardware will inevitably fail Read and process data with MapReduce Processing is sent to the data Many map tasks each work on a slice of the data Failed tasks are automatically restarted on another node/replica
12 Hadoop Ecosystem Management - Oozie (Workflow) - Chukwa (Monitoring) - Flume (Data Ingest) - ZooKeeper (Management) - Cloudera Manager (Management) Data Access - Hive (SQL) - Pig (Data Flow) - Mahout (Machine Learning) - Avro (RPC, Serialization) - Sqoop (RDBMS Connector) Data Processing - MapReduce - MRv2 or YARN Data Storage - HDFS (Distributed File System) - Hbase (Column Data Store)
13 Block Storage Replication Master Nodes Namenode(s) Jobtracker(s) Slave Nodes Datanode Tasktracker Datanode Tasktracker Datanode Tasktracker Blocks
14 Map is used for Searching 64, big data is totally cool and big Foreach word MAP Intermediate Output (on local disk): big, 1 data, 1 is, 1 totally, 1 cool, 1 and, 1 big, 1
15 Reduce is used to aggregate Hadoop aggregates the keys and calls a reduce for each unique key e.g. GROUP BY, ORDER BY reduce (key, list) sum the list output (key, sum) big, (1,1) data, (1) is, (1) totally, (1) cool, (1) and, (1) Reduce big, 2 data, 1 is, 1 totally, 1 cool, 1 and, 1
16 Map/Reduce CompA,2013,35.25 CompB,2013,5.25 CompA,2013,15.00 CompC,2013,25.00 MAP CompA,2013,50.25 CompB,2013,5.25 CompC,2013,25.00 Reduce CompA,2013,60.25 CompB,2013,41.00 CompC,2013,35.25 CompB,2014,20.75 CompC,2014,10.25 CompA,2014,10.00 CompB,2014,15.00 MAP CompA,2013,10.00 CompB,2013,35.75 CompC,2013,10.25
17 Why do MR jobs take so long? Source: Hadoop the Definitive Guide, Figure 6-4 Shuffle and Sort
18 Where does Hadoop fit in? Think of Hadoop as an augmentation of your traditional RDBMS system You want to store years of data or you just have a lot of data You need to aggregate all of the data over many years time You want ALL your data stored and accessible not forgotten or deleted You need this to be free software running on commodity hardware
19 Where does Hadoop fit in? http http http Tableau: Business Analytics Hive Impala Pig MySQL MySQL MySQL MySQL Oozie/ Sqoop/ETL Sqoop MySQL MySQL NameNode NameNode2 DataNode DataNode Hadoop (CDH4) DataNode DataNode JobTracker JobTracker DataNode DataNode DataNode DataNode
20 Where does Tungsten fit in? http http http Tableau: Business Analytics Hive Impala Pig MySQL MySQL MySQL MySQL MySQL MySQL NameNode NameNode2 Hadoop (CDH4) JobTracker JobTracker Tungsten 3.0 DataNode DataNode DataNode DataNode Single Path Loading DataNode DataNode DataNode DataNode
21 Simple Data Flow MySQL is used for OLTP data processing ETL process moves data from MySQL to Hadoop Oozie Sqoop Oozie Custom ETL Tungsten Replicator 3.0+ Use MapReduce to transform data, run batch analysis, join data, etc Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report You can also read this data out of Impala directly to save time in export out of HDFS
22 MySQL Data Capacity Depends, (TB)+ PB+ Data per query/ MR Read/Write Depends, (MB -> GB) Random read/ write Hadoop PB+ Sequential scans, Append-only Query Language SQL MapReduce, Scripted Streaming, HiveQL, Pig Latin Transactions Yes No Indexes Yes No Latency Sub-second Minutes to hours Data structure Relational Both structured and un-structured Enterprise and Community Support Yes Yes
23 About Sqoop Open Source and stands for SQL-to-Hadoop Parallel import and export between Hadoop and various RDBMS Default implementation is JDBC Optimized for MySQL but not for performance Integrated with connectors for Oracle, Netezza, Teradata, MicroStrategy, Tableau
24 Sqoop Features You can choose specific tables or columns to import with the --where flag Controlled parallelism Parallel mappers/connections (--num-mappers) Specify the column to split on (--split-by) Incremental loads with TIMESTAMP and AUTO INCREMENT Integration with Hive (HDFS) and Hbase (Column Store)
25 How Sqoop Import Works 1. Client calls a sqoop import 2. Sqoop submits a map only job to the Jobtracker 3. The Map tasks connect to the database server via JDBC and execute a select 1. Happens on the Tasktrackers 4. Mysqldump on the Datanodes will connect to MySQL (in this case) or other RDBMS systems 1. Select WHERE PK >=0 and PK < = N 2. Select WHERE PK >=N and PK < = N 3. Select WHERE PK >=N and PK < = N 4. Select WHERE PK >=N and PK < = N
26 Sqoop Data Into Hadoop $ sqoop import --connect jdbc:mysql://example.com/world \ --username <database_username> \ --tables City \ --fields-terminated-by \t \ --lines-terminated-by \n This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.city The resulting TSV file(s) will be stored in HDFS
27 Sqoop Export $ sqoop export --connect jdbc:mysql://example.com/world \ --username <database_username> \ --tables City \ --export-dir /path/in/hdfs The City table needs to exist within MySQL Default CSV formatted Can use staging table (--staging-table)
28 Sqoop Gotcha 1 You NEED to make sure to install Connector/J Latest version: Then you must place the.jar file the Sqoop lib directory: Default: /usr/lib/sqoop/lib
29 Sqoop Gotcha 2 attempt_ _0002_m_000000_0: log4j:warn Please initialize the log4j system properly.java.io.ioexception: Cannot run program "mysqldump": java.io.ioexception: error=2, No such file or directory at java.lang.processbuilder.start(processbuilder.java: 460) at java.lang.runtime.exec(runtime.java:593) at java.lang.runtime.exec(runtime.java:466)at com.cloudera.sqoop.mapreduce.mysqldumpma pper.map(mysqldumpmapper.java:396) at com.cloudera.sqoop.mapreduce.mysqldumpma pper.map(mysqldumpmapper.java:49)
30 Gotcha 3 Full table scan Try not to sqoop from a master MySQL instance Try not to sqoop from a slave actively taking reads May degrade performance as sqoop selected data page into memory while pushing out expected data
31 About Hive Offers a way around the complexities of MapReduce/JAVA Hive is an open-source project managed by the Apache Software Foundation Non-JAVA to be able to access data Language based on SQL (ANSI-SQL 92) Easy to lean and use Data is available to many more people
32 More About Hive Hive is NOT a replacement for RDBMS Not all SQL works Hive is only an interpreter that converts HiveQL to MapReduce more specifically procedural java code HiveQL queries can take many seconds or minutes to produce a result set Hive has a metastore within MySQL that contains details about table representation on HDFS
33 RDBMS vs Hive RDBMS Hive Language SQL Subset of SQL along with Hive extensions Transactions Yes No ACID Yes No Latency Updates? Sub-second (Indexed Data) Yes, INSERT [IGNORE], UPDATE, DELETE, REPLACE Many seconds to minutes (Non Index Data) INSERT OVERWRITE
34 Sqoop and Hive $ sqoop import --connect jdbc:mysql://example.com/world \ --tables City \ --hive-import Alternatively, you can create table(s) within the Hive CLI and run an hadoop fs -put with an exported CSV file on the local file system
35 Impala Brings processing to the data nodes and avoids network bottlenecks Uses a shared metastore Hue integration in CDH 4.2+ Based off of Google s Dremel
36 Impala High Level Image Source: cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
37 Advantages of Impala 100% Open Source. Local processing on data nodes help avoid network bottlenecks Saves time with minimal data movement A single, open, and unified metadata store can be utilized. Costly data format conversion is unnecessary and thus no overhead is incurred. All data is immediately query-able, with no delays for ETL. All hardware is utilized for Impala queries as well as for MapReduce. Only a single machine pool is needed to scale. ANSI-92 SQL supported with UDFs Supports common Hadoop file formats: text, SequenceFiles, Avro, RPCFile, LZO and Parquet
38 Impala Info Factors that help make Impala faster Hardware Configuration Complexity of the Query Availability of main memory Does it replace MapReduce or Hive? Limitations? All joins are done in a memory space no larger than the smallest node
39 Tungsten Replicator 3.0 MySQL to HDFS Why this is needed Sqoop issues Time, Stale Data, Performance issues Real Time data load Idempotent failover If you have multiple slaves and log_slave_updates active All row changes are stored
40 Current Scope 3.0 Release Extract from MySQL via binlog (RBR) Move data into current popular distributions, Cloudera, Hortonworks Initial provion with Sqoop or parallel extractor (currently Oracle only but MySQL coming soon Asynchronous replicated changes Data transformation into preferred HDFS formats Schema generation of HIVE tables Tools for generating of materialized views
41 High Level Overview MySQL RDBMS Master Replicator THL GTID + Metadata Binlogs Slave Replicator THL GTID + Metadata HDFS
42 Specific Overview MySQL Hadoop.js CSV Tungsten Master Replicator Tungsten Slave Replicator HIVE/HDFS Master Filtering: - Drop/Modify Columns (PII) - Fill in primary key - Fill in column names - Add DBMS source - Select subset of tables to replicate Slave Loader: - CSV files are created (1 per table) - 100K transactions or ~1 min data collected - Parallel push into HDFS
43 Process into HIVE Hadoop.js Load CSV into hadoop Staging table creation Contains all changed data along with Tungsten Specific row information Create Base Table Equivalent table to MySQL in HIVE Create Materialized View Check for table match between MySQL and HIVE
44 Automate the process $ git clone \ $ cd continuent-tools.hadoop $ bin/load-reduce-check \ -U jdbc:mysql:thin://sourcemysqlserver:3306/database \ -s database --verbose
45 DEMO hive -e drop table city Sqoop Import City table sqoop import --connect jdbc:mysql:// localhost.localdomain/world --username root -- create-hive-table --hive-table City --hive-import -- direct --table City --warehouse-dir /user/hive/ warehouse HiveQL Saved in Hue Impala Saved in Hue
46 References Hive Impala +Documentation Cloudera Tungsten Replicator 3.0 Download - index.html Documentation MySQL > HDFS - Tungsten Hadoop Tools
Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years sarah@cloudera.com 2 What is Hadoop? An open-source
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationHOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP
HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP AGENDA Introduction What is Hadoop and the rationale behind it Hadoop Distributed File System (HDFS) and MapReduce Common Hadoop
More information<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationComplete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationGoogle Bing Daytona Microsoft Research
Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationApache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
More informationHadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationCertified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationbrief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationCloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here
Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here JusIn Erickson Senior Product Manager, Cloudera Speaker Name or Subhead Goes Here May 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationSQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
More informationMySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering
MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationBig Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce
Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationFrom Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian
From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationFrom Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten
From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer. About Tungsten Replicator Open source drop-in
More informationBIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION
FACT SHEET BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION BIGDATA & HADOOP CLASS ROOM SESSION GreyCampus provides Classroom sessions for Big Data & Hadoop Developer Certification. This course will
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationHadoop and MySQL for Big Data
Hadoop and MySQL for Big Data Alexander Rubin October 9, 2013 About Me Alexander Rubin, Principal Consultant, Percona Working with MySQL for over 10 years Started at MySQL AB, Sun Microsystems, Oracle
More informationChase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
More informationHadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationUnlocking Hadoop for Your Rela4onal DB. Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.
Unlocking Hadoop for Your Rela4onal DB Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.be April 4, 2014 Who Am I? Started 3 yr ago as 1 st Cloudera Support Eng Now
More informationCommunicating with the Elephant in the Data Center
Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationCloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationIntegrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationImportant Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationPractical Hadoop by Example
Practical Hadoop by Example for relational database professioanals Alex Gorbachev 12-Mar-2013 New York, NY Alex Gorbachev Chief Technology Officer at Pythian Blogger OakTable Network member Oracle ACE
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationSelf-service BI for big data applications using Apache Drill
Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationCURSO: ADMINISTRADOR PARA APACHE HADOOP
CURSO: ADMINISTRADOR PARA APACHE HADOOP TEST DE EJEMPLO DEL EXÁMEN DE CERTIFICACIÓN www.formacionhadoop.com 1 Question: 1 A developer has submitted a long running MapReduce job with wrong data sets. You
More informationReplicating to everything
Replicating to everything Featuring Tungsten Replicator A Giuseppe Maxia, QA Architect Vmware About me Giuseppe Maxia, a.k.a. "The Data Charmer" QA Architect at VMware Previously at AB / Sun / 3 times
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationLecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
More informationDominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationApache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com
Apache Sentry Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More information