An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING

Similar documents
BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

BIG DATA TRENDS AND TECHNOLOGIES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem B Y R A H I M A.

MapReduce with Apache Hadoop Analysing Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Introduction to Big Data Training

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source Google-style large scale data analysis with Hadoop

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data and Industrial Internet

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

How To Scale Out Of A Nosql Database

Implement Hadoop jobs to extract business value from large and varied data sets

Qsoft Inc

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Journal of Environmental Science, Computer Science and Engineering & Technology

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop and Map-Reduce. Swati Gore

Hadoop. Sunday, November 25, 12

Big Data Course Highlights

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data: Tools and Technologies in Big Data

Big Data With Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Application Development. A Paradigm Shift

Hadoop IST 734 SS CHUNG

MySQL and Hadoop. Percona Live 2014 Chris Schneider

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

ITG Software Engineering

Dominik Wagenknecht Accenture

Data Analyst Program- 0 to 100

Big Data and Apache Hadoop s MapReduce

Hadoop Big Data for Processing Data and Performing Workload

Peers Techno log ies Pv t. L td. HADOOP

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Dell In-Memory Appliance for Cloudera Enterprise

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

HDP Hadoop From concept to deployment.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

HDP Enabling the Modern Data Architecture

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

<Insert Picture Here> Big Data

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Deploying Hadoop with Manager

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Chase Wu New Jersey Ins0tute of Technology

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Bringing Big Data to People

Community Driven Apache Hadoop. Apache Hadoop Basics. May Hortonworks Inc.

BIG DATA What it is and how to use?

Introduction to Hadoop

Workshop on Hadoop with Big Data

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Certified Big Data and Apache Hadoop Developer VS-1221

Maximizing Hadoop Performance with Hardware Compression

#TalendSandbox for Big Data

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Apache Hadoop: Past, Present, and Future

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

I/O Considerations in Big Data Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data Too Big To Ignore

BIG DATA USING HADOOP

A Survey on Big Data Concepts and Tools

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Apache Hadoop FileSystem and its Usage in Facebook

Testing 3Vs (Volume, Variety and Velocity) of Big Data

The Hadoop Eco System Shanghai Data Science Meetup

Open source large scale distributed data management with Google s MapReduce and Bigtable

Matt Benton, Mike Bull, Josh Fyne, Georgie Mackender, Richard Meal, Louise Millard, Bogdan Paunescu, Chencheng Zhang

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Complete Java Classes Hadoop Syllabus Contact No:

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Upcoming Announcements

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Transcription:

An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING 2015 1

Table Of Contents Introduction Problems with RDBMs What is Hadoop? Who use Hadoop? Job Positions History Hadoop Distributions Hadoop Ecosystem HDFS MapReduce Map Reduce Example Word Count Hive HBase Pig Mahout Zookeeper Flume Sqoop How to Get Hadoop Cloudera Resources 2

Introduction Increasing Data 2011 : 1.8 zettabytes 2012 : 2.8 zettabytes 2020 : 40 zettabytes Social Networks 3

Problems with RDBMs Inflexible schemas Designed for structured data But most of them are semi-structured data Designed for steady data retention But we need rapid growth 4

What is Hadoop? A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines. It is designed to detect and handle failures at the application layer, Rather than rely on hardware to deliver high-availability. Hadoop processes run in separate JVMs. 5

Who use Hadoop? Facebook A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Yahoo More than 100,000 CPUs in >40,000 computers running Hadoop Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Spotify 1300 node cluster : 15,600 physical cores, ~70TB RAM, ~60 PB storage EBay 532 nodes cluster (8 * 532 cores, 5.3PB). Others : Amazon, Twitter, LinkedIn 6

Job Positions Resource : http://www.indeed.com/jobtrends/hadoop.html 7

History Hadoop was started by Doug Cutting to support two of his other well known projects, Lucene and Nutch. Hadoop has been inspired by Google's File System (GFS). Hadoop, originally called Nutch Distributed File System (NDFS) split from Nutch in 2006 to become a sub-project of Lucene. At this point it was renamed to Hadoop. Yahoo! have been one of the significant driving forces behind Hadoop. In 2008 they announced that their web search engine index was being generated by a 10,000 core Hadoop cluster. 8

Hadoop Distributions Open Source Commercial Cloud-base Apache Hadoop Cloudera AWS Hortonworks Windows Azure MapR 9

Hadoop Ecosystem 10

HDFS Hadoop Distributed File System. Big Chunks of Data Two Implementations : Distributed : Three replication on different JVMs Pseudo-distributed : One replication on one JVM 11

MapReduce Programming Paradigm Create by Google! How to index data? Two Parts : Map Reduce 12

Map Execute the Map() function on data Execute on each node Output <key, value> pairs on each node 13

Reduce Execute the Reduce() function on data Execute on some nodes Aggregate sets of <key, value> pairs on some nodes 14

Example Word Count 15

Hive SQL-like query language that generates MapReduce code Developed at Facebook Batch, not interactive Good for processing on some part of data Used with HBase 16

HBase Wide-column NoSQL database Create tables over HDFS data Managing the metastore database 17

Pig ETL library for Hadoop Generates MapReduce jobs Developed at Yahoo! Used the Pig Latin language Good for processing on all data 18

Mahout Library for common machine learning algorithms Many data-mining algorithms : Recommendation (Spotify) Classification(spam ID) Clustering(Google News) Mahout is designed for Hadoop scale 19

Zookeeper Centralized service for Hadoop configuration information Where data synchronization matters Distributed in-memory computation Example : Advertise serving in online game 20

Flume Library for working with log data Uses streaming data flows Data sinks for Flume : HTTP Twitter Complex and powerful! 21

Sqoop Command-line utility for transferring data between RDBMs and Hadoop Connectors for Oracle, SQL Server and others. Sqoop1 has more features than Sqoop2! 22

How to Get Hadoop ohttps://hadoop.apache.org owww.cloudera.com owww.hortonworks.com owww.mapr.com ohttp://aws.amazon.com/ ohttp://azure.microsoft.com/en-us/services/hdinsight/ 23

24

Resources Hadoop Fundamentals with Lynn Langit Lynda.com Wikipedia https://hadoop.apache.org/ 25

Enjoy Hadooping! 26