An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING 2015 1
Table Of Contents Introduction Problems with RDBMs What is Hadoop? Who use Hadoop? Job Positions History Hadoop Distributions Hadoop Ecosystem HDFS MapReduce Map Reduce Example Word Count Hive HBase Pig Mahout Zookeeper Flume Sqoop How to Get Hadoop Cloudera Resources 2
Introduction Increasing Data 2011 : 1.8 zettabytes 2012 : 2.8 zettabytes 2020 : 40 zettabytes Social Networks 3
Problems with RDBMs Inflexible schemas Designed for structured data But most of them are semi-structured data Designed for steady data retention But we need rapid growth 4
What is Hadoop? A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines. It is designed to detect and handle failures at the application layer, Rather than rely on hardware to deliver high-availability. Hadoop processes run in separate JVMs. 5
Who use Hadoop? Facebook A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Yahoo More than 100,000 CPUs in >40,000 computers running Hadoop Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Spotify 1300 node cluster : 15,600 physical cores, ~70TB RAM, ~60 PB storage EBay 532 nodes cluster (8 * 532 cores, 5.3PB). Others : Amazon, Twitter, LinkedIn 6
Job Positions Resource : http://www.indeed.com/jobtrends/hadoop.html 7
History Hadoop was started by Doug Cutting to support two of his other well known projects, Lucene and Nutch. Hadoop has been inspired by Google's File System (GFS). Hadoop, originally called Nutch Distributed File System (NDFS) split from Nutch in 2006 to become a sub-project of Lucene. At this point it was renamed to Hadoop. Yahoo! have been one of the significant driving forces behind Hadoop. In 2008 they announced that their web search engine index was being generated by a 10,000 core Hadoop cluster. 8
Hadoop Distributions Open Source Commercial Cloud-base Apache Hadoop Cloudera AWS Hortonworks Windows Azure MapR 9
Hadoop Ecosystem 10
HDFS Hadoop Distributed File System. Big Chunks of Data Two Implementations : Distributed : Three replication on different JVMs Pseudo-distributed : One replication on one JVM 11
MapReduce Programming Paradigm Create by Google! How to index data? Two Parts : Map Reduce 12
Map Execute the Map() function on data Execute on each node Output <key, value> pairs on each node 13
Reduce Execute the Reduce() function on data Execute on some nodes Aggregate sets of <key, value> pairs on some nodes 14
Example Word Count 15
Hive SQL-like query language that generates MapReduce code Developed at Facebook Batch, not interactive Good for processing on some part of data Used with HBase 16
HBase Wide-column NoSQL database Create tables over HDFS data Managing the metastore database 17
Pig ETL library for Hadoop Generates MapReduce jobs Developed at Yahoo! Used the Pig Latin language Good for processing on all data 18
Mahout Library for common machine learning algorithms Many data-mining algorithms : Recommendation (Spotify) Classification(spam ID) Clustering(Google News) Mahout is designed for Hadoop scale 19
Zookeeper Centralized service for Hadoop configuration information Where data synchronization matters Distributed in-memory computation Example : Advertise serving in online game 20
Flume Library for working with log data Uses streaming data flows Data sinks for Flume : HTTP Twitter Complex and powerful! 21
Sqoop Command-line utility for transferring data between RDBMs and Hadoop Connectors for Oracle, SQL Server and others. Sqoop1 has more features than Sqoop2! 22
How to Get Hadoop ohttps://hadoop.apache.org owww.cloudera.com owww.hortonworks.com owww.mapr.com ohttp://aws.amazon.com/ ohttp://azure.microsoft.com/en-us/services/hdinsight/ 23
24
Resources Hadoop Fundamentals with Lynn Langit Lynda.com Wikipedia https://hadoop.apache.org/ 25
Enjoy Hadooping! 26