Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1
2
Big Data Problems Data explosion Data from users on social networks Data from mobile devices Data from of things What is big data? Data at large quantity (terabytes), captured at a rapid rate, structured or unstructured, stored or hold at various machines and locations, or some combination of the above Problems? It is difficulty or at high cost to capture, store, manage process, mine using traditional methods 3
Why all the excitement There are many factors contributing to the hype around Big Data Challenges of the problems The appearance of a cost effective practical solutions The expectation on Internet of Things Bringing compute and storage together on commodity hardware. i.e. cloud computing Price performance: The Hadoop big data technology provides significant cost savings with significant performance improvements Linear Scalability: Every parallel technology makes claims about scale up Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time, until MapReduce system like Hadoop appears 4
MapReduce A programming model for processing large data sets with a parallel and distributed algorithm on a cluster A MapReduce program is composed of two core procedures: Map() and Reduce() Map() performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) Reduce() performs a summary operation (such as counting the number of students in each queue, yielding name frequencies) A "MapReduce System" (also called "infrastructure" or "framework") runs the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing redundancy and fault tolerance A well-established open-source MapReduce system is Apathe Hadoop 5
Solve the word count example by MapReduce Word count problem Input: a several text files, or one big file Output: the words and their frequencies E.g. file00: Hello World Bye World file01: Hello Hadoop Goodbye Hadoop Solve the problem by MapReduce scheme master file00 < Bye, 1> Map() < Hello, 1> < World, 2> file01 Map() < Goodbye,1> < Hadoop, 2> < Hello, 1> Reduce() < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> 6
Hadoop architecture 7
MapReduce layer Jobtracker manages MapReduce jobs, hands out tasks to the slave nodes, schedules tasks, monitoring them and re-executes the failed tasks. There is exactly one JobTracker in each cluster Tasktracker is a slaves that carry out map and reduce tasks, usually associated with Datanode 8
9
HDFS layer Namenode manages the namespace, file system metadata, and access control. There is exactly one Namenode in each cluster Datanode holds application input/output data files and map and reduce programs Client is an application launcher to create MapReduce job with provided application specific input data files and map and reduce programs Hadoop launch application from client program: split data file into input chunks, map input chunks and programs to Datanodes. 10
11
Hadoop ecosystem 12
Components in Hadoop echosystem The Apache Hadoop project has two core components 1. the file store called Hadoop Distributed File System (HDFS) 2. the programming framework called MapReduce Other components 1. Hadoop Streaming: A utility to enable MapReduce code in any language: C, Perl, Python, C++, Bash, etc. The examples include a Python mapper and an AWK reducer. 2. Hive and Hue: Hive convert SQL to a MapReduce job. Hue gives a browser-based graphical interface to do Hive work. 3. Pig: A higher-level programming environment to do MapReduce coding. The Pig language is called Pig Latin. 4. Sqoop: Provides bi-directional data transfer between Hadoop and relational database. 13
5. Oozie: Manages Hadoop workflow. 6. HBase: A super-scalable key-value store. It works very much like a persistent hash-map 7. FlumeNG: A real time loader for streaming your data into Hadoop. 8. Whirr: Cloud provisioning for Hadoop. You can start up a cluster in just a few minutes with a very short configuration file. 9. Mahout: Machine learning for Hadoop. Used for predictive analytics and other advanced analysis. 10.Fuse: Makes the HDFS system look like a regular file system so you can use ls, rm, cd, and others on HDFS data 11.Zookeeper: Used to manage synchronization for the cluster. 14
Get started with Hadoop Cloud computing and big data lab Lab tasks 1. create a private cloud, Ubuntu virtual machines 2. Install and test Hadoop https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html 15
Hadoop business For the executives: Hadoop is an Apache open source software project to get value from the incredible volume/velocity/variety of data about your organization. Use the data instead of throwing most of it away For the technical managers: An open source suite of software that mines the structured and unstructured Big Data about your company. It integrates with your existing Business Intelligence ecosystem. Legal: An open source suite of software that is packaged and supported by multiple suppliers. Please see the Resources section regarding IP indemnification. Engineering: A massively parallel, shared nothing, Java-based map-reduce execution environment. Think hundreds to thousands of computers working on the same problem, with built-in failure resilience. Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities. Security: A Kerberos-secured software suite. 16