+ Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP
+ Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data Processing using MapReduce n HORTONWORKS Data Platform
+ Big Data Architecture Data Analytics FOCUS Data Processing Data Storage
+ Framing the Problem Hadoop Solves n We are in the days of Big Data. Everything around us generate data directly or indirectly n The bad news we are struggling to store, process and analyze it n THE PROBLEM: Even though storage capacities of hard drives have increased massively over the years, ACCESS SPEEDS have not kept up. n ACCESS SPEED = Rate at which data can be read from drives
+ Framing the Problem n Hard drive capacity is growing. Size of online data is increasing. Processor speed/ performance is increasing but the problem is READ/WRITE ACCESS to DATA n Moving data IN and OFF the DISK is the major bottleneck n The need for parallel data access is essential to meeting the challenge of BIG DATA (parallel=> portion of the data is accessed at the same time) n The way to reduce time is to read from multiple disks at once
+ Framing the Problem n Hard drive capacity is growing. Size of online data is increasing. Processor speed/ performance is increasing but the problem is READ/WRITE ACCESS to DATA n Moving data IN and OFF the DISK is the major bottleneck n The need for parallel data access is essential to meeting the challenge of BIG DATA (parallel=> portion of the data is accessed at the same time) n The way to reduce time is to read from multiple disks at once
+ Framing the Problem n Challenges of READ/WRITE Data in Parallel to and from many disks: n Hardware failure: having many disks means the chances that one will fail is high. So you need to replicate the data. n How to combine data in some way your result is scattered on multiple disks and have to be combined. This is a key challenge.
+ Meet HADOOP n HADOOP is a FRAMEWORK of OPEN SOURCE tools, libraries or methodologies for BIG DATA ANALYSIS
+ Main HADOOP Characteristics n Open source (Apache License) n Can handle large unstructured data sets (petabytes) n Simple Programming model running on GNU/Linux n Scalable from single server to thousands of machine n Runs on commodity hardware and the cloud n Application-level fault tolerance n Multiple tools and libraries integrated
+ Brief HADOOP History n Developed by Doug Cutting and Michael J. Cafarella n Based on Google MapReduce technology n Designed to handle large amount of data and be robust n Donated to Apache Software Foundation in 2006 by Yahoo
+ Main Areas where HADOOP is Used n Social Media e.g Facebook, Twitter n Retail e.g. Alibaba, Amazon n Financial Services n Web Search and Recommendation n Government n Everywhere else where there is large amounts of unstructured data to be stored and processed
+ Prominent Hadoop Users
+ CORE COMPONENTS OF HADOOP HDFS Hadoop Distributed File System MapReduce Simple Programming Model for Data Processing
+ HDFS Hadoop Distributed File System
+ MAPREDUCE
+ MAP REDUCE PROGRAMMING MODEL
+ MAP REDUCE ILLUSTRATION Word Count K1,V1 LIST(K2,V2) K2, LIST(V2) LIST (K3, V3)
+ HADOOP PLATFORM BEYOND THE CORE
+ HADOOP PLATFORM DISTRIBUTIONS
+ HORTONWORKS DATA PLATFORM - HDP Visit Hortonworks.com for downloads and tutorials
+ FULL STACK OF TOOLS AND TECHNOLOGIES FOR ENTERPRISE BIG DATA ANALYTICS
+ GETTING STARTED HDP SANDBOX The easiest way to get started with BIG DATA HADOOP is through the SANDBOX a virtual Machine framework that allows you to work with HDP on localhost. Download free from HORTONWORKS.COM
+ HADOOP HANDS-ON SESSION WITH HDP SANDBOX
+ Outline n Installation of Virtual Box Environment n Installation of HDP Sandbox n Tour of HDP Sandbox Web Interface n Setting Up Eclipse for Hadoop n Simple Map-Reduce Application in Eclipse n Loading data into HDP Sandbox n Running Simple MapReduce in the Sandbox
+ 1. Install Oracle VirtualBox n Download and configure a virtual machine on your PC using the instructions on - https://www.virtualbox.org/ n Note: that you can also use any other virtual machine software such as VMWare but for this hands-on, I am using VirtualBox n Note: Your RAM space should be large enough (min of about 8GB RAM would be required) to be able to achieve best results
+ 2. Install HDP Sandbox n Download and configure the latest version of HortonWorks Data Platform (HDP) Sandbox on your PC using the instructions on - http://hortonworks.com/products/hortonworks-sandbox/ #install n Note: I am using HDP Sandbox, there are other providers of Hadoop distributions such as Cloudera etc which can alternatively be configured n Note: Your RAM space should be large enough (min of about 8GB RAM would be required) to be able to achieve best results
+ Note: n This slide is inconclusive, I will send the concluding parts by email n For details contact me: johnson.iyilade@glomacssolutions.com