BIG DATA USING HADOOP

Similar documents

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

BIG DATA TRENDS AND TECHNOLOGIES

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

The Inside Scoop on Hadoop

Open source Google-style large scale data analysis with Hadoop

Big Data and Industrial Internet

Hadoop IST 734 SS CHUNG

BIG DATA SOLUTION DATA SHEET

Big Data Analytics OverOnline Transactional Data Set

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce and Hadoop Distributed File System

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Big Data and Apache Hadoop s MapReduce

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Hadoop. Sunday, November 25, 12

BBM467 Data Intensive ApplicaAons

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Large scale processing using Hadoop. Ján Vaňo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Scale Out Of A Nosql Database

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

MapReduce and Hadoop Distributed File System V I J A Y R A O

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Virtualizing Apache Hadoop. June, 2012

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

NoSQL and Hadoop Technologies On Oracle Cloud

CSE-E5430 Scalable Cloud Computing Lecture 2

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce (in the cloud)

BIG DATA What it is and how to use?

Using Tableau Software with Hortonworks Data Platform

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

BIG DATA TECHNOLOGY. Hadoop Ecosystem

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Hadoop-BAM and SeqPig

Big Data With Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Installing Hortonworks Sandbox 2.1 VirtualBox on Mac

Introduction to Cloud Computing

Data processing goes big

A Brief Outline on Bigdata Hadoop

Hadoop Architecture. Part 1

Data Mining in the Swamp

Apache Hadoop. Alexandru Costan

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Keyword: YARN, HDFS, RAM

Hadoop Big Data for Processing Data and Performing Workload

Big Fast Data Hadoop acceleration with Flash. June 2013

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Constructing a Data Lake: Hadoop and Oracle Database United!

WHAT S NEW IN SAS 9.4

Parallel Computing. Benson Muite. benson.

L1: Introduction to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Like what you hear? Tweet it using: #Sec360

Lessons Learned: Building a Big Data Research and Education Infrastructure

Application Development. A Paradigm Shift

Agenda. Big Data. Dell Cloud Solutions A Dell Story Summary. Concepts Market Trends and Challenges Dell Solutions

Apache Hadoop Patterns of Use

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Technology HADOOP CLUSTER

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Processing of Hadoop using Highly Available NameNode

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Apache Hadoop FileSystem and its Usage in Facebook

<Insert Picture Here> Big Data

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Energy Efficient MapReduce

The little elephant driving Big Data

Hadoop & its Usage at Facebook

Dell Reference Configuration for Hortonworks Data Platform

INTRODUCTION TO CASSANDRA

CDH AND BUSINESS CONTINUITY:

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Extending Hadoop beyond MapReduce

Map Reduce & Hadoop Recommended Text:

Design and Evolution of the Apache Hadoop File System(HDFS)

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Manifest for Big Data Pig, Hive & Jaql

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Transcription:

+ Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP

+ Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data Processing using MapReduce n HORTONWORKS Data Platform

+ Big Data Architecture Data Analytics FOCUS Data Processing Data Storage

+ Framing the Problem Hadoop Solves n We are in the days of Big Data. Everything around us generate data directly or indirectly n The bad news we are struggling to store, process and analyze it n THE PROBLEM: Even though storage capacities of hard drives have increased massively over the years, ACCESS SPEEDS have not kept up. n ACCESS SPEED = Rate at which data can be read from drives

+ Framing the Problem n Hard drive capacity is growing. Size of online data is increasing. Processor speed/ performance is increasing but the problem is READ/WRITE ACCESS to DATA n Moving data IN and OFF the DISK is the major bottleneck n The need for parallel data access is essential to meeting the challenge of BIG DATA (parallel=> portion of the data is accessed at the same time) n The way to reduce time is to read from multiple disks at once

+ Framing the Problem n Hard drive capacity is growing. Size of online data is increasing. Processor speed/ performance is increasing but the problem is READ/WRITE ACCESS to DATA n Moving data IN and OFF the DISK is the major bottleneck n The need for parallel data access is essential to meeting the challenge of BIG DATA (parallel=> portion of the data is accessed at the same time) n The way to reduce time is to read from multiple disks at once

+ Framing the Problem n Challenges of READ/WRITE Data in Parallel to and from many disks: n Hardware failure: having many disks means the chances that one will fail is high. So you need to replicate the data. n How to combine data in some way your result is scattered on multiple disks and have to be combined. This is a key challenge.

+ Meet HADOOP n HADOOP is a FRAMEWORK of OPEN SOURCE tools, libraries or methodologies for BIG DATA ANALYSIS

+ Main HADOOP Characteristics n Open source (Apache License) n Can handle large unstructured data sets (petabytes) n Simple Programming model running on GNU/Linux n Scalable from single server to thousands of machine n Runs on commodity hardware and the cloud n Application-level fault tolerance n Multiple tools and libraries integrated

+ Brief HADOOP History n Developed by Doug Cutting and Michael J. Cafarella n Based on Google MapReduce technology n Designed to handle large amount of data and be robust n Donated to Apache Software Foundation in 2006 by Yahoo

+ Main Areas where HADOOP is Used n Social Media e.g Facebook, Twitter n Retail e.g. Alibaba, Amazon n Financial Services n Web Search and Recommendation n Government n Everywhere else where there is large amounts of unstructured data to be stored and processed

+ Prominent Hadoop Users

+ CORE COMPONENTS OF HADOOP HDFS Hadoop Distributed File System MapReduce Simple Programming Model for Data Processing

+ HDFS Hadoop Distributed File System

+ MAPREDUCE

+ MAP REDUCE PROGRAMMING MODEL

+ MAP REDUCE ILLUSTRATION Word Count K1,V1 LIST(K2,V2) K2, LIST(V2) LIST (K3, V3)

+ HADOOP PLATFORM BEYOND THE CORE

+ HADOOP PLATFORM DISTRIBUTIONS

+ HORTONWORKS DATA PLATFORM - HDP Visit Hortonworks.com for downloads and tutorials

+ FULL STACK OF TOOLS AND TECHNOLOGIES FOR ENTERPRISE BIG DATA ANALYTICS

+ GETTING STARTED HDP SANDBOX The easiest way to get started with BIG DATA HADOOP is through the SANDBOX a virtual Machine framework that allows you to work with HDP on localhost. Download free from HORTONWORKS.COM

+ HADOOP HANDS-ON SESSION WITH HDP SANDBOX

+ Outline n Installation of Virtual Box Environment n Installation of HDP Sandbox n Tour of HDP Sandbox Web Interface n Setting Up Eclipse for Hadoop n Simple Map-Reduce Application in Eclipse n Loading data into HDP Sandbox n Running Simple MapReduce in the Sandbox

+ 1. Install Oracle VirtualBox n Download and configure a virtual machine on your PC using the instructions on - https://www.virtualbox.org/ n Note: that you can also use any other virtual machine software such as VMWare but for this hands-on, I am using VirtualBox n Note: Your RAM space should be large enough (min of about 8GB RAM would be required) to be able to achieve best results

+ 2. Install HDP Sandbox n Download and configure the latest version of HortonWorks Data Platform (HDP) Sandbox on your PC using the instructions on - http://hortonworks.com/products/hortonworks-sandbox/ #install n Note: I am using HDP Sandbox, there are other providers of Hadoop distributions such as Cloudera etc which can alternatively be configured n Note: Your RAM space should be large enough (min of about 8GB RAM would be required) to be able to achieve best results

+ Note: n This slide is inconclusive, I will send the concluding parts by email n For details contact me: johnson.iyilade@glomacssolutions.com