BIG DATA TECHNOLOGY. Hadoop Ecosystem

Similar documents
Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

BIG DATA What it is and how to use?

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

MapReduce with Apache Hadoop Analysing Big Data

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Big Data With Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

A Brief Outline on Bigdata Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Scale Out Of A Nosql Database

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Ecosystem B Y R A H I M A.

Are You Ready for Big Data?

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Open source Google-style large scale data analysis with Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop IST 734 SS CHUNG

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

CSE-E5430 Scalable Cloud Computing Lecture 2

Application Development. A Paradigm Shift

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Hadoop Job Oriented Training Agenda

Hadoop. Sunday, November 25, 12

A Survey on Big Data Concepts and Tools

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Constructing a Data Lake: Hadoop and Oracle Database United!

HDP Hadoop From concept to deployment.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Big Data for Processing Data and Performing Workload

Workshop on Hadoop with Big Data

CS54100: Database Systems

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Are You Ready for Big Data?

White Paper: Hadoop for Intelligence Analysis

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Apache Hadoop FileSystem Internals

Accelerating and Simplifying Apache

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Apache HBase. Crazy dances on the elephant back

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Internals of Hadoop Application Framework and Distributed File System

Certified Big Data and Apache Hadoop Developer VS-1221

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

A very short Intro to Hadoop

Big Data Technology Core Hadoop: HDFS-YARN Internals

CIO Guide How to Use Hadoop with Your SAP Software Landscape

White Paper: What You Need To Know About Hadoop

The Future of Data Management

Hadoop & its Usage at Facebook

So What s the Big Deal?

How To Handle Big Data With A Data Scientist

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Using distributed technologies to analyze Big Data

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Testing Big data is one of the biggest

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Open source large scale distributed data management with Google s MapReduce and Bigtable

HDP Enabling the Modern Data Architecture

Storage and Retrieval of Data for Smart City using Hadoop

Big Data Weather Analytics Using Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop & its Usage at Facebook

Big Data and Apache Hadoop s MapReduce

Big Data Big Data/Data Analytics & Software Development

Big Data on Microsoft Platform

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Qsoft Inc

Hadoop and Map-Reduce. Swati Gore

Design and Evolution of the Apache Hadoop File System(HDFS)

<Insert Picture Here> Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data

Hadoop: Embracing future hardware

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Chase Wu New Jersey Ins0tute of Technology

Transforming the Telecoms Business using Big Data and Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Introduction to Big Data Training

Big Data: Tools and Technologies in Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Transcription:

BIG DATA TECHNOLOGY Hadoop Ecosystem

Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion

What is Big Data? DATA EVERYWHERE BUT NOWHERE Dr. Michio Kaku

Data Everywhere

Data growth An estimated 90% of the world s data has been created over the past two year Data is doubling every two years & global annual data creation is set to leap from 1.2 zettabytes in 2012 to 35 zettabytes in 2020 Every day, we create 2.5 quintillion bytes of data Unstructured information is growing 15 times the rate of structured information Operational Data is extremely small compared to other data sources around

Definition of Big Data Volume Processing many TBs to Petabyte of data. Data arrives in large bursts Sift through the noise to identify the right data to improve business insight Velocity Analyze more data in less time to facilitate faster and more responsive business decision making. New Data acquisition and very rapid creation of data Batch, Near Time and Real Time Data Feeds Variety Data is in many formats, including unstructured, semi structured, Complex document & Rich media Data format is constantly changing Changing Data Context

Every Day Examples of Big Data Category Data Big Data Descriptive Age, Gender, Income, Demographics Attitudes, Psychographics Social User Defined Influence, Peers Location Home Address Real Time Interaction Who is available next Who is best to serve the personality of the consumer Relationship Transactional Patterns, experience, internal and external data

Big Data Solution Objectives Enables scalable, accurate & powerful analysis. Process data fast & cost effectively Connect high volume & volatile data to enable organizations to take effective business decisions. Planning future success using insights from big data to increase the value of predictive analytics. Data Mining can be done using variety of techniques. Build Ability to experiment, Discover & rationalize

Processing Challenge Storage capacities of hard drives have increased but transfer rates have not kept up Hardware Failure Most analysis tasks need to be able to combine the data in some way.

Why Hadoop The ability to read and write data in parallel to or from multiple disks. Enables applications to work with thousands of nodes and petabytes of data with automatic failover A reliable shared storage and analysis system Open Source Architecture for large scale computation & data processing on network of commodity hardware Ability to work with variety of Data mining capabilities Allows Innovation and bring top talent to the core

Hadoop Concept Distribute the data in its original form Process the data where it is stored Combine the result from different nodes HDFS: Distributed file system scalable to accommodate any size of data, Tolerant of failures due to built in replication and regeneration Map Reduce - Processes multiple data sources into structured data (map), Performs optional aggregation on results (Reduce)

History of Hadoop Created by Doug Cutting 2002 Apache Nutch, open source web search engine 2003 Google publishes a paper describing the architecture of their distributed filesystem, GFS. 2004 Nutch Distributed Filesystem (NDFS) 2004 Google publishes a paper on MapReduce 2005 Nutch MapReduce implementation 2006 Hadoop is created; Cutting joins Yahoo! 2008 Yahoo! demonstrates Hadoop capabilities 2008 broke the world record for fastest sort 2013 Continued Innovation and Adoption

Hadoop Ecosystem

Hadoop Ecosystem - Hadoop Base Platform

Hadoop Ecosystem - HDFS Hadoop Distributed File System Files split into 128MB blocks Blocks replicated across several DATANODEs (usually 3) Single NAMENODE stores metadata (file names, block locations, etc.) Optimized for large files

Hadoop Map Reduce Example

Hadoop Ecosystem - HBase HBase is a distributed column-oriented database built on top of HDFS NoSQL highly available Database used as input and/or output with Hadoop Used when you require real-time read/write random-access to very large datasets

Hadoop Ecosystem - Hive Developed at Facebook Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning, clustering, complex data types, some optimizations Translates SQL into MapReduce jobs

Hadoop Ecosystem - Additional Components Zookeeper Configuration storage and synchronization system for Hadoop Pig Data modeling language (Pig Latin) for creating Map Reduce jobs SQOOP Data import tool to bring structured data into Hbase from RDBMS Avro Framework for persistent data and communication between Hadoop nodes Apache Oozie An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by Yahoo.

Traditional & Big Data Approaches Traditional Approach Structured & Repeatable Analysis Big Data Approach Iterative & Exploratory Analysis Business Users Determine what question to ask IT Delivers a platform to enable creative discovery IT Structures the data to answer that question Monthly sales reports Profitability analysis Customer surveys Brand sentiment Predictive Analytics Maximum asset utilization Business Explores what questions could be asked

Bring the Balance

Analytics Platform Differentiator Conventional + Big Data

Hybrid Enterprise Data warehouse Data Sources Emerging and Raw Data streams Existing and Operational Data sources Cleansing, Modeling Cleansing Tools, Metadata, Legal, Compliance Hybrid Platform Presentation Tier

Big Data and Analytics

Analytics Descriptive analytics: Using historical data to describe the business. This is usually associated with Business Intelligence or visibility systems. In supply chain understand historical demand patterns, to understand how product flows through your supply chain & to understand when a shipment might be late. Predictive analytics: Using data to predict trends & patterns. This is commonly associated with statistics. In the supply chain, you use predictive analytics to forecast future demand or to forecast the price of fuel. Prescriptive analytics: using data to suggest the optimal solution. This is commonly associated with optimization. In the supply chain, you use prescriptive analytics to set your inventory levels, schedule your plants, or route your trucks.

Predicative analytics and Crime Mitigation Look at past data Create Patterns based on Where crime events happened, Situations Seasons, Socio economic Sniff Data from multiple sources Travel Data Changes in Socio economic Data Crime Hot spots Social Media Data sensitive comments Financial Data USPA, AML Images and documents Why it is a Big Data Problem Data is coming from variety of sources in different form At a very dynamic rate in different volume Solution Identify and process data from identified sources à create a Mathematical model à Process the model offline and Real Time

Business Intelligence & Predictive Analysis Transportation: Identify Traffic Patterns and predicting Traffic Conditions Big Data in Health Care Conservation of Natural Resources Waste Management Operationally and economically efficient education system Resident services Global Warming Revenue Opportunities Tourism and Tax patterns

Conclusion Identify the Problem statement Align Hadoop with Business strategy Hadoop is not Big Data but part of the ecosystem. Big Data Solution is both Art and Science Rationalizing the approach is extremely critical Hadoop and conventional EDW need to co-exist Attract top talent by using the innovative and latest technology. Operational, Execution and Business efficiency should be the success criteria Data Privacy, Legal and Access regulations has to be integral part of design and Metadata

Appendix

Storage Capacity Terms

Big Data Example by Type

varying types of data, used in combination. Variety Structured Semi- Structured Unstructured... Time= &customer= &product=...