While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.



Similar documents
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA TRENDS AND TECHNOLOGIES

Large scale processing using Hadoop. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Are You Ready for Big Data?

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Data Refinery with Big Data Aspects

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Are You Ready for Big Data?

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

NextGen Infrastructure for Big DATA Analytics.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Data Centric Computing Revisited

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

A Survey on Big Data Concepts and Tools

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Hadoop Architecture. Part 1

Hadoop implementation of MapReduce computational model. Ján Vaňo

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Virtualizing Apache Hadoop. June, 2012

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop & its Usage at Facebook

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data - Infrastructure Considerations

Hadoop IST 734 SS CHUNG

Advanced In-Database Analytics

Apache Hadoop. Alexandru Costan

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A Brief Outline on Bigdata Hadoop

The 3 questions to ask yourself about BIG DATA

Big Data Big Deal? Salford Systems

CDH AND BUSINESS CONTINUITY:

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Networking in the Hadoop Cluster

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Can Storage Fix Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Processing of Hadoop using Highly Available NameNode

EMC BACKUP MEETS BIG DATA

DEFINITELY. GAME CHANGER? EVOLUTION? Big Data

Building your Big Data Architecture on Amazon Web Services

An Oracle White Paper October Oracle: Big Data for the Enterprise

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data on Microsoft Platform

Modernizing Your Data Warehouse for Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big + Fast + Safe + Simple = Lowest Technical Risk

Big Data Analytics - Accelerated. stream-horizon.com

Protecting Big Data Data Protection Solutions for the Business Data Lake

BIG DATA USING HADOOP

So What s the Big Deal?

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Real Time Big Data Processing

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Big Data, Big Traffic. And the WAN

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop & its Usage at Facebook

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

BIG DATA What it is and how to use?

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data Storage Architecture Design in Cloud Computing

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Big Data With Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Technology Core Hadoop: HDFS-YARN Internals

Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

BIG DATA TECHNOLOGY. Hadoop Ecosystem

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

An Oracle White Paper June Oracle: Big Data for the Enterprise

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Data-Intensive Computing with Map-Reduce and Hadoop

Transcription:

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. Remember it stands front and center in the discussion of how to implement a big data strategy. Early adopters are google, facebook, linkedin and yahoo, but it is now becoming mainstream, so what is it? 1

2

Firstly some context Who here has big data? So, Big Data, what does it mean? Most often big data refers to the size, if it is data and it is big, it must be BIG DATA. However, there are other elements of big data not necessarily associated with size the best way I like to describe big data is big data is any attribute that challenges the constraints of a system capability, for example 20mb powerpoint you cant send it via email, so this could be big data 100gb xray image you cant accurately display it on a remote screen in real time for a consultation, this could be big data 1 tb movie you cant render edits within the time constraints of the business, this could be big data With the explosion of data in our environments, the likelihood of you having big data is higher than you think. So, who here now has big data? As you can see, in 2000, we generated 2 exabytes or 2000 PB of new information in the year. 3

Fast forward to 2011, the amount of information we generated is 2 exabytes everyday! That is huge, most of this could be a result of big data. 4

This unstructured, file-based data comes from many sources such as email, audio, video, images, Word documents, machine-generated logs, etc. This type of file-based data is growing at an especially accelerated rate and is expected to grow 50x in the next 10 years. So we have gone from 2 exabytes in a year in 2000, to 2 exabytes a day in 2011, and then 50x that in the next 10 years. That to me means we need to do things differently otherwise we wont be able to store that data let alone extract any type of business value from it. -Unprompted, an airline or a telephone company offers you a spiff? Perhaps just a few days after you ve had an unpleasant experience on one of their flights. How d they do that? - A retailer seems to offer you products closer and closer to the ones you d actually like to buy. And at prices better than you had seen in the store, or on your favorite web site. -Your Physician is increasingly able to predict how you individually will respond to a particular course of treatment? - You re seeing fewer and fewer dropped calls on your mobile? - Your power company is now delivering exact assessments of how energy-efficient your dwelling is? 5

We arrive at what many have called the next great productivity revolution. The age in which research and the scientific process becomes automated in the sense that so much data is available for analysis that we no longer need to speculate about what may be taking place the trends, the hypotheses, are already produced by the amount of data available. (see article here: http://www.wired.com/wired/issue/16-07 A story to illustrate: The Australian government recently announced funding for a National Mental Health Council. Their job is to collect annually all the metrics on mental health from around Australia (this is currently done on an ad-hoc basis by researchers, but to systematically). So they will gather data from the states on suicide rates, on hospital admissions, on substance abuse, on medically reported rates of depression, and so on all manually collated by emailing and asking for spreadsheets a huge undertaking. The old way. Consider now that you can install an iphone app that will use voice pattern recognition to determine your mood ie your mental health can be roughly determined by your smartphone listening to your intonation. So in the age of big data: all this manual assembly of data becomes redundant the process can be autmoated. And not only do we leave behind the manual assembly of data, we can leave behind the process of speculation, hypothesis, and then testing the hypothesis the mental health of all Australians, if they have a smartphone, could be collected (this would need to be voluntary obviously) and then researchers could dig into patterns of mood on a daily basis across demographics and so on. An outbreak of short-term depression across Queensland the day after the maroons got beaten by the blues, for example. The point is, there s a revolution in research productivity here. That s why Wired Magazine call it the end of science. 6

7

8

Big Data is massive new data volumes This is a typical Australian electricity bill. How often do I get one of these? (guess) every 3 months. (Why? Because a person has to physically walk up to the meter and read it.) very manual. Can only be done a few times a year. Now, the electricity company has a data warehouse which captures all their billing data. They use it to analyse usage patterns across parts of the network (and not much else). Their data warehouse might be 3 terabytes. Not huge. This is a smart meter. The smart meter provides readings directly to the utility via wireless or mobile phone networks every 5 minutes. So instead of one reading per customer every three months, we can access a record per customer every 5 minutes. So the data has just grown 3000 times. (That s big data). So suddenly the utility, to retain the same level of customer data, needs 9 petabytes of data warehouse. Most of it becomes exhaust data but there is tremendous value in this data if you can keep it and analyse it. -Network load analysis over time -Better decisions on where to increase network capacity -Real-time alerts when a particular power node is approaching saturation -You can also provide the information back to your customers 9

10

This is the Silverspring web site, a screenshot taken about 2 weeks ago. Notice the promise here to the consumer See your energy use in real time you only want to promise that if your database can perform, can handle huge numbers of ad-hoc queries from consumers accessing your website 24 x 7. SilverSpring use Greenplum to capture all the readings from their smart meters into a database, and they make this data available to their customers. 11

An example of what consumer-facing real-time electricity usage looks like. 12 12

But it s not just for the consumers the real value for the utility is what they can do with the data themselves: -Predictive maintenance -Usage trends over time down to the suburb and street level -Geo-spatial mappings over streets looking weather-related incidents, maintenance cost anomalies and so on. -(I worked with a Utility in Sydney where, using their data warehouse, they were able to identify some motors and pumps that were starting and stopping several times an hour, and others that only cycled once or twice a day or week so instead of blindly sending around a maintenance crew every 3 months, they could maintain some pumps every month and other pumps only twice a year. Savings of several $M) 13 13

14

15

16

17

And here are some examples of the kind of analytics that can be run against different types of data, and the kind of insights you can expect to gain His point is that traditional data warehousing architectures don t cater for these types of analytics. 18

Firstly, lets cover the name. Hadoop was created by Doug Cutting who named his new project after his son s favorite toy, a stuffed elephant, for which the boy had made up the name. His elephant s name was Hadoop! So it is not an acronym, and it doesn t have a special hidden meaning, it is simply the name of a toy elephant. 19

I am sure we have all heard the hype.hadoop is increasingly generating the attention of the press and the IT industry as a transformative technology, and can be used to obtain the competitive advantage, if you read some of these quotes you would be saying, where can I get one! 20

On the flip side, when you dig a little deeper you can find some hidden facts. Although Hadoop does have real momentum in the market today, as it says on the slide, only a few enterprise vendors have adopted hadoop. Companies may say they have it on their fast track, or are evaluating it..but as we shall soon find out, traditional Hadoop is difficult and does lack some of the features, functionalities, and deployment options needed to be easily adopted by enterprises. 21

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. Remember it stands front and center in the discussion of how to implement a big data strategy. Early adopters are google, facebook, linkedin and yahoo, but it is now becoming mainstream, so what is it? 22

Trying to understand Hadoop concepts in the context of the wrong architecture design can actually hurt our brains Hadoop was not designed with the typical enterprise environment in mind. Hadoop was designed for a Cluster Architecture built out of commodity x86 hardware with DAS using Open source software based off of papers written by Google. Why base Hadoop on a cluster architecture? Linearly and horizontally scalable Open source software based off of papers written by Google Able to store and query massive amounts of unstructured data Fault tolerant, reliable storage Inexpensive to build and maintain Hide complexity with a common interface These are the elements that were a critical part of the hadoop design. 23

Here it is, this is Hadoop is all its glory. Hadoop harnesses the Cluster architecture, by addressing two key points: The layout of the data across the cluster, ensuring that data is evenly distributed; And then present that evenly distributed data back to the applications so they can benefit from Data Locality; And that brings us to Hadoop s two main mechanisms: Hadoop Distributed File System Hadoop MapReduce HDFS is the distributed file system that lets Hadoop scale across commodity servers. That part is easy, we all know what filesystems are and HDFS is no different, it stores data across all nodes that participate in an hadoop cluster. MapReduce is the parallel-processing engine that allows Hadoop to churn through that data, quickly. These are the two must-have components for any Hadoop cluster and together they take care of all complexities for us in leveraging the parallel processing power of a cluster. You now know what Hadoop is at a high level, lets looks at the detail. 24

In a traditional Hadoop Architecture, we now can visualise the major components. 1. Map-reduce (Job Tracker and task tracker) 2. Namenode and Secondary namenode (A HDFS NameNode stores Edit logs and File system Image). 3. Datanode (Runs on slaves) In a traditional architecture the compute nodes and the storage nodes can be one in the same. 25

So now we are clear on what hadoop is and how it fits within an environment, it does come with some challenges. Namenode failover is not seamless. Poor utilization of storage and CPU resources in Hadoop clusters Inefficient data staging and loading processes Backup and disaster recovery missing Servers with Direct Attached Storage (DAS) islands of storage, remember the 90 s? Data Protection: 3x mirror of all data Data Ingest: Difficult and tool dependent (No common protocol access) Scaling: Add more servers with DAS Single Points of Failure: NameNode Replication: None Data Recovery: Recreate data from other sources 26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57