What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Similar documents

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Architecture. Part 1

Large scale processing using Hadoop. Ján Vaňo

MapReduce and Hadoop Distributed File System

Hadoop Ecosystem B Y R A H I M A.

BIG DATA USING HADOOP

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Distributed File Systems

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Improving MapReduce Performance in Heterogeneous Environments

MapReduce and Hadoop Distributed File System V I J A Y R A O

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop IST 734 SS CHUNG

Big Data With Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data-Intensive Computing with Map-Reduce and Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The little elephant driving Big Data

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Distributed File Systems

Hadoop implementation of MapReduce computational model. Ján Vaňo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Parallel Processing of cluster by Map Reduce

Introduction to Hadoop

Application Development. A Paradigm Shift

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Parallel Data Processing

Chapter 7. Using Hadoop Cluster and MapReduce

Choosing a Provider from the Hadoop Ecosystem

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Apache Hadoop s MapReduce

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

MapReduce with Apache Hadoop Analysing Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Survey on Scheduling Algorithm in MapReduce Framework

MapReduce, Hadoop and Amazon AWS

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

A Study of Data Management Technology for Handling Big Data

Comparative analysis of Google File System and Hadoop Distributed File System

Google File System. Web and scalability

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

The Google File System

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Fault Tolerance in Hadoop for Work Migration

Research Article Hadoop-Based Distributed Sensor Node Management System

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

L1: Introduction to Hadoop

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Hadoop

BIG DATA, MAPREDUCE & HADOOP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Map Reduce & Hadoop Recommended Text:

Unified Big Data Processing with Apache Spark. Matei

Big Data and Industrial Internet

Efficient Data Replication Scheme based on Hadoop Distributed File System

Snapshots in Hadoop Distributed File System

HDFS Space Consolidation

What is cloud computing?

Big Data Explained. An introduction to Big Data Science.

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

A Brief Outline on Bigdata Hadoop

Introduction to Hadoop

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

A Performance Analysis of Distributed Indexing using Terrier

HADOOP: Scalable, Flexible Data Storage and Analysis

Massive Data Storage

Hadoop: Embracing future hardware

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

International Journal of Advance Research in Computer Science and Management Studies

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Transcription:

What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani

# whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven Network Security Products at Informant Networks Cyber Security Me Big Data Hiring! Machine Learning

Agenda What is Big Data? What do we mean by a Distributed system? How does HDFS work? What is Map Reduce When is MapReduce supposed to be used Spark

SIGINT for questions

What is BigData? Very broad term Usually means that computation code has become minuscule compared to error handling and synchronisation Data is heterogeneous 100 GB is not yet BigData. 10+ TB. Probably? MapReduce won sorting challenge. 102 TB in 72 mins with 2100 nodes

Why so much data? One tweet is an anecdote But a million tweets may signal important trends One person s product review is an opinion But a million reviews might uncover a design flaw One person s diagnosis is an isolated case But a million medical records could lead to a cure

The Failure problem 5 commodity machines cheaper than one large machine If availability of large machine is 99.9% it will fail one day per year 5 machines will fail 5 days per year. Systems need to incorporate this downtime Imagine no Google for 5 days a year.

The problem of Scale Anything that can go wrong will go wrong - Murphy At 10000+ machines, If one fails everyday, A lot of failures will happen time and again At large scale. Small error rates manifest themselves Amplification problem

The problems to be solved How to store all that data at reasonable cost Process all the stored data

The solutions How to store all that data at reasonable cost Hadoop Distributed File System i.e. HDFS Process all the stored data Hadoop Map Reduce Framework i.e. MapR

GFS and MapReduce Produced by Jeff Dean, Sanjay Ghemawat and others GFS - Google File System Yahoo! open sourced Hadoop Hadoop = HDFS(GFS) + MapReduce Comes under the Apache License Hadoop was the name of Doug Cutting(Hadoop Inventor) s kids doll; which was an elephant Proceedings of the 19th ACM symposium on Operating systems principles Over 5000 citations

What is Distributed? Nothing is more important than anything else No master, no slave, everyone is equal A colony of ants Parallel and distributed are different things (in some ways) Like workers lifting rocks. Expendable! In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, We didn t try to grow a larger ox

Key ideas Files are stored as chunks of 64 MB each (configureable) Reliability through replication. Each chunk is replicated across 2+ chunkservers Single master to co ordinate access, keep metadata Simple centralised management. No data caching Little benefit due to large datasets, streaming reads.

In a nutshell

More Replication is rack aware Hence, Fault tolerance is extremely high Supports scale of 10,000+ machines Allows for building higher degrees of abstraction Yahoo! - 4000 Nodes, Facebook - 2300 Nodes

Map Reduce Hopes for a fault tolerant file system underneath Applies computation to the data at the node No need to bring data to one place Send the computation to where the data resides Exploits the associative property of Mathematics

Map Reduce Map is a one-to-one operation For some x, Apply a function, f(x) = y Reduce collects all y s to give a result count(y) = n All maps are done in parallel, reduce is done in parallel over similar data items.

Illustration

What happens when things break? Failed computations are re-run If one computer is slow, Its operations are given to others If a hard disk fails. Replace disk, data will be copied back

Next Generation 100 s of frameworks built on top of HDFS Cloudera offers a one-click solution for $$$$. Hortonworks is open-sourced Spark is the most upcoming project in the Apache group of projects Remember MapReduce had sorted 102 TB in 72 minutes using 2100 machines?

Where things are going