Big Data Workshop. dattamsha.com



Similar documents
Big Data Explained. An introduction to Big Data Science.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop and Map-Reduce. Swati Gore

Big Data Course Highlights

Transforming the Telecoms Business using Big Data and Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Too Big To Ignore

Qsoft Inc

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Applications for Big Data Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop IST 734 SS CHUNG

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data and Data Science: Behind the Buzz Words

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

How To Scale Out Of A Nosql Database

Hadoop Ecosystem B Y R A H I M A.

Big Data Technologies Compared June 2014

MapReduce with Apache Hadoop Analysing Big Data

Big Data With Hadoop

Big Data on Microsoft Platform

Big Data and Industrial Internet

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Open Source Technologies on Microsoft Azure

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Sunnie Chung. Cleveland State University

White Paper: What You Need To Know About Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Modernizing Your Data Warehouse for Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Workshop on Hadoop with Big Data

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Job Oriented Training Agenda

Big Data Analytics OverOnline Transactional Data Set

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Moving From Hadoop to Spark

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data and Market Surveillance. April 28, 2014

Integrating Big Data into the Computing Curricula

Large scale processing using Hadoop. Ján Vaňo

Open source Google-style large scale data analysis with Hadoop

So What s the Big Deal?

Ubuntu and Hadoop: the perfect match

Internals of Hadoop Application Framework and Distributed File System

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

HDP Hadoop From concept to deployment.

Suresh Lakavath csir urdip Pune, India

The Future of Data Management

TUT NoSQL Seminar (Oracle) Big Data

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Data processing goes big

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

#TalendSandbox for Big Data

Big Data and Apache Hadoop s MapReduce

HDP Enabling the Modern Data Architecture

Introduction to Big Data Training

Using distributed technologies to analyze Big Data

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Big Data/Data Analytics & Software Development

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Bringing Big Data to People

Big Data Weather Analytics Using Hadoop

Peers Techno log ies Pv t. L td. HADOOP

The Next Wave of Data Management. Is Big Data The New Normal?

Cost-Effective Business Intelligence with Red Hat and Open Source

Big Data and Apache Hadoop Adoption:

Application Development. A Paradigm Shift

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

How To Create A Data Visualization With Apache Spark And Zeppelin

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Future of Data Management with Hadoop and the Enterprise Data Hub

BIG DATA TRENDS AND TECHNOLOGIES

Reference Architecture, Requirements, Gaps, Roles

BIG DATA USING HADOOP

Dominik Wagenknecht Accenture

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

<Insert Picture Here> Big Data

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

COMP9321 Web Application Engineering

Transcription:

Big Data Workshop

About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the first part Cloudera Certified Data Scientist Certification (DS-200). Is a Sun Certified Java Professional. Has been active in the Big Data/Hadoop community Blog (http://www.thecloudavenue.com/ and http://www./blog/) Stack Overflow (http://stackoverflow.com/users/614157/praveensripati) Extensive experience in consulting, conducting training and workshops around Big Data.

Agenda Job market around Big Data What is Big Data? Big Data use cases Big Data Ecosystem What is HDFS? What is MapReduce? Pig, Hive etc NoSQL Databases

Why we are here? http://www.indeed.com/jobtrends http://www.indeed.com/jobtrends

Why we are here? IDC predicts that the market for big data will reach $16.1 billion in 2014, growing 6 times faster than the overall IT market. http://www.nasscom.in/big-data-next-big-thing http://www.nasscom.in/big-data-next-big-thing

Big Data Certifications http://university.cloudera.com/certification.html http://hortonworks.com/hadoop-training/

Expectations from the workshop To make each one of you get familiar with the different aspects of Big Data which will help you to explore further the various opportunities around Big Data. The work shop is to make each one of you motivated to work on the Big Data in the future. The workshop is not to make you an expert in Big Data given the time frame.

Scale of data http://en.wikipedia.org/wiki/metric_prefix

Some interesting facts about Data Every day, we create 2.5 quintillion bytes of data, so much that 90% of the data in the world today has been created in the last two years alone. Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 PB of data Twitter generates 12 TB of data every day. Airbus A380 generates 10 TB every 30 minutes of flight. NYSE generates a TB of data every month. What do we do we so much amount of data? Ignore or use it.

Lets jump into Big Data 10

The model has changed Old Model only a few companies were generating data (like news outlets), all others are consuming data New Model all of us are generating data, and all of us are consuming data 11

Who s generating data? Scientific instruments (collecting all sorts of data) Social media and networks (all of us are generating data) Mobile devices (tracking all objects all the time) Sensor technology (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data. But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion. 12

MapMyRide iphone App and Cycling 13

How much data is getting generated? http://www.nasscom.in/big-data-next-big-thing http://www.nasscom.in/big-data-next-big-thing

What's Big Data? In information technology, Big Data is a collection of data sets so large and complex that it becomes difficult to process and store using on-hand existing tools and technologies.

Converting raw data into information Raw data from various sources Machine Learning Data Mining Analytics Info Action Store Process Increase the bottom line or the profits of the company

Interesting fact of Neo (a graph db) This is a Social Graph me Friends (depth 2) Friends of Friends (depth 3)

Interesting fact of Neo (a graph db) Depth Execution Time - MySQL Execution Time Neo4J How much faster 2 0.016 0.010 1.6 3 30.267 0.168 180.2 4 1543.505 1.359 1135.8 5 Not finished in 1 hr 2.132 NA With a database of 1,000,000 users, finding the friends of friends (fof) at different levels and the results are striking. Execution Time is in seconds, for 1,000 users. http://www.neotechnology.com/how-much-faster-is-a-graph-database-really/

1-Scale (Volume) http://www.nasscom.in/big-data-next-big-thing

2-Complexity (Variety) Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data 20

3-Speed (Velocity) Data is begin generated fast and need to be processed fast. Late decisions will lead to missing opportunities. E-Promotions: send promotions right now for store next to customer, based on your current location, your purchase history, what the customer likes. Healthcare monitoring : Any abnormal sensor measurements in the body would require immediate reaction Fraud detection : Credit Card company should detect frauds as soon as possible and take immediate action. 21

Some Make it 4V s 22

General use cases of Big Data Recommendation Customer segmentation and targeted promotions Fraud detection Network intrusion detection Reporting and analytics Sentimental analysis 23 http://www.bigdata-madesimple.com/category/sectors/

Predict the future 24

Use Case - Recommendations This is called Collaborative Filtering a type of machine learning algorithm B1 B2 B3 B1 B2 B4 B6 B7 B8 U1 U2 U3

Use Case Sentimental Analysis Twitter Facebook Natural Language Processing Blogs Based on the feedback in the different social media outlets identify if the feedback has been positive, negative or neutral. This can be used to improve the service/product or how it has been marketed.

Use Case Reporting & Predictive Analytics How many smart phones have been sold in each of the store this year and the previous year? How many customers have booked a hotel room in the past and how many will book in the next few months? Did providing an incentive / offer to the customer increase the sales of a particular gadget? How much of a particular flu medicine should be manufactured in the next 6 months?

Use Case Classification Classify an email as a spam or a non-spam automatically as in the case of GMail. Given a particular medical report. Automatically classify if the report is positive or negative for a particular symptom. Based on the different features of a customer. Separate the potential customer who would be interested in a particular service from the rest. Classify weather a taxi booked will be cancelled by the customer or not.

Use Case Customer segmentation & targeted advertising Separate the customers into different segments based on the features like age, salary, gender, location, spending habits. Target each of the customer by sending a customized email with some offers. This will increase the prospects of the customer opting for a particular service. This is much better than sending the same email in bulk to all the customers.

Use Case Fraud detection Based on the nature of the transaction like time, location, spending habits etc identify if a particular credit card transaction is a genuine transaction or not. Identifying the fraudulent transactions as soon as possible will help in mitigating or minimizing the losses.

Use Case Flu Trends Identify the trends in the search terms for flu related topics. And then manufacture the appropriate amount of flu related drugs. Manufacturing more flu related drugs will result in excess inventory. By manufacturing less, the demand won t be met. So, the appropriate amount of flu related drugs have to be manufactured. http://www.google.org/flutrends/

Use Case Wiki Analysis Download the entire Wikipedia dump and identify the importance of the different pages based on the PageRank algorithm. Create an index for the entire Wikipedia similar to an index at the end of the book. By combining the above two data sets, a search engine can be created on top of Wikipedia, similar to Google. http://en.wikipedia.org/wiki/pagerank http://en.wikipedia.org/wiki/wikipedia:database_download

Challenges with Big Data Big Data technology is like a moving target. Security had been an after thought Privacy of the data Lack of skills Sharing of data across silos Risks of storing huge data sets within a single system Regulations in the different domains (HIPPA for Health Care) 33

Lets jump into technology aspects of Big Data 34

What is ASF? Apache Software is a Non Profit Organization. Provides a platform or an environment for the development of different softwares. The development happens in an open way. The code can be obtained from http://svn.apache.org/repos/asf/. Different individuals and companies can work in a collaborative fashion. There are a lot of Big Data and Non Big Data related projects. The softwares are free to use with very few or no restrictions.

Different softwares under ASF Non Big Data softwares Ant Maven Tomcat Log4J JUnit Apache Http Torque (ORM Tool) Big Data softwares Hadoop Hive Pig HBase Cassandra Sqoop Oozie https://projects.apache.org/indexes/alpha.html

Ecosystem around Big Data s

How they fit together?

Parallels between Linux and Big Data GNU Linux Base Fedora CentOS RHEL Ubuntu Derivatives RedHat Canonical Company

Parallels between Linux and Big Data Apache Base CDH HDP M3/M5/M7 HDInsights Derivative Cloudera Hortonworks MapR Microsoft Company

About the Big Data Virtual Machine Hadoop Hive Pig Ubuntu Oracle Virtual Box This is what has been shared in the DVD along with the installation instructions. Windows Mac Linux Laptop/Desktop http://www.thecloudavenue.com/2013/01/virtual-machine-for-learning-hadoop.html

Getting started with Big Data Virtual Machine (VM) Installation of VirtualBox Configuring VM image in VirtualBox Use the different frameworks in the VM

Ubuntu running on Windows 8 43

Introduction to Linux http://www.ee.surrey.ac.uk/teaching/unix/

Vertical & Horizontal scaling of Big Data systems Traditional systems are expensive to scale and by design difficult to distribute. By adding more and more commodity machines to the cluster, more and more of the data is stored and processed in parallel Commodity machines High end machine Desktop grade machine Traditional Systems Big Data Systems 45

How a datacenter looks like? 46

Google leading the Big Data revolution 47 http://www.thecloudavenue.com/2012/10/google-driving-big-data-space.html

What is Hadoop? Framework/Software for storage and analysis at large scale Can process and store structured as well as unstructured data Addresses many of the challenges with distributed processing and storing HDFS for Storage and MapReduce for Analysis HDFS and MapReduce based on Google's papers. Is free and is developed from Apache Software Foundation Commercial support is available from vendors like Cloudera, HortonWorks, Amazon and others Deployments are in the range of thousands of nodes in production

Hadoop handles the different failure scenarios A rack/node going down. A process dying. A particular process running slow. Problems in the network. A Hard Disk Drive crashing... and it also handles any changes to the network topology like adding a new rack or a new node. 49

What is HDFS? Distributed file system for redundant storage based on Google s GFS Paper. Designed to reliably store data on commodity hardware. Built to expect hardware failures. Performs best with a modest number of large files Sits on top of native file system 50 http://research.google.com/archive/gfs-sosp2003.pdf

How does HDFS store the data? Original file is split into blocks b1 b2 b3 b4 m/c1 m/c2 m/c3 m/c4 b1 b2 b2 b3 b3 b4 b4 b1 The blocks are placed on different machines with redundancy to address fault tolerance. 51 http://research.google.com/archive/gfs-sosp2003.pdf

Processing of the data on different machines m/c1 m/c2 b1 b2 b2 b3 p1 p2 m/c3 m/c4 b3 b4 b4 b1 p3 p4 Once the blocks are placed in different machines, they can processed in parallel by different processes. 52

Different models for processing of the data Softwares implementing the distributed computing models Hadoop Spark Open MPI Hama Giraph MapReduce RDD MPI BSP Graph HDFS Distributed computing models 53

What is MapReduce? Programming model for distributed computations at a massive scale. A method for distributing a task across multiple nodes Execution framework for organizing and performing such computations Each node processes data stored on that node. Provides Automatic parallelization and distribution Fault tolerance Status and monitoring tools A clean abstraction for programmers 54 http://research.google.com/archive/mapreduce.html

Map Reduce Data Flow Hadoop is easy m1 (Hadoop,1) (is,1) (easy,1) r (Hadoop,2) (is,2) (easy,1) (cool,1) Hadoop is cool m2 Mapper doing the transformation by splitting the line into words. (Hadoop,1) (is,1) (cool,1) Reducer doing the aggregation by counting the numbers of occurrence of a particular word. 55

WordCount in Java (Mapper code) MapReduce code can be written in Java and non-java languages. 56 http://wiki.apache.org/hadoop/wordcount

WordCount in Java (Reducer code) 57

Apache Spark Spark was developed AMPlab in University of Berkeley http://spark.apache.org/

WordCount in Python (Spark Code) Spark code is not only faster, but is much easy to write from pyspark import SparkContext logfile = "hdfs://localhost:9000/user/bigdatavm/input" sc = SparkContext("spark://bigdata-vm:7077", "WordCount") textfile = sc.textfile(logfile) wordcounts = textfile.flatmap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordcounts.saveastextfile("hdfs://localhost:9000/user/bigdatavm/output") 59 http://spark.apache.org/

Putting it all together 60

Apache Pig High-level data flow language Made of two components Data processing language Pig Latin Compiler to translate Pig Latin to MapReduce Abstracts you from specific details and allows you to focus on data processing.

Apache Pig (find top 10 urls visited by users) users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredusers = FILTER users BY age >= 18 and age <=50; joinresult = JOIN filteredusers BY name, pages by user; grouped = GROUP joinresult BY url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO 'top10sites';

Same thing in MapReduce users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredusers = FILTER users BY age >= 18 and age <=50; joinresult = JOIN filteredusers BY name, pages by user; grouped = GROUP joinresult BY url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO 'top10sites'; MapReduce requires a lot more coding when compared to Pig.

Apache Hive (find top 3 urls visited by users) Data Warehousing Layer on top of Hadoop Allows analysis and queries using a SQL-like language Hive is best for data analysts familiar with SQL who need to do dynamic queries, summarization and data analysis.

Apache Hive (find top 10 urls visited by users) CREATE TABLE users(name STRING, age INT); CREATE TABLE pages(user STRING, url STRING); LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users'; LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages'; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT BY clicks DESC LIMIT 10;

What is NoSQL? 66

Different types of RDBMS 67

RDBMS have scalability problems RDBMS have scalability issues 125+ NoSQL databases RDBMS NoSQL ACID Compliant Codd s rule Designed 15-20 years back BASE Compliant Brewers Cap Theorem Schema less Horizontally scalable 68

Different types of NoSQL database 69

Building online applications Online application RDBMS NoSQL 70

Demo of the Cloudera Cluster 71

Books Hadoop The Definitive Guide (3 rd Edition) http://www.amazon.com/hadoop-definitive-guide-tom- White/dp/1449311520/ HBase The Definitive Guide http://www.amazon.com/hbase-definitive-guide-lars- George/dp/1449396100/ NoSQL Distilled http://www.amazon.com/nosql-distilled-emerging-polyglot- Persistence/dp/0321826620/ Seven Database in Seven Weeks http://www.amazon.com/seven-databases-weeks-modern- Movement/dp/1934356921/ Books are old and technology is moving fast. 72

Blogs http://blog.cloudera.com/blog/ http://hortonworks.com/blog/ https://www.mapr.com/blog https://databricks.com/blog http://www.datastax.com/blog http://planetbigdata.com/ Use an RSS aggregator like feedly.com to keep updated. 73

There is data every where around us. The data needs to be analyzed for better things customer satisfaction better health better sales Big Data will be the next big thing. Need to scale up with proper knowledge of Big Data to grab the opportunities Proper training Practice Reading books 74

75