Big Data Explained. An introduction to Big Data Science.



Similar documents
Big Data. Introducción. Santiago González

Transforming the Telecoms Business using Big Data and Analytics

Applications for Big Data Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Using Big Data to Explore New Opportunities. Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Big Data Workshop. dattamsha.com

Big Data Technologies Compared June 2014

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data and Data Science: Behind the Buzz Words

Open source Google-style large scale data analysis with Hadoop

COMP9321 Web Application Engineering

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data and Analytics: Challenges and Opportunities

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data Too Big To Ignore

Hadoop IST 734 SS CHUNG

Large-Scale Data Processing

BIG DATA What it is and how to use?

Sunnie Chung. Cleveland State University

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

BIG DATA TRENDS AND TECHNOLOGIES

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Cost-Effective Business Intelligence with Red Hat and Open Source

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Integrating a Big Data Platform into Government:

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Spatial Analytics An Introduction

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Apache Hadoop: The Big Data Refinery

How Companies are! Using Spark

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data and Apache Hadoop Adoption:

How To Scale Out Of A Nosql Database

Hadoop and Map-Reduce. Swati Gore

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Customized Report- Big Data

Big Data on Microsoft Platform

A Study of Data Management Technology for Handling Big Data

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data and Market Surveillance. April 28, 2014

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Keywords Big Data Analytic Tools, Data Mining, Hadoop and MapReduce, HBase and Hive tools, User-Friendly tools.

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Hadoop. Sunday, November 25, 12

Big Data and Industrial Internet

Hadoop Ecosystem B Y R A H I M A.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Ubuntu and Hadoop: the perfect match

Modernizing Your Data Warehouse for Hadoop

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

BIG DATA HADOOP TRAINING

Native Connectivity to Big Data Sources in MSTR 10

BIG DATA TOOLS. Top 10 open source technologies for Big Data

BIG DATA CHALLENGES AND PERSPECTIVES

A Brief Outline on Bigdata Hadoop

Navigating the Big Data infrastructure layer Helena Schwenk

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Constructing a Data Lake: Hadoop and Oracle Database United!

Bringing Big Data to People

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data: Are You Ready? Kevin Lancaster

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data Warehouse design

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

White Paper: Hadoop for Intelligence Analysis

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

ANALYTICS CENTER LEARNING PROGRAM

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Outline. What is Big data and where they come from? How we deal with Big data?

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

White Paper: What You Need To Know About Hadoop

Manifest for Big Data Pig, Hive & Jaql

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big data for the Masses The Unique Challenge of Big Data Integration

Real Time Big Data Processing

Tap into Hadoop and Other No SQL Sources

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Transcription:

Big Data Explained An introduction to Big Data Science. 1

Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of Big Data 2

What is Big Data Large-Scale Data Management Data Science and Analytics Managing very large amounts of data and extracting value and knowledge from it! 3

Introduction to Big Data What is Big Data? What makes data, Big Data? 4

Big Data Definition No single standard definition Big Data is the data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it Examples : Google, Wikipedia, Amazon, Facebook, ebay and other corporate enterprises 5

Data explosion 6

Data generation Web data, e-commerce Purchases at department and grocery stores Bank/Credit Card transactions Social Networks Health care records Satellite imagery and weather modeling 7

Data Approximation Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) ebay has 6.5 PB of user data + 50 TB/day (5/2009) CERN s Large Hydron Collider (LHC) generates 15 PB a year 8

Characteristics of Big Data: 1-Scale (Volume) Data Volume 44x increase from 2009 to 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data 9

Characteristics of Big Data: 2-Complexity (Varity) Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data To extract knowledge all these types of data need to be linked together 10

Characteristics of Big Data: 3-Speed (Velocity) Data is being generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 11

Big Data: 3V s 12

Some Make it 4V s 13

Harnessing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 14

Who s Generating Big Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 15

Why learn Big Data - The Model Has Changed The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 16

Big Data Types 17

What s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets 18

Who is it for - Value of Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not wellsuited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 19

Big Data Market 20

Challenges in Handling Big Data The Bottleneck is in technology New architecture, algorithms, techniques are needed Also in technical skills Experts in using the new technology and dealing with big data 21

How does Big Data work? What Technologies do we have for Big Data? 22

Big Data Landscape 23

Big Data Technology 24

How to get started Learn the platform (how it is designed and works) How big data are managed in a scalable, efficient way Learn writing Hadoop jobs in different languages Programming Languages: Java, C, Python High-Level Languages: Apache Pig, Hive Learn advanced analytics tools on top of Hadoop RHadoop: Statistical tools for managing big data Mahout: Data mining and machine learning tools over big data Learn state-of-art technology from recent research papers Optimizations, indexing techniques, and other extensions to Hadoop 25

Some popular vendors 26

When to learn Big Data Proven historical trend on recruiting engineering graduates in IT companies Those doing ECE, EEE, Civil, Mechanical and others are mostly not able to apply their core skills to work on what they learnt in graduation Pre-graduation is the right time to learn Big Data technologies 27

Learning Big Data continued If you start now, you can master it by next 2 years Big Data involves not just several tools, but numerous technologies, methodologies, and mathematical and/or statistical concepts These need to be thought, developed, and applied appropriately to reach a certain goal Algorithms and computing languages are required to practically turn Big Data in to Applied Intelligence 28

Why to learn Big Data now? A culmination of several technologies Sooner, the better If a flexible mind starts learning HADOOP and related stuff now, it can rightly be positioned after few years in the right job Synonymous to 3-year IT diploma courses 29

Resources & Books No specific syllabus Big Data is a relatively new topic with no fixed syllabus Evolutionary development, being standardized Where to learn Big Data University Cloudera CDH VM and many more vendors Related books: Hadoop, The Definitive Guide. Several others. 30

Resources on the net Vast information on Big Data available on the internet Tutorials, YouTube videos, articles and vendor white papers Most of it is open source and for every one What one needs is time, interest, energy, and a bit of foresight to work on ambitious projects 31

Learning curve Some Big-Data courses available in the market Cloudera Certified Administrator Apache Hadoop Cloudera Certified Developer for Apache Hadoop Several perceptions and perspectives Several tools exist for Big-Data technology Students need the right direction to get started Who learns what is more important Clear goals and learning curves for administrators and developers A combination of above is the right mix for young minds 32

Starting with Big Data Virtual machine environment is best suited to start Any supported or popular Linux distribution Preferred RHEL, SUSE, Cent OS, Ubuntu or Fedora Hadoop platform Single-node and then clustered with High-Availability Cloudera Quickstart VM (CDH 4.4) Cloudera is one of the pioneers in Big Data technologies CDH or Cloudera Distribution for HADOOP available as a VM Downloadable from Cloudera website Other needed software packages 33

Introduction to HADOOP High Availability Distributed Object Oriented Platform Developed by The Apache Software Foundation (http://apache.org) Google started in 1990 s. 2000 s brought data management complexities In 2004, Google published whitepaper on MapReduce, a framework that provides a parallel processing model 34

HADOOP contd.. Google s technologies namely 1. GFS (Google File System) A distributed file system 2. MapReduce A framework for parallel processing 3. BigTable A Data storage system These are reverse engineered and re-engineered by Apache Software Foundation, and called as: 1. HDFS (Hadoop Distributed File System) 2. MapReduce 3. Apache HBase 35

Real-world scenarios IMAGINE YOUR BOSS COMES TO YOU AND SAYS: HERE ARE 50 GB OF LOGFILES FIND A WAY TO IMPROVE OUR business! What would you do? Where would you start? And what would you do next? 36

Cloud and Big Data Most of the traditional IT skills are being moved towards the Cloud and Big Data. Some related fields: Artificial Intelligence Distributed computing / super computing Business Analytics / Business Intelligence Data Analytics / Data Mining 37

Companies using HADOOP 38

HADOOP - Business problems types 39

How does MapReduce help 40

Hadoop and MapReduce Architecture 41

A Sample HADOOP Cluster Configuration 42

RDBMS vs. HADOOP 43

History of Databases 44

Object Databases 45

Relational Dominance 46

Bigtable, Dynamo and HBase 47

NoSQL = Not Only SQL 48

Database ecosystem 49

Past, Present and Future of IT Information technology or IT The term IT first appeared in 1958 IT as a catalyst to other areas of science and technology A movement from IT driven industry to open information society We are today a part global village, via internet, which is now a commodity or a common consumer service Fast Innovations and Inventions to continue 50

Big data development Big Data environment Use of Virtual Machines Java runtime environment HADOOP and related software Installed on a single node or clustered Running Cloudera CDH, IBM BigInsight etc. Prerequisites for leaning Big Data Working knowledge of computers Basic knowledge of Linux, C, Java etc. Awareness of virtualization and cloud trends 51

Thank You