Big Data Explained An introduction to Big Data Science. 1
Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of Big Data 2
What is Big Data Large-Scale Data Management Data Science and Analytics Managing very large amounts of data and extracting value and knowledge from it! 3
Introduction to Big Data What is Big Data? What makes data, Big Data? 4
Big Data Definition No single standard definition Big Data is the data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it Examples : Google, Wikipedia, Amazon, Facebook, ebay and other corporate enterprises 5
Data explosion 6
Data generation Web data, e-commerce Purchases at department and grocery stores Bank/Credit Card transactions Social Networks Health care records Satellite imagery and weather modeling 7
Data Approximation Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) ebay has 6.5 PB of user data + 50 TB/day (5/2009) CERN s Large Hydron Collider (LHC) generates 15 PB a year 8
Characteristics of Big Data: 1-Scale (Volume) Data Volume 44x increase from 2009 to 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data 9
Characteristics of Big Data: 2-Complexity (Varity) Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data To extract knowledge all these types of data need to be linked together 10
Characteristics of Big Data: 3-Speed (Velocity) Data is being generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 11
Big Data: 3V s 12
Some Make it 4V s 13
Harnessing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 14
Who s Generating Big Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 15
Why learn Big Data - The Model Has Changed The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 16
Big Data Types 17
What s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets 18
Who is it for - Value of Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not wellsuited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 19
Big Data Market 20
Challenges in Handling Big Data The Bottleneck is in technology New architecture, algorithms, techniques are needed Also in technical skills Experts in using the new technology and dealing with big data 21
How does Big Data work? What Technologies do we have for Big Data? 22
Big Data Landscape 23
Big Data Technology 24
How to get started Learn the platform (how it is designed and works) How big data are managed in a scalable, efficient way Learn writing Hadoop jobs in different languages Programming Languages: Java, C, Python High-Level Languages: Apache Pig, Hive Learn advanced analytics tools on top of Hadoop RHadoop: Statistical tools for managing big data Mahout: Data mining and machine learning tools over big data Learn state-of-art technology from recent research papers Optimizations, indexing techniques, and other extensions to Hadoop 25
Some popular vendors 26
When to learn Big Data Proven historical trend on recruiting engineering graduates in IT companies Those doing ECE, EEE, Civil, Mechanical and others are mostly not able to apply their core skills to work on what they learnt in graduation Pre-graduation is the right time to learn Big Data technologies 27
Learning Big Data continued If you start now, you can master it by next 2 years Big Data involves not just several tools, but numerous technologies, methodologies, and mathematical and/or statistical concepts These need to be thought, developed, and applied appropriately to reach a certain goal Algorithms and computing languages are required to practically turn Big Data in to Applied Intelligence 28
Why to learn Big Data now? A culmination of several technologies Sooner, the better If a flexible mind starts learning HADOOP and related stuff now, it can rightly be positioned after few years in the right job Synonymous to 3-year IT diploma courses 29
Resources & Books No specific syllabus Big Data is a relatively new topic with no fixed syllabus Evolutionary development, being standardized Where to learn Big Data University Cloudera CDH VM and many more vendors Related books: Hadoop, The Definitive Guide. Several others. 30
Resources on the net Vast information on Big Data available on the internet Tutorials, YouTube videos, articles and vendor white papers Most of it is open source and for every one What one needs is time, interest, energy, and a bit of foresight to work on ambitious projects 31
Learning curve Some Big-Data courses available in the market Cloudera Certified Administrator Apache Hadoop Cloudera Certified Developer for Apache Hadoop Several perceptions and perspectives Several tools exist for Big-Data technology Students need the right direction to get started Who learns what is more important Clear goals and learning curves for administrators and developers A combination of above is the right mix for young minds 32
Starting with Big Data Virtual machine environment is best suited to start Any supported or popular Linux distribution Preferred RHEL, SUSE, Cent OS, Ubuntu or Fedora Hadoop platform Single-node and then clustered with High-Availability Cloudera Quickstart VM (CDH 4.4) Cloudera is one of the pioneers in Big Data technologies CDH or Cloudera Distribution for HADOOP available as a VM Downloadable from Cloudera website Other needed software packages 33
Introduction to HADOOP High Availability Distributed Object Oriented Platform Developed by The Apache Software Foundation (http://apache.org) Google started in 1990 s. 2000 s brought data management complexities In 2004, Google published whitepaper on MapReduce, a framework that provides a parallel processing model 34
HADOOP contd.. Google s technologies namely 1. GFS (Google File System) A distributed file system 2. MapReduce A framework for parallel processing 3. BigTable A Data storage system These are reverse engineered and re-engineered by Apache Software Foundation, and called as: 1. HDFS (Hadoop Distributed File System) 2. MapReduce 3. Apache HBase 35
Real-world scenarios IMAGINE YOUR BOSS COMES TO YOU AND SAYS: HERE ARE 50 GB OF LOGFILES FIND A WAY TO IMPROVE OUR business! What would you do? Where would you start? And what would you do next? 36
Cloud and Big Data Most of the traditional IT skills are being moved towards the Cloud and Big Data. Some related fields: Artificial Intelligence Distributed computing / super computing Business Analytics / Business Intelligence Data Analytics / Data Mining 37
Companies using HADOOP 38
HADOOP - Business problems types 39
How does MapReduce help 40
Hadoop and MapReduce Architecture 41
A Sample HADOOP Cluster Configuration 42
RDBMS vs. HADOOP 43
History of Databases 44
Object Databases 45
Relational Dominance 46
Bigtable, Dynamo and HBase 47
NoSQL = Not Only SQL 48
Database ecosystem 49
Past, Present and Future of IT Information technology or IT The term IT first appeared in 1958 IT as a catalyst to other areas of science and technology A movement from IT driven industry to open information society We are today a part global village, via internet, which is now a commodity or a common consumer service Fast Innovations and Inventions to continue 50
Big data development Big Data environment Use of Virtual Machines Java runtime environment HADOOP and related software Installed on a single node or clustered Running Cloudera CDH, IBM BigInsight etc. Prerequisites for leaning Big Data Working knowledge of computers Basic knowledge of Linux, C, Java etc. Awareness of virtualization and cloud trends 51
Thank You