BIG DATA TECHNOLOGY Hadoop Ecosystem
Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion
What is Big Data? DATA EVERYWHERE BUT NOWHERE Dr. Michio Kaku
Data Everywhere
Data growth An estimated 90% of the world s data has been created over the past two year Data is doubling every two years & global annual data creation is set to leap from 1.2 zettabytes in 2012 to 35 zettabytes in 2020 Every day, we create 2.5 quintillion bytes of data Unstructured information is growing 15 times the rate of structured information Operational Data is extremely small compared to other data sources around
Definition of Big Data Volume Processing many TBs to Petabyte of data. Data arrives in large bursts Sift through the noise to identify the right data to improve business insight Velocity Analyze more data in less time to facilitate faster and more responsive business decision making. New Data acquisition and very rapid creation of data Batch, Near Time and Real Time Data Feeds Variety Data is in many formats, including unstructured, semi structured, Complex document & Rich media Data format is constantly changing Changing Data Context
Every Day Examples of Big Data Category Data Big Data Descriptive Age, Gender, Income, Demographics Attitudes, Psychographics Social User Defined Influence, Peers Location Home Address Real Time Interaction Who is available next Who is best to serve the personality of the consumer Relationship Transactional Patterns, experience, internal and external data
Big Data Solution Objectives Enables scalable, accurate & powerful analysis. Process data fast & cost effectively Connect high volume & volatile data to enable organizations to take effective business decisions. Planning future success using insights from big data to increase the value of predictive analytics. Data Mining can be done using variety of techniques. Build Ability to experiment, Discover & rationalize
Processing Challenge Storage capacities of hard drives have increased but transfer rates have not kept up Hardware Failure Most analysis tasks need to be able to combine the data in some way.
Why Hadoop The ability to read and write data in parallel to or from multiple disks. Enables applications to work with thousands of nodes and petabytes of data with automatic failover A reliable shared storage and analysis system Open Source Architecture for large scale computation & data processing on network of commodity hardware Ability to work with variety of Data mining capabilities Allows Innovation and bring top talent to the core
Hadoop Concept Distribute the data in its original form Process the data where it is stored Combine the result from different nodes HDFS: Distributed file system scalable to accommodate any size of data, Tolerant of failures due to built in replication and regeneration Map Reduce - Processes multiple data sources into structured data (map), Performs optional aggregation on results (Reduce)
History of Hadoop Created by Doug Cutting 2002 Apache Nutch, open source web search engine 2003 Google publishes a paper describing the architecture of their distributed filesystem, GFS. 2004 Nutch Distributed Filesystem (NDFS) 2004 Google publishes a paper on MapReduce 2005 Nutch MapReduce implementation 2006 Hadoop is created; Cutting joins Yahoo! 2008 Yahoo! demonstrates Hadoop capabilities 2008 broke the world record for fastest sort 2013 Continued Innovation and Adoption
Hadoop Ecosystem
Hadoop Ecosystem - Hadoop Base Platform
Hadoop Ecosystem - HDFS Hadoop Distributed File System Files split into 128MB blocks Blocks replicated across several DATANODEs (usually 3) Single NAMENODE stores metadata (file names, block locations, etc.) Optimized for large files
Hadoop Map Reduce Example
Hadoop Ecosystem - HBase HBase is a distributed column-oriented database built on top of HDFS NoSQL highly available Database used as input and/or output with Hadoop Used when you require real-time read/write random-access to very large datasets
Hadoop Ecosystem - Hive Developed at Facebook Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning, clustering, complex data types, some optimizations Translates SQL into MapReduce jobs
Hadoop Ecosystem - Additional Components Zookeeper Configuration storage and synchronization system for Hadoop Pig Data modeling language (Pig Latin) for creating Map Reduce jobs SQOOP Data import tool to bring structured data into Hbase from RDBMS Avro Framework for persistent data and communication between Hadoop nodes Apache Oozie An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by Yahoo.
Traditional & Big Data Approaches Traditional Approach Structured & Repeatable Analysis Big Data Approach Iterative & Exploratory Analysis Business Users Determine what question to ask IT Delivers a platform to enable creative discovery IT Structures the data to answer that question Monthly sales reports Profitability analysis Customer surveys Brand sentiment Predictive Analytics Maximum asset utilization Business Explores what questions could be asked
Bring the Balance
Analytics Platform Differentiator Conventional + Big Data
Hybrid Enterprise Data warehouse Data Sources Emerging and Raw Data streams Existing and Operational Data sources Cleansing, Modeling Cleansing Tools, Metadata, Legal, Compliance Hybrid Platform Presentation Tier
Big Data and Analytics
Analytics Descriptive analytics: Using historical data to describe the business. This is usually associated with Business Intelligence or visibility systems. In supply chain understand historical demand patterns, to understand how product flows through your supply chain & to understand when a shipment might be late. Predictive analytics: Using data to predict trends & patterns. This is commonly associated with statistics. In the supply chain, you use predictive analytics to forecast future demand or to forecast the price of fuel. Prescriptive analytics: using data to suggest the optimal solution. This is commonly associated with optimization. In the supply chain, you use prescriptive analytics to set your inventory levels, schedule your plants, or route your trucks.
Predicative analytics and Crime Mitigation Look at past data Create Patterns based on Where crime events happened, Situations Seasons, Socio economic Sniff Data from multiple sources Travel Data Changes in Socio economic Data Crime Hot spots Social Media Data sensitive comments Financial Data USPA, AML Images and documents Why it is a Big Data Problem Data is coming from variety of sources in different form At a very dynamic rate in different volume Solution Identify and process data from identified sources à create a Mathematical model à Process the model offline and Real Time
Business Intelligence & Predictive Analysis Transportation: Identify Traffic Patterns and predicting Traffic Conditions Big Data in Health Care Conservation of Natural Resources Waste Management Operationally and economically efficient education system Resident services Global Warming Revenue Opportunities Tourism and Tax patterns
Conclusion Identify the Problem statement Align Hadoop with Business strategy Hadoop is not Big Data but part of the ecosystem. Big Data Solution is both Art and Science Rationalizing the approach is extremely critical Hadoop and conventional EDW need to co-exist Attract top talent by using the innovative and latest technology. Operational, Execution and Business efficiency should be the success criteria Data Privacy, Legal and Access regulations has to be integral part of design and Metadata
Appendix
Storage Capacity Terms
Big Data Example by Type
varying types of data, used in combination. Variety Structured Semi- Structured Unstructured... Time= &customer= &product=...