Big Data Big Data/Data Analytics & Software Development Danairat T. danairat@gmail.com, 081-559-1446 1
Agenda Big Data Overview Business Cases and Benefits Hadoop Technology Architecture Big Data Development Process Summary 2
Big Data Overview 3
What is Big data Big data analytics is concerned with the analysis of large volumes of transaction/event data and behavioral analysis of human/human a human/system interactions. (Gartner) Big data represents the collection of technologies that handle large data volumes well beyond that inflection point and for which, at least in theory, hardware resources required to manage data by volume track to an almost straight line rather than a curve. (IDC) 4
Structured.. Non Structured Non Structured Level Structured Semi-structured Quasi-structured Unstructured Example Relational database XML data files Text documents Images and video A new class of problems has emerged which demands an ability to accept and manage data without advanced knowledge of its structure or format. 5
Unstructured Data Growth Trends 6
Big Data, An Integrated Architecture Capture Store/Proces s Integrate Organize Analyze Gover n Structured Master & Ref Data Transaction Data Machine Generated Social Media Text, Image Video, Audio DBMS (OLTP) Hadoop Cluster MapReduce Key-Value Data Store DB Replic ETL/ELT ChangeDC Real-Time Unstructured Semistructured Message- Based ODS Data Warehouse Data Marts Streaming (CEP Engine) Reporting & Dashboards Alerting EPM BI Applications Big Data Text Analytics and Search In-Database Analytics Advanced Analytics Visual Discovery Management Security, Governance Source: oracle.com 7
What is Big data SOCIAL BLOG SMART METER 101100101001 001001101010 101011100101 010100100101 VOLUME VELOCITY VARIETY VALUE Source: oracle.com 8
Business Cases and Benefits 9
Hadoop Use Cases CRM, Customer Analysis, Social Marketing Telco: Network Analysis, Quality of Service Public Service: Crime Analysis, Flooding Alert, City Planning Healthcare: Patient Safety, EMR, Next Best Action (NBA) Retail: Everyday Low Price, Offer better quality products, Next Best Action (NBA) Finance: Risk Management, Loan Origination, Credit Line, Wealth Management HCM, Talent Management, Social Analytics, Etc. 10
What does a Big Data World look like? Utilities 0101010101010101010101010101010101010101 0101010101010101010101010101010101010101 0101010101010101010101010101010101010101 What they collect Smart Metering -Monitors power usage How they use it Better demand planning Better targeted marketing Better targeted products based on individuals power needs Big Data means The ability to predict demand at household level Reduce exposure to spot market 11
12 3. Public/Private Hospital executes Health Program integrated EHR/EMR Systems 4. Health Tracking Health check up records 1. Blood Pressure Sleep Tracking 4. Health Tracking Health check up records 1. Blood Pressure Heart rate Tracking 4. Health Tracking Health check up records Government Officer 2. Government creates/revises National Health Program Personal Health Improvement Cloud Suggested Health Improvement (Secured Personal Access) Big Data Doctor Nurse Officer 1. Blood Pressure Coach Tracking Integrated Medical Device Integrated Medical Device Integrated Medical Device Integrated Medical Device Integrated Medical Device
Public Healthcare Management of Outbreak Through Early Detection of Clusters 13
Hadoop Technology 14
What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing Its scalability comes from the marriage of: HDFS: Self-Healing High-Bandwidth Clustered Storage MapReduce: Fault-Tolerant Distributed Processing Operates on structured and complex data A large and active ecosystem (many developers and additions like HBase, Hive, Pig, ) Open source under the Apache License http://wiki.apache.org/hadoop/ apache.org/hadoop/ 15
Hadoop History 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch 2003-2004: Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s 2008: Web-scale deployments at Y!, Facebook, Last.fm April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes May 2009: Yahoo does fastest sort of a TB, 62secs over 1,460 nodes Yahoo sorts a PB in 16.25hours over 3658 nodes June 2009, Oct 2009: Hadoop Summit, Hadoop World September 2009: Doug Cutting joins Cloudera apache.org/hadoop/ 16
Basic Architecture Client 17
Basic Architecture HDFS Client Name Node 18
Basic Architecture HDFS Client Name Node Data Node Data Node Data Node 19
Basic Architecture Client HDFS Map Reduce Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker 20
Basic Architecture Client HDFS Map Reduce Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker 21
HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few /month vs $/month apache.org/hadoop/ 22
MapReduce: Distributed Processing apache.org/hadoop/ 23
Hadoop MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) 24
Hadoop Ecosystem Hive Sqoop Zookepper MapReduce (Job Scheduling/Execution System) HBase HDFS (Hadoop Distributed File System) Flume 25
Use The Right Tool For The Right Job Relational Databases: Hadoop: When to use? Interactive Reporting (<1sec) Multistep Transactions Lots of Inserts/Updates/Deletes When to use? Affordable Storage/Compute Structured or Not (Agility) Resilient Auto Scalability apache.org/hadoop/ 26
Example Thai Content Analytics Thai WordCount Input Content 27
Example Thai Content Analytics Thai WordCount The results 28
Example - Hive Hive enables Hadoop to support SQL for Non-Java developer hive (default)> select * from test_tbl; OK 1 USA 62 Indonesia 63 Philippines 65 Singapore 66 Thailand Time taken: 0.287 seconds hive (default)> Note: Support only text content with column separator 29
Bid Data Development Process 30
Big Data Development Process Guideline Architecture Planning Big Data Development Operation and Support System Evaluation Targeted Users Target Opportunities Data Scientist Data Source/Type Data Capturing Approach Data Processing and Visualize Planning Technology Architecture Big Data EcoSystem (Hadoop Ecosystem) Sizing Integration Security Administration and Operation Planning Develop Use Cases Set up Big Data Pseudo-distribution Mode Set up HDFS Develop Data Capturing System Develop Data Analytic Map Reduce Hive R Etc. Integrate result to Enterprise Analytic System Set up Big Data Cluster Mode Monitor HDFS utilization and capacity planning Monitor Job Tracker availability Monitor Data Capturing System Upgrade or Patch Big Data Hadoop ecosystem System admin. Training Helpdesk Training End-User Training (Analytic Results) Adoption Rates for each analytics results No. of Missing Analytic Results No. of Missing Data Lost hours per month Avg. of each Analytic Result Response Time No. of Technology System Failure per month 31
Summary 32
Big Data, An Integrated Architecture Capture Store/Proces s Integrate Organize Analyze Gover n Structured Master & Ref Data Transaction Data Machine Generated Social Media Text, Image Video, Audio DBMS (OLTP) Hadoop Cluster MapReduce Key-Value Data Store DB Replic ETL/ELT ChangeDC Real-Time Unstructured Semistructured Message- Based ODS Data Warehouse Data Marts Streaming (CEP Engine) Reporting & Dashboards Alerting EPM BI Applications Big Data Text Analytics and Search In-Database Analytics Advanced Analytics Visual Discovery Management Security, Governance Source: oracle.com 33
Thank you very much 34