Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com

Similar documents

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Big Data for the JVM developer. Costin Leau,

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

BIG DATA APPLICATIONS

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Enterprise Data Storage and Analysis on Tim Barr

Mrs: MapReduce for Scientific Computing in Python

Xiaoming Gao Hui Li Thilina Gunarathne

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Hadoop Ecosystem B Y R A H I M A.

Hadoop and Map-Reduce. Swati Gore

Using RDBMS, NoSQL or Hadoop?

Introduc8on to Apache Spark

So What s the Big Deal?

Introduc)on to Map- Reduce. Vincent Leroy

Big Data Course Highlights

Using distributed technologies to analyze Big Data

Hadoop IST 734 SS CHUNG

Oracle Big Data SQL Technical Update

Apache Hadoop: Past, Present, and Future

BIG DATA What it is and how to use?

Big Data 2012 Hadoop Tutorial

Internals of Hadoop Application Framework and Distributed File System

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

CSE-E5430 Scalable Cloud Computing Lecture 2

The Hadoop Eco System Shanghai Data Science Meetup

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

The Future of Data Management

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Outline of Tutorial. Hadoop and Pig Overview Hands-on

CS54100: Database Systems

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data and Data Science: Behind the Buzz Words

Peers Techno log ies Pv t. L td. HADOOP

HDP Hadoop From concept to deployment.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Large scale processing using Hadoop. Ján Vaňo

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Real Time Big Data Processing

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

The Future of Data Management with Hadoop and the Enterprise Data Hub

Stinger Initiative: Introduction

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

NoSQL for SQL Professionals William McKnight

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

I/O Considerations in Big Data Analytics

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Open source Google-style large scale data analysis with Hadoop

Real-time Big Data Analytics with Storm

Tutorial- Counting Words in File(s) using MapReduce

How To Scale Out Of A Nosql Database

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

BIG DATA TRENDS AND TECHNOLOGIES

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Luncheon Webinar Series May 13, 2013

HDP Enabling the Modern Data Architecture

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big data blue print for cloud architecture

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

IBM Big Data Platform

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Big Data? Definition # 1: Big Data Definition Forrester Research

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

NoSQL Data Base Basics

MapReduce with Apache Hadoop Analysing Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

SQream Technologies Ltd - Conﬁden7al

Introduction to MapReduce and Hadoop

Oracle Database 12c Plug In. Switch On. Get SMART.

NextGen Infrastructure for Big DATA Analytics.

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data With Hadoop

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Big Data on Microsoft Platform

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Transcription:

Big Data and Data Science Grows Up Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com 1

Source IDC 2

Hadoop Open Source Distributed Cluster SoGware Distributed file system Java- based MapReduce Resource manager Started in Nutch project (open source crawler) Inspired by Google MapReduce and GFS 2/7/12

Hadoop Components SQL Store Ingest Service Outgest Service SQL Store Logs Key italics: process : MR jobs Primary Master Server Job Tracker Name Node Cluster Secondary Master Server Secondary Name Node Slaves Client Servers Hive, Pig,... cron+bash, Azkaban, Sqoop, Scribe, Monitoring, Management Slave Server Slave Server Slave Server Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node... Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk 27 4

Data Processing Models 5

Input Mappers Sort, Shuffle Reducers MapReduce 101 Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 6

Word Count: Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 7

MapReduce Wiring public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args). getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } 8

Hive Overview A SQL- based tool for data warehousing using Hadoop clusters. Lowers the barrier for Hadoop adop4on for exis8ng SQL apps and users.. Translates SQL to MapReduce Provides an op4mizer Extensible data types & UDFs The first popular metadata service for Hadoop 9

Word Count in Hive CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) as count from (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY count DESC, word; 10

Pig Overview Pig La4n is a higher- level map/reduce language A simple data flow language designed for produc8vity not Turing complete (yet!) Built- in support for joins, filters, etc. Provides an op4mizer translates into Hadoop map reduce job steps Allows user- defined func8ons With HCatalog will share metadata with Hive 11

Sample Pig Script lines = LOAD 'docs/*' USING TextLoader(); words = FOREACH lines GENERATE FLATTEN(TOKENIZE($0)); groups = GROUP words BY $0; counts = FOREACH groups GENERATE $0, COUNT($1); sorted = ORDER counts BY $1 desc, $0; STORE sorted INTO 'output/wc' USING PigStorage('\t'); 12

MapReduce Frameworks Cascading Java- based op8mizer & rela8onal operators Crunch Abstract collec8ons and op8mizer Streaming, Pipes Non- Java integra8on (Perl, Python, Ruby, C/C++, ) Tap Simplify 8me series processing, use of diverse tools and data formats 13

public static class WordCountMapper extends TapMapper { @Override } public void map(string in, Pipe<CountRec> out) { StringTokenizer tokenizer = new StringTokenizer(in); } while (tokenizer.hasmoretokens()) { out.put(countrec.newbuilder(). setword(tokenizer.nexttoken()). } Tap Mapper setcount(1).build())); 14

Tap Wiring public static main(string[] args) throws Exception { } CommandOptions o = new CommandOptions(args); Tap tap = new Tap(o); tap.newphase().reads(o.input).map(wordcountmapper.class).combine(wordcountreducer.class).groupby("word").writes(o.output).reduce(wordcountreducer.class); tap.make(); 15

IntegraFon 16

Reference Architecture Addi8ve data processing power for flexibility. Big Data Strategy is integrated with HBase, rela8onal, exis8ng BI and data warehouse technology. Provides capability to create data science discipline using full set data. Analysis capability on all internal data with capability to add external data at will. 17

HBase Tables for Hadoop inspired by Google s Big Table Supports both batch and random access Ad hoc lookup Website serving queries High Consistency Maturing rapidly (e.g., reducing latency variance) S8ll a performance tax vs. DFS 18

Unstructured Data IngesFon Batch log shipping No distributed management and monitoring Syslog forwarding No distributed management and monitoring Apache Kakfa Wrinen in Java Distributed message rou8ng Distributed monitoring and management (agents) Apache Flume Wrinen in Java Pluggable sources, adapters, sinks Distributed monitoring and management (agents) Other streaming frameworks Scribe, Chukwa, Honu Messaging 19

Hadoop High Availability Name Node & Job Tracker tradi8onally a SPOF Durability excellent (3x replicas careful ops) MapR offers HA filesystem and auto- recovery for Job Tracker crash Hadoop 0.23 promises HA Name Node & Job Tracker 20

Ac8ve/Ac8ve op8on for strict SLAs Parallel ingest of data DR: not all or nothing Backup op8ons Recovery Parallel processing Dist cp HBase: snapshots and replica8on 21

Version CompaFbility People upgrade Hadoop by installing a new cluster Tradi8onally stop and upgrade all Rolling upgrades in MapR and futures for Hadoop Protocols not backward compa8ble (8l 0.24+) Only one version of processing API (8l 0.23) 22

HosFng OpFons Cloud- based Amazon EMR and S3 notable Popular with startups & POCs Private Data Center or Coloca8on Commodity Hardware w/ DAS: has been the main approach for enterprise IT Appliance Commodity Hardware w/ SAS or SAN 23

Batch processing ETL Model training Model scoring Fast analy8cs Search Lookup Common Workloads 24

Sweet- Spot Machine ConfiguraFon 8-12 cores 8-12 JBOD 2 TB spindles 32-64 GB of RAM 10 gige Changing quickly (more density, disk drive shortage) 25

Common Uses 26

IT Log & Security Forensics & Analy8cs Automated Device Data Analy8cs Find New Signal Predict Events React in real 8me Failure Analysis Proac8ve Fixes Product Planning 100% Capture Data Governance Shared Services Cross Sell/Upsell Customer Analy8cs Mone8ze Data Adver8sing Analy8cs Anribu8on Customer Value Segmenta8on Insights Op8miza8on Social Media Big Data Warehouse Analy8cs Hadoop + MPP + EDW Cost Reduc8on Ad Hoc Insight Flexibility Predic8ve Analy8cs 27

Automated Device Support Case Study 28

The Enterprise Data Warehouse has been a founda8on of analysis for 20 years It shines for structured analysis, repor8ng, etc. Enterprises are dealing with new needs New data types, at new scale Need to build analy8c models The Big Data Warehouse integrates Hadoop with an Enterprise Data Warehouse 29

Why? Challenges Cost to store unstructured data Poor response 8me to changing BI needs Data Warehouse access for departments Goals Integrate unstructured data with data warehouse Predic8ve analy8cs based on data science Comprehensive access to cluster for all users 30

Hadoop s Role Support semi- structured and unstructured data Large scale storage Transac8on- level detail (e.g., clickstreams) Archival Integrated data: mul8ple warehouses, new data sources, Powerful processing capacity Perform large scale analyses/studies Drill to detail in large fact tables Query without structure: agility to analyze data without preprocessing Transforma8on to build dimensional models, aggregates, and summaries Build predic8ve models 31

Data Agility Classic Warehouse ETL Pre- parse all data Normalize up front Feed data marts New ideas need IT projects Big Data Warehouse Store raw data Parse only when proven Approximate parse on demand Capacity for analysis on demand Prove ideas before projects to op8mize 32

Common Data Sets User ac8vity logs Website, ad server, mobile Social data Graph, ac8vity, profiles Sensor data Hardware & sogware phone home, IT logs, cell phones, smart grid, energy Databases To join with less structured, handle evolving schemas Time series, text, scien8fic 33

Data Value Chain Integra8ng data mul4plies its value Data Provider Data Integrator Data Consumer (Internal Products) Data Marketplace (Amazon, MicrosoG, Infochimps, Buzz Data, Quantbench) 34

Data Science 35

What is Data Science? aka Machine Learning, Data Mining Exploratory Modeling Iden8fying trends and excep8ons Detec8ng signal Confirmatory Modeling Building models to capture signal Proving at scale Working with data bo?om up Norvig: more data beats be?er algorithms 36

People A New Role Exists the Data ScienFst One Part Scien8st/Sta8s8cian Two Parts Sleuth/Ar8st One Part Programmer One Part Entrepreneur Focused on data not models 37

Techniques Supervised Decision trees, random forest Logis8c regression Back- propaga8ng Neural Networks Support Vector Machines Unsupervised (probabilis8c and clustering models) Principal Component Analysis K- means clustering Single Value Decomposi8on Bayesian Networks 38

Technologies Open Source Libraries Mahout (hnp://mahout.apache.org/) Mallet (hnp://mallet.cs.umass.edu/) Weka (hnp://www.cs.waikato.ac.nz/ml/weka/) OpenNLP (hnp://incubator.apache.org/opennlp/) RHadoop (hnps://github.com/revolu8onanaly8cs/ Rhadoop) Tools Karmasphere Analyst: visualiza8on Greenplum Chorus: collabora8on and annota8on 2/7/12

Process Best PracFces Data Science with a Big Data Warehouse Structured + Unstructured Data Make 100 s of mistakes with linle cost Find Promising small signal detec8on Quickly move to produc8on for tes8ng Establish signal detec8on capabili8es Discipline to learn and experiment Innovate with new data sources Retain lessons learned & Voodoo IP Mone8ze your data science investments.! 40

Real- Fme Big Data 41

Hadoop Processing Today: batch- processing Time to spin up JVM instances. HDFS op8mized for disk scans. Designed for reliability: tolerate failures Not yet suitable for real- 4me event processing Storm, KaHa, message queues,... Futures: shared storage and cluster management With mul8ple processing models 42

Edge Serving Needs Scale Out Simple Opera8ons Fast Parallel Export (for profiles, scores) DSS Analy8cs Feed (Fast Parallel Import) Fast Analy8cs (opera8onal repor8ng) 43

Edge Serving OpFons Applica8on- clustered SQL mysql, Oracle, etc. nosql clusters MongoDB, CouchBase, Cassandra, HBase, Citrus Leaf scalable RDBMS for OLTP Oracle RAC?, MySQL Cluster VoltDB, Clustrix 2/7/12

NoSQL Database Types Key- value stores Distributed Hash Table, single index Tabular/columnar stores Big Table style column families, scans, MapReduce Document stores JSON- style self- contained structures, 2ndary indices Object stores Document store + foreign keys Graphical stores Links among nodes and traversal op8ons 2/7/12

HBase Architecture Uses HDFS to handle replica8on Gives us replica8on, resiliency Consists of Master node and Region nodes 3- level hierarchy to get node where data is stored: Client Master Region (Metadata) Region (Data table) * Data table loca8on is cached in client ager lookup, to speed access 2/7/12

Cassandra Architecture Uses Amazon Dynamo- style model Distributed hash table All nodes are homogeneous (no master, no SPOF) Nodes organized in a ring client can connect to any node, as they communicate over the ring: Client 2/7/12

MongoDB Simpler document database model access keys, simple filter queries no joins You can do secondary indices Including geo indexing Eventual consistency model (allows CAP tradeoffs) Updates normally replicated to a slave, defers disk writes for major performance boost Focus on simple API Mongo- Hadoop integra8on ac8vely developed 2/7/12

Streaming Big Data Responding to incoming events at scale Requires keeping state Simple cases applica8on logic with scale- out database SQL- style SQLStream, InfoSphere Streams MapReduce- style emerging Kava, S4, Storm, FlumeBase 2/7/12

Futures 50

CompuFng Trends The growth of storage density has well outpaced the growth of data transfer rates Storage Transfer Rate 1985 1990 1995 2000 2005 2010 51

CompuFng Trends, cont d. In 1990, you could read all the data from a typical drive in about 5 minutes Today, it would take over 2 hours And, seek 8mes have improved even more slowly than data transfer rates (SSDs improve this) Network speeds in the data center have improved at a comparable speed (60%/yr) So clusters of commodity servers allow throughput Clusters of servers allow RAM density 52

Commodity Hardware in 2017? 512 GB of RAM 64 cores 15 TB spinning disks 1 TB SSDs for caching 100 Gigabit (InfiniBand?) 53

Hadoop 0.23 (2.0?) Explosion in New Applica8on Models HBase Prominence Data Science Prac8ces, Tools, Technologies Integra8on Trends in Big Data for 2012 54

ron.bodkin@xthinkbiganaly8cs..com 55