Big Data and Data Science Grows Up Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com 1
Source IDC 2
Hadoop Open Source Distributed Cluster SoGware Distributed file system Java- based MapReduce Resource manager Started in Nutch project (open source crawler) Inspired by Google MapReduce and GFS 2/7/12
Hadoop Components SQL Store Ingest Service Outgest Service SQL Store Logs Key italics: process : MR jobs Primary Master Server Job Tracker Name Node Cluster Secondary Master Server Secondary Name Node Slaves Client Servers Hive, Pig,... cron+bash, Azkaban, Sqoop, Scribe, Monitoring, Management Slave Server Slave Server Slave Server Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node... Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk 27 4
Data Processing Models 5
Input Mappers Sort, Shuffle Reducers MapReduce 101 Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 6
Word Count: Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 7
MapReduce Wiring public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args). getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } 8
Hive Overview A SQL- based tool for data warehousing using Hadoop clusters. Lowers the barrier for Hadoop adop4on for exis8ng SQL apps and users.. Translates SQL to MapReduce Provides an op4mizer Extensible data types & UDFs The first popular meta- data service for Hadoop 9
Word Count in Hive CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) as count from (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY count DESC, word; 10
Pig Overview Pig La4n is a higher- level map/reduce language A simple data flow language designed for produc8vity not Turing complete (yet!) Built- in support for joins, filters, etc. Provides an op4mizer translates into Hadoop map reduce job steps Allows user- defined func8ons With HCatalog will share metadata with Hive 11
Sample Pig Script lines = LOAD 'docs/*' USING TextLoader(); words = FOREACH lines GENERATE FLATTEN(TOKENIZE($0)); groups = GROUP words BY $0; counts = FOREACH groups GENERATE $0, COUNT($1); sorted = ORDER counts BY $1 desc, $0; STORE sorted INTO 'output/wc' USING PigStorage('\t'); 12
MapReduce Frameworks Cascading Java- based op8mizer & rela8onal operators Crunch Abstract collec8ons and op8mizer Streaming, Pipes Non- Java integra8on (Perl, Python, Ruby, C/C++, ) Tap Simplify 8me series processing, use of diverse tools and data formats 13
public static class WordCountMapper extends TapMapper { @Override } public void map(string in, Pipe<CountRec> out) { StringTokenizer tokenizer = new StringTokenizer(in); } while (tokenizer.hasmoretokens()) { out.put(countrec.newbuilder(). setword(tokenizer.nexttoken()). } Tap Mapper setcount(1).build())); 14
Tap Wiring public static main(string[] args) throws Exception { } CommandOptions o = new CommandOptions(args); Tap tap = new Tap(o); tap.newphase().reads(o.input).map(wordcountmapper.class).combine(wordcountreducer.class).groupby("word").writes(o.output).reduce(wordcountreducer.class); tap.make(); 15
IntegraFon 16
Reference Architecture Addi8ve data processing power for flexibility. Big Data Strategy is integrated with HBase, rela8onal, exis8ng BI and data warehouse technology. Provides capability to create data science discipline using full set data. Analysis capability on all internal data with capability to add external data at will. 17
HBase Tables for Hadoop inspired by Google s Big Table Supports both batch and random access Ad hoc lookup Website serving queries High Consistency Maturing rapidly (e.g., reducing latency variance) S8ll a performance tax vs. DFS 18
Unstructured Data IngesFon Batch log shipping No distributed management and monitoring Syslog forwarding No distributed management and monitoring Apache Kakfa Wrinen in Java Distributed message rou8ng Distributed monitoring and management (agents) Apache Flume Wrinen in Java Pluggable sources, adapters, sinks Distributed monitoring and management (agents) Other streaming frameworks Scribe, Chukwa, Honu Messaging 19
Hadoop High Availability Name Node & Job Tracker tradi8onally a SPOF Durability excellent (3x replicas careful ops) MapR offers HA filesystem and auto- recovery for Job Tracker crash Hadoop 0.23 promises HA Name Node & Job Tracker 20
Ac8ve/Ac8ve op8on for strict SLAs Parallel ingest of data DR: not all or nothing Backup op8ons Recovery Parallel processing Dist cp HBase: snapshots and replica8on 21
Version CompaFbility People upgrade Hadoop by installing a new cluster Tradi8onally stop and upgrade all Rolling upgrades in MapR and futures for Hadoop Protocols not backward compa8ble (8l 0.24+) Only one version of processing API (8l 0.23) 22
HosFng OpFons Cloud- based Amazon EMR and S3 notable Popular with startups & POCs Private Data Center or Coloca8on Commodity Hardware w/ DAS: has been the main approach for enterprise IT Appliance Commodity Hardware w/ SAS or SAN 23
Batch processing ETL Model training Model scoring Fast analy8cs Search Lookup Common Workloads 24
Sweet- Spot Machine ConfiguraFon 8-12 cores 8-12 JBOD 2 TB spindles 32-64 GB of RAM 10 gige Changing quickly (more density, disk drive shortage) 25
Common Uses 26
IT Log & Security Forensics & Analy8cs Automated Device Data Analy8cs Find New Signal Predict Events React in real 8me Failure Analysis Proac8ve Fixes Product Planning 100% Capture Data Governance Shared Services Cross Sell/Upsell Customer Analy8cs Mone8ze Data Adver8sing Analy8cs Anribu8on Customer Value Segmenta8on Insights Op8miza8on Social Media Big Data Warehouse Analy8cs Hadoop + MPP + EDW Cost Reduc8on Ad Hoc Insight Flexibility Predic8ve Analy8cs 27
Automated Device Support Case Study 28
The Enterprise Data Warehouse has been a founda8on of analysis for 20 years It shines for structured analysis, repor8ng, etc. Enterprises are dealing with new needs New data types, at new scale Need to build analy8c models The Big Data Warehouse integrates Hadoop with an Enterprise Data Warehouse 29
Why? Challenges Cost to store unstructured data Poor response 8me to changing BI needs Data Warehouse access for departments Goals Integrate unstructured data with data warehouse Predic8ve analy8cs based on data science Comprehensive access to cluster for all users 30
Hadoop s Role Support semi- structured and unstructured data Large scale storage Transac8on- level detail (e.g., clickstreams) Archival Integrated data: mul8ple warehouses, new data sources, Powerful processing capacity Perform large scale analyses/studies Drill to detail in large fact tables Query without structure: agility to analyze data without preprocessing Transforma8on to build dimensional models, aggregates, and summaries Build predic8ve models 31
Data Agility Classic Warehouse ETL Pre- parse all data Normalize up front Feed data marts New ideas need IT projects Big Data Warehouse Store raw data Parse only when proven Approximate parse on demand Capacity for analysis on demand Prove ideas before projects to op8mize 32
Common Data Sets User ac8vity logs Website, ad server, mobile Social data Graph, ac8vity, profiles Sensor data Hardware & sogware phone home, IT logs, cell phones, smart grid, energy Databases To join with less structured, handle evolving schemas Time series, text, scien8fic 33
Data Value Chain Integra8ng data mul4plies its value Data Provider Data Integrator Data Consumer (Internal Products) Data Marketplace (Amazon, MicrosoG, Infochimps, Buzz Data, Quantbench) 34
Data Science 35
What is Data Science? aka Machine Learning, Data Mining Exploratory Modeling Iden8fying trends and excep8ons Detec8ng signal Confirmatory Modeling Building models to capture signal Proving at scale Working with data bo?om up Norvig: more data beats be?er algorithms 36
People A New Role Exists the Data ScienFst One Part Scien8st/Sta8s8cian Two Parts Sleuth/Ar8st One Part Programmer One Part Entrepreneur Focused on data not models 37
Techniques Supervised Decision trees, random forest Logis8c regression Back- propaga8ng Neural Networks Support Vector Machines Unsupervised (probabilis8c and clustering models) Principal Component Analysis K- means clustering Single Value Decomposi8on Bayesian Networks 38
Technologies Open Source Libraries Mahout (hnp://mahout.apache.org/) Mallet (hnp://mallet.cs.umass.edu/) Weka (hnp://www.cs.waikato.ac.nz/ml/weka/) OpenNLP (hnp://incubator.apache.org/opennlp/) RHadoop (hnps://github.com/revolu8onanaly8cs/ Rhadoop) Tools Karmasphere Analyst: visualiza8on Greenplum Chorus: collabora8on and annota8on 2/7/12
Process Best PracFces Data Science with a Big Data Warehouse Structured + Unstructured Data Make 100 s of mistakes with linle cost Find Promising small signal detec8on Quickly move to produc8on for tes8ng Establish signal detec8on capabili8es Discipline to learn and experiment Innovate with new data sources Retain lessons learned & Voodoo IP Mone8ze your data science investments.! 40
Real- Fme Big Data 41
Hadoop Processing Today: batch- processing Time to spin up JVM instances. HDFS op8mized for disk scans. Designed for reliability: tolerate failures Not yet suitable for real- 4me event processing Storm, KaHa, message queues,... Futures: shared storage and cluster management With mul8ple processing models 42
Edge Serving Needs Scale Out Simple Opera8ons Fast Parallel Export (for profiles, scores) DSS Analy8cs Feed (Fast Parallel Import) Fast Analy8cs (opera8onal repor8ng) 43
Edge Serving OpFons Applica8on- clustered SQL mysql, Oracle, etc. nosql clusters MongoDB, CouchBase, Cassandra, HBase, Citrus Leaf scalable RDBMS for OLTP Oracle RAC?, MySQL Cluster VoltDB, Clustrix 2/7/12
NoSQL Database Types Key- value stores Distributed Hash Table, single index Tabular/columnar stores Big Table style column families, scans, MapReduce Document stores JSON- style self- contained structures, 2ndary indices Object stores Document store + foreign keys Graphical stores Links among nodes and traversal op8ons 2/7/12
HBase Architecture Uses HDFS to handle replica8on Gives us replica8on, resiliency Consists of Master node and Region nodes 3- level hierarchy to get node where data is stored: Client Master Region (Metadata) Region (Data table) * Data table loca8on is cached in client ager lookup, to speed access 2/7/12
Cassandra Architecture Uses Amazon Dynamo- style model Distributed hash table All nodes are homogeneous (no master, no SPOF) Nodes organized in a ring client can connect to any node, as they communicate over the ring: Client 2/7/12
MongoDB Simpler document database model access keys, simple filter queries no joins You can do secondary indices Including geo indexing Eventual consistency model (allows CAP tradeoffs) Updates normally replicated to a slave, defers disk writes for major performance boost Focus on simple API Mongo- Hadoop integra8on ac8vely developed 2/7/12
Streaming Big Data Responding to incoming events at scale Requires keeping state Simple cases applica8on logic with scale- out database SQL- style SQLStream, InfoSphere Streams MapReduce- style emerging Kava, S4, Storm, FlumeBase 2/7/12
Futures 50
CompuFng Trends The growth of storage density has well outpaced the growth of data transfer rates Storage Transfer Rate 1985 1990 1995 2000 2005 2010 51
CompuFng Trends, cont d. In 1990, you could read all the data from a typical drive in about 5 minutes Today, it would take over 2 hours And, seek 8mes have improved even more slowly than data transfer rates (SSDs improve this) Network speeds in the data center have improved at a comparable speed (60%/yr) So clusters of commodity servers allow throughput Clusters of servers allow RAM density 52
Commodity Hardware in 2017? 512 GB of RAM 64 cores 15 TB spinning disks 1 TB SSDs for caching 100 Gigabit (InfiniBand?) 53
Hadoop 0.23 (2.0?) Explosion in New Applica8on Models HBase Prominence Data Science Prac8ces, Tools, Technologies Integra8on Trends in Big Data for 2012 54
ron.bodkin@xthinkbiganaly8cs..com 55