Big Data and Solr. Ken Krugler. Confidential and Proprietary 2014/15

Transcription

1

2 Big Data and Solr Ken Krugler Confidential and Proprietary 2014/15

3 Welcome to Big Data and Solr This class is about using Solr with big data Topics covered include Hadoop & batch processing workflows Storm & continuous data processing Cassandra & NoSQL Scalable Solr indexing This is not a Solr/SolrCloud tutorial! 3

4 Meet Your Instructor Ken Krugler - direct from Nevada City, California President of Scale Unlimited Apache Software Foundation Member Hadoop/Cascading/Mahout/Cassandra developer and trainer Solr developer and trainer 4

5 Day 1 in a Nutshell Big data overview Hadoop fundamentals Eco-system, other pieces Not-so-gentle intro to workflows Foundation for the "and Solr" piece tomorrow Show of hands for Hadoop background Heard the name Used it some Elbows deep 5

6 Title Text Real-world example Real-world solution Hadoop overview Counting words Hadoop map-reduce Hadoop streaming Hadoop distributed file system 6

7 Title Text Map-Reduce lab Hadoop Eco-system Cascading & workflows Cassandra & NoSQL Storm & continuous processing Workflow lab 7

8 Real World Example Big Data and Solr 8 Confidential and Proprietary 2014/15

9 Adbeat It s an analytics web site for display advertising You can find out who s advertising where Company has been active for about 3 years now 9

10 What s an Analytics Web Site? Let the user ask questions about data 10

11 Including Sexy Dashboards All driven by slices of the data 11

12 Old system architecture Web crawl running in Amazon cloud Crawl state saved in NoSQL database (MongoDB) Python + queues to fetch & analyze millions of pages/day Data pushed into Amazon storage (S3) Custom database used to store results Supported queries and search Also count/sum of query results 12

13 Back end database Each view or filter change causes queries to be executed sum ad impact for all advertisers on all networks, sort by sum, limit 10 sum ad impact by ad type for advertiser oracle.com For many millions of records Fast, accurate, cheap pick any two Fast Accurate Cheap 13

14 Combinatorial Explosion Too many possibilities to pre-calculate everything more than 100,000 publishers more than 1,000,000 advertisers 40 ad networks, 5 date ranges, etc So there are trillions of possible combinations Caching of DB query results isn t very useful 14

15 Trouble in UI Land UI refresh took seconds Well outside of target range of about a second or so 0.1 second: instantaneous 1.0 second: I m still in the flow 10 seconds: I m bored 15

16 Trouble in the back office Beefy hardware for multiple DBs was expensive AWS cost > USD$10,000 per month Couldn t process data fast enough The data sets needed to grow significantly (time, depth, breadth) Constant schema changes meant painful data reloading Extract, load, transform (inside of DB) Re-indexing of DB fields 16

17 Was it a Big Data problem? Sometimes hard to describe - think about the three V s Volume - data on 2 billion ads Velocity - millions of new ads each day Variety - ad data, web page crawl results, partner data 17

18 What were key problems? Scalability - data size growing 5x year-by-year Flexibility - how to add new functionality quickly Reliability - don t lose data, regular data updates All of these have impact on Cost, Performance, Quality 18

19 Doing some things right Using NoSQL database (MongoDB) to store web crawl metadata Scalable, flexible schema Using Amazon Web Services for scalable infrastructure Web crawling, storage, some data processing 19

20 Key Problems DB-centric design often has trouble with scaling, flexibility Data volume, velocity and variety were all problems You re only as good as the weakest link 20

21 Real World Solution Big Data and Solr 21 Confidential and Proprietary 2014/15

22 A New Approach Do analytics off-line using Hadoop Pre-generate as much as possible Use Solr as a NoSQL database (including search) + = 22

23 Hadoop-based analytics Hadoop is very good at parsing text files Crawl results pulled from Amazon S3 storage, parsed Hadoop is very good at counting & summing lots of things Group on advertiser, count ads during different time periods Hadoop can be used for more advanced analytics Parse ad text to extract all phrases Use machine learning bits from Mahout 23

24 What Solr Gives Us Open source enterprise search system Fast, memory-efficient queries (NoSQL) Pre-computed answers to 300M questions Count the number of documents that match a query Sort results by fields Fast searches Find all Flash ads with the word diet 24

25 Obligatory Architectural Slide Single search server 16 shards per index Optimize response time 300M total documents Solr /select AWS Server cc2.8xlarge Jetty Webapp Container Request Handler Shard #1 Shard #2 Shard N Solr Cores (Indexes) 25

26 How to Connect the Dots We have web crawl data - ads, advertisers, publishers, networks text google DIRECTV For Businesses Save $13/mo We have target Solr schemas with the fields defined <field name="network" type="string" indexed="true" stored="false" required="true" /> <field name="publisher" type="string" indexed="true" stored="false" required="true" /> Data Sources f(data)??? Index 26

27 Hadoop ETL Implement appropriate Extract, Transform, Load Extract is just parsing text files that are stored in Amazon s S3 Load is building the Solr index and deploying it to the search servers What about that pesky Transform part? 27

28 Simplicity Itself 80+ Hadoop Jobs Developed with Cascading Every-other-day run is $60 28

29 Where are they now? < 1 second average response time to most queries 10x more data, and growing Many more features 1/5th the monthly cost Regular updates/no data loss 29

30 Key Solution Points Using AWS gave them flexibility to change architecture Switching to Hadoop gave them scalable data processing Using Solr as a NoSQL DB gave them real-time analytics and search 30

31 Hadoop Overview Big Data and Solr 31 Confidential and Proprietary 2014/15

32 Key Questions Why is something like Hadoop needed? What are the two key architectural components? 32

33 How to Crunch a Petabyte Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry, since network errors happen 33

34 Hadoop to the Rescue Open source project Under Apache Software Foundation Other distributions - Cloudera, MapR, etc. Scalable - many servers with lots of cores and spindles Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer 34

35 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer 35

36 Logical Architecture Users Cluster Client Client Execution job job job Storage Client Where users can submit jobs that: Run on the Execution layer, and Read/write data from the Storage layer 36

37 Storage Layer Hadoop Distributed File System (aka HDFS) Runs on top of regular OS file system, typically Linux ext3 Fixed-size blocks (64MB by default) that are replicated Write once, read many; optimized for streaming in and out Geography Geography Rack Rack Rack Node Node Node Node... block block block block block block block block 37

38 Execution Layer Hadoop Map-Reduce Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Jobs consist of special map and reduce operations Geography Geography Rack Rack Rack Node map Node Node Node... map map map map reduce reduce reduce 38

39 Scalable Cluster Rack Rack Rack Node Node Node Node... cpu Execution disk Storage Virtual execution & storage layers span many nodes (servers) Scales linearly (sort of) with cores and disks. 39

40 Reliable Each block is replicated, typically three times. Each block is checksummed. Each task must succeed, or the job fails. All intermediate data copies are validated. 40

41 Fault-tolerant Failed tasks are automatically retried. Failed data transfers are automatically retried. Servers can join and leave the cluster at any time. 41

42 Simple Node Node Node Node... cpu Execution disk Storage Reduces complexity Conceptual operating system that spans many CPUs & disks 42

43 Typical Hadoop Cluster Has one master server - high quality box NameNode process - manages file system JobTracker process - manages tasks Has multiple slave servers - commodity hardware DataNode process - manages file system blocks on local drives TaskTracker process - runs tasks on server Uses high speed network between all servers 43

44 Architectural Components Solid boxes are unique applications Dashed boxes are child JVM instances (on same node as parent) Dotted boxes are blocks of managed files (on same node as parent) NameNode DataNode DataNode DataNode DataNode data block Client jobs JobTracker tasks TaskTracker mapper child mapper jvm child mapper jvm child jvm reducer child reducer jvm child reducer jvm child jvm Master Slaves 44

45 Q & A Why is something like Hadoop needed? What are the two key architectural components? 45

46 HDFS Big Data and Solr 46 Confidential and Proprietary 2014/15

47 Key Questions Why does Hadoop keep multiple copies of the data? What are the two processes that manage the file system? What are key limitations to the file system? 47

48 Virtual File System Treats many disks on many servers as one huge, logical volume Files have paths, same as a regular file system For example: /user/kkrugler/data/big-file.txt Data is stored in 1...n blocks Blocks in file all have same max size (typically 64MB) The DataNode process manages blocks of data on a slave. The NameNode process keeps track of file metadata on the master. 48

49 Replication Each block is stored on several different disks (default is 3) Hadoop tries to copy blocks to different servers and racks Protects data against disk, server, rack failures Reduces the need to move data to code 49

50 Error Recovery Slaves constantly check in with the master. Aka the heartbeat Data is automatically replicated if a disk or server goes away. Data is checksummed to protect against bad hardware. Per-block, multiple CRC values Verified during writes and reads Background verification (block scanner) 50

51 Performance / Scaling Optimized for streaming data in/out Data rates 30% - 50% of max raw disk rate No random access during writes (streaming out) Write once, read many Limit to scaling is NameNode Memory, CPU horsepower But 10K servers are probably more than you ll ever need 51

52 NameNode Runs on master node Was a single point of failure (Hadoop 2.0 has HA) There are no built-in software hot failover mechanisms Maintains filesystem namespace, the Namespace files and hierarchical directories Executes filesystem namespace operations opening, closing, and renaming Maintains mapping between data blocks to DataNodes the the Block Map 52

53 DataNodes DataNode DataNode DataNode DataNode data block Files stored on HDFS are stored as blocks on DataNode Manages storage attached to the nodes that they run on Performs block creation, deletion, replication as directed from NameNode Typically one instance per node in the cluster Data never flows through NameNode, only DataNodes Heartbeats sent to NameNode determine availability 53

54 Getting Data into HDFS Using Hadoop command line tool hadoop fs -copyfromlocal <local path> <HDFS path> Using Hadoop DFSClient Java class Create output stream, write to it From external data sources, via API requests HTTP requests to web servers JDBC requests to databases 54

55 Q & A Why does Hadoop keep multiple copies of the data? What are the two processes that manage the file system? What are some limitations to the Hadoop file system? 55

56 Counting Words Big Data and Solr 56 Confidential and Proprietary 2014/15

57 How Long to Count a Page of Words? Many different approaches Take the first word, count all occurrences in the document, repeat Or keep a list of words with counts Clearly doesn't scale well What happens if you have one million pages? 57

58 How Can a Room of People Count Words? Some people are "mappers" They cut up sentences into individual words Some people are "reducers" They count the total number of words Each "reducer" gets all words that end with a letter in their range e.g. "obvious" ends with "s", so it goes to the O - T reducer Let's give that a try... 58

59 Making it Faster The "mappers" could sort their words So then the "reducers" could merge-sort the lists from each mapper The "mappers" could combine word counts If they have two words that are the same, write "2" on one slip This reduces the number of words that have to given to the "reducers" The "mappers" could run to the "reducers" That's like having a faster network The "reducers" could have more fair ranges of letters U-Z has it easy, O-T has to work much harder 59

60 Map-Reduce Big Data and Solr 60 Confidential and Proprietary 2014/15

61 Key Questions What is the fundamental format for all data? Why might one reduce task take much longer than the others? 61

62 Definitions Key Value Pair -> two units of data, exchanged between Map & Reduce Map -> The map function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs Reduce -> The reduce function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs Group -> A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values 62

63 Definitions Mapper Map Reducer Reduce Mapper -> A process that executes Map for each input Key Value Pair Reducer -> A process that executes Reduce for each unique map Key Shuffling -> The magic between the Mapper and Reducer Job -> A single Map and Reduce implementation that are submitted together and that execute on many Mappers and Reducers on the Cluster 63

64 Title Text Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper 64

65 Recap MapReduce is conceptually Map->Group->Reduce Map and Reduce are user defined functions that execute on Key Value Pairs A Job pushes one Map and one Reduce implementation onto the cluster to be executed by distributed Mappers and Reducers The Group function is inherent to the algorithm and is executed during Shuffling 65

66 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,...}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,...}] Reduce [K3,V3] 66

67 Things to consider Map: is required in a Job [K1,V1] Map [K2,V2] may emit 0 or more Key Value Pairs [K2,V2] Reduce: is optional in a Job [K2,{V2,V2,...}] Reduce [K3,V3] sees K2 keys in sort order but K2 keys will be randomly distributed across Reducers the collection of V2 values are unordered may emit 0 or more Key Value Pairs [K3,V3] 67

68 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 e.g. < the,1> < quick,1> < brown,1> < fox,1> < jumped,1> Sum the values (ones) for each unique word 68

69 Title [1, When in the Text Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [K1,V1] [3, which have connected them with another, and to assume] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Defined Group [K2,{V2,V2,...}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [K3,V3] [When,4] [peope,6] [dissolve,3] [connected,5] [k,sum(v,v,v,..)] 69

70 Divide & Conquer (splitting data) Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Then A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key 70

71 [K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3] Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer split1 split2 split3 split4... file Mapper Mappers must complete before Reducers can begin part part part-000n directory 71

72 Divide & Conquer (parallelizable) Because Each Mapper is independent and processes part of the whole, and Each Reducer is independent and processes part of the whole Then Any number of Mappers can run on each node, and Any number of Reducers can run on each node, and The cluster can contain any number of nodes 72

73 JobTracker Client jobs JobTracker Is a single point of failure Determines # Mapper Tasks from file splits via InputFormat Uses predefined value for # Reducer Tasks Client applications use JobClient to submit jobs and query status Command line use hadoop job <commands> Web status console use 73

74 TaskTracker mapper child mapper jvm child mapper jvm child jvm TaskTracker reducer child reducer jvm child reducer jvm child jvm Spawns each Task as a new child JVM Max # mapper and reducer tasks set independently Can pass child JVM opts via mapred.child.java.opts Can re-use JVM to avoid overhead of task initialization 74

75 Q & A What is the fundamental format for all data? Why might one reduce task take much longer than the others? 75

76 Hadoop Streaming Big Data and Solr 76 Confidential and Proprietary 2014/15

77 Key Questions Why can't the key contain tab characters? Why do you have to keep track of the key in your reducer? 77

78 Streaming Overview Use any executable or script as the Mapper and/or Reducer awk, sed, grep, bash, etc Python, Ruby, Perl Mapper or Reducer may also be a Java class Uses Unix standard input and output streams Essentially a pre-created Hadoop job that knows how to... run code separate from the TaskTracker JVM send values via stdin, and get results via stdout 78

79 Streaming Text vs Binary Typically streaming uses text (UTF-8) for everything Map just gets the value (offset into file is the key, and is tossed) Reduce gets key/value pairs as <key><tab><value><eol> Though TypedBytes lets you send binary data in, get it back out Needs language-specific implementation e.g. see Dumbo project for using this with Python 79

80 Streaming Map Function Each line read from stdin has one value to be processed Output is written as one key/value pair per line of text, to stdout #!/usr/bin/env python import sys for line in sys.stdin:... # Split the line of text... # do some processing print "%s\t%s" % , count 80

81 Streaming Reduce Function Each line read from stdin has one key/value pair to be processed For a group, you're called multiple times with the same key There's no iterator Output is written as one key/value pair per line of text, to stdout 81

82 Title Text $HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \ -input file:/users/user/sandbox/input/ncdc/sample.txt \ -output file:/users/user/sandbox/output \ -mapper /Users/user/sandbox/src/main/map.rb \ -reducer /Users/user/sandbox/src/main/reduce.rb Many optional parameters to control... Fields in keys Partitioning etc. etc. 82

83 Q & A Why can't the key contain tab characters? Why do you have to keep track of the key in your reducer? 83

84 count Lab Big Data and Solr 84 Confidential and Proprietary 2014/15

85 Key Questions Can you get the count example to build & run? What was challenging about the exercises? 85

86 count Overview Goal is to count s by author ( address) Input is text file or directory of text files with pre-processed s E.g. src/test/resources/ s.tsv One per line, with tab-separated values msgid, author, , subject, date, replyid, content line feeds in content are replaced by the text "\n", tabs by "\t" <msg1>!author1! subject1! T14:07:23Z! <msg101>! content1 86

87 count Map Map task extracts address from input line Uses FieldUtils.safeSplit(string) to correctly handle empty fields Emits key = < address>, value = <1> for key-value pairs <msgid> <author> < > <subject> <date> <replyid> <content> < > 1 87

88 count Reduce Reduce task sums the counts of each unique address Emits key = < address>, value = <summed count> a.karpenko@oxseed.com 5 < 1> 1 < 1> 1 < 1> 1 < 2> 1 < 1> 3 < 2> 1 88

89 Streaming-like Java API public interface IStreamingMapper { void map(bufferedreader in, PrintWriter out) throws Exception; } public interface IStreamingReducer { void reduce(bufferedreader in, PrintWriter out) throws Exception; } 89

90 IStreamingMapper for count public interface IStreamingMapper { void map(bufferedreader in, PrintWriter out) throws Exception; } public class CountJob1Mapper implements IStreamingMapper public void map(bufferedreader in, PrintWriter out) throws Exception { String inputline; while ((inputline = in.readline())!= null) { // We have a line of text with a bunch of tab-separated fields. // msgid, author, , subject, date, replyid, content String fields[] = FieldUtils.safeSplit(inputLine); } } } // We want to emit the address and a count of 1 out.println(fields[2].trim().tolowercase() + "\t" + 1); 90

91 IStreamingReducer for count public interface IStreamingReducer { void reduce(bufferedreader in, PrintWriter out) throws Exception; } public class CountJob1Reducer implements IStreamingReducer { public void reduce(bufferedreader in, PrintWriter out) throws Exception { String inputline, current = null; int count = 0; while ((inputline = in.readline())!= null) { // We have a line of text with two tab-separated fields ( , count) String fields[] = FieldUtils.safeSplit(inputLine); String = fields[0]; int count = Integer.parseInt(fields[1]); } if (current == null) { current = ; } if ( .equals(current )) { count += count; } else { out.println(current + "\t" + count); current = ; count = count; } 91

92 Test code will run StreamingTool In src/test/java/com/scaleunlimited/labs/ count/ test() method inside of public void test() throws Exception { String args[] = {"-input", "src/test/resources/ s.tsv", "-output", "build/test/ counttest/", "-mapper", CountMapper.class.getCanonicalName(), "-reducer", CountReducer.class.getCanonicalName(), }; StreamingTool.main(args); } assertequals(40, TestUtils.getNumLines("build/test/ CountLabTest/")); 92

93 You can "chain" Jobs String args1[] = { "-input", INPUT_FILENAME, "-output", JOB_1_OUTPUT_DIRNAME, "-mapper", CountExercise2Job1Mapper.class.getCanonicalName(), "-reducer", CountExercise2Job1Reducer.class.getCanonicalName(), }; StreamingTool.main(args1); // We'll use the output of the first job as the input for this second job. String args2[] = { "-input", JOB_1_OUTPUT_DIRNAME, "-output", JOB_2_OUTPUT_DIRNAME, "-mapper", CountExercise3Job2Mapper.class.getCanonicalName(), "-reducer", CountExercise3Job2Reducer.class.getCanonicalName(), }; StreamingTool.main(args2); 93

94 Additional options -debug prints every record to the console -mappertimeout & -reducertimeout are useful when debugging You can hit a breakpoint and look at variables without thread termination String args1[] = { "-input", INPUT_FILENAME, "-output", JOB_1_OUTPUT_DIRNAME, "-mapper", CountExercise2Job1Mapper.class.getCanonicalName(), "-reducer", CountExercise2Job1Reducer.class.getCanonicalName(), "-debug", "-mappertimeout", "100000", "-reducertimeout", "100000", }; 94

95 Lab Details Follow steps in bigdata-solr/ count/readme- count Use "ant test" to build and test the code Modify count as per the README exercises Use "ant test1", "ant test2", etc to test solutions Exercise 1 in all labs should be doable in the time you've got Subsequent exercises are harder If you finish the "challenge" exercise, I'll be very impressed :) Solutions are in bigdata-solr/ count/solutions We ll do Q & A in one hour 95

96 Q & A Can you get the count example to build & run? What was challenging about the exercises? 96

97 Hadoop Eco-system Big Data and Solr 97 Confidential and Proprietary 2014/15

98 Key Questions What's the key difference between Hive and Drill? What is one common problem with using Pig? 98

99 Hive Hive is a data warehouse built on top of flat files in HDFS Developed by Facebook Now a top-level Apache project Very active developer/user community 99

100 Hive Data Organization into Tables with logical and hash partitioning A Metastore to store metadata about Tables/Partitions/Buckets Partitions are how the data is stored (collections of related rows) Buckets further divide Partitions based on a hash of a column metadata is actually stored in a RDBMS A SQL like query language over object data stored in Tables Custom types, structs of primitive types User defined functions 100

101 Hive - Creating Tables CREATE TABLE page_view(viewtime DATETIME, userid MEDIUMINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date DATETIME, country STRING) BUCKETED ON (userid) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY \001 COLLECTION ITEMS TERMINATED BY \002 MAP KEYS TERMINATED BY \003 LINES TERMINATED BY \012 STORED AS COMPRESSED 101

102 Hive - Loading Data CREATE EXTERNAL TABLE page_view_stg(viewtime DATETIME, userid MEDIUMINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY \054 LINES TERMINATED BY \012 LOCATION '/user/facebook/staging/page_view'; hadoop dfs -put /tmp/pv_ txt /user/facebook/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(date= , country='us') SELECT pvs.viewtime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; 102

103 Hive - Queries Queries always write results into a table FROM user INSERT OVERWRITE TABLE user_active SELECT user.* WHERE user.active = true; FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; 103

104 Pig Pig is a data flow programming environment for processing large files PigLatin is the text language it provides for defining data flows Developed by researchers in Yahoo! and widely used internally 104

105 PigLatin - Example W = LOAD 'filename' AS (url, outlink); G = GROUP W by url; R = FOREACH G { FW = FILTER W BY outlink eq ' PW = FW.outlink; DW = DISTINCT PW; GENERATE group, COUNT(DW); } Adding new functionality requires a user defined function aka the dreaded "UDF" Brings back bad memories of stored procedures 105

106 Drill Distributed system for interactive analysis of largescale datasets Based on Google paper about their Dremel system Apache Incubator project Very hot area right now Impala (Cloudera) Druid (Metamarkets) etc, etc, etc 106

107 Interactive Analysis Systems designed to... Support very large datasets (petabytes) on many servers (1000s) Quickly answer SQL-ish queries, e.g. in 10s to 100s of milliseconds Easily handle "nested" data formats, versus requiring flattening Uses "columnar" format for storing data ORCFile, Parquet, Trevni, etc. 107

108 Q & A What's the key difference between Hive and Drill? What is one common problem with using Pig? 108

109 Hadoop Summary Big Data and Solr 109 Confidential and Proprietary 2014/15

110 Good Use Cases for Hadoop Data doesn t fit on one server Data can t be processed fast enough on one server Batch processing latency is an acceptable trade-off 110

111 Bad Use Cases for Hadoop Problems that aren t great for solving with Hadoop Small data problems Real-time data processing 111

112 Success Factors The standard list - clear definition of victory, timetable, etc. Where is your data coming from, and going to? Often the biggest chunk of project time What s the simplest Hadoop solution that might work? Streaming, for non-java solutions Hive, for query-centric problems Cascading, for workflows 112

113 Beware the Hadoopaphile Hadoop is a big hammer But not every problem is a nail And it's not magic pixie dust 113

114 Resources Hadoop mailing lists Users groups - e.g. Hadoop API Books Hadoop: The Definitive Guide, 2nd edition by Tom White 114

115 Workflows & Cascading Big Data and Solr 115 Confidential and Proprietary 2014/15

116 Key Questions How is Cascading similar to Hive and Pig? What is one challenging Hadoop operation that's easy with Cascading? 116

117 Workflow Definition Complex processing of (semi) structured data at scale Complex: not something easily handled via map-reduce Processing: conversion of one or more input data streams Structured: data typically has defined fields with specific meanings At scale: beyond what you can do on one server 117

118 Cascading An API for defining data processing workflows Open source project, Apache Public License First public release Jan 2008 Currently and supports Hadoop through 2.5.x 118

119 Cascading Implements all standard data processing operations function, filter, group by, co-group, aggregator, etc Complex applications are built by chaining operations with pipes Pipe assemblies are bound to input and output data And are planned into the minimum number of MapReduce jobs Similar to Hive, Pig and other higher-level abstractions 119

120 Cascading Code Pipe pipe = new Pipe("wordcount"); pipe = new Each(pipe, new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+")); pipe = new CountBy(pipe, new Fields("word"), new Fields("count")); pipe = new GroupBy(pipe, new Fields("count"), true); Tap source = new Hfs(new TextLine(), "wordcount.txt"); Tap sink = new Hfs(new TextLine(), "output", SinkMode.REPLACE); Flow f = new HadoopFlowConnector().connect(source, sink, pipe); f.complete(); 120

121 Hfs['TextLine[['line']->[ALL]]']['src/test/resources/wordcount.txt']'] Each('wordcount')[RegexSplitGenerator[decl:'word'][args:1]] Each('wordcount')[CompositeFunction[decl:'word', 'count']] GroupBy('wordcount')[by:['word']] Every('wordcount')[Sum[decl:'count'][args:1]] TempHfs['SequenceFile[['word', 'count']]'][ /wordcount/] GroupBy('wordcount')[by:['count']] Hfs['TextLine[['offset', 'line']->[all]]']['build/test/minwordcounttest']'] 121

122 Joining Streams of Data Common requirement to join two data sets, using a common key E.g. log file data and country data, by IP address 'Log IP', 'Status' 'Data IP', 'Country' Join Tuple streams using 'Log IP' and 'Data IP' 'Log IP', 'Status', 'Data IP', 'Country' Pipe loganalysispipe = new CoGroup( logdatapipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipdatapipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do 122

123 [head] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] ['', ''] ['', ''] ['', ''] ['', ''] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] Each('')[RegexSplitter[decl:'', ''][args:1]] Each('')[RegexSplitter[decl:'', '', '', ''][args:1]] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', '', ''][args:1]] TempHfs['SequenceFile[['', '']]'][] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', ''] ['', ''] ['', ''] ['', ''] a[''],b[''] ['', '', '', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', ''] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] ['', '', ''] ['', '', ''] ['', '', ''] ['', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', ''] ['', ''] ['', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', ''] ['', '', '', ''] TempHfs['SequenceFile[['', '', '', '']]'][] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', '', '', ''] a[''],b[''] ['', '', '', '', '', ''] TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][] Each('')[Identity[decl:'', '', '', '']] ['', '', '', '', '', '', ''] ['', '', '', '', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', '', '', '', '', ''] Each('')[Identity[decl:'', '']] ['', ''] ['', ''] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] GroupBy('')[by:['']] a[''] ['', ''] ['', '', ''] ['', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', '', '', ''] Each('')[Identity[decl:'', '', '']] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] TempHfs['SequenceFile[['', '', '', '']]'][] ['', '', '', ''] ['', '', '', ''] Every('')[Agregator[decl:'']] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', ''] Each('')[Identity[decl:'', '', '']] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] ['', '', ''] ['', '', ''] GroupBy('')[by:['']] a[''] ['', '', ''] Every('')[Agregator[decl:'', '']] ['', '', ''] ['', '', ''] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] [tail] 123

124 Workflow Design Simplicity itself :) Sketch as operations & groups connected with pipes 124

125 count Lab Exercise 1 Count total s and total characters, by author First step is to parse input text to create Tuples with fields Use the handy RegexSplitter function Pipe pipe = new Pipe(" s"); Fields f = new Fields("msg_id", "author", " ", "subject", "date", "reply_id", "content"); pipe = new Each(pipe, new Fields("line"), new RegexSplitter(f)); 125

126 count Lab Exercise 1 Next step is to calculate the length of each Use the handy ExpressionFunction Fields lengthfield = new Fields("content_length"); Function func = new ExpressionFunction(lengthField, "content.length()", String.class); p = new Each(pipe, new Fields("content"), func, Fields.ALL); 126

127 count Lab Exercise 1 Finally group on the address, and count s/sum length AggregateBy does map-side pre-aggregations for efficiency But wait, we'll also group on the summed length, to sort high to low AggregateBy count = new CountBy(new Fields("num_ s")); AggregateBy sum = new SumBy(lengthField, new Fields("total"), Integer.class); pipe = new AggregateBy(pipe, new Fields(" "), count, sum); pipe = new GroupBy(p, new Fields("total"), true); 127

128 Q & A How is Cascading similar to Hive and Pig? What is one challenging Hadoop operation that's easy with Cascading? 128

129 NoSQL & Cassandra Big Data and Solr 129 Confidential and Proprietary 2014/15

130 Key Questions Where do you start, when designing a NoSQL schema? What are wide rows, and why should you care? 130

131 What is NoSQL? High availability Linear scaling Flexible data model 131

132 When to switch to NoSQL? If you have to shard your DB If you put memcached in front of your DB If you process your data with Hadoop Don t do this for: Highly-transactional data (make sure you need this) Highly-relational data Small data 132

133 Cassandra Distributed, fault-tolerant NoSQL database system Open source project, Apache Public License Based on aspects of both Dynamo and BigTable Started out at Facebook, now an Apache project Main corporate sponsor is DataStax 133

134 Cassandra Performance Assume > 50GB of typical data MySQL Average write time 300ms Average read time 350ms Cassandra Average write time 0.12ms Average read time 15ms Writes/second 10K per node (4 striped disks, faster w/ssds) Numbers from Netflix & Avinash Lakshman 134

135 Cassandra Columns Columns in a row are actually a sorted map Column name is the key You control sorting of column names Every datum in Cassandra has row key column name value timestamp row key column name kkrugler ken@scaleunlimited.com value timestamp 135

136 Cassandra Static Table Use a fixed set of column names, for every row Very similar to typical table in a RDBMS "user id" age kkrugler ken@scaleunlimited.com

137 Cassandra Dynamic Table Often called a wide row - up to 2 billion columns per row The column name contains data (often a timestamp + something) Ordering of columns means you can do a slice of the row data post status status kkrugler How tall is Aconcagua? I want... Heading to BA Teaching clas 137

138 Cassandra Key Limitations You must specify the (unique) row key for queries So typical SQL queries on indexed fields don't work The work-around is to have additional tables Don't be afraid of duplicating data - disk space is cheap Think about queries you need, then design your table(s) There's no join support Doing this client-side is a Bad Idea The work-around is to have de-normalized tables Each row contains a complete result 138

139 Cassandra Clustering Every node plays the same role No master, slaves No single point of failure Data distributed by hash of row key 139

140 Cassandra Replication Data replicated between nodes Number of replicas is controllable New nodes are inserted into ring Losing N-1 nodes is OK 140

141 Cassandra Consistency Write consistency options (replication = N) One copy (W=1) Quorum (W = N/2 + 1) All (W=N) Read consistency Eventually consistent Pick 1...N replicas to read from Picking W+R > N means strongly consistent 141

142 Cassandra Terminology keyspace database table (column family) table row key unique table key column column 142

143 Cassandra Goodness Client API: clients in Java, Python, Ruby, PHP, etc. Secondary index: Faster access by non-row key data CQL3: SQL-like query language via command line tool counters: Avoids read-change-update performance hit DataStax Enterprise Hadoop integration: Cassandra as HDFS replacement Solr integration: Easy search of DB contents 143

144 Q & A Where do you start, when designing a NoSQL schema? What are wide rows, and why should you care? 144

145 Continuous & Storm Big Data and Solr 145 Confidential and Proprietary 2014/15

146 Key Questions What key software component is used by both Storm and SolrCloud? Why is grouping "interesting" in Storm? 146

147 What exactly is continuous? For data processing, basically means not batch Streaming data processing and/or fast analytics Closer to "near real time" Canonical example is processing tweet stream But could be log file analysis, feeds, etc And output could be updating a Solr index 147

148 Storm Distributed, fault-tolerant real time computation distributed: scales up based on number of servers fault tolerant: handles failure without data loss real time: continuous, not batch Open sourced by Twitter in 2012 Based on tweet-processing system from Back Type Lead developer is Nathan Marz 148

149 Before Storm Scaling is painful Poor fault-tolerance Coding is tedious 149

150 Storm Cluster Nimbus is the master Zookeeper coordinates Supervisors run tasks Supervisor Zookeeper Supervisor Nimbus Zookeeper Supervisor Zookeeper Supervisor Supervisor 150

151 Storm Concepts Streams - Cascading Pipes Spouts - Cascading Source Taps Bolts - Cascading Functions Topologies - Cascading Flows 151

152 Storm Stream Unbounded sequence of Tuples Tuple is just a list of values (line a row in a CSV file) Each stream has defined field names for each value position Stream tweets has fields user, time, text 152

153 Storm Spout Source of Streams Typically reading from queue Kafka, Kestrel, JMS Or other stream API Twitter firehose RSS feed 153

154 Storm Bold Process input stream(s) Generate output stream(s) Functions, filters, aggregations Might be endpoint e.g. write to Cassandra 154

155 Storm Topology Network of Spouts & Bolts Connected by Streams Topology defined in code Submitted to Nimbus Run as multiple tasks 155

156 Storm Tasks User-controlled parallelism Per Spout & Bolt Tasks run in Supervisors 156

157 Storm - Key Points Does require fairly deep infrastructure stack Zookeeper Zeromq Integration with data source(s) & data sink(s) storm-yarn project helps with this Great for continuous streams of data at scale Can be a better choice than Hadoop Often Hadoop is used because you can "make it work" Batch processing of time-stamped files/directories 157

158 Trident Built on top of Storm Provides more Cascading-like functionality Grouping, Aggregating - processes batches of Tuples as a "stream" Uses persistence layer to maintain state for exactly-once requirements Cassandra, memcached, etc. Can do parallel queries (similar to Drill) against NoSQL databases 158

159 Q & A What key software component is used by both Storm and SolrCloud? Why is grouping "interesting" in Storm? 159

160 helpful Lab Big Data and Solr 160 Confidential and Proprietary 2014/15

161 Key Questions Can you get the helpful example to build & run? Could you find the "most helpful" person? Did anyone solve the reply-to-a-reply challenge? 161

162 helpful Overview Goal: Find the most helpful person on a mailing list Processing mailing list s, same as for the count lab But this time we want to: Score original based on whether reply indicated it was helpful Assign scores to the original 's author Sum and sort authors by these scores 162

163 helpful Overview Exercise 1 is easy... Only emit a record from the map if it's a reply to another Then sum counts, so we know the number of replies for each But this isn't very useful <02A7D1D0-AC15-4F53-A240-1ABEE315ED40@gmail.com> 2 163

164 helpful Overview Let's assume we have a magic function... It calculates whether an is expressing gratitude (a nonzero score) So for any that's a reply, and shows gratitude... We can emit <replyid><tab><gratitude score> Then we need some way to connect scores to authors Each <replyid> should be a <msgid> in one of our s And so we'll have a <msgid><tab>< address> record So the map has to always emit <msgid><tab>< address> 164

165 helpful Job #1 - Map Phase The first job's map needs to assign scores to s that are: replies to an (so replyid is not null) where the content indicates gratitude :) It needs to output records for both scores and address <replyid><tab>s<tab><score> <msgid><tab>e<tab>< > 165

166 helpful Map Phase <msgid> <author> < > <subject> <date> <replyid> <content> <msgid> 'e' < > <replyid> 's' <score> 'id-1' <author> <subject> <date> null <content> 'id-2' <author> <subject> <date> 'id-1' <content> 'id-1' 'e' 'id-2' 'e' 'id-1' 's' 5 166

167 helpful Job #1 - Reduce Phase The first job's reduce outputs address + summed score but only when there's one or more 's' (score) records 'id-1' 'e' 'ken@su.com' 'id-1' 's' 5 'id-1' 's' 2 ken@su.com' 7 167

168 helpful Job #2 The mapper here does nothing except re-emit < ><tab><score> The reducer has to sum the scores, and emit < ><tab><sum_score> 168

169 helpful tests Already set up tests for exercises, called test1, test2, test3, test4 So just run "ant test1" to test your solution to exercise 1, etc. All tests will fail, because the map & reduce functions are skeletons These skeletons are the HelpfulExerciseXJobYMapper & HelpfulExerciseXJobYReducer classes Tests often re-use mapper & reducer classes from previous exercises, if the code is the same 169

170 Lab Details Follow steps in bigdata-solr/helpful/readme-helpful Use "ant testx" to build and test your solution for exercise X Modify helpful as per the README exercises We ll do Q & A in one hour 170

171 Q & A Can you get the helpful example to build & run? Could you find the "most helpful" person? Did anyone solve the reply-to-a-reply challenge? 171

172 Day 1 Wrap-Up Big Data and Solr 172 Confidential and Proprietary 2014/15

173 What We Covered Fundamentals of Hadoop + Lab Hadoop Ecosystem Data processing Workflows + Lab NoSQL storage Continuous data processing 173

174 Tomorrow s Agenda Why use Hadoop with Solr? Workflow Design Scalable Solr Indexing + Lab Augmented Search Solr as NoSQL Solr-based Analytics + Lab Solr Index Optimization + Lab 174