Big Data and Solr. Ken Krugler. Confidential and Proprietary 2014/15

Size: px
Start display at page:

Download "Big Data and Solr. Ken Krugler. Confidential and Proprietary 2014/15"

Transcription

1

2 Big Data and Solr Ken Krugler Confidential and Proprietary 2014/15

3 Welcome to Big Data and Solr This class is about using Solr with big data Topics covered include Hadoop & batch processing workflows Storm & continuous data processing Cassandra & NoSQL Scalable Solr indexing This is not a Solr/SolrCloud tutorial! 3

4 Meet Your Instructor Ken Krugler - direct from Nevada City, California President of Scale Unlimited Apache Software Foundation Member Hadoop/Cascading/Mahout/Cassandra developer and trainer Solr developer and trainer 4

5 Day 1 in a Nutshell Big data overview Hadoop fundamentals Eco-system, other pieces Not-so-gentle intro to workflows Foundation for the "and Solr" piece tomorrow Show of hands for Hadoop background Heard the name Used it some Elbows deep 5

6 Title Text Real-world example Real-world solution Hadoop overview Counting words Hadoop map-reduce Hadoop streaming Hadoop distributed file system 6

7 Title Text Map-Reduce lab Hadoop Eco-system Cascading & workflows Cassandra & NoSQL Storm & continuous processing Workflow lab 7

8 Real World Example Big Data and Solr 8 Confidential and Proprietary 2014/15

9 Adbeat It s an analytics web site for display advertising You can find out who s advertising where Company has been active for about 3 years now 9

10 What s an Analytics Web Site? Let the user ask questions about data 10

11 Including Sexy Dashboards All driven by slices of the data 11

12 Old system architecture Web crawl running in Amazon cloud Crawl state saved in NoSQL database (MongoDB) Python + queues to fetch & analyze millions of pages/day Data pushed into Amazon storage (S3) Custom database used to store results Supported queries and search Also count/sum of query results 12

13 Back end database Each view or filter change causes queries to be executed sum ad impact for all advertisers on all networks, sort by sum, limit 10 sum ad impact by ad type for advertiser oracle.com For many millions of records Fast, accurate, cheap pick any two Fast Accurate Cheap 13

14 Combinatorial Explosion Too many possibilities to pre-calculate everything more than 100,000 publishers more than 1,000,000 advertisers 40 ad networks, 5 date ranges, etc So there are trillions of possible combinations Caching of DB query results isn t very useful 14

15 Trouble in UI Land UI refresh took seconds Well outside of target range of about a second or so 0.1 second: instantaneous 1.0 second: I m still in the flow 10 seconds: I m bored 15

16 Trouble in the back office Beefy hardware for multiple DBs was expensive AWS cost > USD$10,000 per month Couldn t process data fast enough The data sets needed to grow significantly (time, depth, breadth) Constant schema changes meant painful data reloading Extract, load, transform (inside of DB) Re-indexing of DB fields 16

17 Was it a Big Data problem? Sometimes hard to describe - think about the three V s Volume - data on 2 billion ads Velocity - millions of new ads each day Variety - ad data, web page crawl results, partner data 17

18 What were key problems? Scalability - data size growing 5x year-by-year Flexibility - how to add new functionality quickly Reliability - don t lose data, regular data updates All of these have impact on Cost, Performance, Quality 18

19 Doing some things right Using NoSQL database (MongoDB) to store web crawl metadata Scalable, flexible schema Using Amazon Web Services for scalable infrastructure Web crawling, storage, some data processing 19

20 Key Problems DB-centric design often has trouble with scaling, flexibility Data volume, velocity and variety were all problems You re only as good as the weakest link 20

21 Real World Solution Big Data and Solr 21 Confidential and Proprietary 2014/15

22 A New Approach Do analytics off-line using Hadoop Pre-generate as much as possible Use Solr as a NoSQL database (including search) + = 22

23 Hadoop-based analytics Hadoop is very good at parsing text files Crawl results pulled from Amazon S3 storage, parsed Hadoop is very good at counting & summing lots of things Group on advertiser, count ads during different time periods Hadoop can be used for more advanced analytics Parse ad text to extract all phrases Use machine learning bits from Mahout 23

24 What Solr Gives Us Open source enterprise search system Fast, memory-efficient queries (NoSQL) Pre-computed answers to 300M questions Count the number of documents that match a query Sort results by fields Fast searches Find all Flash ads with the word diet 24

25 Obligatory Architectural Slide Single search server 16 shards per index Optimize response time 300M total documents Solr /select AWS Server cc2.8xlarge Jetty Webapp Container Request Handler Shard #1 Shard #2 Shard N Solr Cores (Indexes) 25

26 How to Connect the Dots We have web crawl data - ads, advertisers, publishers, networks text google DIRECTV For Businesses Save $13/mo We have target Solr schemas with the fields defined <field name="network" type="string" indexed="true" stored="false" required="true" /> <field name="publisher" type="string" indexed="true" stored="false" required="true" /> Data Sources f(data)??? Index 26

27 Hadoop ETL Implement appropriate Extract, Transform, Load Extract is just parsing text files that are stored in Amazon s S3 Load is building the Solr index and deploying it to the search servers What about that pesky Transform part? 27

28 Simplicity Itself 80+ Hadoop Jobs Developed with Cascading Every-other-day run is $60 28

29 Where are they now? < 1 second average response time to most queries 10x more data, and growing Many more features 1/5th the monthly cost Regular updates/no data loss 29

30 Key Solution Points Using AWS gave them flexibility to change architecture Switching to Hadoop gave them scalable data processing Using Solr as a NoSQL DB gave them real-time analytics and search 30

31 Hadoop Overview Big Data and Solr 31 Confidential and Proprietary 2014/15

32 Key Questions Why is something like Hadoop needed? What are the two key architectural components? 32

33 How to Crunch a Petabyte Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry, since network errors happen 33

34 Hadoop to the Rescue Open source project Under Apache Software Foundation Other distributions - Cloudera, MapR, etc. Scalable - many servers with lots of cores and spindles Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer 34

35 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer 35

36 Logical Architecture Users Cluster Client Client Execution job job job Storage Client Where users can submit jobs that: Run on the Execution layer, and Read/write data from the Storage layer 36

37 Storage Layer Hadoop Distributed File System (aka HDFS) Runs on top of regular OS file system, typically Linux ext3 Fixed-size blocks (64MB by default) that are replicated Write once, read many; optimized for streaming in and out Geography Geography Rack Rack Rack Node Node Node Node... block block block block block block block block 37

38 Execution Layer Hadoop Map-Reduce Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Jobs consist of special map and reduce operations Geography Geography Rack Rack Rack Node map Node Node Node... map map map map reduce reduce reduce 38

39 Scalable Cluster Rack Rack Rack Node Node Node Node... cpu Execution disk Storage Virtual execution & storage layers span many nodes (servers) Scales linearly (sort of) with cores and disks. 39

40 Reliable Each block is replicated, typically three times. Each block is checksummed. Each task must succeed, or the job fails. All intermediate data copies are validated. 40

41 Fault-tolerant Failed tasks are automatically retried. Failed data transfers are automatically retried. Servers can join and leave the cluster at any time. 41

42 Simple Node Node Node Node... cpu Execution disk Storage Reduces complexity Conceptual operating system that spans many CPUs & disks 42

43 Typical Hadoop Cluster Has one master server - high quality box NameNode process - manages file system JobTracker process - manages tasks Has multiple slave servers - commodity hardware DataNode process - manages file system blocks on local drives TaskTracker process - runs tasks on server Uses high speed network between all servers 43

44 Architectural Components Solid boxes are unique applications Dashed boxes are child JVM instances (on same node as parent) Dotted boxes are blocks of managed files (on same node as parent) NameNode DataNode DataNode DataNode DataNode data block Client jobs JobTracker tasks TaskTracker mapper child mapper jvm child mapper jvm child jvm reducer child reducer jvm child reducer jvm child jvm Master Slaves 44

45 Q & A Why is something like Hadoop needed? What are the two key architectural components? 45

46 HDFS Big Data and Solr 46 Confidential and Proprietary 2014/15

47 Key Questions Why does Hadoop keep multiple copies of the data? What are the two processes that manage the file system? What are key limitations to the file system? 47

48 Virtual File System Treats many disks on many servers as one huge, logical volume Files have paths, same as a regular file system For example: /user/kkrugler/data/big-file.txt Data is stored in 1...n blocks Blocks in file all have same max size (typically 64MB) The DataNode process manages blocks of data on a slave. The NameNode process keeps track of file metadata on the master. 48

49 Replication Each block is stored on several different disks (default is 3) Hadoop tries to copy blocks to different servers and racks Protects data against disk, server, rack failures Reduces the need to move data to code 49

50 Error Recovery Slaves constantly check in with the master. Aka the heartbeat Data is automatically replicated if a disk or server goes away. Data is checksummed to protect against bad hardware. Per-block, multiple CRC values Verified during writes and reads Background verification (block scanner) 50

51 Performance / Scaling Optimized for streaming data in/out Data rates 30% - 50% of max raw disk rate No random access during writes (streaming out) Write once, read many Limit to scaling is NameNode Memory, CPU horsepower But 10K servers are probably more than you ll ever need 51

52 NameNode Runs on master node Was a single point of failure (Hadoop 2.0 has HA) There are no built-in software hot failover mechanisms Maintains filesystem namespace, the Namespace files and hierarchical directories Executes filesystem namespace operations opening, closing, and renaming Maintains mapping between data blocks to DataNodes the the Block Map 52

53 DataNodes DataNode DataNode DataNode DataNode data block Files stored on HDFS are stored as blocks on DataNode Manages storage attached to the nodes that they run on Performs block creation, deletion, replication as directed from NameNode Typically one instance per node in the cluster Data never flows through NameNode, only DataNodes Heartbeats sent to NameNode determine availability 53

54 Getting Data into HDFS Using Hadoop command line tool hadoop fs -copyfromlocal <local path> <HDFS path> Using Hadoop DFSClient Java class Create output stream, write to it From external data sources, via API requests HTTP requests to web servers JDBC requests to databases 54

55 Q & A Why does Hadoop keep multiple copies of the data? What are the two processes that manage the file system? What are some limitations to the Hadoop file system? 55

56 Counting Words Big Data and Solr 56 Confidential and Proprietary 2014/15

57 How Long to Count a Page of Words? Many different approaches Take the first word, count all occurrences in the document, repeat Or keep a list of words with counts Clearly doesn't scale well What happens if you have one million pages? 57

58 How Can a Room of People Count Words? Some people are "mappers" They cut up sentences into individual words Some people are "reducers" They count the total number of words Each "reducer" gets all words that end with a letter in their range e.g. "obvious" ends with "s", so it goes to the O - T reducer Let's give that a try... 58

59 Making it Faster The "mappers" could sort their words So then the "reducers" could merge-sort the lists from each mapper The "mappers" could combine word counts If they have two words that are the same, write "2" on one slip This reduces the number of words that have to given to the "reducers" The "mappers" could run to the "reducers" That's like having a faster network The "reducers" could have more fair ranges of letters U-Z has it easy, O-T has to work much harder 59

60 Map-Reduce Big Data and Solr 60 Confidential and Proprietary 2014/15

61 Key Questions What is the fundamental format for all data? Why might one reduce task take much longer than the others? 61

62 Definitions Key Value Pair -> two units of data, exchanged between Map & Reduce Map -> The map function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs Reduce -> The reduce function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs Group -> A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values 62

63 Definitions Mapper Map Reducer Reduce Mapper -> A process that executes Map for each input Key Value Pair Reducer -> A process that executes Reduce for each unique map Key Shuffling -> The magic between the Mapper and Reducer Job -> A single Map and Reduce implementation that are submitted together and that execute on many Mappers and Reducers on the Cluster 63

64 Title Text Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper 64

65 Recap MapReduce is conceptually Map->Group->Reduce Map and Reduce are user defined functions that execute on Key Value Pairs A Job pushes one Map and one Reduce implementation onto the cluster to be executed by distributed Mappers and Reducers The Group function is inherent to the algorithm and is executed during Shuffling 65

66 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,...}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,...}] Reduce [K3,V3] 66

67 Things to consider Map: is required in a Job [K1,V1] Map [K2,V2] may emit 0 or more Key Value Pairs [K2,V2] Reduce: is optional in a Job [K2,{V2,V2,...}] Reduce [K3,V3] sees K2 keys in sort order but K2 keys will be randomly distributed across Reducers the collection of V2 values are unordered may emit 0 or more Key Value Pairs [K3,V3] 67

68 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 e.g. < the,1> < quick,1> < brown,1> < fox,1> < jumped,1> Sum the values (ones) for each unique word 68

69 Title [1, When in the Text Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [K1,V1] [3, which have connected them with another, and to assume] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Defined Group [K2,{V2,V2,...}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [K3,V3] [When,4] [peope,6] [dissolve,3] [connected,5] [k,sum(v,v,v,..)] 69

70 Divide & Conquer (splitting data) Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Then A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key 70

71 [K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3] Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer split1 split2 split3 split4... file Mapper Mappers must complete before Reducers can begin part part part-000n directory 71

72 Divide & Conquer (parallelizable) Because Each Mapper is independent and processes part of the whole, and Each Reducer is independent and processes part of the whole Then Any number of Mappers can run on each node, and Any number of Reducers can run on each node, and The cluster can contain any number of nodes 72

73 JobTracker Client jobs JobTracker Is a single point of failure Determines # Mapper Tasks from file splits via InputFormat Uses predefined value for # Reducer Tasks Client applications use JobClient to submit jobs and query status Command line use hadoop job <commands> Web status console use 73

74 TaskTracker mapper child mapper jvm child mapper jvm child jvm TaskTracker reducer child reducer jvm child reducer jvm child jvm Spawns each Task as a new child JVM Max # mapper and reducer tasks set independently Can pass child JVM opts via mapred.child.java.opts Can re-use JVM to avoid overhead of task initialization 74

75 Q & A What is the fundamental format for all data? Why might one reduce task take much longer than the others? 75

76 Hadoop Streaming Big Data and Solr 76 Confidential and Proprietary 2014/15

77 Key Questions Why can't the key contain tab characters? Why do you have to keep track of the key in your reducer? 77

78 Streaming Overview Use any executable or script as the Mapper and/or Reducer awk, sed, grep, bash, etc Python, Ruby, Perl Mapper or Reducer may also be a Java class Uses Unix standard input and output streams Essentially a pre-created Hadoop job that knows how to... run code separate from the TaskTracker JVM send values via stdin, and get results via stdout 78

79 Streaming Text vs Binary Typically streaming uses text (UTF-8) for everything Map just gets the value (offset into file is the key, and is tossed) Reduce gets key/value pairs as <key><tab><value><eol> Though TypedBytes lets you send binary data in, get it back out Needs language-specific implementation e.g. see Dumbo project for using this with Python 79

80 Streaming Map Function Each line read from stdin has one value to be processed Output is written as one key/value pair per line of text, to stdout #!/usr/bin/env python import sys for line in sys.stdin:... # Split the line of text... # do some processing print "%s\t%s" % , count 80

81 Streaming Reduce Function Each line read from stdin has one key/value pair to be processed For a group, you're called multiple times with the same key There's no iterator Output is written as one key/value pair per line of text, to stdout 81

82 Title Text $HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \ -input file:/users/user/sandbox/input/ncdc/sample.txt \ -output file:/users/user/sandbox/output \ -mapper /Users/user/sandbox/src/main/map.rb \ -reducer /Users/user/sandbox/src/main/reduce.rb Many optional parameters to control... Fields in keys Partitioning etc. etc. 82

83 Q & A Why can't the key contain tab characters? Why do you have to keep track of the key in your reducer? 83

84 count Lab Big Data and Solr 84 Confidential and Proprietary 2014/15

85 Key Questions Can you get the count example to build & run? What was challenging about the exercises? 85

86 count Overview Goal is to count s by author ( address) Input is text file or directory of text files with pre-processed s E.g. src/test/resources/ s.tsv One per line, with tab-separated values msgid, author, , subject, date, replyid, content line feeds in content are replaced by the text "\n", tabs by "\t" <msg1>!author1! subject1! T14:07:23Z! <msg101>! content1 86

87 count Map Map task extracts address from input line Uses FieldUtils.safeSplit(string) to correctly handle empty fields Emits key = < address>, value = <1> for key-value pairs <msgid> <author> < > <subject> <date> <replyid> <content> < > 1 87

88 count Reduce Reduce task sums the counts of each unique address Emits key = < address>, value = <summed count> a.karpenko@oxseed.com 5 < 1> 1 < 1> 1 < 1> 1 < 2> 1 < 1> 3 < 2> 1 88

89 Streaming-like Java API public interface IStreamingMapper { void map(bufferedreader in, PrintWriter out) throws Exception; } public interface IStreamingReducer { void reduce(bufferedreader in, PrintWriter out) throws Exception; } 89

90 IStreamingMapper for count public interface IStreamingMapper { void map(bufferedreader in, PrintWriter out) throws Exception; } public class CountJob1Mapper implements IStreamingMapper public void map(bufferedreader in, PrintWriter out) throws Exception { String inputline; while ((inputline = in.readline())!= null) { // We have a line of text with a bunch of tab-separated fields. // msgid, author, , subject, date, replyid, content String fields[] = FieldUtils.safeSplit(inputLine); } } } // We want to emit the address and a count of 1 out.println(fields[2].trim().tolowercase() + "\t" + 1); 90

91 IStreamingReducer for count public interface IStreamingReducer { void reduce(bufferedreader in, PrintWriter out) throws Exception; } public class CountJob1Reducer implements IStreamingReducer { public void reduce(bufferedreader in, PrintWriter out) throws Exception { String inputline, current = null; int count = 0; while ((inputline = in.readline())!= null) { // We have a line of text with two tab-separated fields ( , count) String fields[] = FieldUtils.safeSplit(inputLine); String = fields[0]; int count = Integer.parseInt(fields[1]); } if (current == null) { current = ; } if ( .equals(current )) { count += count; } else { out.println(current + "\t" + count); current = ; count = count; } 91

92 Test code will run StreamingTool In src/test/java/com/scaleunlimited/labs/ count/ test() method inside of public void test() throws Exception { String args[] = {"-input", "src/test/resources/ s.tsv", "-output", "build/test/ counttest/", "-mapper", CountMapper.class.getCanonicalName(), "-reducer", CountReducer.class.getCanonicalName(), }; StreamingTool.main(args); } assertequals(40, TestUtils.getNumLines("build/test/ CountLabTest/")); 92

93 You can "chain" Jobs String args1[] = { "-input", INPUT_FILENAME, "-output", JOB_1_OUTPUT_DIRNAME, "-mapper", CountExercise2Job1Mapper.class.getCanonicalName(), "-reducer", CountExercise2Job1Reducer.class.getCanonicalName(), }; StreamingTool.main(args1); // We'll use the output of the first job as the input for this second job. String args2[] = { "-input", JOB_1_OUTPUT_DIRNAME, "-output", JOB_2_OUTPUT_DIRNAME, "-mapper", CountExercise3Job2Mapper.class.getCanonicalName(), "-reducer", CountExercise3Job2Reducer.class.getCanonicalName(), }; StreamingTool.main(args2); 93

94 Additional options -debug prints every record to the console -mappertimeout & -reducertimeout are useful when debugging You can hit a breakpoint and look at variables without thread termination String args1[] = { "-input", INPUT_FILENAME, "-output", JOB_1_OUTPUT_DIRNAME, "-mapper", CountExercise2Job1Mapper.class.getCanonicalName(), "-reducer", CountExercise2Job1Reducer.class.getCanonicalName(), "-debug", "-mappertimeout", "100000", "-reducertimeout", "100000", }; 94

95 Lab Details Follow steps in bigdata-solr/ count/readme- count Use "ant test" to build and test the code Modify count as per the README exercises Use "ant test1", "ant test2", etc to test solutions Exercise 1 in all labs should be doable in the time you've got Subsequent exercises are harder If you finish the "challenge" exercise, I'll be very impressed :) Solutions are in bigdata-solr/ count/solutions We ll do Q & A in one hour 95

96 Q & A Can you get the count example to build & run? What was challenging about the exercises? 96

97 Hadoop Eco-system Big Data and Solr 97 Confidential and Proprietary 2014/15

98 Key Questions What's the key difference between Hive and Drill? What is one common problem with using Pig? 98

99 Hive Hive is a data warehouse built on top of flat files in HDFS Developed by Facebook Now a top-level Apache project Very active developer/user community 99

100 Hive Data Organization into Tables with logical and hash partitioning A Metastore to store metadata about Tables/Partitions/Buckets Partitions are how the data is stored (collections of related rows) Buckets further divide Partitions based on a hash of a column metadata is actually stored in a RDBMS A SQL like query language over object data stored in Tables Custom types, structs of primitive types User defined functions 100

101 Hive - Creating Tables CREATE TABLE page_view(viewtime DATETIME, userid MEDIUMINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date DATETIME, country STRING) BUCKETED ON (userid) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY \001 COLLECTION ITEMS TERMINATED BY \002 MAP KEYS TERMINATED BY \003 LINES TERMINATED BY \012 STORED AS COMPRESSED 101

102 Hive - Loading Data CREATE EXTERNAL TABLE page_view_stg(viewtime DATETIME, userid MEDIUMINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY \054 LINES TERMINATED BY \012 LOCATION '/user/facebook/staging/page_view'; hadoop dfs -put /tmp/pv_ txt /user/facebook/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(date= , country='us') SELECT pvs.viewtime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; 102

103 Hive - Queries Queries always write results into a table FROM user INSERT OVERWRITE TABLE user_active SELECT user.* WHERE user.active = true; FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; 103

104 Pig Pig is a data flow programming environment for processing large files PigLatin is the text language it provides for defining data flows Developed by researchers in Yahoo! and widely used internally 104

105 PigLatin - Example W = LOAD 'filename' AS (url, outlink); G = GROUP W by url; R = FOREACH G { FW = FILTER W BY outlink eq ' PW = FW.outlink; DW = DISTINCT PW; GENERATE group, COUNT(DW); } Adding new functionality requires a user defined function aka the dreaded "UDF" Brings back bad memories of stored procedures 105

106 Drill Distributed system for interactive analysis of largescale datasets Based on Google paper about their Dremel system Apache Incubator project Very hot area right now Impala (Cloudera) Druid (Metamarkets) etc, etc, etc 106

107 Interactive Analysis Systems designed to... Support very large datasets (petabytes) on many servers (1000s) Quickly answer SQL-ish queries, e.g. in 10s to 100s of milliseconds Easily handle "nested" data formats, versus requiring flattening Uses "columnar" format for storing data ORCFile, Parquet, Trevni, etc. 107

108 Q & A What's the key difference between Hive and Drill? What is one common problem with using Pig? 108

109 Hadoop Summary Big Data and Solr 109 Confidential and Proprietary 2014/15

110 Good Use Cases for Hadoop Data doesn t fit on one server Data can t be processed fast enough on one server Batch processing latency is an acceptable trade-off 110

111 Bad Use Cases for Hadoop Problems that aren t great for solving with Hadoop Small data problems Real-time data processing 111

112 Success Factors The standard list - clear definition of victory, timetable, etc. Where is your data coming from, and going to? Often the biggest chunk of project time What s the simplest Hadoop solution that might work? Streaming, for non-java solutions Hive, for query-centric problems Cascading, for workflows 112

113 Beware the Hadoopaphile Hadoop is a big hammer But not every problem is a nail And it's not magic pixie dust 113

114 Resources Hadoop mailing lists Users groups - e.g. Hadoop API Books Hadoop: The Definitive Guide, 2nd edition by Tom White 114

115 Workflows & Cascading Big Data and Solr 115 Confidential and Proprietary 2014/15

116 Key Questions How is Cascading similar to Hive and Pig? What is one challenging Hadoop operation that's easy with Cascading? 116

117 Workflow Definition Complex processing of (semi) structured data at scale Complex: not something easily handled via map-reduce Processing: conversion of one or more input data streams Structured: data typically has defined fields with specific meanings At scale: beyond what you can do on one server 117

118 Cascading An API for defining data processing workflows Open source project, Apache Public License First public release Jan 2008 Currently and supports Hadoop through 2.5.x 118

119 Cascading Implements all standard data processing operations function, filter, group by, co-group, aggregator, etc Complex applications are built by chaining operations with pipes Pipe assemblies are bound to input and output data And are planned into the minimum number of MapReduce jobs Similar to Hive, Pig and other higher-level abstractions 119

120 Cascading Code Pipe pipe = new Pipe("wordcount"); pipe = new Each(pipe, new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+")); pipe = new CountBy(pipe, new Fields("word"), new Fields("count")); pipe = new GroupBy(pipe, new Fields("count"), true); Tap source = new Hfs(new TextLine(), "wordcount.txt"); Tap sink = new Hfs(new TextLine(), "output", SinkMode.REPLACE); Flow f = new HadoopFlowConnector().connect(source, sink, pipe); f.complete(); 120

121 Hfs['TextLine[['line']->[ALL]]']['src/test/resources/wordcount.txt']'] Each('wordcount')[RegexSplitGenerator[decl:'word'][args:1]] Each('wordcount')[CompositeFunction[decl:'word', 'count']] GroupBy('wordcount')[by:['word']] Every('wordcount')[Sum[decl:'count'][args:1]] TempHfs['SequenceFile[['word', 'count']]'][ /wordcount/] GroupBy('wordcount')[by:['count']] Hfs['TextLine[['offset', 'line']->[all]]']['build/test/minwordcounttest']'] 121

122 Joining Streams of Data Common requirement to join two data sets, using a common key E.g. log file data and country data, by IP address 'Log IP', 'Status' 'Data IP', 'Country' Join Tuple streams using 'Log IP' and 'Data IP' 'Log IP', 'Status', 'Data IP', 'Country' Pipe loganalysispipe = new CoGroup( logdatapipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipdatapipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do 122

123 [head] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] ['', ''] ['', ''] ['', ''] ['', ''] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] Each('')[RegexSplitter[decl:'', ''][args:1]] Each('')[RegexSplitter[decl:'', '', '', ''][args:1]] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', '', ''][args:1]] TempHfs['SequenceFile[['', '']]'][] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', ''] ['', ''] ['', ''] ['', ''] a[''],b[''] ['', '', '', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', ''] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] ['', '', ''] ['', '', ''] ['', '', ''] ['', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', ''] ['', ''] ['', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', ''] ['', '', '', ''] TempHfs['SequenceFile[['', '', '', '']]'][] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', '', '', ''] a[''],b[''] ['', '', '', '', '', ''] TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][] Each('')[Identity[decl:'', '', '', '']] ['', '', '', '', '', '', ''] ['', '', '', '', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', '', '', '', '', ''] Each('')[Identity[decl:'', '']] ['', ''] ['', ''] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] GroupBy('')[by:['']] a[''] ['', ''] ['', '', ''] ['', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', '', '', ''] Each('')[Identity[decl:'', '', '']] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] TempHfs['SequenceFile[['', '', '', '']]'][] ['', '', '', ''] ['', '', '', ''] Every('')[Agregator[decl:'']] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', ''] Each('')[Identity[decl:'', '', '']] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] ['', '', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] ['', '', ''] ['', '', ''] GroupBy('')[by:['']] a[''] ['', '', ''] Every('')[Agregator[decl:'', '']] ['', '', ''] ['', '', ''] Dfs['TextLine[['offset', 'line']->[all]]']['/data/']'] [tail] 123

124 Workflow Design Simplicity itself :) Sketch as operations & groups connected with pipes 124

125 count Lab Exercise 1 Count total s and total characters, by author First step is to parse input text to create Tuples with fields Use the handy RegexSplitter function Pipe pipe = new Pipe(" s"); Fields f = new Fields("msg_id", "author", " ", "subject", "date", "reply_id", "content"); pipe = new Each(pipe, new Fields("line"), new RegexSplitter(f)); 125

126 count Lab Exercise 1 Next step is to calculate the length of each Use the handy ExpressionFunction Fields lengthfield = new Fields("content_length"); Function func = new ExpressionFunction(lengthField, "content.length()", String.class); p = new Each(pipe, new Fields("content"), func, Fields.ALL); 126

127 count Lab Exercise 1 Finally group on the address, and count s/sum length AggregateBy does map-side pre-aggregations for efficiency But wait, we'll also group on the summed length, to sort high to low AggregateBy count = new CountBy(new Fields("num_ s")); AggregateBy sum = new SumBy(lengthField, new Fields("total"), Integer.class); pipe = new AggregateBy(pipe, new Fields(" "), count, sum); pipe = new GroupBy(p, new Fields("total"), true); 127

128 Q & A How is Cascading similar to Hive and Pig? What is one challenging Hadoop operation that's easy with Cascading? 128

129 NoSQL & Cassandra Big Data and Solr 129 Confidential and Proprietary 2014/15

130 Key Questions Where do you start, when designing a NoSQL schema? What are wide rows, and why should you care? 130

131 What is NoSQL? High availability Linear scaling Flexible data model 131

132 When to switch to NoSQL? If you have to shard your DB If you put memcached in front of your DB If you process your data with Hadoop Don t do this for: Highly-transactional data (make sure you need this) Highly-relational data Small data 132

133 Cassandra Distributed, fault-tolerant NoSQL database system Open source project, Apache Public License Based on aspects of both Dynamo and BigTable Started out at Facebook, now an Apache project Main corporate sponsor is DataStax 133

134 Cassandra Performance Assume > 50GB of typical data MySQL Average write time 300ms Average read time 350ms Cassandra Average write time 0.12ms Average read time 15ms Writes/second 10K per node (4 striped disks, faster w/ssds) Numbers from Netflix & Avinash Lakshman 134

135 Cassandra Columns Columns in a row are actually a sorted map Column name is the key You control sorting of column names Every datum in Cassandra has row key column name value timestamp row key column name kkrugler ken@scaleunlimited.com value timestamp 135

136 Cassandra Static Table Use a fixed set of column names, for every row Very similar to typical table in a RDBMS "user id" age kkrugler ken@scaleunlimited.com

137 Cassandra Dynamic Table Often called a wide row - up to 2 billion columns per row The column name contains data (often a timestamp + something) Ordering of columns means you can do a slice of the row data post status status kkrugler How tall is Aconcagua? I want... Heading to BA Teaching clas 137

138 Cassandra Key Limitations You must specify the (unique) row key for queries So typical SQL queries on indexed fields don't work The work-around is to have additional tables Don't be afraid of duplicating data - disk space is cheap Think about queries you need, then design your table(s) There's no join support Doing this client-side is a Bad Idea The work-around is to have de-normalized tables Each row contains a complete result 138

139 Cassandra Clustering Every node plays the same role No master, slaves No single point of failure Data distributed by hash of row key 139

140 Cassandra Replication Data replicated between nodes Number of replicas is controllable New nodes are inserted into ring Losing N-1 nodes is OK 140

141 Cassandra Consistency Write consistency options (replication = N) One copy (W=1) Quorum (W = N/2 + 1) All (W=N) Read consistency Eventually consistent Pick 1...N replicas to read from Picking W+R > N means strongly consistent 141

142 Cassandra Terminology keyspace database table (column family) table row key unique table key column column 142

143 Cassandra Goodness Client API: clients in Java, Python, Ruby, PHP, etc. Secondary index: Faster access by non-row key data CQL3: SQL-like query language via command line tool counters: Avoids read-change-update performance hit DataStax Enterprise Hadoop integration: Cassandra as HDFS replacement Solr integration: Easy search of DB contents 143

144 Q & A Where do you start, when designing a NoSQL schema? What are wide rows, and why should you care? 144

145 Continuous & Storm Big Data and Solr 145 Confidential and Proprietary 2014/15

146 Key Questions What key software component is used by both Storm and SolrCloud? Why is grouping "interesting" in Storm? 146

147 What exactly is continuous? For data processing, basically means not batch Streaming data processing and/or fast analytics Closer to "near real time" Canonical example is processing tweet stream But could be log file analysis, feeds, etc And output could be updating a Solr index 147

148 Storm Distributed, fault-tolerant real time computation distributed: scales up based on number of servers fault tolerant: handles failure without data loss real time: continuous, not batch Open sourced by Twitter in 2012 Based on tweet-processing system from Back Type Lead developer is Nathan Marz 148

149 Before Storm Scaling is painful Poor fault-tolerance Coding is tedious 149

150 Storm Cluster Nimbus is the master Zookeeper coordinates Supervisors run tasks Supervisor Zookeeper Supervisor Nimbus Zookeeper Supervisor Zookeeper Supervisor Supervisor 150

151 Storm Concepts Streams - Cascading Pipes Spouts - Cascading Source Taps Bolts - Cascading Functions Topologies - Cascading Flows 151

152 Storm Stream Unbounded sequence of Tuples Tuple is just a list of values (line a row in a CSV file) Each stream has defined field names for each value position Stream tweets has fields user, time, text 152

153 Storm Spout Source of Streams Typically reading from queue Kafka, Kestrel, JMS Or other stream API Twitter firehose RSS feed 153

154 Storm Bold Process input stream(s) Generate output stream(s) Functions, filters, aggregations Might be endpoint e.g. write to Cassandra 154

155 Storm Topology Network of Spouts & Bolts Connected by Streams Topology defined in code Submitted to Nimbus Run as multiple tasks 155

156 Storm Tasks User-controlled parallelism Per Spout & Bolt Tasks run in Supervisors 156

157 Storm - Key Points Does require fairly deep infrastructure stack Zookeeper Zeromq Integration with data source(s) & data sink(s) storm-yarn project helps with this Great for continuous streams of data at scale Can be a better choice than Hadoop Often Hadoop is used because you can "make it work" Batch processing of time-stamped files/directories 157

158 Trident Built on top of Storm Provides more Cascading-like functionality Grouping, Aggregating - processes batches of Tuples as a "stream" Uses persistence layer to maintain state for exactly-once requirements Cassandra, memcached, etc. Can do parallel queries (similar to Drill) against NoSQL databases 158

159 Q & A What key software component is used by both Storm and SolrCloud? Why is grouping "interesting" in Storm? 159

160 helpful Lab Big Data and Solr 160 Confidential and Proprietary 2014/15

161 Key Questions Can you get the helpful example to build & run? Could you find the "most helpful" person? Did anyone solve the reply-to-a-reply challenge? 161

162 helpful Overview Goal: Find the most helpful person on a mailing list Processing mailing list s, same as for the count lab But this time we want to: Score original based on whether reply indicated it was helpful Assign scores to the original 's author Sum and sort authors by these scores 162

163 helpful Overview Exercise 1 is easy... Only emit a record from the map if it's a reply to another Then sum counts, so we know the number of replies for each But this isn't very useful <02A7D1D0-AC15-4F53-A240-1ABEE315ED40@gmail.com> 2 163

164 helpful Overview Let's assume we have a magic function... It calculates whether an is expressing gratitude (a nonzero score) So for any that's a reply, and shows gratitude... We can emit <replyid><tab><gratitude score> Then we need some way to connect scores to authors Each <replyid> should be a <msgid> in one of our s And so we'll have a <msgid><tab>< address> record So the map has to always emit <msgid><tab>< address> 164

165 helpful Job #1 - Map Phase The first job's map needs to assign scores to s that are: replies to an (so replyid is not null) where the content indicates gratitude :) It needs to output records for both scores and address <replyid><tab>s<tab><score> <msgid><tab>e<tab>< > 165

166 helpful Map Phase <msgid> <author> < > <subject> <date> <replyid> <content> <msgid> 'e' < > <replyid> 's' <score> 'id-1' <author> <subject> <date> null <content> 'id-2' <author> <subject> <date> 'id-1' <content> 'id-1' 'e' 'id-2' 'e' 'id-1' 's' 5 166

167 helpful Job #1 - Reduce Phase The first job's reduce outputs address + summed score but only when there's one or more 's' (score) records 'id-1' 'e' 'ken@su.com' 'id-1' 's' 5 'id-1' 's' 2 ken@su.com' 7 167

168 helpful Job #2 The mapper here does nothing except re-emit < ><tab><score> The reducer has to sum the scores, and emit < ><tab><sum_score> 168

169 helpful tests Already set up tests for exercises, called test1, test2, test3, test4 So just run "ant test1" to test your solution to exercise 1, etc. All tests will fail, because the map & reduce functions are skeletons These skeletons are the HelpfulExerciseXJobYMapper & HelpfulExerciseXJobYReducer classes Tests often re-use mapper & reducer classes from previous exercises, if the code is the same 169

170 Lab Details Follow steps in bigdata-solr/helpful/readme-helpful Use "ant testx" to build and test your solution for exercise X Modify helpful as per the README exercises We ll do Q & A in one hour 170

171 Q & A Can you get the helpful example to build & run? Could you find the "most helpful" person? Did anyone solve the reply-to-a-reply challenge? 171

172 Day 1 Wrap-Up Big Data and Solr 172 Confidential and Proprietary 2014/15

173 What We Covered Fundamentals of Hadoop + Lab Hadoop Ecosystem Data processing Workflows + Lab NoSQL storage Continuous data processing 173

174 Tomorrow s Agenda Why use Hadoop with Solr? Workflow Design Scalable Solr Indexing + Lab Augmented Search Solr as NoSQL Solr-based Analytics + Lab Solr Index Optimization + Lab 174

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Introduction to Apache Hive

Introduction to Apache Hive Introduction to Apache Hive Pelle Jakovits 1. Oct, 2013, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language Examples Internals Hive vs

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013 Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Introduction to Apache Hive

Introduction to Apache Hive Introduction to Apache Hive Pelle Jakovits 14 Oct, 2015, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language User Defined Functions Hive

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Sentimental Analysis using Hadoop Phase 2: Week 2

Sentimental Analysis using Hadoop Phase 2: Week 2 Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular

More information

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00 Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

How To Use Big Data For Telco (For A Telco)

How To Use Big Data For Telco (For A Telco) ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information