Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Size: px

Start display at page:

Download "Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban"

Stephany Waters
9 years ago
Views:

1 Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster Dan Şerban

2 Agenda :: Introduction - Real-world uses of MapReduce - The origins of Hadoop - Hadoop facts and architecture :: Part 1 - Deploying Hadoop :: Part 2 - MapReduce is machine learning :: Q&A

3 Why shopping cart analysis is useful to amazon.com

4 Linkedin and Google Reader

The origins of Hadoop :: Hadoop got its start in Nutch :: A few enthusiastic developers were attempting to build an open source web search engine and having trouble managing computations running on

5 The origins of Hadoop :: Hadoop got its start in Nutch :: A few enthusiastic developers were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers :: Once Google published their GoogleFS and MapReduce whitepapers, the way forward became clear :: Google had devised systems to solve precisely the problems the Nutch project was facing :: Thus, Hadoop was born

computers :: Once Google published their GoogleFS and MapReduce whitepapers, the way forward became clear ::

6 Hadoop facts :: Hadoop is a distributed computing platform for processing extremely large amounts of data :: Hadoop is divided into two main components: - the MapReduce runtime - the Hadoop Distributed File System (HDFS) :: The MapReduce runtime allows the user to submit MapReduce jobs :: The HDFS is a distributed file system that provides a logical interface for persistent and redundant storage of large data :: Hadoop also provides the HadoopStreaming library that leverages STDIN and STDOUT so you can write mappers and reducers in your programming language of choice

jobs :: The HDFS is a distributed file system that provides a logical interface for persistent and redundant storage of large data :: Hadoop

7 Hadoop facts :: Hadoop is based on the principle of moving computation to where the data is :: Data stored on the Hadoop Distributed File System is broken up into chunks and replicated across the cluster providing fault tolerant parallel processing and redundancy for both the data and the jobs :: Computation takes the form of a job which consists of a map phase and a reduce phase :: Data is initially processed by map functions which run in parallel across the cluster :: Map output is in the form of key-value pairs :: The reduce phase then aggregates the map results :: The reduce phase typically happens in multiple consecutive waves until the job is complete

job which consists of a map phase and a reduce phase :: Data is initially processed by map functions which run in parallel across the cluster :: Map output is in the

8 Hadoop architecture

9 Part 1: Configuring and deploying the Hadoop cluster

10 Hands-on with Hadoop

11 core-site.xml - before

12 core-site.xml - after

13 hdfs-site.xml - before

14 hdfs-site.xml - after

15 mapred-site.xml - before

16 mapred-site.xml - after

17 Setting up SSH :: needs to be able to ssh* into: - hadoop@hadoop-master - hadoop@chunkserver-a - hadoop@chunkserver-b - hadoop@chunkserver-c :: hadoop@job-tracker needs to be able to ssh* into: - hadoop@job-tracker - hadoop@chunkserver-a - hadoop@chunkserver-b - hadoop@chunkserver-c *Passwordless-ly and passphraseless-ly

hadoop@chunkserver-c :: hadoop@job-tracker needs to be able to ssh* into: -

18 Hands-on with Hadoop

19 Hands-on with Hadoop

20 Hands-on with Hadoop

21 Hands-on with Hadoop

22 Hands-on with Hadoop

23 Hands-on with Hadoop

24 Hands-on with Hadoop

25 Part 2: MapReduce is machine learning

26 Rolling your own self-hosted alternative to...

27 Hands-on with MapReduce

28 Hands-on with MapReduce

29 Hands-on with MapReduce

30 mapper.py #!/usr/bin/python import sys for line in sys.stdin: line = line.strip() IDs = line.split() for firstid in IDs: for secondid in IDs: if secondid > firstid: print '%s_%s\t%s' % (firstid, secondid, 1)

31 reducer.py #!/usr/bin/python import sys subtotals = {} for line in sys.stdin: line = line.strip() word = line.split('\t')[0] count = int(line.split('\t')[1]) subtotals[word] = subtotals.get(word, 0) + count for k, v in subtotals.items(): print "%s\t%s" % (k, v)

32 Hands-on with MapReduce

33 Hands-on with MapReduce

34 Hands-on with MapReduce

35 Hands-on with MapReduce

36 Hands-on with MapReduce

37 Other MapReduce use cases :: :: :: :: :: :: :: :: :: :: Google Suggest Video recommendations (YouTube) ClickStream Analysis (large web properties) Spam filtering and contextual advertising (Yahoo) Fraud detection (ebay, CC companies) Firewall log analysis to discover exfiltration and other undesirable (possibly malware-related) activity Finding patterns in social data, analyzing likes and building a search engine on top of them (FaceBook) Discovering microblogging trends and opinion leaders, analyzing who follows who (Twitter) Plain old supermarket shopping basket analysis The semantic web

38 Questions / Feedback

39 Bonus slide: Making of SQLite DB

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current