Hadoop: challenge accepted!

Transcription

1 Hadoop: challenge accepted! Arkadiusz Osiński Robert Mroczkowski

2 ToC - Hadoop basics - Gather data - Process your data - Learn from your data - Visualize your data

3 BigData - Petabytes of (un)structured data

4 BigData - Petabytes of (un)structured data - 12% of data is analyzed

5 BigData - Petabytes of (un)structured data - 12% of data is analyzed - a lot of data is not gathered

6 BigData - Petabytes of (un)structured data - 12% of data is analyzed - a lot of data is not gathered - how to gain knowledge?

7 Big Data Data Lake Power Scalability Petabytes Commodity Mapreduce

8 - Storage layer HDFS

9 HDFS - Storage layer - Distributed file system

10 HDFS - Storage layer - Distributed file system - Commodity hardware

11 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability

12 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD

13 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD - Access control

14 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD - Access control - No SPOF

15 YARN - Distributed computing layer

16 YARN - Distributed computing layer - Operations in place of data

17 YARN - Distributed computing layer - Operations in place of data - MapReduce

18 YARN - Distributed computing layer - Operations in place of data - MapReduce - and others applications

19 YARN - Distributed computing layer - Operations in place of data - MapReduce - and others applications - Resource management

20 Let s squize our data to get a juice!!

21 Gather data flume-twitter.sources.twitter.type = com.cloudera.flume.source.twittersource flume-twitter.sources.twitter.channels = MemChannel flume-twitter.sources.twitter.consumerkey = ( ) flume-twitter.sources.twitter.consumersecret = ( ) flume-twitter.sources.twitter.accesstoken = ( ) flume-twitter.sources.twitter.accesstokensecret = ( ) flume-twitter.sources.twitter.keywords = hadoop, big data, nosql

22 Process your data - Hadoop Streaming!

23 Process your data - Hadoop Streaming! - No need to write code in Java

24 Process your data - Hadoop Streaming! - No need to write code in Java - You can use Python, Perl or Awk

25 Process your data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))

26 Process your data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey!= line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1 print "{0}\t{1}".format(str(datekey),str(counter))

27 Process your data yarn jar /usr/lib/hadoop-mapreduce/hadoopstreaming.jar \ -files./map.py,./reduce.py \ -mapper./map.py \ -reducer./reduce.py \ -input /tweets/2014/04/*/*/* \ -input /tweets/2014/05/*/*/* \ -output /tweet_keyword

28 Process your data (.) (.)

29 Process your data

30 Recommendations We ve got historical users interaction with items. Which product will be desired by client?

31

32 Simple Example > apt-get install mahout > cat simple_example.csv 1,101 1,102 1,103 2,101 Let s just do mahout - it s easy! > hdfs dfs -put simple_example.csv > mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \ -Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \ -Dmapred.output.dir=/mahout/output/wikilinks/simple_example \ -Dmapred.job.queue.name=atmosphere_prod

33 Simple Example Tadadam! > hdfs dfs text /mahout/output/wikilinks/simple_example/part-r snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]

34 Wiki Case Wikipedia ( i /ˌwɪkɨˈpiːdiəә/ or i /ˌwɪkiˈpiːdiəә/ WIK- i- PEE- dee- əә) is a collaboratively edited, multilingual, free Internet encyclopedia that is supported by the non- profit Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access We ve got links between wikipedia articles, and want to propose new links between articles.

35 Wiki Case

36 Wiki Case hlp://users.on.net/%7ehenry/pagerank/links- simple- sorted.zip #!/usr/bin/awk -f BEGIN { OFS=", ; } { } gsub(":","",$1); for (i=2;i<=nf;i++) { print $1,$i }

37 Wiki Case yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.max.split.locations=24 \ -Dmapreduce.job.queuename=hadoop_prod \ -Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \ -Dmapred.text.key.comparator.options=-n \ -Dmapred.output.compress=false \ -files./mahout/mapper.awk \ -mapper./mapper.awk \ -input /mahout/input/wikilinks/links-simple-sorted.txt \ -output /mahout/output/wikilinks/fixedinput

38 Wiki Case Mahout lib count s similarity Matrix and gave recommendations for 824 articles. What s important, we didn t gather any knowledge a priori and just ran algorithm s out of box.

39 Wiki Case Acadèmia_Valenciana_de_la_Llengua FIFA October_1 Prehistoric_Iberia Ceuta Roussillon Sweden Valencia Calendar Link appears recently Spain City at the north coast of Africa Part of France by the border with Spain J Turís municipality in the Valencian Community Vulgar_Latin Western_Italo- Western_languages Àngel_Guimerà Language article Language article Spanish wriler

40 Wiki Case

41 Tweets Let s find group of: tags users

42 Tweets Our data is not random We ve picked specific keywords We ll do analysis in two orthogonal directions

43 Tweets { "filter_level":"medium", "contributors":null, "text":"promoción MES DE MAYO. con...", "geo":null, "retweeted":false, "lang":"es", "entities":{ "urls":[ { "expanded_url":" "indices":[ 69, 91 ], "display_url":"agmuriel.com/#!-/c1gz", "url":" } ] } ( )

44 #!/usr/bin/python import json, sys Tweets for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[ entities"][ hashtags ] for tag in tags: print tag[ text ]+"\t"+text except KeyError: continue #!/usr/bin/python import sys (lastkey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastkey and lastkey!= key: print lastkey+"\t"+text (lastkey,text) = (key,value) else: (lastkey,text) = (key,text+" "+value)

45 Tweets Get SequenceFile with proper mapping yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.queuename=atmosphere_time \ -Dmapred.output.compress=false \ -Dmapreduce.job.max.split.locations=24 \ -D-Dmapred.reduce.tasks=20 \ -files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \ -mapper./twitter_map.py \ -reducer./twitter_reduce.py \ -input /project/atmosphere/tweets/2014/04/*/* \ -output /project/atmosphere/tweets/output \ -outputformat org.apache.hadoop.mapred.sequencefileoutputformat

46 Tweets Calculate vector representation for text mahout seq2sparse \ -i /project/atmosphere/tweets/output \ -o /project/atmosphere/tweets/vectorized -ow \ -chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2 {10: ,14: } {10: ,14: } {3: ,14: ,19: ,22: } {17:1.0} {3: ,14: ,19: ,22: }

47 Tweets I ts time to begin clusterization Let s find 100 clusters mahout kmeans \ -i /tweets_5/vectorized/tfidf-vectors \ -c /tweets_5/kmeans/initial-clusters \ -o /tweets_5/kmeans/output-clusters \ -cd 1.0 -k 100 -x 10 -cl ow \ -dm org.apache.mahout.common.distance.squaredeuclideandistancemeasure

48 Tweets Glance at results BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE PATCHING

49 Tweets It was easy because tags are very dependent (coocurence).

50 Tweets Bigger challenge user clustering LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC

51 Tweets Bigger challenge user clustering Results show that dataset is strongly curved by mobile and games Dataset wasn t random we subscribed specific keywords OS result is great!

52 Tweets HADOOP WORLD run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir #bigdata

53 Tweets HADOOP WORLD Cloudera wants to do big data in Real Time. Hortonworks wants to replace cloudera by research.

54 Visualize data add jar hive-serdes-1.0-snapshot.jar; create table tw_data_ ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012 STORED AS TEXTFILE LOCATION /tweets/tw_data_ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like ' %' GROUP BY v_date,lower(hashtags.text),lang

55 Visualize data add jar elasticsearch-hadoop-hive rc1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.esstoragehandler TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;

56 Visualize data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_ may LEFT outer JOIN tw_data_ april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;

57 Visualize data

58 Visualize data Tag: eurovisiontve

59 Thank you! Questions?