Hadoop: challenge accepted!



Similar documents
Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Big Data Too Big To Ignore

Big Data and Data Science: Behind the Buzz Words

MapReduce Détails Optimisation de la phase Reduce avec le Combiner

The MapReduce Framework

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Click Stream Data Analysis Using Hadoop

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook


Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Mammoth Scale Machine Learning!

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Hadoop and Map-Reduce. Swati Gore

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Moving From Hadoop to Spark

Integrating Big Data into the Computing Curricula

How to Hadoop Without the Worry: Protecting Big Data at Scale

HDFS. Hadoop Distributed File System

Unified Big Data Processing with Apache Spark. Matei

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Explained. An introduction to Big Data Science.

Open Source Technologies on Microsoft Azure

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Introduction to Apache Hive

Cloud Computing. Chapter Hadoop


Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Assignment 2: More MapReduce with Hadoop

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

#TalendSandbox for Big Data

Harnessing the Potential Raj Nair

Introduction to Apache Hive

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Introduction To Hive

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Case Study : 3 different hadoop cluster deployments

Big Data Workshop. dattamsha.com

Text Clustering Using LucidWorks and Apache Mahout

Chapter 7. Using Hadoop Cluster and MapReduce

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Hadoop Technology for Flow Analysis of the Internet Traffic

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Hadoop/MapReduce Workshop

Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc

Workshop on Hadoop with Big Data

Business Intelligence for Big Data

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Getting Started with Hadoop with Amazon s Elastic MapReduce

Tutorial for Assignment 2.0

Implement Hadoop jobs to extract business value from large and varied data sets

H2O on Hadoop. September 30,

Hadoop and Big Data Research

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Ubuntu and Hadoop: the perfect match

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

How To Scale Out Of A Nosql Database

Hadoop. Bioinformatics Big Data

Sriram Krishnan, Ph.D.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Ecosystem B Y R A H I M A.

Hadoop Distributed File System (HDFS) Overview

Microsoft Azure Data Technologies: An Overview

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Big Data and Apache Hadoop s MapReduce

NoSQL and Hadoop Technologies On Oracle Cloud

Play with Big Data on the Shoulders of Open Source

A Brief Outline on Bigdata Hadoop

Hadoop Architecture. Part 1

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Intensive Computing Handout 6 Hadoop

Connecting Hadoop with Oracle Database

Using distributed technologies to analyze Big Data

Dealing with Data Especially Big Data

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Hadoop Big Data for Processing Data and Performing Workload

Constructing a Data Lake: Hadoop and Oracle Database United!

Open source large scale distributed data management with Google s MapReduce and Bigtable

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

MapReduce. Tushar B. Kute,

Bringing Big Data to People

Securing Big Data at Rest with Encryption for Hadoop, Cassandra and MongoDB on Red Hat.

A Study of Data Management Technology for Handling Big Data

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Transcription:

Hadoop: challenge accepted! Arkadiusz Osiński arkadiusz.osinski@allegrogroup.com Robert Mroczkowski robert.mroczkowski@allegrogroup.com

ToC - Hadoop basics - Gather data - Process your data - Learn from your data - Visualize your data

BigData - Petabytes of (un)structured data

BigData - Petabytes of (un)structured data - 12% of data is analyzed

BigData - Petabytes of (un)structured data - 12% of data is analyzed - a lot of data is not gathered

BigData - Petabytes of (un)structured data - 12% of data is analyzed - a lot of data is not gathered - how to gain knowledge?

Big Data Data Lake Power Scalability Petabytes Commodity Mapreduce

- Storage layer HDFS

HDFS - Storage layer - Distributed file system

HDFS - Storage layer - Distributed file system - Commodity hardware

HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability

HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD

HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD - Access control

HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD - Access control - No SPOF

YARN - Distributed computing layer

YARN - Distributed computing layer - Operations in place of data

YARN - Distributed computing layer - Operations in place of data - MapReduce

YARN - Distributed computing layer - Operations in place of data - MapReduce - and others applications

YARN - Distributed computing layer - Operations in place of data - MapReduce - and others applications - Resource management

Let s squize our data to get a juice!!

Gather data flume-twitter.sources.twitter.type = com.cloudera.flume.source.twittersource flume-twitter.sources.twitter.channels = MemChannel flume-twitter.sources.twitter.consumerkey = ( ) flume-twitter.sources.twitter.consumersecret = ( ) flume-twitter.sources.twitter.accesstoken = ( ) flume-twitter.sources.twitter.accesstokensecret = ( ) flume-twitter.sources.twitter.keywords = hadoop, big data, nosql

Process your data - Hadoop Streaming!

Process your data - Hadoop Streaming! - No need to write code in Java

Process your data - Hadoop Streaming! - No need to write code in Java - You can use Python, Perl or Awk

Process your data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))

Process your data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey!= line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1 print "{0}\t{1}".format(str(datekey),str(counter))

Process your data yarn jar /usr/lib/hadoop-mapreduce/hadoopstreaming.jar \ -files./map.py,./reduce.py \ -mapper./map.py \ -reducer./reduce.py \ -input /tweets/2014/04/*/*/* \ -input /tweets/2014/05/*/*/* \ -output /tweet_keyword

Process your data (.) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (.)

Process your data

Recommendations We ve got historical users interaction with items. Which product will be desired by client?

Simple Example > apt-get install mahout > cat simple_example.csv 1,101 1,102 1,103 2,101 Let s just do mahout - it s easy! > hdfs dfs -put simple_example.csv > mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \ -Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \ -Dmapred.output.dir=/mahout/output/wikilinks/simple_example \ -Dmapred.job.queue.name=atmosphere_prod

Simple Example Tadadam! > hdfs dfs text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]

Wiki Case Wikipedia ( i /ˌwɪkɨˈpiːdiəә/ or i /ˌwɪkiˈpiːdiəә/ WIK- i- PEE- dee- əә) is a collaboratively edited, multilingual, free Internet encyclopedia that is supported by the non- profit Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access We ve got links between wikipedia articles, and want to propose new links between articles.

Wiki Case

Wiki Case hlp://users.on.net/%7ehenry/pagerank/links- simple- sorted.zip #!/usr/bin/awk -f BEGIN { OFS=", ; } { } gsub(":","",$1); for (i=2;i<=nf;i++) { print $1,$i }

Wiki Case yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.max.split.locations=24 \ -Dmapreduce.job.queuename=hadoop_prod \ -Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \ -Dmapred.text.key.comparator.options=-n \ -Dmapred.output.compress=false \ -files./mahout/mapper.awk \ -mapper./mapper.awk \ -input /mahout/input/wikilinks/links-simple-sorted.txt \ -output /mahout/output/wikilinks/fixedinput

Wiki Case Mahout lib count s similarity Matrix and gave recommendations for 824 articles. What s important, we didn t gather any knowledge a priori and just ran algorithm s out of box.

Wiki Case Acadèmia_Valenciana_de_la_Llengua FIFA October_1 Prehistoric_Iberia Ceuta Roussillon Sweden Valencia Calendar Link appears recently Spain City at the north coast of Africa Part of France by the border with Spain J Turís municipality in the Valencian Community Vulgar_Latin Western_Italo- Western_languages Àngel_Guimerà Language article Language article Spanish wriler

Wiki Case

Tweets Let s find group of: tags users

Tweets Our data is not random We ve picked specific keywords We ll do analysis in two orthogonal directions

Tweets { "filter_level":"medium", "contributors":null, "text":"promoción MES DE MAYO. con...", "geo":null, "retweeted":false, "lang":"es", "entities":{ "urls":[ { "expanded_url":"http://www.agmuriel.com", "indices":[ 69, 91 ], "display_url":"agmuriel.com/#!-/c1gz", "url":"http://t.co/apppjrrtxn" } ] } ( )

#!/usr/bin/python import json, sys Tweets for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[ entities"][ hashtags ] for tag in tags: print tag[ text ]+"\t"+text except KeyError: continue #!/usr/bin/python import sys (lastkey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastkey and lastkey!= key: print lastkey+"\t"+text (lastkey,text) = (key,value) else: (lastkey,text) = (key,text+" "+value)

Tweets Get SequenceFile with proper mapping yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.queuename=atmosphere_time \ -Dmapred.output.compress=false \ -Dmapreduce.job.max.split.locations=24 \ -D-Dmapred.reduce.tasks=20 \ -files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \ -mapper./twitter_map.py \ -reducer./twitter_reduce.py \ -input /project/atmosphere/tweets/2014/04/*/* \ -output /project/atmosphere/tweets/output \ -outputformat org.apache.hadoop.mapred.sequencefileoutputformat

Tweets Calculate vector representation for text mahout seq2sparse \ -i /project/atmosphere/tweets/output \ -o /project/atmosphere/tweets/vectorized -ow \ -chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2 {10:0.6292275202550768,14:0.7772211575566166} {10:0.6292275202550768,14:0.7772211575566166} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859} {17:1.0} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}

Tweets I ts time to begin clusterization Let s find 100 clusters mahout kmeans \ -i /tweets_5/vectorized/tfidf-vectors \ -c /tweets_5/kmeans/initial-clusters \ -o /tweets_5/kmeans/output-clusters \ -cd 1.0 -k 100 -x 10 -cl ow \ -dm org.apache.mahout.common.distance.squaredeuclideandistancemeasure

Tweets Glance at results BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE PATCHING

Tweets It was easy because tags are very dependent (coocurence).

Tweets Bigger challenge user clustering LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC

Tweets Bigger challenge user clustering Results show that dataset is strongly curved by mobile and games Dataset wasn t random we subscribed specific keywords OS result is great!

Tweets HADOOP WORLD run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar rt @mapr: google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir #bigdata

Tweets HADOOP WORLD Cloudera wants to do big data in Real Time. Hortonworks wants to replace cloudera by research.

Visualize data add jar hive-serdes-1.0-snapshot.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012 STORED AS TEXTFILE LOCATION /tweets/tw_data_201404 AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,lower(hashtags.text),lang

Visualize data add jar elasticsearch-hadoop-hive-2.0.0.rc1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.esstoragehandler TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;

Visualize data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;

Visualize data

Visualize data Tag: eurovisiontve

Thank you! Questions?