Hadoop: challenge accepted!
|
|
- Josephine Osborne
- 8 years ago
- Views:
Transcription
1 Hadoop: challenge accepted! Arkadiusz Osiński Robert Mroczkowski
2 ToC - Hadoop basics - Gather data - Process your data - Learn from your data - Visualize your data
3 BigData - Petabytes of (un)structured data
4 BigData - Petabytes of (un)structured data - 12% of data is analyzed
5 BigData - Petabytes of (un)structured data - 12% of data is analyzed - a lot of data is not gathered
6 BigData - Petabytes of (un)structured data - 12% of data is analyzed - a lot of data is not gathered - how to gain knowledge?
7 Big Data Data Lake Power Scalability Petabytes Commodity Mapreduce
8 - Storage layer HDFS
9 HDFS - Storage layer - Distributed file system
10 HDFS - Storage layer - Distributed file system - Commodity hardware
11 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability
12 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD
13 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD - Access control
14 HDFS - Storage layer - Distributed file system - Commodity hardware - Scalability - JBOD - Access control - No SPOF
15 YARN - Distributed computing layer
16 YARN - Distributed computing layer - Operations in place of data
17 YARN - Distributed computing layer - Operations in place of data - MapReduce
18 YARN - Distributed computing layer - Operations in place of data - MapReduce - and others applications
19 YARN - Distributed computing layer - Operations in place of data - MapReduce - and others applications - Resource management
20 Let s squize our data to get a juice!!
21 Gather data flume-twitter.sources.twitter.type = com.cloudera.flume.source.twittersource flume-twitter.sources.twitter.channels = MemChannel flume-twitter.sources.twitter.consumerkey = ( ) flume-twitter.sources.twitter.consumersecret = ( ) flume-twitter.sources.twitter.accesstoken = ( ) flume-twitter.sources.twitter.accesstokensecret = ( ) flume-twitter.sources.twitter.keywords = hadoop, big data, nosql
22 Process your data - Hadoop Streaming!
23 Process your data - Hadoop Streaming! - No need to write code in Java
24 Process your data - Hadoop Streaming! - No need to write code in Java - You can use Python, Perl or Awk
25 Process your data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))
26 Process your data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey!= line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1 print "{0}\t{1}".format(str(datekey),str(counter))
27 Process your data yarn jar /usr/lib/hadoop-mapreduce/hadoopstreaming.jar \ -files./map.py,./reduce.py \ -mapper./map.py \ -reducer./reduce.py \ -input /tweets/2014/04/*/*/* \ -input /tweets/2014/05/*/*/* \ -output /tweet_keyword
28 Process your data (.) (.)
29 Process your data
30 Recommendations We ve got historical users interaction with items. Which product will be desired by client?
31
32 Simple Example > apt-get install mahout > cat simple_example.csv 1,101 1,102 1,103 2,101 Let s just do mahout - it s easy! > hdfs dfs -put simple_example.csv > mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \ -Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \ -Dmapred.output.dir=/mahout/output/wikilinks/simple_example \ -Dmapred.job.queue.name=atmosphere_prod
33 Simple Example Tadadam! > hdfs dfs text /mahout/output/wikilinks/simple_example/part-r snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]
34 Wiki Case Wikipedia ( i /ˌwɪkɨˈpiːdiəә/ or i /ˌwɪkiˈpiːdiəә/ WIK- i- PEE- dee- əә) is a collaboratively edited, multilingual, free Internet encyclopedia that is supported by the non- profit Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access We ve got links between wikipedia articles, and want to propose new links between articles.
35 Wiki Case
36 Wiki Case hlp://users.on.net/%7ehenry/pagerank/links- simple- sorted.zip #!/usr/bin/awk -f BEGIN { OFS=", ; } { } gsub(":","",$1); for (i=2;i<=nf;i++) { print $1,$i }
37 Wiki Case yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.max.split.locations=24 \ -Dmapreduce.job.queuename=hadoop_prod \ -Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \ -Dmapred.text.key.comparator.options=-n \ -Dmapred.output.compress=false \ -files./mahout/mapper.awk \ -mapper./mapper.awk \ -input /mahout/input/wikilinks/links-simple-sorted.txt \ -output /mahout/output/wikilinks/fixedinput
38 Wiki Case Mahout lib count s similarity Matrix and gave recommendations for 824 articles. What s important, we didn t gather any knowledge a priori and just ran algorithm s out of box.
39 Wiki Case Acadèmia_Valenciana_de_la_Llengua FIFA October_1 Prehistoric_Iberia Ceuta Roussillon Sweden Valencia Calendar Link appears recently Spain City at the north coast of Africa Part of France by the border with Spain J Turís municipality in the Valencian Community Vulgar_Latin Western_Italo- Western_languages Àngel_Guimerà Language article Language article Spanish wriler
40 Wiki Case
41 Tweets Let s find group of: tags users
42 Tweets Our data is not random We ve picked specific keywords We ll do analysis in two orthogonal directions
43 Tweets { "filter_level":"medium", "contributors":null, "text":"promoción MES DE MAYO. con...", "geo":null, "retweeted":false, "lang":"es", "entities":{ "urls":[ { "expanded_url":" "indices":[ 69, 91 ], "display_url":"agmuriel.com/#!-/c1gz", "url":" } ] } ( )
44 #!/usr/bin/python import json, sys Tweets for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[ entities"][ hashtags ] for tag in tags: print tag[ text ]+"\t"+text except KeyError: continue #!/usr/bin/python import sys (lastkey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastkey and lastkey!= key: print lastkey+"\t"+text (lastkey,text) = (key,value) else: (lastkey,text) = (key,text+" "+value)
45 Tweets Get SequenceFile with proper mapping yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.queuename=atmosphere_time \ -Dmapred.output.compress=false \ -Dmapreduce.job.max.split.locations=24 \ -D-Dmapred.reduce.tasks=20 \ -files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \ -mapper./twitter_map.py \ -reducer./twitter_reduce.py \ -input /project/atmosphere/tweets/2014/04/*/* \ -output /project/atmosphere/tweets/output \ -outputformat org.apache.hadoop.mapred.sequencefileoutputformat
46 Tweets Calculate vector representation for text mahout seq2sparse \ -i /project/atmosphere/tweets/output \ -o /project/atmosphere/tweets/vectorized -ow \ -chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2 {10: ,14: } {10: ,14: } {3: ,14: ,19: ,22: } {17:1.0} {3: ,14: ,19: ,22: }
47 Tweets I ts time to begin clusterization Let s find 100 clusters mahout kmeans \ -i /tweets_5/vectorized/tfidf-vectors \ -c /tweets_5/kmeans/initial-clusters \ -o /tweets_5/kmeans/output-clusters \ -cd 1.0 -k 100 -x 10 -cl ow \ -dm org.apache.mahout.common.distance.squaredeuclideandistancemeasure
48 Tweets Glance at results BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE PATCHING
49 Tweets It was easy because tags are very dependent (coocurence).
50 Tweets Bigger challenge user clustering LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC
51 Tweets Bigger challenge user clustering Results show that dataset is strongly curved by mobile and games Dataset wasn t random we subscribed specific keywords OS result is great!
52 Tweets HADOOP WORLD run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir #bigdata
53 Tweets HADOOP WORLD Cloudera wants to do big data in Real Time. Hortonworks wants to replace cloudera by research.
54 Visualize data add jar hive-serdes-1.0-snapshot.jar; create table tw_data_ ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012 STORED AS TEXTFILE LOCATION /tweets/tw_data_ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like ' %' GROUP BY v_date,lower(hashtags.text),lang
55 Visualize data add jar elasticsearch-hadoop-hive rc1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.esstoragehandler TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;
56 Visualize data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_ may LEFT outer JOIN tw_data_ april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;
57 Visualize data
58 Visualize data Tag: eurovisiontve
59 Thank you! Questions?
Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May
2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationMapReduce Détails Optimisation de la phase Reduce avec le Combiner
MapReduce Détails Optimisation de la phase Reduce avec le Combiner S'il est présent, le framework insère le Combiner dans la pipeline de traitement sur les noeuds qui viennent de terminer la phase Map.
More informationThe MapReduce Framework
The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationhttp://glennengstrand.info/analytics/fp
Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationMammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
More informationCOSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout
COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationHow to Hadoop Without the Worry: Protecting Big Data at Scale
How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationBig Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
More informationOpen Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
More informationHadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
More informationIntroduction to Apache Hive
Introduction to Apache Hive Pelle Jakovits 1. Oct, 2013, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language Examples Internals Hive vs
More informationCloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
More informationZihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical
More informationMicrosoft Big Data Solutions. Anar Taghiyev P-TSP E-mail: b-anarta@microsoft.com;
Microsoft Big Data Solutions Anar Taghiyev P-TSP E-mail: b-anarta@microsoft.com; Why/What is Big Data and Why Microsoft? Options of storage and big data processing in Microsoft Azure. Real Impact of Big
More informationHadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
More informationHadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
More informationThe Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale
The Power of Pentaho and Hadoop in Action Demonstrating MapReduce Performance at Scale Introduction Over the last few years, Big Data has gone from a tech buzzword to a value generator for many organizations.
More informationAssignment 2: More MapReduce with Hadoop
Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz
More information! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering
E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache
More informationBIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
More information#TalendSandbox for Big Data
Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND
More informationHarnessing the Potential Raj Nair
Linking Structured and Unstructured Data Harnessing the Potential Raj Nair AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use
More informationIntroduction to Apache Hive
Introduction to Apache Hive Pelle Jakovits 14 Oct, 2015, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language User Defined Functions Hive
More informationWINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
More informationHadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay
Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-
More informationIntroduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationCase Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
More informationBig Data Workshop. dattamsha.com
Big Data Workshop About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the
More informationText Clustering Using LucidWorks and Apache Mahout
Text Clustering Using LucidWorks and Apache Mahout (Nov. 17, 2012) 1. Module name Text Clustering Using Lucidworks and Apache Mahout 2. Scope This module introduces algorithms and evaluation metrics for
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationInvestigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein dstrohschein@cga.harvard.edu Stephen Mcdonald stephenmcdonald@cga.harvard.edu Benjamin Lewis blewis@cga.harvard.edu Weihe
More informationIntroduction to Hadoop on the SDSC Gordon Data Intensive Cluster"
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationFacebook s Petabyte Scale Data Warehouse using Hive and Hadoop
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends
More informationHadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
More informationApplication and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10
Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Outline Mass data management problem Applications of parallel cloud computing in ISPs
More informationHadoop/MapReduce Workshop
Hadoop/MapReduce Workshop Dan Mazur daniel.mazur@mcgill.ca Simon Nderitu simon.nderitu@mcgill.ca guillimin@calculquebec.ca August 14, 2015 1 Outline Hadoop introduction and motivation Python review HDFS
More informationSentiment Analysis using Hadoop Sponsored By Atlink Communications Inc
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc Instructor : Dr.Sadegh Davari Mentors : Dilhar De Silva, Rishita Khalathkar Team Members : Ankur Uprit Kiranmayi Ganti Pinaki Ranjan
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationBusiness Intelligence for Big Data
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
More informationINTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data
More informationOpen Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
More informationYou should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
More informationYahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations marco@yahoo-inc.com What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
More informationGetting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
More informationTutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationH2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
More informationHadoop and Big Data Research
Jive with Hive Allan Mitchell Joint author on 2005/2008 SSIS Book by Wrox Websites www.copperblueconsulting.com Specialise in Data and Process Integration Microsoft SQL Server MVP Twitter: allansqlis E:
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationProject Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science
Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/
More informationHow To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
More informationUbuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
More informationWhat is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani
What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationHadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture
More informationSriram Krishnan, Ph.D. sriram@sdsc.edu
Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop Distributed File System (HDFS) Overview
2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized
More informationMicrosoft Azure Data Technologies: An Overview
David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationPlay with Big Data on the Shoulders of Open Source
OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationData Intensive Computing Handout 6 Hadoop
Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
More informationConnecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationDealing with Data Especially Big Data
Dealing with Data Especially Big Data INFO-GB-2346.30 Spring 2016 Very Rough Draft Subject to Change Professor Norman White Background: Most courses spend their time on the concepts and techniques of analyzing
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationBringing Big Data to People
Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process
More informationSecuring Big Data at Rest with Encryption for Hadoop, Cassandra and MongoDB on Red Hat.
Securing Big Data at Rest with Encryption for Hadoop, Cassandra and MongoDB on Red Hat. Alex Gonzalez Software Engineer 1 Content What is Big Data 3 important No-SQL players Who uses Big Data Use Cases
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationCSCI6900 Assignment 2: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use
More information