Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Size: px
Start display at page:

Download "Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science"

Transcription

1 Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: Submitted By Mahesh Jaliminche Mohan Upadhyay

2 Table of Contents Abstract Project Objectives Project Approach Installing Hadoop on your own machine Data Collection: Use of twitter Streaming API Data-Intensive Analysis of Twitter Data using Map Reduce Finding most trending words, hash Implementing word-co-occurrence algorithm, both pairs and stripes approach. Clustering: using Map Reduce version of K-means. Applying the MR version of shortest path algorithm to label the edges of the network/graph. Web enabled visualization of project Lessoned Learned

3 1. Abstract This project focus on understanding the Hadoop architecture and implementation of Map Reduce on this architecture. The objective of this project is to aggregate data from twitter using twitter API and apply Map Reduce to find most trending words, Hash Tags, find most co-occurring Hash Tags. To understand and implement the word-co-occurrence algorithm, both pairs and stripes. We will use Map Reduce version of K-means to cluster the tweeters by number of followers they have which is further used to marketing and targeting ads/message. Finally we learn and implement MR version of shortest path algorithm to label the edges of the network/graph. We also analyze the discovered knowledge using the R language. During our implementation, we plan to use different metrics like most trending words, hash hashtags, relative frequency of co-occurring hashtags, 2. Project Objectives At the end of this project, the following objectives will be covered. To understand the components and core technologies related to content retrieval, storage and dataintensive computing analysis. To understand twitter API to collect streaming tweets. To explore designing and implementing data-intensive (Big-data) computing solutions using Map Reduce (MR) programming model. Setup Hadoop 2 for HDFS and MR infrastructure. Thorough analysis of data using R. Extract meaningful results. Come to a conclusion about the dataset in hand. Web enabled visualization of project (

4 3. Project Approach Twitter API: The Streaming API is the real-time sample of the Twitter Firehose. This API is for those developers with data intensive needs. Streaming API allows for large quantities of keywords to be specified and tracked, retrieving geo-tagged tweets from a certain region, or have the public statuses of a user set returned. We used twitter API to get tweet text, date, username, follower count, retweet count. We also used filter query to collect topic specific tweet. Data aggregator: we collect the data from twitter using Streaming API. Once the data is received, we clean the unwanted details from the tweets and save them in a set of regular files in a designated directory. Data: Once the data collected we put this data into Hadoop s /input directory so that map reduce program can read the data from this folder. Data-Intensive Analysis(MR): o Setup Hadoop 2 for HDFS and Map Reduce infrastructure. o Designing and implementing the various MR workflows to extract various information from the data. (I) simple word count (ii) trends (iii) #tag counts (iv)@xyz counts etc. o Implementing word-co-occurrence algorithm, both pairs and stripes approach. o Clustering: using Map Reduce version of K-means discussed in class. o Applying the MR version of shortest path algorithm to label the edges of the network/graph. Discovery: From the MR implementation we discover different knowledge about most trending words, hash tags, most co-occurring hash tags. The output files are converted into csv and visualization on this data is done in R.

5 Visualization: We analyze the discovered knowledge in R. We plot various graph to find the most trending words, hash tags. We analyze the data on daily/ weekly basis to find the current trend. Also uploads the result on website ( 4. Installing Hadoop on your own machine Installing guide URL: We followed the installing guide to install Hadoop on our machine Virtual Machine configuration: o Ubuntu x64 o 2 CPUs, 2GB RAM o 12GB Virtual Hard Disk We learned various component of Hadoop such as: o Map Reduce o HDFS o Yarn Fig: Hadoop Infrastructure From this setup we understood the underlying infrastructure of Hadoop and its power to solve Big-Data problem.

6 5. Data Collection: Use of Twitter Streaming API. We used twitter Streaming API to collect tweets. We used the tweetcollector project given by TA and modified it to extract topic specific tweets. We collected tweet text, date, user name, follower count and retweet count. We used Filter query provided by twitter API to collect topic specific tweet. Before storing the data into text file we cleaned the unwanted data. In cleaning process we removed stop words, punctuation, special characters. Workflow: Source: Twitter Data Format: we collected tweet text, date, retweet count, username, follower count day wise and week wise. Our data format is TweetText,Date,RetweetCount,UserName,FollowerCount Eg: Srinivasan Meyiappan goes CSK starts lose Bring them back guys #CskvsKXIP #KXIPvsCSK, Fri_Apr_18_21:31:30_EDT_2014, 0, Dabbu_k_papa_ji, 60 ; Amount of Data Collected: Tweets

7 6. Data-Intensive Analysis of Twitter Data using Map Reduce a. Finding most trending words, hash Algorithm: Once the data is collected we run Map Reduce program of word count on the collected data to find most trending words, We implemented a custom partition which partition the data and send it to a particular reducer. CLASS DRIVER MAIN METHOD 1) Run Mapper 2) Set NO_OF_REDUCERS 3) Run Practitioner 4) Run Reducer CLASS MAPPER METHOD MAP (Key k, Line l) 1) Emit (token, 1) CLASS PARTITIONER METHOD GET_PARTITION 1) If KEY is HASH_TAG Assign to Reducer1 Else if KEY is USER_NAME Assign to Reducer2 Else Assign to Reducer0 CLASS REDUCER METHOD REDUCE (Word w, count <c1, c2...>) 1) SUM <- 0 2) For count c in <c1, c2...> SUM = SUM + c EMIT (Word w, SUM) Visualization: The output file of map reduce program is converted into csv so that it can be read into R. The command for converting the output into csv: :%s/\s\+/,/g We can open the csv file in MS excel to sort it.

8 R code to find top 10 trending words on 20 April myfile<-read.csv("c:\\users\\mahesh Jaliminche\\Desktop\\wordCount\\20apr0.csv") Flight_query= sqldf("select * from myfile order by Count desc LIMIT 10") Flight_query Word Count Maxwell IPL your APP Catch Vote backing choice chance From this data we can see that the most trending words on 20 April is about IPL and election in India Word cloud for most trending 100 words From this word cloud we can see the most trending words on 20 April is about IPL.

9 Word cloud for most trending 100 hash tags Word cloud for most trending

10 Top 10 trending Hashtags: Top Count

11 b. Implementing Word-Co-occurrence Algorithm (Pairs and Stripes) Algorithm: In this problem we find the co-occurring hash tags in a single tweet. There are 2 approaches to find co-occurrence hash tags, Pairs and Stripes approach. We also found the relative frequency of these co-occurring hashtags. Co-occurrence Stripe count CLASS DRIVER METHOD MAIN 1) Run Mapper 2) Run Reducer CLASS MAPPER METHOD MAP(docid a, doc d) 1) for all term w in doc d do H <- new AssociativeArray 2) for all term u in Neighbors(w) do H{u} <- H{u} + 1 3) Emit(Term w, Stripe H) CLASS REDUCER MRTHOD REDUCE(term w, stripes [H1, H2, H3,...]) 1) Hf <- new AssociativeArray 2) for all stripe H in stripes [H1, H2, H3,...] do Sum(Hf, H) for all terms x in stripe Hf 3) Emit(term w_x, x.value ) Algorithm: Co-occurrence Pair count CLASS DRIVER METHOD MAIN 1) Run Mapper 2) Set NO_OF_REDUCERS 3) Run Practitioner 4) Run Reducer CLASS MAPPER METHOD MAP(docid a, doc d) 1) for all term w in doc d do

12 for all term u in Neighbors(w) do Emit(pair (w, u), count 1) CLASS PARTITIONER METHOD GET_PARTITION 1) If KEY starts with a-e, A-E Assign to Reducer0 Else if KEY starts with f-j, F-J Assign to Reducer1 Else if Key starts with k-p, K-P Assign to Reducer2 Else if Key starts with q-z, Q-Z Assign to Reducer3 Else Assign to Reducer 4 CLASS REDUCER METHOD REDUCE(pair p, counts [c1, c2,...]) 1) s <- 0 2) for all count c in counts [c1, c2,...] do s <- s+c 3) Emit(pair p, count s) Algorithm: Co-occurrence Stripe Relative Frequency CLASS DRIVER METHOD MAIN 1) Run Mapper 2) Run Reducer CLASS MAPPER METHOD MAP(docid a, doc d) 1) for all term w in doc d do H <- new AssociativeArray 2) for all term u in Neighbors(w) do H{u} <- H{u} + 1 3) Emit(Term w, Stripe H) CLASS REDUCER MRTHOD REDUCE(term w, stripes [H1, H2, H3,...]) 1) Hf <- new AssociativeArray 2) Int count<-0; 2) for all stripe H in stripes [H1, H2, H3,...] do { Sum(Hf, H)

13 count=count+h.value } for all terms x in stripe H 3) Emit(term w_x, x.value/count ) Algorithm: Co-occurrence Pair Relative Frequency CLASS DRIVER METHOD MAIN 1) Run Mapper1 2) Run Reducer1 3) Run Mapper2 4) Set NO_OF_REDUCERS 5) Run Practitioner 6) Run Reducer2 CLASS MAPPER1 METHOD MAP(docid a, doc d) 1) for all term w in doc d do for all term u in Neighbors(w) do Emit(pair (w, *), count 1) CLASS MAPPER2 METHOD MAP(docid a, doc d) 1) for all term w in doc d do for all term u in Neighbors(w) do Emit(pair (w, u), count 1) CLASS PARTITIONER METHOD GET_PARTITION 1) If KEY starts with a-e, A-E Assign to Reducer0 Else if KEY starts with f-j, F-J Assign to Reducer1 Else if Key starts with k-p, K-P Assign to Reducer2 Else if Key starts with q-z, Q-Z Assign to Reducer3 Else Assign to Reducer 4 CLASS REDUCER1

14 METHOD REDUCE(pair p, counts [c1, c2,...]) 1) s <- 0 2) for all count c in counts [c1, c2,...] do s <- s+c 3) Emit(pair p, count s) CLASS REDUCER2 METHOD REDUCE(pair p, counts [c1, c2,...]) 1) s <- 0 2) Map H <-output of 1 st reducer 2) for all count c in counts [c1, c2,...] do s <- s+c 3) s<- s/h.get(p[0]) 3) Emit(pair p, count s) Visualization We categorized the relative frequency into 5 category and find the number of co-occurring hash lying in those category. group RF.Length RF.Minimum RF.Mean RF.Maximum 1 (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] Pie chart representation of category: Inference: The Relative frequency of most of the co-occurring hash tags lie between 0-0.2

15 Top 10 co-occurring hash tags:

16 c. Clustering: using Map Reduce version of K-means The algorithm checks new centroids with centroids generated during previous iterations. Counters are used to track whether the previous centroids are same as the new centroids. If the centroids are the same then convergence is reached and the loop is terminated. The mapper takes user id and follower count as input and finds the centroid nearest to the user. The output of mapper method is centroid id and pairs of user id and follower count. The framework aggregates all values emitted for same centroid id. The reducer takes in centroid id and aggregated pairs of user id and followers, calculates the new centroid and updates the counters. Algorithm: CLASS DRIVER 1) Declare enum CONVERGE {COUNTER, ITERATION} Declare INPUT_FILE_PATH and OUTPUT_FILE_PATH 2) While COUNTER > 0 do Run MAPPER Run REDUCER Fetch COUNTER value Update INPUT_FILE_PATH and OUTPUT_FILE_PATH CLASS MAPPER METHOD MAP (UserId id, FollowerCount followers) 1) Fetch updated CENTROIDS from file 2) For centroid in CENTROIDS do Calculate DISTANCE_CENTROID <- absolute(centroid-followers) 3) Get CENTROID_ID <- min(distance_centroid) 4) EMIT (CENTROID_ID,[UserId id, FollowerCount followers]) CLASS REDUCER CONSTRUCTOR REDUCER 1) Fetch centroids from previous iterations from file Array PREVIOUS_CENTROIDS <- Fetch from file METHOD REDUCER (CentroidId id, <[UserId id1, FolloweCount followers1],[userid id2, FolloweCount followers2]...>) 1) SUM <- 0 COUNT <- 0 NEW_CENTROID <- 0 2) For records R in < [UserId id1, FolloweCount followers1],[userid id2, FolloweCount followers2]...> SUM <- SUM + R.Followers COUNT <- COUNT +1 3) NEW_CENTROID <- SUM/COUNT 4) Append NEW_CENTROID in File 5) ITERATION <- VALUE_OF (CONVERGE.ITERATIONS) If mod (ITERATION,3)==0 Then CONVERGE.COUNTER <- 0

17 K-Means plot If PREVIOUS_CENTROIDS [ITERATION%3]!= NEW_CENTROID CONVERGE.COUNTER <- CONVERGE.COUNTER + 1 6) Emit (UserId id, followers + NEW_CENTROID)

18 d. Applying the MR version of shortest path algorithm to label the edges of the network/graph Algorithm : Explanation: The algorithm uses counters to check for convergence. The mapper takes node id and adjacency list as input and updates the distance of the node from adjacent nodes. The mapper emits nodes from adjacency list and updated distance. The framework aggregates all values for a particular node id. The reducer takes node id and aggregated list of distance for that node and updates the adjacency list for that node. Counters are updated and are checked in the driver class for convergence. CLASS DRIVER 1) Declare enum CONVERGE {COUNTER} Declare InputFilePath and OutputFilePath 2) While COUNTER > 0 Do Run MAPPER Run REDUCER Update InputFilePath and OutputFilePath CLASS MAPPER FUNCTION MAP (NodeId n, Node N ) 1) Read DISTANCE and ADJANCENCY_LIST for the node n DISTANCE <- N.Distance ADJACENCY_LIST <- N.AdjacencyList 2) Emit (NodeId n, N) 3) for nodes m in ADJANCENCY LIST Emit (NodeId m, DISTANCE + 1)

19 CLASS REDUCER Declare FIRST_ITERATION CONSTRUCTOR REDUCER Initialize FIRST_ITERATION=TRUE METHOD REDUCE(NodeId m, [d1,d2...]) 1) Initialize dmin< ) M <- null 3) for all d in counts [d1,d2...] do 4) if IsNode(d) then 5) M <- d DISTANCE<-d.DISTANCE 6) Else if d < dmin then 7) dmin <- d 8) M.Distance <- dmin 10) Emit (nid m; node M) 11) If FIRST_ITERATION == TRUE then Set COUNTER <- 0 Set FIRST_ITERATION <- FALSE 12) If DISTANCE!=dmin Set COUNTER <- COUNTER+1 Dijkstra Graph plot using Gephi.

20 7. Web enabled Visualization of Project We have designed a website and uploaded all the result on the website. Website: 8. Lesson Learned: a. Learned Hadoop infrastructure b. Data analysis using Map Reduce

21 8. Web enabled Visualization of Project

Twitter Data Analysis: Hadoop2 Map Reduce Framework

Twitter Data Analysis: Hadoop2 Map Reduce Framework Data Intensive Computing CSE 487/587 Project Report 2 Twitter Data Analysis: Hadoop2 Map Reduce Framework Masters in Computer Science University at Buffalo Submitted By Prachi Gokhale (prachich@buffalo.edu)

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

BIG DATA - HADOOP PROFESSIONAL amron

BIG DATA - HADOOP PROFESSIONAL amron 0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Client Overview. Engagement Situation. Key Requirements

Client Overview. Engagement Situation. Key Requirements Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

MINING DATA FROM TWITTER. Abhishanga Upadhyay Luis Mao Malavika Goda Krishna

MINING DATA FROM TWITTER. Abhishanga Upadhyay Luis Mao Malavika Goda Krishna MINING DATA FROM TWITTER Abhishanga Upadhyay Luis Mao Malavika Goda Krishna 1 Abstract The purpose of this report is to illustrate how to data mine Twitter to anyone with a computer and Internet access.

More information

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

More information

Assignment 2: More MapReduce with Hadoop

Assignment 2: More MapReduce with Hadoop Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Lustre * Filesystem for Cloud and Hadoop *

Lustre * Filesystem for Cloud and Hadoop * OpenFabrics Software User Group Workshop Lustre * Filesystem for Cloud and Hadoop * Robert Read, Intel Lustre * for Cloud and Hadoop * Brief Lustre History and Overview Using Lustre with Hadoop Intel Cloud

More information

Hadoop Development & BI- 0 to 100

Hadoop Development & BI- 0 to 100 Development Master the Data Analysis tools like Pig and hive Data Science Hadoop Development & BI- 0 to 100 Build a recommendation engine Hadoop Development - 0 to 100 HADOOP SCHOOL OF TRAINING Basics

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School

NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE A Thesis Presented to The Faculty of the Graduate School At the University of Missouri In Partial Fulfillment Of

More information

Data Intensive Computing Handout 6 Hadoop

Data Intensive Computing Handout 6 Hadoop Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

Assignment 5: Visualization

Assignment 5: Visualization Assignment 5: Visualization Arash Vahdat March 17, 2015 Readings Depending on how familiar you are with web programming, you are recommended to study concepts related to CSS, HTML, and JavaScript. The

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @ The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...

More information

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 info@rittmanmead.com www.rittmanmead.com @rittmanmead About the Speaker Mark

More information

Real World Hadoop Use Cases

Real World Hadoop Use Cases Real World Hadoop Use Cases JFokus 2013, Stockholm Eva Andreasson, Cloudera Inc. Lars Sjödin, King.com 1 2012 Cloudera, Inc. Agenda Recap of Big Data and Hadoop Analyzing Twitter feeds with Hadoop Real

More information

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2

More information

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Keywords social media, internet, data, sentiment analysis, opinion mining, business Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction

More information

Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

DMX-h ETL Use Case Accelerator. Word Count

DMX-h ETL Use Case Accelerator. Word Count DMX-h ETL Use Case Accelerator Word Count Syncsort Incorporated, 2015 All rights reserved. This document contains proprietary and confidential material, and is only for use by licensees of DMExpress. This

More information

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012 Cloud Computing AWS a practical example Mayo 2012 Hugo Pérez UPC -2- Index Introduction Infraestructure Development and Results Conclusions Introduction In order to know deeper about AWS services, mapreduce

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

Big Data Frameworks: Scala and Spark Tutorial

Big Data Frameworks: Scala and Spark Tutorial Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA

A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA An Undergraduate Research Scholars Thesis by JASON M. BOLDEN Submitted to Honors and Undergraduate Research Texas

More information

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

MapReduce for Data Warehouses

MapReduce for Data Warehouses MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Professional Hadoop Solutions

Professional Hadoop Solutions Brochure More information from http://www.researchandmarkets.com/reports/2542488/ Professional Hadoop Solutions Description: The go-to guidebook for deploying Big Data solutions with Hadoop Today's enterprise

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

This exam contains 17 pages (including this cover page) and 21 questions. Check to see if any pages are missing.

This exam contains 17 pages (including this cover page) and 21 questions. Check to see if any pages are missing. Big Data Processing 2015-2016 Q2 January 29, 2016 Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer, continue

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps White provides GRASP-powered big data predictive analytics that increases marketing effectiveness and customer satisfaction with API-driven adaptive apps that anticipate, learn, and adapt to deliver contextual,

More information

Hadoop Project for IDEAL in CS5604

Hadoop Project for IDEAL in CS5604 Hadoop Project for IDEAL in CS5604 by Jose Cadena Mengsu Chen Chengyuan Wen {jcadena,mschen,wechyu88@vt.edu Completed as part of the course CS5604: Information storage and retrieval offered by Dr. Edward

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Project 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm

Project 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm Project 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm Goal. In this project you will use Hadoop to build a tool for processing sets of Twitter posts (i.e. tweets) and determining which people, tweets,

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Big Data Analytics OverOnline Transactional Data Set

Big Data Analytics OverOnline Transactional Data Set Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Content-Based Discovery of Twitter Influencers

Content-Based Discovery of Twitter Influencers Content-Based Discovery of Twitter Influencers Chiara Francalanci, Irma Metra Department of Electronics, Information and Bioengineering Polytechnic of Milan, Italy irma.metra@mail.polimi.it chiara.francalanci@polimi.it

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

More information

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Tweets as big data. Rob Procter and Alex Voss. www.analysingsocialmedia.org. rob.procter@manchester.ac.uk alex.voss@st andrews.ac.

Tweets as big data. Rob Procter and Alex Voss. www.analysingsocialmedia.org. rob.procter@manchester.ac.uk alex.voss@st andrews.ac. Tweets as big data Rob Procter and Alex Voss www.analysingsocialmedia.org rob.procter@manchester.ac.uk alex.voss@st andrews.ac.uk SRA, December 10th 2012 1 Overview The qualitative data deluge A small

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information