Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science
|
|
- Deborah Stanley
- 8 years ago
- Views:
Transcription
1 Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: Submitted By Mahesh Jaliminche Mohan Upadhyay
2 Table of Contents Abstract Project Objectives Project Approach Installing Hadoop on your own machine Data Collection: Use of twitter Streaming API Data-Intensive Analysis of Twitter Data using Map Reduce Finding most trending words, hash Implementing word-co-occurrence algorithm, both pairs and stripes approach. Clustering: using Map Reduce version of K-means. Applying the MR version of shortest path algorithm to label the edges of the network/graph. Web enabled visualization of project Lessoned Learned
3 1. Abstract This project focus on understanding the Hadoop architecture and implementation of Map Reduce on this architecture. The objective of this project is to aggregate data from twitter using twitter API and apply Map Reduce to find most trending words, Hash Tags, find most co-occurring Hash Tags. To understand and implement the word-co-occurrence algorithm, both pairs and stripes. We will use Map Reduce version of K-means to cluster the tweeters by number of followers they have which is further used to marketing and targeting ads/message. Finally we learn and implement MR version of shortest path algorithm to label the edges of the network/graph. We also analyze the discovered knowledge using the R language. During our implementation, we plan to use different metrics like most trending words, hash hashtags, relative frequency of co-occurring hashtags, 2. Project Objectives At the end of this project, the following objectives will be covered. To understand the components and core technologies related to content retrieval, storage and dataintensive computing analysis. To understand twitter API to collect streaming tweets. To explore designing and implementing data-intensive (Big-data) computing solutions using Map Reduce (MR) programming model. Setup Hadoop 2 for HDFS and MR infrastructure. Thorough analysis of data using R. Extract meaningful results. Come to a conclusion about the dataset in hand. Web enabled visualization of project (
4 3. Project Approach Twitter API: The Streaming API is the real-time sample of the Twitter Firehose. This API is for those developers with data intensive needs. Streaming API allows for large quantities of keywords to be specified and tracked, retrieving geo-tagged tweets from a certain region, or have the public statuses of a user set returned. We used twitter API to get tweet text, date, username, follower count, retweet count. We also used filter query to collect topic specific tweet. Data aggregator: we collect the data from twitter using Streaming API. Once the data is received, we clean the unwanted details from the tweets and save them in a set of regular files in a designated directory. Data: Once the data collected we put this data into Hadoop s /input directory so that map reduce program can read the data from this folder. Data-Intensive Analysis(MR): o Setup Hadoop 2 for HDFS and Map Reduce infrastructure. o Designing and implementing the various MR workflows to extract various information from the data. (I) simple word count (ii) trends (iii) #tag counts (iv)@xyz counts etc. o Implementing word-co-occurrence algorithm, both pairs and stripes approach. o Clustering: using Map Reduce version of K-means discussed in class. o Applying the MR version of shortest path algorithm to label the edges of the network/graph. Discovery: From the MR implementation we discover different knowledge about most trending words, hash tags, most co-occurring hash tags. The output files are converted into csv and visualization on this data is done in R.
5 Visualization: We analyze the discovered knowledge in R. We plot various graph to find the most trending words, hash tags. We analyze the data on daily/ weekly basis to find the current trend. Also uploads the result on website ( 4. Installing Hadoop on your own machine Installing guide URL: We followed the installing guide to install Hadoop on our machine Virtual Machine configuration: o Ubuntu x64 o 2 CPUs, 2GB RAM o 12GB Virtual Hard Disk We learned various component of Hadoop such as: o Map Reduce o HDFS o Yarn Fig: Hadoop Infrastructure From this setup we understood the underlying infrastructure of Hadoop and its power to solve Big-Data problem.
6 5. Data Collection: Use of Twitter Streaming API. We used twitter Streaming API to collect tweets. We used the tweetcollector project given by TA and modified it to extract topic specific tweets. We collected tweet text, date, user name, follower count and retweet count. We used Filter query provided by twitter API to collect topic specific tweet. Before storing the data into text file we cleaned the unwanted data. In cleaning process we removed stop words, punctuation, special characters. Workflow: Source: Twitter Data Format: we collected tweet text, date, retweet count, username, follower count day wise and week wise. Our data format is TweetText,Date,RetweetCount,UserName,FollowerCount Eg: Srinivasan Meyiappan goes CSK starts lose Bring them back guys #CskvsKXIP #KXIPvsCSK, Fri_Apr_18_21:31:30_EDT_2014, 0, Dabbu_k_papa_ji, 60 ; Amount of Data Collected: Tweets
7 6. Data-Intensive Analysis of Twitter Data using Map Reduce a. Finding most trending words, hash Algorithm: Once the data is collected we run Map Reduce program of word count on the collected data to find most trending words, We implemented a custom partition which partition the data and send it to a particular reducer. CLASS DRIVER MAIN METHOD 1) Run Mapper 2) Set NO_OF_REDUCERS 3) Run Practitioner 4) Run Reducer CLASS MAPPER METHOD MAP (Key k, Line l) 1) Emit (token, 1) CLASS PARTITIONER METHOD GET_PARTITION 1) If KEY is HASH_TAG Assign to Reducer1 Else if KEY is USER_NAME Assign to Reducer2 Else Assign to Reducer0 CLASS REDUCER METHOD REDUCE (Word w, count <c1, c2...>) 1) SUM <- 0 2) For count c in <c1, c2...> SUM = SUM + c EMIT (Word w, SUM) Visualization: The output file of map reduce program is converted into csv so that it can be read into R. The command for converting the output into csv: :%s/\s\+/,/g We can open the csv file in MS excel to sort it.
8 R code to find top 10 trending words on 20 April myfile<-read.csv("c:\\users\\mahesh Jaliminche\\Desktop\\wordCount\\20apr0.csv") Flight_query= sqldf("select * from myfile order by Count desc LIMIT 10") Flight_query Word Count Maxwell IPL your APP Catch Vote backing choice chance From this data we can see that the most trending words on 20 April is about IPL and election in India Word cloud for most trending 100 words From this word cloud we can see the most trending words on 20 April is about IPL.
9 Word cloud for most trending 100 hash tags Word cloud for most trending
10 Top 10 trending Hashtags: Top Count
11 b. Implementing Word-Co-occurrence Algorithm (Pairs and Stripes) Algorithm: In this problem we find the co-occurring hash tags in a single tweet. There are 2 approaches to find co-occurrence hash tags, Pairs and Stripes approach. We also found the relative frequency of these co-occurring hashtags. Co-occurrence Stripe count CLASS DRIVER METHOD MAIN 1) Run Mapper 2) Run Reducer CLASS MAPPER METHOD MAP(docid a, doc d) 1) for all term w in doc d do H <- new AssociativeArray 2) for all term u in Neighbors(w) do H{u} <- H{u} + 1 3) Emit(Term w, Stripe H) CLASS REDUCER MRTHOD REDUCE(term w, stripes [H1, H2, H3,...]) 1) Hf <- new AssociativeArray 2) for all stripe H in stripes [H1, H2, H3,...] do Sum(Hf, H) for all terms x in stripe Hf 3) Emit(term w_x, x.value ) Algorithm: Co-occurrence Pair count CLASS DRIVER METHOD MAIN 1) Run Mapper 2) Set NO_OF_REDUCERS 3) Run Practitioner 4) Run Reducer CLASS MAPPER METHOD MAP(docid a, doc d) 1) for all term w in doc d do
12 for all term u in Neighbors(w) do Emit(pair (w, u), count 1) CLASS PARTITIONER METHOD GET_PARTITION 1) If KEY starts with a-e, A-E Assign to Reducer0 Else if KEY starts with f-j, F-J Assign to Reducer1 Else if Key starts with k-p, K-P Assign to Reducer2 Else if Key starts with q-z, Q-Z Assign to Reducer3 Else Assign to Reducer 4 CLASS REDUCER METHOD REDUCE(pair p, counts [c1, c2,...]) 1) s <- 0 2) for all count c in counts [c1, c2,...] do s <- s+c 3) Emit(pair p, count s) Algorithm: Co-occurrence Stripe Relative Frequency CLASS DRIVER METHOD MAIN 1) Run Mapper 2) Run Reducer CLASS MAPPER METHOD MAP(docid a, doc d) 1) for all term w in doc d do H <- new AssociativeArray 2) for all term u in Neighbors(w) do H{u} <- H{u} + 1 3) Emit(Term w, Stripe H) CLASS REDUCER MRTHOD REDUCE(term w, stripes [H1, H2, H3,...]) 1) Hf <- new AssociativeArray 2) Int count<-0; 2) for all stripe H in stripes [H1, H2, H3,...] do { Sum(Hf, H)
13 count=count+h.value } for all terms x in stripe H 3) Emit(term w_x, x.value/count ) Algorithm: Co-occurrence Pair Relative Frequency CLASS DRIVER METHOD MAIN 1) Run Mapper1 2) Run Reducer1 3) Run Mapper2 4) Set NO_OF_REDUCERS 5) Run Practitioner 6) Run Reducer2 CLASS MAPPER1 METHOD MAP(docid a, doc d) 1) for all term w in doc d do for all term u in Neighbors(w) do Emit(pair (w, *), count 1) CLASS MAPPER2 METHOD MAP(docid a, doc d) 1) for all term w in doc d do for all term u in Neighbors(w) do Emit(pair (w, u), count 1) CLASS PARTITIONER METHOD GET_PARTITION 1) If KEY starts with a-e, A-E Assign to Reducer0 Else if KEY starts with f-j, F-J Assign to Reducer1 Else if Key starts with k-p, K-P Assign to Reducer2 Else if Key starts with q-z, Q-Z Assign to Reducer3 Else Assign to Reducer 4 CLASS REDUCER1
14 METHOD REDUCE(pair p, counts [c1, c2,...]) 1) s <- 0 2) for all count c in counts [c1, c2,...] do s <- s+c 3) Emit(pair p, count s) CLASS REDUCER2 METHOD REDUCE(pair p, counts [c1, c2,...]) 1) s <- 0 2) Map H <-output of 1 st reducer 2) for all count c in counts [c1, c2,...] do s <- s+c 3) s<- s/h.get(p[0]) 3) Emit(pair p, count s) Visualization We categorized the relative frequency into 5 category and find the number of co-occurring hash lying in those category. group RF.Length RF.Minimum RF.Mean RF.Maximum 1 (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] Pie chart representation of category: Inference: The Relative frequency of most of the co-occurring hash tags lie between 0-0.2
15 Top 10 co-occurring hash tags:
16 c. Clustering: using Map Reduce version of K-means The algorithm checks new centroids with centroids generated during previous iterations. Counters are used to track whether the previous centroids are same as the new centroids. If the centroids are the same then convergence is reached and the loop is terminated. The mapper takes user id and follower count as input and finds the centroid nearest to the user. The output of mapper method is centroid id and pairs of user id and follower count. The framework aggregates all values emitted for same centroid id. The reducer takes in centroid id and aggregated pairs of user id and followers, calculates the new centroid and updates the counters. Algorithm: CLASS DRIVER 1) Declare enum CONVERGE {COUNTER, ITERATION} Declare INPUT_FILE_PATH and OUTPUT_FILE_PATH 2) While COUNTER > 0 do Run MAPPER Run REDUCER Fetch COUNTER value Update INPUT_FILE_PATH and OUTPUT_FILE_PATH CLASS MAPPER METHOD MAP (UserId id, FollowerCount followers) 1) Fetch updated CENTROIDS from file 2) For centroid in CENTROIDS do Calculate DISTANCE_CENTROID <- absolute(centroid-followers) 3) Get CENTROID_ID <- min(distance_centroid) 4) EMIT (CENTROID_ID,[UserId id, FollowerCount followers]) CLASS REDUCER CONSTRUCTOR REDUCER 1) Fetch centroids from previous iterations from file Array PREVIOUS_CENTROIDS <- Fetch from file METHOD REDUCER (CentroidId id, <[UserId id1, FolloweCount followers1],[userid id2, FolloweCount followers2]...>) 1) SUM <- 0 COUNT <- 0 NEW_CENTROID <- 0 2) For records R in < [UserId id1, FolloweCount followers1],[userid id2, FolloweCount followers2]...> SUM <- SUM + R.Followers COUNT <- COUNT +1 3) NEW_CENTROID <- SUM/COUNT 4) Append NEW_CENTROID in File 5) ITERATION <- VALUE_OF (CONVERGE.ITERATIONS) If mod (ITERATION,3)==0 Then CONVERGE.COUNTER <- 0
17 K-Means plot If PREVIOUS_CENTROIDS [ITERATION%3]!= NEW_CENTROID CONVERGE.COUNTER <- CONVERGE.COUNTER + 1 6) Emit (UserId id, followers + NEW_CENTROID)
18 d. Applying the MR version of shortest path algorithm to label the edges of the network/graph Algorithm : Explanation: The algorithm uses counters to check for convergence. The mapper takes node id and adjacency list as input and updates the distance of the node from adjacent nodes. The mapper emits nodes from adjacency list and updated distance. The framework aggregates all values for a particular node id. The reducer takes node id and aggregated list of distance for that node and updates the adjacency list for that node. Counters are updated and are checked in the driver class for convergence. CLASS DRIVER 1) Declare enum CONVERGE {COUNTER} Declare InputFilePath and OutputFilePath 2) While COUNTER > 0 Do Run MAPPER Run REDUCER Update InputFilePath and OutputFilePath CLASS MAPPER FUNCTION MAP (NodeId n, Node N ) 1) Read DISTANCE and ADJANCENCY_LIST for the node n DISTANCE <- N.Distance ADJACENCY_LIST <- N.AdjacencyList 2) Emit (NodeId n, N) 3) for nodes m in ADJANCENCY LIST Emit (NodeId m, DISTANCE + 1)
19 CLASS REDUCER Declare FIRST_ITERATION CONSTRUCTOR REDUCER Initialize FIRST_ITERATION=TRUE METHOD REDUCE(NodeId m, [d1,d2...]) 1) Initialize dmin< ) M <- null 3) for all d in counts [d1,d2...] do 4) if IsNode(d) then 5) M <- d DISTANCE<-d.DISTANCE 6) Else if d < dmin then 7) dmin <- d 8) M.Distance <- dmin 10) Emit (nid m; node M) 11) If FIRST_ITERATION == TRUE then Set COUNTER <- 0 Set FIRST_ITERATION <- FALSE 12) If DISTANCE!=dmin Set COUNTER <- COUNTER+1 Dijkstra Graph plot using Gephi.
20 7. Web enabled Visualization of Project We have designed a website and uploaded all the result on the website. Website: 8. Lesson Learned: a. Learned Hadoop infrastructure b. Data analysis using Map Reduce
21 8. Web enabled Visualization of Project
Twitter Data Analysis: Hadoop2 Map Reduce Framework
Data Intensive Computing CSE 487/587 Project Report 2 Twitter Data Analysis: Hadoop2 Map Reduce Framework Masters in Computer Science University at Buffalo Submitted By Prachi Gokhale (prachich@buffalo.edu)
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationThis exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationBIG DATA - HADOOP PROFESSIONAL amron
0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationClient Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationMINING DATA FROM TWITTER. Abhishanga Upadhyay Luis Mao Malavika Goda Krishna
MINING DATA FROM TWITTER Abhishanga Upadhyay Luis Mao Malavika Goda Krishna 1 Abstract The purpose of this report is to illustrate how to data mine Twitter to anyone with a computer and Internet access.
More informationA Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationIntegrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationMapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationK-means Clustering Technique on Search Engine Dataset using Data Mining Tool
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means
More informationAssignment 2: More MapReduce with Hadoop
Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationTackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
More informationLustre * Filesystem for Cloud and Hadoop *
OpenFabrics Software User Group Workshop Lustre * Filesystem for Cloud and Hadoop * Robert Read, Intel Lustre * for Cloud and Hadoop * Brief Lustre History and Overview Using Lustre with Hadoop Intel Cloud
More informationHadoop Development & BI- 0 to 100
Development Master the Data Analysis tools like Pig and hive Data Science Hadoop Development & BI- 0 to 100 Build a recommendation engine Hadoop Development - 0 to 100 HADOOP SCHOOL OF TRAINING Basics
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationFinding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
More informationBig Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationHands-on Exercises with Big Data
Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In
More informationConvex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationUSING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationNAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School
NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE A Thesis Presented to The Faculty of the Graduate School At the University of Missouri In Partial Fulfillment Of
More informationData Intensive Computing Handout 6 Hadoop
Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
More informationAssignment 5: Visualization
Assignment 5: Visualization Arash Vahdat March 17, 2015 Readings Depending on how familiar you are with web programming, you are recommended to study concepts related to CSS, HTML, and JavaScript. The
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationSoftware tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
More informationThe Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @
The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...
More informationOracle Big Data Spatial & Graph Social Network Analysis - Case Study
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 info@rittmanmead.com www.rittmanmead.com @rittmanmead About the Speaker Mark
More informationReal World Hadoop Use Cases
Real World Hadoop Use Cases JFokus 2013, Stockholm Eva Andreasson, Cloudera Inc. Lars Sjödin, King.com 1 2012 Cloudera, Inc. Agenda Recap of Big Data and Hadoop Analyzing Twitter feeds with Hadoop Real
More informationVOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved
Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2
More informationKeywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
More informationEvaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationComplete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
More informationDMX-h ETL Use Case Accelerator. Word Count
DMX-h ETL Use Case Accelerator Word Count Syncsort Incorporated, 2015 All rights reserved. This document contains proprietary and confidential material, and is only for use by licensees of DMExpress. This
More informationCloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012
Cloud Computing AWS a practical example Mayo 2012 Hugo Pérez UPC -2- Index Introduction Infraestructure Development and Results Conclusions Introduction In order to know deeper about AWS services, mapreduce
More informationExtreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
More informationBig Data Analytics in LinkedIn. Danielle Aring & William Merritt
Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationUniversity of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationHadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
More informationBig Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
More informationSo today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
More informationA SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA
A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA An Undergraduate Research Scholars Thesis by JASON M. BOLDEN Submitted to Honors and Undergraduate Research Texas
More informationResearch of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast
International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationMapReduce for Data Warehouses
MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions
More informationBuilding Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
More informationProfessional Hadoop Solutions
Brochure More information from http://www.researchandmarkets.com/reports/2542488/ Professional Hadoop Solutions Description: The go-to guidebook for deploying Big Data solutions with Hadoop Today's enterprise
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationThis exam contains 17 pages (including this cover page) and 21 questions. Check to see if any pages are missing.
Big Data Processing 2015-2016 Q2 January 29, 2016 Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer, continue
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationApigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps
White provides GRASP-powered big data predictive analytics that increases marketing effectiveness and customer satisfaction with API-driven adaptive apps that anticipate, learn, and adapt to deliver contextual,
More informationHadoop Project for IDEAL in CS5604
Hadoop Project for IDEAL in CS5604 by Jose Cadena Mengsu Chen Chengyuan Wen {jcadena,mschen,wechyu88@vt.edu Completed as part of the course CS5604: Information storage and retrieval offered by Dr. Edward
More informationKEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
More informationResearch Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze
Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce
More informationbrief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationProject 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm
Project 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm Goal. In this project you will use Hadoop to build a tool for processing sets of Twitter posts (i.e. tweets) and determining which people, tweets,
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationBig Data Analytics OverOnline Transactional Data Set
Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationContent-Based Discovery of Twitter Influencers
Content-Based Discovery of Twitter Influencers Chiara Francalanci, Irma Metra Department of Electronics, Information and Bioengineering Polytechnic of Milan, Italy irma.metra@mail.polimi.it chiara.francalanci@polimi.it
More informationCOMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
More informationReport Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to
More informationBIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop
More informationDistributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
More informationDistributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationSearch Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
More informationTweets as big data. Rob Procter and Alex Voss. www.analysingsocialmedia.org. rob.procter@manchester.ac.uk alex.voss@st andrews.ac.
Tweets as big data Rob Procter and Alex Voss www.analysingsocialmedia.org rob.procter@manchester.ac.uk alex.voss@st andrews.ac.uk SRA, December 10th 2012 1 Overview The qualitative data deluge A small
More informationEnhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More information