Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
|
|
- Philippa Conley
- 8 years ago
- Views:
Transcription
1 Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012
2 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and Hadoop Experimental Setup and Analyses Drive Mahout on Hadoop Interesting Communities Center for Web Intelligence, DePaul University, USA
3 Introduction About Me: a recommendation guy My Research: data mining and recommender systems Typical Experimental Research 1) Design or improve an algorithm; 2) Run algorithms and baseline algs on datasets; 3) Compare experimental results; 4) Try different parameters, find reasons and even re-design and improve algorithm itself; 5) Run algorithms and baseline algs on datasets; 6) Compare experimental results; 7) Try different parameters, find reasons and even re-design and improve algorithm itself; 8) And so on Until it approaches expected results.
4 Introduction Sometimes, data is large-scale. e.g. one algorithm may spend days to complete, how about experimental results are not as expected. Then improve algorithms and run it for days again, and again. How can we do previously? (for tasks not that complicated) 1). Paralleling but complicated synchronization and limited resources, such as CPU, memory, etc; 2). Take advantage of PC Labs, let s do it with 10 PCs Nearly all research will ultimately face the large-scale problems, especially in the domain of data mining. But, we have Map-Reduce NOW!
5 Introduction Do not need to distribute data and tasks manually. Instead we just simply generate configurations. Do not need to care about more details, e.g. how data is distributed, when one specific task will be ran on which machine, or how they conduct tasks one by one. Instead, we can pre-define working flow. We can take advantage of the functional contributions from mappers and reducers. More benefits: replication, balancing, robustness, etc
6 Recommender Systems Collaborative Filtering Slope One and Simple Weighted Slope One Slope One in Mahout Distributed Slope One in Mahout Mappers and Reducers Center for Web Intelligence, DePaul University, USA
7 Recommender Systems
8 Collaborative Filtering (CF) One of most popular recommendation algorithms. User-based: User-CF Item-based: Item-CF, Slope One Rating? 4 4 star User Example: User-based Collaborative Filtering
9 Slope One Recommender Reference: Daniel Lemire, Anna Maclachlan, Slope One Predictors for Online Rating-Based Collaborative Filtering, In SIAM Data Mining (SDM'05), April 21-23, User Batman Spiderman U1 3 4 U2 2 4 U3 2? 1). How different two movies were rated? U1 rated Spiderman higher by (4-3) = 1 U2 rated Spiderman higher by (4-2) = 2 On average, Spiderman is rated (1+2)/2 = 1.5 higher 2). Rating difference can tell predictions If we know U3 gave Batman a 2-star, probably he will rated Spiderman by (2+1.5) = 3.5 star
10 Simple Weighted Slope One Usually user rated multiple items User HarryPotter Batman Spiderman U U2? 2 4 U3 4 2? 1). How different the two movies were rated? Diff(Batman, Spiderman) = [(4-3)+(4-2)]/2 = 1.5 Diff(HarryPotter, Spiderman) = (4-5)/1 = -1 2 and 1 here we call them as count. 2). Weighted rating difference can tell predictions We use a simple weighted approach Refer to Batman only, rating = = 3.5 Refer to HarryPotter only, rating = 4-1 = 3 Consider them all, predicted rating = (3.5*2 + 3*1])/ (2+1) = 3.33
11 Simple Weighted Slope One User HarryPotter Batman Spiderman u u2? 2 4 u3 4 2? To calculate the prediction ratings, we need 2 matrices: 1).Difference Matrix Movie1 Question: Online or Offline? Movie2-1.5 Movie1 Movie2 Movie3 Movie4 Movie3 2 1 Movie ). Count Matrix Just number of users co-rated on two items
12 Slope One in Mahout Mahout, an open-source machine learning library. 1). Recommendation algorithms User-based CF, Item-based CF, Slope One, etc 2). Clustering KMeans, Fuzzy KMeans, etc 3). Classification Decision Trees, Naive Bayes, SVM, etc 4). Latent Factor Models LDA, SVD, Matrix Factorization, etc
13 Slope One in Mahout org.apache.mahout.cf.taste.impl.recommender.slopeone.slopeonerecommender Pre-Processing Stage: (class MemoryDiffStorage with Map) for every item i for every other item j for every user u expressing preference for both i and j add the difference in u s preference for i and j to an average Recommendation Stage: for every item i the user u expresses no preference for for every item j that user u expresses a preference for find the average preference difference between j and i add this diff to u s preference value for j add this to a running average return the top items, ranked by these averages Simple weighting: as introduced previously StdDev weighting: item-item rating diffs with lower sd should be weighted highly
14 Distributed Slope One in Mahout Similar to our previous practice, e.g. the matrix factorization Process, what we need is the Difference Matrix. Suppose there are M users rated N items, the matrix requires N(N-1)/2 cells. Also, the density is another aspect how user rated items. If there are several items and the rating matrix is dense, the computational costs will increase accordingly. Question again: Online or Offline? Depends on tasks & data. Large-scale data. Let s do it offline!
15 Distributed Slope One in Mahout package org.apache.mahout.cf.taste.hadoop.slopeone; class SlopeOneAverageDiffsJob class SlopeOnePrefsToDiffsReducer class SlopeOneDiffsToAveragesReducer package org.apache.mahout.cf.taste.hadoop; class ToItemPrefsMapper org.apache.hadoop.mapreduce.mapper Two Mapper-Reducer Stages: 1). Create DiffMatrix for each user 2). Collect AvgDiff info, counts, StdDev Let s see how it works
16 Mapper and Reducer - 1 User HarryPotter Batman Spiderman U U2? 2 4 U3 4 2? Mapper1 (ToItemPrefsMapper) <UserID, Pair<ItemID, Rating>> Reducer1 (PrefsToDiffsReducer) <Pair<Item1,Item2>, Diff> (for all three users) <U1> Potter Bat Spider Potter <U2> Potter Bat Spider Potter Bat -2 Bat NULL Spider -1 1 Spider NULL 2
17 Mapper and Reducer - 2 <U1> Potter Bat Spider <U2> Potter Bat Spider Potter Bat -2 Potter Bat NULL Spider -1 1 Spider NULL 2 Mapper2 (org.apache.hadoop.mapreduce.mapper) Reducer2 (DiffsToAveragesReducer) Average Diffs, Count, StedDev <Aggregate> Potter Bat Spider Potter Bat -2, 1 Spider -1, 1 1.5, 2 Simply, <a,b> pair denotes a=averge diff, b=count Notice: we should use three matrices in practice, here I used 2.
18 Predictions User HarryPotter Batman Spiderman U U2? 2 4 U3 4 2? <Aggregate> Potter Bat Spider Potter Bat -2, 1 Spider -1, 1 1.5, 2 Simply, <a,b> pair denotes a=averge diff, b=count Notice: we should use three matrices in practice, here I used 2. Prediction(U3, Spiderman) = [(4-1)*1 + (2+1.5)*2] / (1+2) =
19 Experiments Data Hadoop Setup Running Performances Center for Web Intelligence, DePaul University, USA
20 Experiment Setup Data: MovieLens-1M ratings # of users: 6,040 # of movies: 3,900 # of ratings: 1,000,209 Density of the ratings: each user has at least 20 ratings obviously, some users have many more ratings Rating format: UserID, ItemID, Rating (scale 1-5) Data Split: 80% training, 20% testing
21 Experiment Setup Hadoop Cluster Setup IBM SmartCloud 1 master node, 7 slave nodes Each node is as SUSE Linux Enterprise Server v11 SP1 Server Configuration: 64 bit (vcpu: 2, RAM: 4 GiB, Disk: 60 GiB) Hadoop v Mahout distribution-0.6 The environment setup follows the typical workflow as: Thanks Scott Young, neat writeup!!
22 Experimental Analyses Stage-1: SlopeOneAverageDiffsJob by Map-Reduce Goal: Build DiffStorage Output: DiffStorage txt file, 1.45GB Running Time: real 13m s user 0m 5.136s sys 0m 1.028s Item1 Item2 Diff Count StdDev Stage-2: Java evaluator to measure MAE on testing set Running Time: Load Testing Set (21K records), 299ms Load Training Set (79K records), 1,771ms Load DiffStorage, 176,352ms = 2.9m Prediction (21K records), 18,182ms = 0.3m MAE =
23 Experimental Experiences 1. Why not MovieLens 10M data? Map-Reduce on 10M data may cost several hrs; Running time depends on cluster and configuration; Also, DiffStorage file will be too large. 2. Java Evaluator Load full DiffStorage file is time-consuming. Also, incur Java heap space and GCOverlimit errors; Those errors can not be fixed by Xmx or other solutions; Two solutions: 1). Just use simple weighting, discard StdDev weighting. 2). Simple Mapper and Reducer, run it on clusters. For MovieLens 1M, it is not that efficient compared with the live SlopeOne recommendation; 10M data may be better, will try MovieLens-10M data later; Slope One is simple but memory-expensive.
24 More Drive Mahout on Hadoop Interesting Communities Center for Web Intelligence, DePaul University, USA
25 Mahout + Hadoop How to put more Mahout algorithms to Hadoop? 1. Pre-set Command in Mahout Let s see bin/mahout help, then it provides a list of available programs such as svd, fkmeans, etc. Some are basic functions, such as splitdataset Some can be executed as Hadoop tasks e.g. Run and evaluate Matrix Factorization on rating dataset bin/mahout parallelals --input inputsource --output outputsource --tempdir tmpfolder --numfeatures 20 --numiterations 10 bin/mahout evaluatefactorization --input inputsource --output outputsource --userfeatures als/out/u/ --itemfeatures als/out/m/ --tempdir tmpfolder
26 Mahout + Hadoop 2. More Algorithms on Hadoop Mahout provides a way to run more Mahount algorithms. Simply, $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core- <version>.jar <Job Class> --recommenderclassname Class <OPTIONS> Which kinds of Jobs it supports? Mahout implemented some versions. Some popular ones: 1).org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob --recommenderclassname ClassName 2).org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 3).org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob 4).org.apache.mahout.cf.taste.hadoop.slopeone.SlopeOneAverageDiffsJob
27 Interesting Communities Beyond Hadoop and Mahout official sites 1. Data Mining KDnuggets, Popular community for Data Mining & Analytics. Lots of useful information, such as news, materials, datasets, jobs, etc. 2. Big Data SmartData Collective, Smarter Computing, Big Data Meetup, 3. Recommender Systems ACM Official Site, RecSys Wiki,
28 Thank You! Center for Web Intelligence, DePaul University, USA
Development of a distributed recommender system using the Hadoop Framework
Development of a distributed recommender system using the Hadoop Framework Raja Chiky, Renata Ghisloti, Zakia Kazi Aoul LISITE-ISEP 28 rue Notre Dame Des Champs 75006 Paris firstname.lastname@isep.fr Abstract.
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More information! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II
! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
More informationHadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationMammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
More informationAdvanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More information! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering
E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationBUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business
BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang (kzhang@rmsmith.umd.edu) Lecture-Discussions:
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
More informationIntroduction to Hadoop on the SDSC Gordon Data Intensive Cluster"
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing
More informationData Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationData Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
More informationHadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
More informationLarge-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
More informationDragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers
Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers This section includes system requirements for DMENE Network configurations that utilize virtual
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationWhat s Cooking in KNIME
What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB
More informationGetting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
More informationIntroduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
More informationLarge-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationBig-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationHiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
More informationPerformance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More informationKNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationMapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
More informationEvaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationRadoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
More informationOn the Performance of High Dimensional Data Clustering and Classification Algorithms
On the Performance of High Dimensional Data Clustering and Classification Algorithms Abstract Kathleen Ericson and Shrideep Pallickara Computer Science Department Colorado State University Fort Collins,
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationIntegrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
More informationCS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
More informationCloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationUnderstanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationBig Data Analytics Verizon Lab, Palo Alto
Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015 Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationHow To Create A Graphlab Framework For Parallel Machine Learning
This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies: GraphLab Tamara
More informationAn analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationPREA: Personalized Recommendation Algorithms Toolkit
Journal of Machine Learning Research 13 (2012) 2699-2703 Submitted 7/11; Revised 4/12; Published 9/12 PREA: Personalized Recommendation Algorithms Toolkit Joonseok Lee Mingxuan Sun Guy Lebanon College
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationSQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
More informationBig Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades
More informationA bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
More informationRECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
More informationLavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
More informationISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
More informationIs a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationPolicy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationPerformance Analysis of Mixed Distributed Filesystem Workloads
Performance Analysis of Mixed Distributed Filesystem Workloads Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent, Scott Brandt Motivation Hadoop-tailored filesystems (e.g. CloudStore)
More informationR at the front end and
Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationHADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationHow To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationProject 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm
Project 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm Goal. In this project you will use Hadoop to build a tool for processing sets of Twitter posts (i.e. tweets) and determining which people, tweets,
More informationMapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
More informationFinal Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationBenchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationOPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationIT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
More informationImage Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
More informationLabStats 5 System Requirements
LabStats Tel: 877-299-6241 255 B St, Suite 201 Fax: 208-473-2989 Idaho Falls, ID 83402 LabStats 5 System Requirements Server Component Virtual Servers: There is a limit to the resources available to virtual
More informationNTT DOCOMO Technical Journal. Large-Scale Data Processing Infrastructure for Mobile Spatial Statistics
Large-scale Distributed Data Processing Big Data Service Platform Mobile Spatial Statistics Supporting Development of Society and Industry Population Estimation Using Mobile Network Statistical Data and
More information