# Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

2 Outline 2 1. Retrospection 2. Stratosphere Plans 3. Comparison with Hadoop 4. Evaluation 5. Outlook

3 Retrospection 3 Matrix factorization: Find optimal factors W H V 5 4?? 5 2 2? 4 W H WxH Stochastic Gradient Descent

4 Retrospection 4 Result of SGD step Hadoop jobs with judging loss

5 Stratosphere Plans Multiple Iterations 5 for each starting point: while loss too high: MapReduce optimize factors MapReduce join triples MapReduce calculate loss (training) calculate loss (judging) for each starting point: while stop file not empty: MapMapReduce optimize factors CrossMatch join triples MapReduceCross calculate loss (training) and decide MapMatchReduce calculate loss (judging)

6 Stratosphere Plans Subplans in One Iteration 6 Triples Training data 3,7,b,w,h Optimize factors + Join triples + 3,7,b,-,- keys: row, column Judging data Losses 0.9; 1.0; 2.4 Calc loss (training) + Calc loss (judging) + Saw,Tom,4 file not empty stop? 0.96

7 Stratosphere Plans Optimize Factors 7 Triples Factors Map: SGD Reduce: average not materialized Factor W Triples 3,7,b,w,h 5,7,b,w,h Map: SGD 3,-,-,w,- -,7,-,-,h 5,-,-,w,- -,7,-,-,h Map: filter w Map: filter h pacts configured to read fields Reduce: average Reduce: average 3,-,-,w,- 5,-,-,w,- Factor H -,7,-,-,h

8 Stratosphere Plans Join Triples (1/2) 8 Factors Training data needs matrix dimensions Map: replicate factors, keep training data Reduce: group Triples Factor W Factor H 3,-,-,w,- -,7,-,-,h Cross Training data 3,7,b,-,- 3,7,-,w,h Match (row,col) Triples 3,7,b,w,h

9 Stratosphere Plans Join Triples (2/2) 9 Factor W 3,-,-,w,- Cross Training data 3,7,-,w,h Triples Factor H -,7,-,-,h 3,7,b,-,- Match (row,col) compiler hints helpful? 3,7,b,w,h Factor W Training data 3,-,-,w,- 3,7,b,-,- Match (row) Factor H -,7,-,-,h 3,7,b,w,- Match (col) Triples 3,7,b,w,h

10 Stratosphere Plans Calculate Loss (Training) 10 Triples Map: local loss Reduce: RMSE 0.9 driver class knows loss history and decides on stopping dummy, loss, #points Triples Map: local loss Reduce: RMSE # 1 1 Cross: loss history OutputFormat decide stop stop? 1 1 Losses (epoch e) 1.0; ; 1.0; 2.4 Losses (epoch e+1)

11 Stratosphere Plans Calculate Loss (Judging) 11 Factors Netflix judging files no MapReduce job, driver class receives loss directly 0.96 Triples Judging data 3,7,b,w,h Saw,Tom,4 Map: emit cells Saw,Tom,4.8 Match (mid, uid): local loss dummy,0.8²,1 similar to training loss, Map included in Match Reduce: RMSE 0.96

12 Comparison Jobs vs. Plans 12 Equal results without random... Files Starting points Training sequence Factors not materialized Separate factor files possible Create new file for each iteration No efficient serialization between plans (yet) Either parsing text file Or use sequence file singlethreaded

13 Comparison Data Schema 13 3#7 / TripleStorage (Storage class is tagged union) Dummy key / LossStorage Getter, setter, tostring() 3,7,b,w,h Dummy key, loss, #points Remember key places Reuse pacts with configurations for different keys Composite keys possible

14 Comparison Preprocessing 14 Requires: Line format according to parameters Copy to HDFS Serialize factors and blocks Use Map file to write serialized values to HDFS Java process to define lines Shell script to move to HDFS Extra pacts to parse lines

15 Stratosphere Preprocessing Define Line Format 15 factorw.txt Netflix files factorh.txt 0 blocks.txt 1 Reduce: group cells to blocks 2 Create triples Calc loss: training SGD Step Plan 0 Netflix judging files

16 Suggestions 16 Global aggregation of loss with dummy key: Reducer with no key Reducer with compiler hint nrofkeys = 1 No sorting needed Provide tostring() in PactRecord Provide getter, setter in PactRecord to encapsulate field numbers and class types Keep configuration options (e.g. reduce: average W or H) Sequence files for sinks and sources Move log files to master node

17 Evaluation Parameters for Netflix Data 17 starting points: 1 max. iterations: 1 degree of parallelism: 1, 2, 5, 10 for each starting point: while stop file not empty: MapMapReduce CrossMatch MapReduceCross MapMatchReduce optimize factors join triples step size: calculate loss (training) and decide calculate loss (judging) factor size: 5 63K 125K 250K 500K user ~2K ~4K 9K 1/8 1/4 1/2 data size block size: 1000 movies x 1000 user 18K 1 movies

18 Evaluation Run Time for Variable Data Size 18 2 h 52 min 50 x 50 2 h 52 min 1000 x 1000 run time 2 h 24 min 1 h 55 min 1 h 26 min 0 h 57 min grows by factor 4 Init SGD Step run time 2 h 24 min 1 h 55 min 1 h 26 min 0 h 57 min 0 h 28 min 0 h 28 min 0 h 00 min 1/8 1/4 1/2 1 data size 0 h 00 min 1/8 1/4 1/2 1 data size run time 2 h 52 min 2 h 24 min 1 h 55 min 1000 x 1000 grows by factor 4 1 h 26 min Init Blocks (reduce cells) 0 h 57 min Init Join (cross, match) 0 h 28 min SDG Step 0 h 00 min 1/8 1/4 1/2 1 data size 10 nodes vs. DoP = 8 Usually iterations No sequence file between plans

19 Evaluation Run Time for Variable Degree of Parallelism 19 Data size: 1/8 20 min Reading and writing always DoP=1 run time 15 min 10 min DoP: performs best, but data is small faster with higher DoP 5 min 10 0 min optimize factors Map Map Reduce join triples Cross Match write triples subplan calc loss (training) Map Reduce Cross calc loss (judging) Map Match Reduce

20 Outlook 20 Join triples with Cross-Match vs. Match-Match Degree of parallelism: 10 Inspect judging outcome: RMSE should be equal Evaluate DoP with bigger data

21 Summary 21

22 References 22 Anand Rajaraman and Jeff Ullman. Mining of Massive Datasets. Cambridge University Press, Rainer Gemulla, Peter J. Haas, Erik Nijkamp, and Yannis Sismanis. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. IBM Research Report RJ10481, March [1] November 2011

### Estimating PageRank Values of Wikipedia Articles using MapReduce

Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html

### BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

### Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data 04 02 --- 04 01 --- 05 Analytics. Theory Marks

Teaching Scheme Credits Assigned Course Code Course Hrs./Week Name Theory Practical Tutorial Theory Practical/Oral Tutorial Tota l BEITC802 Big Data 04 02 --- 04 01 --- 05 Analytics Examination Scheme

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

### Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

### Distributed computing: index building and use

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

### Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

### Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

### Massive scale analytics with Stratosphere using R

Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino jllopezpino@gmail.com Database Systems and Information Management Technische Universität Berlin Supervised by Volker Markl Advised

### Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

### Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

### INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

### Reusable Data Access Patterns

Reusable Data Access Patterns Gary Helmling, Software Engineer @gario HBaseCon 2015 - May 7 Agenda A brief look at data storage challenges How these challenges have influenced our work at Cask Exploration

### ITG Software Engineering

Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

### Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011 The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase

REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

### An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

### Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

### Data storing and data access

Data storing and data access Plan Basic Java API for HBase demo Bulk data loading Hands-on Distributed storage for user files SQL on nosql Summary Basic Java API for HBase import org.apache.hadoop.hbase.*

### COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

### Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

### BIG DATA SOLUTION DATA SHEET

BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

### Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

### Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

### How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

### Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

### Big Data and Analytics (Fall 2015)

Big Data and Analytics (Fall 2015) Core/Elective: MS CS Elective MS SPM Elective Instructor: Dr. Tariq MAHMOOD Credit Hours: 3 Pre-requisite: All Core CS Courses (Knowledge of Data Mining is a Plus) Every

### COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

### Peers Techno log ies Pv t. L td. HADOOP

Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

### Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

### Data processing goes big

Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

### Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

### Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

### Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

### Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

### MapReduce: Algorithm Design Patterns

Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

### A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

### Introduction to MapReduce Tore Risch Information Technology Uppsala University

Introduction to MapReduce http://user.it.uu.se/~torer/kurser/dm2/mapreduce.pdf Tore Risch Information Technology Uppsala University 2015-05-11 What is a NoSQL Database? A key/value store Basic index manager,

### This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

### HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

### A Service for Data-Intensive Computations on Virtual Clusters

A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Planets Project Permanent

### Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

### Operations and Big Data: Hadoop, Hive and Scribe. Zheng Shao @ 铮 9 12/7/2011 Velocity China 2011

Operations and Big Data: Hadoop, Hive and Scribe Zheng Shao @ 铮 9 12/7/2011 Velocity China 2011 Agenda 1 Operations: Challenges and Opportunities 2 Big Data Overview 3 Operations with Big Data 4 Big Data

### Enterprise Discovery Best Practices

Enterprise Discovery Best Practices 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### SQL Server Administrator Introduction - 3 Days Objectives

SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying

### Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

### The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

### CloudRank-D:A Benchmark Suite for Private Cloud Systems

CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction

### Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

### A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies Bingdong Li, Jeff Springer, Mehmet Gunes, George Bebis University of Nevada Reno FloCon 2013 January 7-10, Albuquerque,

### Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

### Click Stream Data Analysis Using Hadoop

Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

### Play with Big Data on the Shoulders of Open Source

OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

### Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

### Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

### Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

### Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

### Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

### Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

### Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked,

### Machine Learning over Big Data

Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

### Complete Java Classes Hadoop Syllabus Contact No: 8888022204

1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

### A highly configurable and efficient simulator for job schedulers on supercomputers

Mitglied der Helmholtz-Gemeinschaft A highly configurable and efficient simulator for job schedulers on supercomputers April 12, 2013 Carsten Karbach, Jülich Supercomputing Centre (JSC) Motivation Objective

### Recommended Literature for this Lecture

COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin,

### Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework

Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework Aryan TaheriMonfared Tomasz Wiktor Wlodarczyk Chunming Rong Department of Electrical Engineering and Computer Science University

### The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

### The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

### Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

### Scaling Up HBase, Hive, Pegasus

CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

### Big Data Management and Analytics

Big Data Management and Analytics Lecture Notes Winter semester 2015 / 2016 Ludwig-Maximilians-University Munich Prof. Dr. Matthias Renz 2015 Based on lectures by Donald Kossmann (ETH Zürich), as well

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

Pig Laboratory This laboratory is dedicated to Hadoop Pig and consists of a series of exercises: some of them somewhat mimic those in the MapReduce laboratory, others are inspired by "real-world" problems.

### NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

### Cloudera Certified Developer for Apache Hadoop

Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

### Qsoft Inc www.qsoft-inc.com

Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

### Building Advanced Data Models with SAP HANA. Werner Steyn Customer Solution Adoption, SAP Labs, LLC.

Building Advanced Data Models with SAP HANA Werner Steyn Customer Solution Adoption, SAP Labs, LLC. Disclaimer This presentation outlines our general product direction and should not be relied on in making

### Advanced Data Science on Spark

Advanced Data Science on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Data Science Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in

### Hadoop Job Oriented Training Agenda

1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

### Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly