# Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Save this PDF as:

Size: px
Start display at page:

Download "Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014"

## Transcription

1 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014

2 Agenda Overview of topic Analysis of research papers Software design

3 Overview of topic

4 Computational Problem Gradient descent can take a long time to converge with many training examples, and when it does it may not have found the global optimum Random initializations of ϴ Global minimum Local minimum Figure from Vasilis Vryniotis. Accessed Sept 14, 2014.

5 Linear/logistic regression cost function: Algorithm Sequential Algorithm I m min J θ = 1 θ 2m (h θ(x i ) Randomly initialize θ i=1 y i ) 2 Repeat until convergence (for all j=0,1,,n): m θ j = θ j α 1 m (h θ(x i ) i=1 y i )x j i What if m equals 500 million training examples! This term is d dθ j J(θ)

6 Sequential Algorithm II Predicted output is then h θ (x i ) with trained parameters θ Linear Regression: h θ x i Logistic Regression z = n = θ i x i j=1 n j=1 θ i x i h θ x i = 1 (1 + e z )

7 Paper analysis

8 Research Paper 1 Identifying Suspicious URLs: An Application of Large-Scale Online Learning

9 Data collection Created a labeled dataset of malicious Web sites with lexical and host-based features Binary BOW of URL tokens, WHOIS info, and IP data Predict whether site is malicious from these features only, excluding Web content Faster and safer not to touch malicious content Used benign URLs from Yahoo directory for the non-malicious class in the dataset This data is also used in two of the following papers

10 BOW grows fast Instances: Attributes: Example Instance (anonymized data): +1 4: : : :1 54:1 56:1 62:1 64:1 66:1 68:1 70:1 72:1 933:1 4346: : : : : : : : : : : : : : : : : : : : : : :1

11 Results Used stochastic gradient descent to continually adapt to each new URL found 98.4% classification rate Found size and velocity of dataset too large for computational complexity of batch processing URL features changed too fast for the training time required Model became stale if not retrained soon enough A later paper will address this shortfall of batch training

12 Research Paper 2 Evaluating parallel logistic regression models

13 Problem Provide design guidelines to choose most effective combination of parameters for logistic regression on large data sets Compare and evaluate different distributed platforms, parallel algorithms, and sublinear approximation techniques.

14 Approaches Platforms: Hadoop Spark Algorithms: Sequential optimization with stochastic gradient descent using LIBLINEAR (machine learning library) Sublinear algorithms: use sampling of the dataset

15 Results I LIBLINEAR was the most precise and also the fastest Ran on single machine No inter-machine communication overhead and uses memory fully Scalable, but limited by memory size LIBLINEAR was best if data set fits in memory, but couldn t fit entire URL dataset Spark performs better than Hadoop

16 Results II Hadoop vs SPARK Most machine learning algos iterate over the dataset Hadoop works off of disk while SPARK works off memory using RDDs Hadoop has non-negligible overhead for setting up tasks and passing parameters, especially for small datasets To get general sense of running time, we should consider both volume and the sparsity of data Observed drop in speedup with increased number of nodes in the cluster due to communication overhead (poor strong scaling)

17 Uses LIBLINEAR was the best logistic regression technique Our dataset, however, is too big to fit in memory. Has about 3 million features, 2 million instances, and is very sparse Authors recommend parallel gradient descent for this situation, which is exactly what we will do using the tardis cluster Measure speedups with different number of workers

18 Research Paper 3 Efficient Mini-batch Training for Stochastic Optimization

19 Problem Stochastic gradient descent for large-scale optimization problems has significant communication overhead To reduce the communication cost in parallelized SGD, mini-batch training needs to be employed But, when mini-batch size increases then rate of convergence decreases (the problem in Paper 1) Uses the same URL data here from Paper 1 This paper discusses a technique that regularizes the objective function within each mini-batch iteration to prevent the convergence rate from decreasing with increased mini-batch sizes

20 Batching reduces synchronization cost The fraction of synchronization cost as a function of minibatch size when communicating between 12 nodes Note cost for every algorithm decreases with batch size

21 Modified minimization equation Note: w = θ General logistic regression Batch Penalty w has larger changes in beginning and smaller at the end Prevents divergence from previous consensus found Limits dramatic changes of parameter w over batch iterations

22 Results Objective value versus mini-batch size after 10^7 examples are processed Note SGD diverges but efficient minibatch training does not

23 Use Use the technique described for efficient minibatch training without subsequent derease in the convergence rate

24 Research Paper 4 Parallel Large Scale Feature Selection for Logistic Regression

25 Problem Need to try all 2 N feature combinations to find the best model In sparse, high-dimensional data this is infeasible Logistic regression works well in highdimensional data since it is likely linearly separable (this is the premise of kernels)

26 Theory Approximate the best feature to add to an existing model by building a new model in parallel for each feature NOT already in the model Dth iteration, need to train N D models Add the best feature found to the model and rerun full logistic regression, to make sure approximation error does not propagate Have to run full regression N times, or stop short

27 Note: Authors used maximization of loglikelihood but this is easily mapped in minimization of squared error Newton s method of optimization used instead of stochastic gradient descent Parallel Algorithm

28 Results Accurately found the true optimal feature to add on many iterations of feature selection Accuracy decreased as the number of irrelevant features in the dataset increased Larger search space to find the true optimal Accuracy decreased as the number of base features increased Each added feature has smaller marginal effect, making optimal one harder to find Feature rankings stable after looking at only 28% of the data

29 Use To improve our model, can use feature selection to grab only those relevant features of the 3 million inside the URL data Good performance and comparable to results found in re-learning the whole model

30 Software design

31 JOB Training Phase Task Testing Phase Task LRWorker_1 LRWorker_2 put put θ θ take take put put θ Err θ Err Reduction Task Find θ with smallest test error... -Calculate θ (parallelfor) on one node for a random starting initialization across nodes... -Calculate test set error for each θ in Tuple Space Images from BIG CPU, BIG DATA: Solving the World's Toughest Computational Problems with Parallel Computing. Alan Kaminsky, 2014.

32 References Sameer Singh, Jeremy Kubica, Scott Larsen and Daria Sorokina Parallel Large Scale Feature Selection for Logistic Regression. In Proceedings of 2009 Society for Industrial and Applied Mathematics (SIAM) Data Mining, SIAM, Philadelphia, PA, USA, URL: Haoruo Peng, Ding Liang, and C. Choi. Evaluating parallel logistic regression models. In Proceedings of the 2013 IEEE International Conference on Big Data, pp , 6-9 Oct URL: &isnumber= Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14). ACM, New York, NY, USA, URL: M&CFID= &CFTOKEN= Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker Identifying suspicious URLs: an application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML '09). ACM, New York, NY, USA, URL:

33 Further References Michele Banko and Eric Brill Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL '01). Association for Computational Linguistics, Stroudsburg, PA, USA, URL: Mohammed Javeed Zaki Parallel and Distributed Data Mining: An Introduction. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, Mohammed Javeed Zaki and Ching-Tien Ho (Eds.). Springer-Verlag, London, UK, UK, URL: Andrew Ng. Machine Learning course materials, Coursera. Accessed September 10, 2014.

34 Questions

### Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

### Scalable Machine Learning - or what to do with all that Big Data infrastructure

- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda

### Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### Increasing Marketing ROI with Optimized Prediction

Increasing Marketing ROI with Optimized Prediction Yottamine s Unique and Powerful Solution Smart marketers are using predictive analytics to make the best offer to the best customer for the least cost.

### COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

### A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

### Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

### Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

### Response prediction using collaborative filtering with hierarchies and side-information

Response prediction using collaborative filtering with hierarchies and side-information Aditya Krishna Menon 1 Krishna-Prasad Chitrapura 2 Sachin Garg 2 Deepak Agarwal 3 Nagaraj Kota 2 1 UC San Diego 2

### Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

### Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Machine learning challenges for big data

Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

### Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

### Big-data Analytics: Challenges and Opportunities

Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)

### Large Scale Learning to Rank

Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

### Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

### Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### Spark and the Big Data Library

Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

### Hybrid model rating prediction with Linked Open Data for Recommender Systems

Hybrid model rating prediction with Linked Open Data for Recommender Systems Andrés Moreno 12 Christian Ariza-Porras 1, Paula Lago 1, Claudia Jiménez-Guarín 1, Harold Castro 1, and Michel Riveill 2 1 School

### Federated Optimization: Distributed Optimization Beyond the Datacenter

Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### Prediction of Stock Performance Using Analytical Techniques

136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

### CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

### Machine Learning over Big Data

Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Mining Interesting Medical Knowledge from Big Data

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### REVIEW OF ENSEMBLE CLASSIFICATION

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

### Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### MLlib: Scalable Machine Learning on Spark

MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph

### Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

### Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa Patricia.Lutu@up.ac.za

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Parallel Scalable Algorithms- Performance Parameters

www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

### Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

### The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

### Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

### Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

### Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

### Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

### REVIEW ON QUERY CLUSTERING ALGORITHMS FOR SEARCH ENGINE OPTIMIZATION

Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

### PeerMon: A Peer-to-Peer Network Monitoring System

PeerMon: A Peer-to-Peer Network Monitoring System Tia Newhall, Janis Libeks, Ross Greenwood, Jeff Knerr Computer Science Department Swarthmore College Swarthmore, PA USA newhall@cs.swarthmore.edu Target:

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

### Data Mining with Hadoop at TACC

Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical

### CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

### Large Linear Classification When Data Cannot Fit In Memory

Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan b93107@csie.ntu.edu.tw Kai-Wei Chang Dept. of Computer

### Feature Engineering in Machine Learning

Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

### BIG DATA IN SCIENCE & EDUCATION

BIG DATA IN SCIENCE & EDUCATION SURFsara Data & Computing Infrastructure Event, 12 March 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: Jimmy Lin & http://en.wikipedia.org/wiki/mount_everest

### Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Advice for applying Machine Learning

Advice for applying Machine Learning Andrew Ng Stanford University Today s Lecture Advice on how getting learning algorithms to different applications. Most of today s material is not very mathematical.

### Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### STOCK OPTION PRICE PREDICTION

STOCK OPTION PRICE PREDICTION ABRAHAM ADAM 1. Introduction The main motivation for this project is to develop a better stock options price prediction system, that investors as well as speculators can use

### Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

### How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

### Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Contact Information

Karthik Sridharan Contact Information 424 Gates Hall Ithaca, NY 14853-7501 USA E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Research Interests Machine Learning, Statistical Learning

### Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

### Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

### Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

### MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

### SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation

### Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

### Advances in Natural and Applied Sciences

AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

### Jubatus: An Open Source Platform for Distributed Online Machine Learning

Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation

### Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions

### Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu