Selected Big Data Analytics Methods in Health Research

Size: px
Start display at page:

Download "Selected Big Data Analytics Methods in Health Research"

Transcription

1 Selected Big Data Analytics Methods in Health Research Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel et al. Adjunct Associated Professor, University of Iceland Jülich Supercomputing Centre, Germany Head of Research Group High Productivity Data Processing Peking, 10th 12th September 2014

2 Outline 2/ 42

3 Outline Research Group High Productivity Data Processing From Big Data Analytics to Smart Data Analytics Research Data Alliance Activities LBSN Analytics [Changes 2013 Seed Project Follow-Up NCSA/JSC] Scientific Case & Visualization Clustering using Parallel DBScan Big Brain Data Analytics [Collaboration INM/JSC] Scientific Case (Supervised) Learning from Data 101 Classification of Brain Big Data Selected Lessons Learned References Changes 2013 activities with selected contents of (dropped) talk from Shaowen Wang et al. (Tue) 3/ 42

4 Demand for Generic Data Methods in Science Advancing user-centered data mining methods & tools (e.g. outlier detection, support vector machines, etc.) Algorithms and Data Structures Enhancing/parallizing generic techniques and algorithmic cores (e.g. indexing, sorting) Visualization Scaling visualization techniques for highperformance graphics and data visualization Security Efficient security and privacy solutions applied to real-world big data use-cases 4/ 42

5 Research Group High Productivity Data Processing 5/ 42

6 Research Group High Productivity Data Processing Data Applied Mining Statistics Data Science Machine Learning Different Approaches Scientific Computing Scientific Applications using Big Data Traditional Scientific Computing Methods HPC and HTC Paradigms & Parallelization Smart Data Analytics Approaches & Techniques Optimized Data Access & Management Statistical Data Mining & Machine Learning Big Data Methods (Parallel) Technologies Scientific Applications Research Data MLlib 6/ 42

7 Juelich Supercomputing Centre Context Support data-intensive science and engineering applications Explore computing that is more intertwined with data analysis Big Data Data Management, Security & Access Clustering Classification Regression 7/ 42

8 Research Data Alliance Activities Steering the Big Data Analytics Interest Group Members: ~60; Founded ~03/2013 in 1st RDA Plenary (Goeteborg) Telcons: 1-2x / month; Secretary: M. Goetz (JSC) Co-chairs: M. Riedel (JSC), K. Kwo-Sen (NASA), P. Baumann (UoBremen) Systematic way: Cross Industry Standard Process for Data Mining (CRISP-DM) Concete Datasets (& source/sensor) Algorithms & Methods Technologies & Ressources Scientific Data Applications Big Data Analytics Group Process Best Practices Communitybased practice & recommendations Reference Data Analytics for reusability & learning CRISP- DM Report [4] Research Data Alliance Openly Shared Datasets Running Analytics Code Join us at Amsterdam: 4 th RDA Plenary September 2014 Two Analytics Sessions! 8/ 42

9 Research Data Alliance Application Example Big Data Analytics Group Process [3] P. Chapman et al., CRISP-DM Guide Sattelite Data(Quickbird) Parallel Support Vector Machines (SVM) HPC/MPI, Map-Reduce & GPGPUs Classification Study of Land Cover Types Best Practices Communitybased practice JUDGE JSC Classification++ Reference Data Analytics for reusability & learning CRISP- DM Report Parallel Brain Data Analytics Openly Shared Datasets Running Analytics Code [2] G. Cavallaro and M. Riedel, Smart Data Analytics Methods for Remote Sensing Applications, IGARSS 2014 [1] EUDAT B2SHARE 9/ 42

10 Understanding Industry vs H1N1 Virus Made Headlines Nature paper from Google employees Explains how Google is able to predict winter flus Not only on national scale, but down to regions Possible via logged big data search queries [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 Big Data is not always better data Think causality vs. correlation 2014 The Parable of Google Flu Large errors in flu prediction & lessons learned (1) Dataset: Transparency & replicability impossible (2) Study the algorithm since they keep changing (3) It s not just about size of the data [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), / 42

11 LBSN Analytics 11 / 42

12 Scientific Case & Realization Clustering Towards interactive visual analytics using parallel methods Goal: Answer health related questions from publicly available data E.g. estimated emissions/region correlated with measurements stations E.g. estimated pollution/emissions breathing/person/region Visualization pipeline design: Open data source: OpenStreet Maps (OSM) maps/streets Approach: Towards interactive data exploration Click-free visualization, i.e. no GUI applications Support for typical overlays (density maps, polygon creation) All of the above possible browser programming APIs (e.g. openlayers) Parallel MPI/OpenMP Trajectories (Statistics) OSRM Lib Parallel MPI/OpenMP Clustering (BDSCAN) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 12 / 42

13 Approach: LBSN Trajectory Mining Scientific Domain Area Smart Cities approaches compined with Health Analytics Research Tweets Scientific Outcome Traffic density estimation Network emission model Location-based Social Networks (LBSN) Data Open data sources: Twitter & Foursquare Data collection and storage: NCSA Initial Computation: JSC Check-ins Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 13 / 42

14 Example: Cluster Visualization Clustering London Clusters 6/1/2014, 1h time slice beginning at 18:00 UTC using Density-Based Spatial Clustering of Applications with Noise (DBScan) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 14 / 42

15 Data Preparation & Visualization Implementation Clustering The source code is available on our bitbucket repository on invite: Changes2013 Outcome: selected work has been recently accepted for inclusion of the Cartography and Geographic Information Science Journal Special Issue (Guest editors: Xinyue Ye, Qunying Huang, Wenwen Li) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 15 / 42

16 Scalable Parallel Clustering Implementation Clustering Exising parallel did not scale for problem domain DBScan noise ingored & separately reported PDSDBCAN Parallel DBScan Algorithm New implementation is work-in-progress, but promising initial results: HPDBSCAN [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 Slide material and performance comparisons courtesy by Markus Goetz and Christian Bodenstein 16 / 42

17 Further use of HPDBSCAN: Data of Koljoefjords Sweden Data of the PANGAEA data collection Example: one month measurement: 03/2012 Reality: years of data are available Goal: Automatic outlier detection using parallel dbscan algorithm Better measurement devices produce orders of magnitudes more big data Manual quality control becomes impossible and error-prone Automate the quality control process ( parallelization) [9] PANGAEA data collection DBScan noise ingored & separately reported Outlier! e.g. boxplots not feasible Data courtesy of Robert Huber, Marum Center for Marine Environmental Sciences, Bremen 17 / 42

18 Big Brain Data Analytics 18 / 42

19 JSC/INM Jointly tackled Case Slide courtesy by Dr. M. Axer from Talk New insights into the fiber architecture of the brain 19 / 42

20 Big Brain Data Analytics Scientific Case Classification Build reconstructed brain (one 3d volume) that matches with sections & block images Understanding the sectioning of the brain and support automation of reconstruction 1. Some pattern exists ([1] Problem understanding phase) Image content classification : [1] brain part; [2] not brain part 2. No exact mathematical formula exists No precise formula for contour of the brain ; a non-linear class boundary [2] 3. Data ([2] Data understanding phase ) Block face images (of frozen tissue) Every 20 micron (cut size) Resolution: 3272 x 2469 ~ 14 MB / RGB image ~ 8 MB / corresponding mask image ( groundtruth ) ~700 images [1] [2] [2] ~ 40 GB dataset we can apply supervised learning due to labelled data [2] [1] [2] [2] class label 20 / 42

21 Supervised Learning Mathematical Building Blocks (1) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Hypothesis Set (set of candidate formulas) Two Solution Tools: The Learning Model Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Elements that we derive from our skillset 21 / 42

22 Statistical Learning Theory Probability Distribution on X Question: Can we really learn a function from data? Learning is only possible in a probabilistic sense [ no restriction ] In sample data tracks out of sample data, created from same distribution Given large N: There is a probability of picking one point or another Data created by probability independently from each other Mathematically established via Hoeffdings Inequality : Future brains, we dont know in sample M is the number of hypothesis Practice: Need enough samples N to learn Problem: M is infinite, but we can reduce it to ~N Solution: VC dimension large overlaps out of sample use for predict! frequency is likely close to other frequency Probability Distribution 22 / 42

23 Supervised Learning Mathematical Building Blocks (2) Unknown Target Function (ideal function) Training Examples (historical records, groundtruth data, examples) Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 23 / 42

24 Statistical Learning Theory Error Measure & Noisy Targets Question: How can we learn a function from (noisy) data? Error measures to quantify our progress, the goal is: Often user-defined, if not often squared error : Error Measure Aka point-wise error measure (Noisy) Target function is not a (deterministic) function Getting with same x in the same y out is not always given in practice Problem: Noise in the data that hinders us from learning Idea: Use a target distribution instead of target function 24 / 42

25 Supervised Learning Mathematical Building Blocks (3) Unknown Target Distribution Function target function plus noise (ideal function) Training Examples (historical records, groundtruth data, examples) Error Measure Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 25 / 42

26 Supervised Learning Selected Classification Method Classification Initial approach: Support Vector Machines (SVMs) [7] C. Cortes and V. Vapnik, Supportvector networks, Machine Learning, vol. 20(3), pp , One of the best out-of-the-box / robust classification methods ([4] Modelling) Binary classifier separates two classes: [1] brain; [2] non-brain Parameters after cross-validation; radial basis function (rbf) kernel; C-SVC type Uses quadratic programming & Lagrangian method with N x N (Cross-validation (grid-search) nicely parallel high throughput computing) (linear example) ( maximal margin clasifier example) (maximizing hyperplane turned into optimization problem, minimization, dual problem) (max. hyperplane dual problem, using quadratic programming method) (quadratic coefficients) 26 / 42

27 Big Brain Data Analytics Data Preprocessing (1) Classification Feature Selection/Extraction/Reduction ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) E.g. principle component analysis (PCA) not very helpful, RGB orthogonal X pixel location [1D] Blue Green Red RGB [3D] SUM = 3 Features 3d Brain [4D] y pixel location [2D] 27 / 42

28 Big Brain Data Analytics Data Preprocessing (2) Classification Transform images to the LibSVM format ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) Label: Class B/W Red Green Blue Each line is a training vector with rgb levels each line is a pixel 0 1: : : : : : : : : : : : : : : features ~ #XYZ out of instances: Samples (we have to randomly pick the samples) 28 / 42

29 Big Brain Data Analytics Data Preprocessing (3) Classification Smart data sampling ([3] Data Preparation) The data bears the potential of sampling bias (here: much more black than white pixels ) Solution: Create samples equally per class ([1] brain; [2] non-brain) Create different datasets for training & testing (same structure, but avoid data snooping!) r_msa_ _dxxxx-xx-xx_all_train r_msa_ _dxxxx-xx-xx_all_test.svm 29 / 42

30 Big Brain Data Analytics Approach Overview Sampling, training and testing Parameters C (allowing error) and gamma (RBF Kernel) 30 / 42

31 Big Brain Data Analytics Initial Results Classification Approach: Cross-validation for model selection ([4] Modelling) (Skipped for simplicity: essentially gridsearch getting two parameters C and gamma bounds sum of errors determines the number and severity of violations (using a soft-margin SVM model, called also slack variables ) Approach: Serial SVM implementations ([4] Modelling) Data: nearly full dataset, but equally balanced classes Stopped: after using three different serial implementations Big data problem : plain balanced dataset too large to be properly processed in serial Potential Solution: create smaller samples from the plain balanced dataset 31 / 42

32 Big Brain Data Analytics Data Preprocessing Revisited Smart data sampling ([3] Data Preparation) Create very small samples Still balanced classes Approach: 0.01 % of the data per class Create different datasets for training & testing Classification r_msa_ _dxxxx-xx-xx_all_train r_msa_ _dxxxx-xx-xx_all_test.svm 32 / 42

33 Big Brain Data Analytics Selected Serial Results Classification Approach: Serial SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes [We reached a limit: Approach not scalable for larger quantities of data] Scikit-learn (python) Example: ([5] Evaluation) Training time: ~39 minutes on JUDGE; Testing time: ~2 Min; Accuracy ~91% Matlab Example: ([5] Evaluation) Training time: ~3 hours on Laptop; Testing time: ~27 Min; Accuracy ~90,1% Using the small sample size worked to train with some serial implementations Potential Improvement: Training-time reduction by maintaining classification accuracy 33 / 42

34 Big Brain Data Analytics Selected Parallel Results Classification Approach: Parallel SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes No limit: theoretical scalable, but usefulness depends on datasets Incremented number of datasets, towards a sample size of 0.1% Twister (iterative Map-Reduce) Example: ([5] Evaluation) Challenge: Data distribution across the parallel infrastructure on FutureGrid Training time: ~7 minutes on JUDGE; Testing time: ~7 Min; Accuracy ~96% MLlib Parallel version with Twister (iterative map-reduce) is working with growing dataset Work-in-progress: Work on other parallel implementations 34 / 42

35 Selected Lessons Learned 35 / 42

36 Big Brain Data Analytics Parallel SVM Technologies Tool Platform Approach Parallel Support Vector Machine Apache Mahout Java; Apache Hadoop 1.0 (mapreduce); HTC No strategy for implementation (Website), serial SVM in code Apache Spark/MLlib Apache Spark; HTC Only linear SVM; no multi-class implementation Twister/ParallelSVM Java; Apache Hadoop 1.0 (mapreduce); Twister (iterations), HTC Much dependencies on other software: Hadoop, Messaging, etc. Scikit-Learn Python; HPC/HTC Multi-class Implementations of SVM, but not fully parallelized pisvm C code; Message Passing Interface (MPI); HPC Simple multi-class parallel SVM implementation outdated (~2011) GPU accelerated LIBSVM CUDA language Multi-class parallel SVM, relatively hard to program, no std. (CUDA) psvm C code; Message Passing Interface (MPI); HPC Unstable beta, SVM implementation outdated (~2011) Journal Paper in preparation Algorithm A Implementation closed/old source, also after asking paper authors Clustering++ Classification++ Regression++ Algorithm Extension A Implementation Parallelization of Algorithm Extension A A implementations available implementations rare and/or not stable 36 / 42

37 Big Brain Data Analytics Classifier Challenges Classification Sampling, training and testing Checking out-of-sample performance ([5] Evaluation) E.g. using two different images and compare with masks issue Check out-of-sample performance & better data understanding ~ok Problem: 2d cut classification on 3D brain color data (i.e. background color) Solution: E.g. use neighbouring methods Color histograms: e.g samples of class 0 & only 567 of class 1 (for G) 37 / 42

38 Big Brain Data Analytics Potential Next Approaches Approach to 2D/3D problem: Apply Self Dual Attribute Profile (SDAP) Increase number of dimensions, using different threshold values Takes advantage of neighbouring pixels and cuts trees at certain thresholds Good experience in land cover classification : e.g. ~70% to ~90% accuracy Area Std Dev Moment of Inertia [9] G. Cavallaro, M. Mura, J.A. Benediktsson, L. Bruzzone A Comparison of Self-Dual Attribute Profiles based on different filter rules for classification, IEEE IGARSS2014, Quebec, Canada Example: Std Dev (Channel Blue) Approach: Increase number of training samples (no more serial) Very small sampling may violate the generalization capability of classifier Approach: Compare with other (parallel) classification methods E.g. Naive Bayer classifier, DecisionTrees, RandomForests, etc. 38 / 42

39 Big Brain Data Analytics : Much more interesting Challenges! 3D Reconstruction of High Resolution Images Data: brain slices (microscopic measurements) Mapping of cell densities and cortical areas (in 3D) Data: ~1 PB/brain Analyse differences of brains, evolving over time (longitudinal studies) Data: e.g Kohorte project (in-vivo humans studies) Data: e.g. Vervet monkey 20 brains (in-vivo and post-mortem monkeys) Close collaboration between JSC and INM bears lots of potential for tackling research challenges 39 / 42

40 Acknowledgements & References 40 / 42

41 Acknowledgements Gabriele Cavallaro, University of Iceland Tomas Philipp Runarsson, University of Iceland Shaowen Wang, National Center for Supercomputing Applications Junjun Yin, National Center for Supercomputing Applications Markus Axer, Stefan Köhnen, Tim Hütz, Institute of Neuroscience & Medicine, Juelich Selected Members of the Research Group on High Productivity Data Processing Ahmed Shiraz Memon Mohammad Shahbaz Memon Markus Goetz Christian Bodenstein Philipp Glock Matthias Richerzhagen 41 / 42

42 References [1] EUDAT European Data Infrastructure, B2SHARE Tool, Online: [2] G. Cavallaro & M. Riedel et al., Smart Data Analytics Methods for Remote Sensing Applications, IEEE IGARSS, Quebec, Canada [3] P. Chapman et al., CRISP-DM Guide [4] Research Data Alliance, online: [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), 2014 [7] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20(3), pp , [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 [9] PANGAEA data collection, Talk available at 42 / 42

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

High Productivity Data Processing Analytics Methods with Applications

High Productivity Data Processing Analytics Methods with Applications High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research

More information

Classification Techniques in Remote Sensing Research using Smart Data Analytics

Classification Techniques in Remote Sensing Research using Smart Data Analytics Classification Techniques in Remote Sensing Research using Smart Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Morris Riedel Juelich Supercomputing

More information

Scientific Big Data Analytics by HPC - Parallel and Scalable Machine Learning on JURECA

Scientific Big Data Analytics by HPC - Parallel and Scalable Machine Learning on JURECA Scientific Big Data Analytics by HPC - Parallel and Scalable Machine Learning on JURECA Federated Systems and Data Division Research Group High Productivity Data Processing Dr.- Ing. Morris Riedel et al.

More information

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Selected Parallel and Scalable Methods for Scientific Big Data Analytics Selected Parallel and Scalable Methods for Scientific Big Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group

More information

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Selected Parallel and Scalable Methods for Scientific Big Data Analytics Selected Parallel and Scalable Methods for Scientific Big Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group

More information

Understanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics

Understanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics Understanding Big Data Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Interest Group Co Chairs are Needed in Big Data-driven Scientific Research The

More information

Introduction to Big Data in HPC, Hadoop and HDFS Part One

Introduction to Big Data in HPC, Hadoop and HDFS Part One Introduction to Big Data in HPC, Hadoop and HDFS Part One Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel Adjunct Associated Professor, University

More information

From Big Data Analytics To Smart Data Analytics With Parallelization Techniques

From Big Data Analytics To Smart Data Analytics With Parallelization Techniques From Big Data Analytics To Smart Data Analytics With Parallelization Techniques Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel et al. Adjunct

More information

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1 On Understanding Big Data Impacts in Remotely Sensed Image Classification Using Support Vector Machine Methods Gabriele

More information

On Establishing Big Data Breakwaters

On Establishing Big Data Breakwaters On Establishing Big Data Breakwaters with Analytics Dr. - Ing. Morris Riedel Head of Research Group High Productivity Data Processing, Juelich Supercomputing Centre, Germany Adjunct Associated Professor,

More information

European Data Infrastructure - EUDAT Data Services & Tools

European Data Infrastructure - EUDAT Data Services & Tools European Data Infrastructure - EUDAT Data Services & Tools Dr. Ing. Morris Riedel Research Group Leader, Juelich Supercomputing Centre Adjunct Associated Professor, University of iceland BDEC2015, 2015-01-28

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Introduction to Big Data in HPC, Hadoop and HDFS Part Two

Introduction to Big Data in HPC, Hadoop and HDFS Part Two Introduction to Big Data in HPC, Hadoop and HDFS Part Two Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel Adjunct Associated Professor, University

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Maschinelles Lernen mit MATLAB

Maschinelles Lernen mit MATLAB Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com>

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com> IC05 Introduction on Networks &Visualization Nov. 2009 Overview 1. Networks Introduction Networks across disciplines Properties Models 2. Visualization InfoVis Data exploration

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou BIG DATA VISUALIZATION Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou Let s begin with a story Let s explore Yahoo s data! Dora the Data Explorer has a new

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Network Intrusion Detection using Semi Supervised Support Vector Machine

Network Intrusion Detection using Semi Supervised Support Vector Machine Network Intrusion Detection using Semi Supervised Support Vector Machine Jyoti Haweliya Department of Computer Engineering Institute of Engineering & Technology, Devi Ahilya University Indore, India ABSTRACT

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Comparison of machine learning methods for intelligent tutoring systems

Comparison of machine learning methods for intelligent tutoring systems Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Big Data Analytics. Tools and Techniques

Big Data Analytics. Tools and Techniques Big Data Analytics Basic concepts of analyzing very large amounts of data Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

A Map Reduce based Support Vector Machine for Big Data Classification

A Map Reduce based Support Vector Machine for Big Data Classification , pp.77-98 http://dx.doi.org/10.14257/ijdta.2015.8.5.07 A Map Reduce based Support Vector Machine for Big Data Classification Anushree Priyadarshini and SonaliAgarwal Indian Institute of Information Technology,

More information

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Software Engineering for Big Data CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Big Data Big data technologies describe a new generation of technologies that aim

More information

Machine learning for algo trading

Machine learning for algo trading Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

More information

Traffic Prediction and Analysis using a Big Data and Visualisation Approach

Traffic Prediction and Analysis using a Big Data and Visualisation Approach Traffic Prediction and Analysis using a Big Data and Visualisation Approach Declan McHugh 1 1 Department of Computer Science, Institute of Technology Blanchardstown March 10, 2015 Summary This abstract

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Sentiment analysis using emoticons

Sentiment analysis using emoticons Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other

More information

Identification algorithms for hybrid systems

Identification algorithms for hybrid systems Identification algorithms for hybrid systems Giancarlo Ferrari-Trecate Modeling paradigms Chemistry White box Thermodynamics System Mechanics... Drawbacks: Parameter values of components must be known

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

Music Mood Classification

Music Mood Classification Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

More information

Information Processing, Big Data, and the Cloud

Information Processing, Big Data, and the Cloud Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive

More information

The? Data: Introduction and Future

The? Data: Introduction and Future The? Data: Introduction and Future Husnu Sensoy Global Maksimum Data & Information Technologies Global Maksimum Data & Information Technologies The Data Company Massive Data Unstructured Data Insight Information

More information

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Dr. Daisy Zhe Wang Director of Data Science Research Lab University of Florida, CISE

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Object Recognition and Template Matching

Object Recognition and Template Matching Object Recognition and Template Matching Template Matching A template is a small image (sub-image) The goal is to find occurrences of this template in a larger image That is, you want to find matches of

More information

The STC for Event Analysis: Scalability Issues

The STC for Event Analysis: Scalability Issues The STC for Event Analysis: Scalability Issues Georg Fuchs Gennady Andrienko http://geoanalytics.net Events Something [significant] happened somewhere, sometime Analysis goal and domain dependent, e.g.

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco Optimizing content delivery through machine learning James Schneider Anton DeFrancesco Obligatory company slide Our Research Areas Machine learning The problem Prioritize import information in low bandwidth

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data 100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.

More information

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information