Selected Big Data Analytics Methods in Health Research
|
|
- Kenneth Payne
- 8 years ago
- Views:
Transcription
1 Selected Big Data Analytics Methods in Health Research Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel et al. Adjunct Associated Professor, University of Iceland Jülich Supercomputing Centre, Germany Head of Research Group High Productivity Data Processing Peking, 10th 12th September 2014
2 Outline 2/ 42
3 Outline Research Group High Productivity Data Processing From Big Data Analytics to Smart Data Analytics Research Data Alliance Activities LBSN Analytics [Changes 2013 Seed Project Follow-Up NCSA/JSC] Scientific Case & Visualization Clustering using Parallel DBScan Big Brain Data Analytics [Collaboration INM/JSC] Scientific Case (Supervised) Learning from Data 101 Classification of Brain Big Data Selected Lessons Learned References Changes 2013 activities with selected contents of (dropped) talk from Shaowen Wang et al. (Tue) 3/ 42
4 Demand for Generic Data Methods in Science Advancing user-centered data mining methods & tools (e.g. outlier detection, support vector machines, etc.) Algorithms and Data Structures Enhancing/parallizing generic techniques and algorithmic cores (e.g. indexing, sorting) Visualization Scaling visualization techniques for highperformance graphics and data visualization Security Efficient security and privacy solutions applied to real-world big data use-cases 4/ 42
5 Research Group High Productivity Data Processing 5/ 42
6 Research Group High Productivity Data Processing Data Applied Mining Statistics Data Science Machine Learning Different Approaches Scientific Computing Scientific Applications using Big Data Traditional Scientific Computing Methods HPC and HTC Paradigms & Parallelization Smart Data Analytics Approaches & Techniques Optimized Data Access & Management Statistical Data Mining & Machine Learning Big Data Methods (Parallel) Technologies Scientific Applications Research Data MLlib 6/ 42
7 Juelich Supercomputing Centre Context Support data-intensive science and engineering applications Explore computing that is more intertwined with data analysis Big Data Data Management, Security & Access Clustering Classification Regression 7/ 42
8 Research Data Alliance Activities Steering the Big Data Analytics Interest Group Members: ~60; Founded ~03/2013 in 1st RDA Plenary (Goeteborg) Telcons: 1-2x / month; Secretary: M. Goetz (JSC) Co-chairs: M. Riedel (JSC), K. Kwo-Sen (NASA), P. Baumann (UoBremen) Systematic way: Cross Industry Standard Process for Data Mining (CRISP-DM) Concete Datasets (& source/sensor) Algorithms & Methods Technologies & Ressources Scientific Data Applications Big Data Analytics Group Process Best Practices Communitybased practice & recommendations Reference Data Analytics for reusability & learning CRISP- DM Report [4] Research Data Alliance Openly Shared Datasets Running Analytics Code Join us at Amsterdam: 4 th RDA Plenary September 2014 Two Analytics Sessions! 8/ 42
9 Research Data Alliance Application Example Big Data Analytics Group Process [3] P. Chapman et al., CRISP-DM Guide Sattelite Data(Quickbird) Parallel Support Vector Machines (SVM) HPC/MPI, Map-Reduce & GPGPUs Classification Study of Land Cover Types Best Practices Communitybased practice JUDGE JSC Classification++ Reference Data Analytics for reusability & learning CRISP- DM Report Parallel Brain Data Analytics Openly Shared Datasets Running Analytics Code [2] G. Cavallaro and M. Riedel, Smart Data Analytics Methods for Remote Sensing Applications, IGARSS 2014 [1] EUDAT B2SHARE 9/ 42
10 Understanding Industry vs H1N1 Virus Made Headlines Nature paper from Google employees Explains how Google is able to predict winter flus Not only on national scale, but down to regions Possible via logged big data search queries [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 Big Data is not always better data Think causality vs. correlation 2014 The Parable of Google Flu Large errors in flu prediction & lessons learned (1) Dataset: Transparency & replicability impossible (2) Study the algorithm since they keep changing (3) It s not just about size of the data [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), / 42
11 LBSN Analytics 11 / 42
12 Scientific Case & Realization Clustering Towards interactive visual analytics using parallel methods Goal: Answer health related questions from publicly available data E.g. estimated emissions/region correlated with measurements stations E.g. estimated pollution/emissions breathing/person/region Visualization pipeline design: Open data source: OpenStreet Maps (OSM) maps/streets Approach: Towards interactive data exploration Click-free visualization, i.e. no GUI applications Support for typical overlays (density maps, polygon creation) All of the above possible browser programming APIs (e.g. openlayers) Parallel MPI/OpenMP Trajectories (Statistics) OSRM Lib Parallel MPI/OpenMP Clustering (BDSCAN) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 12 / 42
13 Approach: LBSN Trajectory Mining Scientific Domain Area Smart Cities approaches compined with Health Analytics Research Tweets Scientific Outcome Traffic density estimation Network emission model Location-based Social Networks (LBSN) Data Open data sources: Twitter & Foursquare Data collection and storage: NCSA Initial Computation: JSC Check-ins Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 13 / 42
14 Example: Cluster Visualization Clustering London Clusters 6/1/2014, 1h time slice beginning at 18:00 UTC using Density-Based Spatial Clustering of Applications with Noise (DBScan) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 14 / 42
15 Data Preparation & Visualization Implementation Clustering The source code is available on our bitbucket repository on invite: Changes2013 Outcome: selected work has been recently accepted for inclusion of the Cartography and Geographic Information Science Journal Special Issue (Guest editors: Xinyue Ye, Qunying Huang, Wenwen Li) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 15 / 42
16 Scalable Parallel Clustering Implementation Clustering Exising parallel did not scale for problem domain DBScan noise ingored & separately reported PDSDBCAN Parallel DBScan Algorithm New implementation is work-in-progress, but promising initial results: HPDBSCAN [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 Slide material and performance comparisons courtesy by Markus Goetz and Christian Bodenstein 16 / 42
17 Further use of HPDBSCAN: Data of Koljoefjords Sweden Data of the PANGAEA data collection Example: one month measurement: 03/2012 Reality: years of data are available Goal: Automatic outlier detection using parallel dbscan algorithm Better measurement devices produce orders of magnitudes more big data Manual quality control becomes impossible and error-prone Automate the quality control process ( parallelization) [9] PANGAEA data collection DBScan noise ingored & separately reported Outlier! e.g. boxplots not feasible Data courtesy of Robert Huber, Marum Center for Marine Environmental Sciences, Bremen 17 / 42
18 Big Brain Data Analytics 18 / 42
19 JSC/INM Jointly tackled Case Slide courtesy by Dr. M. Axer from Talk New insights into the fiber architecture of the brain 19 / 42
20 Big Brain Data Analytics Scientific Case Classification Build reconstructed brain (one 3d volume) that matches with sections & block images Understanding the sectioning of the brain and support automation of reconstruction 1. Some pattern exists ([1] Problem understanding phase) Image content classification : [1] brain part; [2] not brain part 2. No exact mathematical formula exists No precise formula for contour of the brain ; a non-linear class boundary [2] 3. Data ([2] Data understanding phase ) Block face images (of frozen tissue) Every 20 micron (cut size) Resolution: 3272 x 2469 ~ 14 MB / RGB image ~ 8 MB / corresponding mask image ( groundtruth ) ~700 images [1] [2] [2] ~ 40 GB dataset we can apply supervised learning due to labelled data [2] [1] [2] [2] class label 20 / 42
21 Supervised Learning Mathematical Building Blocks (1) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Hypothesis Set (set of candidate formulas) Two Solution Tools: The Learning Model Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Elements that we derive from our skillset 21 / 42
22 Statistical Learning Theory Probability Distribution on X Question: Can we really learn a function from data? Learning is only possible in a probabilistic sense [ no restriction ] In sample data tracks out of sample data, created from same distribution Given large N: There is a probability of picking one point or another Data created by probability independently from each other Mathematically established via Hoeffdings Inequality : Future brains, we dont know in sample M is the number of hypothesis Practice: Need enough samples N to learn Problem: M is infinite, but we can reduce it to ~N Solution: VC dimension large overlaps out of sample use for predict! frequency is likely close to other frequency Probability Distribution 22 / 42
23 Supervised Learning Mathematical Building Blocks (2) Unknown Target Function (ideal function) Training Examples (historical records, groundtruth data, examples) Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 23 / 42
24 Statistical Learning Theory Error Measure & Noisy Targets Question: How can we learn a function from (noisy) data? Error measures to quantify our progress, the goal is: Often user-defined, if not often squared error : Error Measure Aka point-wise error measure (Noisy) Target function is not a (deterministic) function Getting with same x in the same y out is not always given in practice Problem: Noise in the data that hinders us from learning Idea: Use a target distribution instead of target function 24 / 42
25 Supervised Learning Mathematical Building Blocks (3) Unknown Target Distribution Function target function plus noise (ideal function) Training Examples (historical records, groundtruth data, examples) Error Measure Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 25 / 42
26 Supervised Learning Selected Classification Method Classification Initial approach: Support Vector Machines (SVMs) [7] C. Cortes and V. Vapnik, Supportvector networks, Machine Learning, vol. 20(3), pp , One of the best out-of-the-box / robust classification methods ([4] Modelling) Binary classifier separates two classes: [1] brain; [2] non-brain Parameters after cross-validation; radial basis function (rbf) kernel; C-SVC type Uses quadratic programming & Lagrangian method with N x N (Cross-validation (grid-search) nicely parallel high throughput computing) (linear example) ( maximal margin clasifier example) (maximizing hyperplane turned into optimization problem, minimization, dual problem) (max. hyperplane dual problem, using quadratic programming method) (quadratic coefficients) 26 / 42
27 Big Brain Data Analytics Data Preprocessing (1) Classification Feature Selection/Extraction/Reduction ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) E.g. principle component analysis (PCA) not very helpful, RGB orthogonal X pixel location [1D] Blue Green Red RGB [3D] SUM = 3 Features 3d Brain [4D] y pixel location [2D] 27 / 42
28 Big Brain Data Analytics Data Preprocessing (2) Classification Transform images to the LibSVM format ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) Label: Class B/W Red Green Blue Each line is a training vector with rgb levels each line is a pixel 0 1: : : : : : : : : : : : : : : features ~ #XYZ out of instances: Samples (we have to randomly pick the samples) 28 / 42
29 Big Brain Data Analytics Data Preprocessing (3) Classification Smart data sampling ([3] Data Preparation) The data bears the potential of sampling bias (here: much more black than white pixels ) Solution: Create samples equally per class ([1] brain; [2] non-brain) Create different datasets for training & testing (same structure, but avoid data snooping!) r_msa_ _dxxxx-xx-xx_all_train r_msa_ _dxxxx-xx-xx_all_test.svm 29 / 42
30 Big Brain Data Analytics Approach Overview Sampling, training and testing Parameters C (allowing error) and gamma (RBF Kernel) 30 / 42
31 Big Brain Data Analytics Initial Results Classification Approach: Cross-validation for model selection ([4] Modelling) (Skipped for simplicity: essentially gridsearch getting two parameters C and gamma bounds sum of errors determines the number and severity of violations (using a soft-margin SVM model, called also slack variables ) Approach: Serial SVM implementations ([4] Modelling) Data: nearly full dataset, but equally balanced classes Stopped: after using three different serial implementations Big data problem : plain balanced dataset too large to be properly processed in serial Potential Solution: create smaller samples from the plain balanced dataset 31 / 42
32 Big Brain Data Analytics Data Preprocessing Revisited Smart data sampling ([3] Data Preparation) Create very small samples Still balanced classes Approach: 0.01 % of the data per class Create different datasets for training & testing Classification r_msa_ _dxxxx-xx-xx_all_train r_msa_ _dxxxx-xx-xx_all_test.svm 32 / 42
33 Big Brain Data Analytics Selected Serial Results Classification Approach: Serial SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes [We reached a limit: Approach not scalable for larger quantities of data] Scikit-learn (python) Example: ([5] Evaluation) Training time: ~39 minutes on JUDGE; Testing time: ~2 Min; Accuracy ~91% Matlab Example: ([5] Evaluation) Training time: ~3 hours on Laptop; Testing time: ~27 Min; Accuracy ~90,1% Using the small sample size worked to train with some serial implementations Potential Improvement: Training-time reduction by maintaining classification accuracy 33 / 42
34 Big Brain Data Analytics Selected Parallel Results Classification Approach: Parallel SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes No limit: theoretical scalable, but usefulness depends on datasets Incremented number of datasets, towards a sample size of 0.1% Twister (iterative Map-Reduce) Example: ([5] Evaluation) Challenge: Data distribution across the parallel infrastructure on FutureGrid Training time: ~7 minutes on JUDGE; Testing time: ~7 Min; Accuracy ~96% MLlib Parallel version with Twister (iterative map-reduce) is working with growing dataset Work-in-progress: Work on other parallel implementations 34 / 42
35 Selected Lessons Learned 35 / 42
36 Big Brain Data Analytics Parallel SVM Technologies Tool Platform Approach Parallel Support Vector Machine Apache Mahout Java; Apache Hadoop 1.0 (mapreduce); HTC No strategy for implementation (Website), serial SVM in code Apache Spark/MLlib Apache Spark; HTC Only linear SVM; no multi-class implementation Twister/ParallelSVM Java; Apache Hadoop 1.0 (mapreduce); Twister (iterations), HTC Much dependencies on other software: Hadoop, Messaging, etc. Scikit-Learn Python; HPC/HTC Multi-class Implementations of SVM, but not fully parallelized pisvm C code; Message Passing Interface (MPI); HPC Simple multi-class parallel SVM implementation outdated (~2011) GPU accelerated LIBSVM CUDA language Multi-class parallel SVM, relatively hard to program, no std. (CUDA) psvm C code; Message Passing Interface (MPI); HPC Unstable beta, SVM implementation outdated (~2011) Journal Paper in preparation Algorithm A Implementation closed/old source, also after asking paper authors Clustering++ Classification++ Regression++ Algorithm Extension A Implementation Parallelization of Algorithm Extension A A implementations available implementations rare and/or not stable 36 / 42
37 Big Brain Data Analytics Classifier Challenges Classification Sampling, training and testing Checking out-of-sample performance ([5] Evaluation) E.g. using two different images and compare with masks issue Check out-of-sample performance & better data understanding ~ok Problem: 2d cut classification on 3D brain color data (i.e. background color) Solution: E.g. use neighbouring methods Color histograms: e.g samples of class 0 & only 567 of class 1 (for G) 37 / 42
38 Big Brain Data Analytics Potential Next Approaches Approach to 2D/3D problem: Apply Self Dual Attribute Profile (SDAP) Increase number of dimensions, using different threshold values Takes advantage of neighbouring pixels and cuts trees at certain thresholds Good experience in land cover classification : e.g. ~70% to ~90% accuracy Area Std Dev Moment of Inertia [9] G. Cavallaro, M. Mura, J.A. Benediktsson, L. Bruzzone A Comparison of Self-Dual Attribute Profiles based on different filter rules for classification, IEEE IGARSS2014, Quebec, Canada Example: Std Dev (Channel Blue) Approach: Increase number of training samples (no more serial) Very small sampling may violate the generalization capability of classifier Approach: Compare with other (parallel) classification methods E.g. Naive Bayer classifier, DecisionTrees, RandomForests, etc. 38 / 42
39 Big Brain Data Analytics : Much more interesting Challenges! 3D Reconstruction of High Resolution Images Data: brain slices (microscopic measurements) Mapping of cell densities and cortical areas (in 3D) Data: ~1 PB/brain Analyse differences of brains, evolving over time (longitudinal studies) Data: e.g Kohorte project (in-vivo humans studies) Data: e.g. Vervet monkey 20 brains (in-vivo and post-mortem monkeys) Close collaboration between JSC and INM bears lots of potential for tackling research challenges 39 / 42
40 Acknowledgements & References 40 / 42
41 Acknowledgements Gabriele Cavallaro, University of Iceland Tomas Philipp Runarsson, University of Iceland Shaowen Wang, National Center for Supercomputing Applications Junjun Yin, National Center for Supercomputing Applications Markus Axer, Stefan Köhnen, Tim Hütz, Institute of Neuroscience & Medicine, Juelich Selected Members of the Research Group on High Productivity Data Processing Ahmed Shiraz Memon Mohammad Shahbaz Memon Markus Goetz Christian Bodenstein Philipp Glock Matthias Richerzhagen 41 / 42
42 References [1] EUDAT European Data Infrastructure, B2SHARE Tool, Online: [2] G. Cavallaro & M. Riedel et al., Smart Data Analytics Methods for Remote Sensing Applications, IEEE IGARSS, Quebec, Canada [3] P. Chapman et al., CRISP-DM Guide [4] Research Data Alliance, online: [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), 2014 [7] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20(3), pp , [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 [9] PANGAEA data collection, Talk available at 42 / 42
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
More informationHigh Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
More informationClassification Techniques in Remote Sensing Research using Smart Data Analytics
Classification Techniques in Remote Sensing Research using Smart Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Morris Riedel Juelich Supercomputing
More informationScientific Big Data Analytics by HPC - Parallel and Scalable Machine Learning on JURECA
Scientific Big Data Analytics by HPC - Parallel and Scalable Machine Learning on JURECA Federated Systems and Data Division Research Group High Productivity Data Processing Dr.- Ing. Morris Riedel et al.
More informationSelected Parallel and Scalable Methods for Scientific Big Data Analytics
Selected Parallel and Scalable Methods for Scientific Big Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group
More informationSelected Parallel and Scalable Methods for Scientific Big Data Analytics
Selected Parallel and Scalable Methods for Scientific Big Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group
More informationUnderstanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics
Understanding Big Data Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Interest Group Co Chairs are Needed in Big Data-driven Scientific Research The
More informationIntroduction to Big Data in HPC, Hadoop and HDFS Part One
Introduction to Big Data in HPC, Hadoop and HDFS Part One Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel Adjunct Associated Professor, University
More informationFrom Big Data Analytics To Smart Data Analytics With Parallelization Techniques
From Big Data Analytics To Smart Data Analytics With Parallelization Techniques Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel et al. Adjunct
More informationIEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1 On Understanding Big Data Impacts in Remotely Sensed Image Classification Using Support Vector Machine Methods Gabriele
More informationOn Establishing Big Data Breakwaters
On Establishing Big Data Breakwaters with Analytics Dr. - Ing. Morris Riedel Head of Research Group High Productivity Data Processing, Juelich Supercomputing Centre, Germany Adjunct Associated Professor,
More informationEuropean Data Infrastructure - EUDAT Data Services & Tools
European Data Infrastructure - EUDAT Data Services & Tools Dr. Ing. Morris Riedel Research Group Leader, Juelich Supercomputing Centre Adjunct Associated Professor, University of iceland BDEC2015, 2015-01-28
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationHealthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationAnalysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationIntroduction to Big Data in HPC, Hadoop and HDFS Part Two
Introduction to Big Data in HPC, Hadoop and HDFS Part Two Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel Adjunct Associated Professor, University
More informationBig Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationHPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationMaschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationBig Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
More informationCOMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationIC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com>
IC05 Introduction on Networks &Visualization Nov. 2009 Overview 1. Networks Introduction Networks across disciplines Properties Models 2. Visualization InfoVis Data exploration
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationBIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou
BIG DATA VISUALIZATION Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou Let s begin with a story Let s explore Yahoo s data! Dora the Data Explorer has a new
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationPredicting Flight Delays
Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationNetwork Intrusion Detection using Semi Supervised Support Vector Machine
Network Intrusion Detection using Semi Supervised Support Vector Machine Jyoti Haweliya Department of Computer Engineering Institute of Engineering & Technology, Devi Ahilya University Indore, India ABSTRACT
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
More informationComparison of machine learning methods for intelligent tutoring systems
Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationMAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
More informationGrid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
More informationBig Data Analytics. Tools and Techniques
Big Data Analytics Basic concepts of analyzing very large amounts of data Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research
More informationAnalytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
More informationHigh Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationA Map Reduce based Support Vector Machine for Big Data Classification
, pp.77-98 http://dx.doi.org/10.14257/ijdta.2015.8.5.07 A Map Reduce based Support Vector Machine for Big Data Classification Anushree Priyadarshini and SonaliAgarwal Indian Institute of Information Technology,
More informationSoftware Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo
Software Engineering for Big Data CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Big Data Big data technologies describe a new generation of technologies that aim
More informationMachine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
More informationTraffic Prediction and Analysis using a Big Data and Visualisation Approach
Traffic Prediction and Analysis using a Big Data and Visualisation Approach Declan McHugh 1 1 Department of Computer Science, Institute of Technology Blanchardstown March 10, 2015 Summary This abstract
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationSentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationText Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationData Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationBig Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationDATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights
DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other
More informationIdentification algorithms for hybrid systems
Identification algorithms for hybrid systems Giancarlo Ferrari-Trecate Modeling paradigms Chemistry White box Thermodynamics System Mechanics... Drawbacks: Parameter values of components must be known
More informationSpark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationSupport Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
More informationMusic Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
More informationInformation Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
More informationThe? Data: Introduction and Future
The? Data: Introduction and Future Husnu Sensoy Global Maksimum Data & Information Technologies Global Maksimum Data & Information Technologies The Data Company Massive Data Unstructured Data Insight Information
More informationChallenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015
Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Dr. Daisy Zhe Wang Director of Data Science Research Lab University of Florida, CISE
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationSupport Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
More informationObject Recognition and Template Matching
Object Recognition and Template Matching Template Matching A template is a small image (sub-image) The goal is to find occurrences of this template in a larger image That is, you want to find matches of
More informationThe STC for Event Analysis: Scalability Issues
The STC for Event Analysis: Scalability Issues Georg Fuchs Gennady Andrienko http://geoanalytics.net Events Something [significant] happened somewhere, sometime Analysis goal and domain dependent, e.g.
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More informationOptimizing content delivery through machine learning. James Schneider Anton DeFrancesco
Optimizing content delivery through machine learning James Schneider Anton DeFrancesco Obligatory company slide Our Research Areas Machine learning The problem Prioritize import information in low bandwidth
More informationContent-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationFrom Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data
100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.
More informationOracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features
Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine
More informationIntrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department
More information