Analyzing large and complex data sets in fusion: Modern tools from probability theory and pattern recognition
|
|
|
- Abel Wilson
- 9 years ago
- Views:
Transcription
1 FACULTY OF ENGINEERING AND ARCHITECTURE Analyzing large and complex data sets in fusion: Modern tools from probability theory and pattern recognition Geert Verdoolaege Department of Applied Physics Ghent University, Ghent, Belgium Laboratory for Plasma Physics, Royal Military Academy, Brussels, Belgium PRL Seminar Canberra, December 18, 2013
2 Overview 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 1
3 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 2
4 Evolution of science (1) 1000 years ago: experimental science Last few 100 years: theoretical science 3
5 Evolution of science (2) Last few decades: Computational science Today: Data science 4
6 The data deluge Big data : High-throughput scientific experiments: Telescopes Particle accelerators Fusion devices Sensor networks, satellite surveys Simulations Biology: e.g. human genome Sociology Digital archives... 5
7 The fourth paradigm (1) Jim Gray, Microsoft Research (Turing Award, 1998): Fourth paradigm The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration. 6
8 The fourth paradigm (2) The Fourth Paradigm: Data-Intensive Scientific Discovery, T. Hey, S. Tansley, K. Tolle, eds., Microsoft Research, Redmond, WA,
9 Data science Challenges: Large data sets Heterogeneous sources and formats Complex nonlinear dependencies Activities: Data analysis: extracting useful information Data visualization Data archiving and retrieval 8
10 What is a data scientist? Opinions vary: Hilary Mason, chief data scientist at bitly: Harvard Business Review, 2012: The sexiest job of the 21 st Century 9
11 Data analysis Level: low intermediate high Descriptive statistics for data visualization Analysis of time series, images, videos and fields (e.g. magnetic) Resampling, (motion) correction,... Fourier/wavelet analysis Inverse problems: Abel inversion, tomography,... Estimation and prediction Event detection, object recognition, object tracking, shape analysis Extracting patterns in data spaces: clusters, regression lines/surfaces Probability theory and statistics for modeling, estimation, hypothesis testing, prediction, error analysis,... 10
12 Fusion data characteristics Challenge Remedy/opportunity Massive databases Real-time requirements (plasma control) Clever and fast algorithms Substantial and heterogeneous uncertainties, stochasticity Probability theory Error propagation studies e.g. Bayesian probability theory Redundancy Pattern recognition: (nonlinear) relations, cluster structure High dimensionality Dimensionality reduction Data visualization 11
13 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 12
14 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 13
15 Motivation and objectives Methods: Bayesian probability theory (BPT) Integrated data analysis (IDA) Motivation and objectives: Enhance reliability and robustness of physics results Identify and reduce uncertainty sources Study non-gaussian error propagation Model and quantify systematic uncertainty Integrate heterogeneous data sets: exploit redundancy and interdependencies Diagnostic design ITER and fusion reactors: Reduced space: limited data Reduced access: systematic uncertainty In situ calibration 14
16 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 15
17 Bayesian recipe Forward model: link (physical) model with data x, e.g.: x = f ( θ ) Assume statistical (random) measurement error, often Gaussian, e.g.: x = f ( θ ) + ν, with ν N (0, σν) 2 = p(x 1 θ ) = exp 2πσν [x f ( θ )] 2 2σ 2 ν Likelihood Propose prior information p( θ ), possibly uninformative (e.g. uniform) Apply Bayes theorem: solve the inverse problem (also profile reconstruction!) 16
18 Bayes theorem p( θ x, I ) = p( x θ, I )p( θ I ) p( x I ) x = data vector, θ = parameter vector, I = implicit assumptions Posterior Likelihood: misfit between model and data Prior: expert or diffuse knowledge Evidence: normalization 17
19 Bayes theorem p( θ x, I ) = p( x θ, I )p( θ I ) p( x I ) x = data vector, θ = parameter vector, I = implicit assumptions Posterior Likelihood: misfit between model and data Prior: expert or diffuse knowledge Evidence: normalization 17
20 Bayes theorem p( θ x, I ) = p( x θ, I )p( θ I ) p( x I ) x = data vector, θ = parameter vector, I = implicit assumptions Posterior Likelihood: misfit between model and data Prior: expert or diffuse knowledge Evidence: normalization 17
21 Bayes theorem p( θ x, I ) = p( x θ, I )p( θ I ) p( x I ) x = data vector, θ = parameter vector, I = implicit assumptions Posterior Likelihood: misfit between model and data Prior: expert or diffuse knowledge Evidence: normalization 17
22 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 18
23 Principle of IDA Integrate heterogeneous (conflicting?) data sets: likelihoods Model statistical and systematic uncertainties: Measured data Calibration Underlying physical model Estimate model parameters using Bayes theorem Sample the posterior: Markov Chain Monte Carlo 19
24 Example applications n e from Thomson scattering, soft X-ray and interferometry n e profiles from Li-beam and interferometry: factor 400 increase in temporal resolution! Magnetic equilibrium reconstruction: current filaments + Grad-Shafranov Impurity concentrations from CXS + BES Antenna design EM wave propagation via stochastic FDTD: uncertainty in boundary conditions, medium properties, etc. 20
25 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 21
26 Thomson scattering + soft X-ray + interferometer Fischer et al., Plasma Phys. Control. Fusion, 44, pp ,
27 Bremsstrahlung + CX impurity lines p (x ) n (x cm 3 ) e,1 2 p (x ) n (x cm 3 ) e,2 2 p p Z eff, s ε p p Z eff, s δ Verdoolaege et al., IEEE Trans. Plasma Sci.,, pp.,
28 CXS + BES Helium concentration profile at ITER: 10% error Spectrometer at TEXTOR and later AUG BES: emission from beam Error reduction from 40% 10% He concentration profile with 68% credibility bounds (TEXTOR): Planned at AUG 24
29 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 25
30 Pattern recognition opportunities Clustering/classification: grouping of data points Dimensionality reduction: data visualization, better learning efficiency Regression: (nonlinear) deterministic relation between variables Objectives 1 Contribute to physics studies by extracting patterns, structure and relations from data 2 Contribute to plasma control through real-time data interpretation Note: Both model-based and purely data-driven approaches are possible One approach does not exclude the other 26
31 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 27
32 A different view on measurement Traditional measurement: value + error bar Uncertainty is quantized by probability Measurement = sample from underlying (non-gaussian?) probability distribution Goal of measurement = probing the underlying distribution Stochastic component of data descriptive model = probability density function (PDF) Examples: y = β 0 + β 1 x } {{ } Deterministic component y = β 0 + β 1 (x + + ɛ }{{} Stochastic component η ) + ɛ }{{} Error in (independent) variable PDF contains all information about measurement! 28
33 The challenge The probabilistic nature of data The fundamental object resulting from a measurement is a probability distribution. Any further processing of the data (statistical inference, pattern recognition) should respect this inherent probabilistic nature. Pattern recognition in PDF spaces Pattern recognition is based on geometry, primarily distance Geometry of probability distributions Obtain PDF from Repeated measurements (Bayesian) probability theory... 29
34 Probability + geometry: a happy marriage Probabilistic manifold: PDF = point on manifold Coordinates = PDF parameters Distance between PDFs? Information geometry 30
35 Information geometry Riemannian differential geometry Fisher information = unique metric tensor: ) ( Parametric probability model: p x θ = ( ) [ θ g µν = E 2 θ µ θ ν ln p θ = N-dimensional parameter vector ( x θ )], µ, ν = 1... N Line element: ds 2 = g µν dθ µ dθ ν Minimum-length curve: geodesic Geodesic distance (GD) Natural and theoretically well motivated distance between PDFs 31
36 Univariate Gaussian distribution PDF: p(x µ, σ) = 1 ] (x µ)2 exp [ 2πσ 2σ 2 Line element: ds 2 = dµ2 σ 2 + 2dσ2 σ 2 Hyperbolic geometry: Poincaré half-plane model 32
37 Poincaré half-plane p 1 : µ 1 = 4, σ 1 = 0.7; p 2 : µ 2 = 3, σ 2 =
38 Poincaré half-plane p 1 : µ 1 = 4, σ 1 = 0.7; p 2 : µ 2 = 3, σ 2 =
39 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 35
40 ITPA Confinement Database ITPA Global H Mode Confinement Database (DB3) ITER H-Mode Database Working Group D.C. McDonald et al., Nucl. Fusion 47, pp , entries from 19 tokamaks Approximate error estimates: limited information on PDF! Assume standard deviations Gaussian PDFs (maximum entropy) Different machines different error estimates: difficult to handle using classic approach! 36
41 Confinement regime classification Distinguish between L- and H-mode: 3845 L and 6207 H Identify edge localized mode (ELM) behavior 8 global engineering variables: I p, B t, n e, P loss, R, a, M eff, κ Variables statistically independent product of Gaussians Note: this does not exclude the variables to be related through a deterministic relation! 37
42 Gaussian product manifold Plasma and machine variables: x i x, i = 1,..., 8 Distribution parameters: µ i, σ i Gaussian product distribution: ( x p µ 1,..., µ 8, σ 1,..., σ 8) N = N ( x i µ i, σ i) Measurements ( A and B: µa,b = µ 1 A,B,..., µ8 A,B GD in closed form: ( µa ) GD, σ A µ B, σ B = 2 i=1 ) ( ), σ A,B = σa,b 1,..., σ8 A,B [ 8 i=1 [( µ δab i i = A µ i B ( µ i A µ i B ( 1 + δ ln 2 i ) ] 1/2 AB 1 δab i, ) 2 ( + 2 σ i A σb) i 2 ] 1/2 ) 2 ( ) + 2 σ i A + σb i 2 38
43 Dimensionality reduction for visualization Step 1. Calculate all pairs of GDs proximity matrix [D ij ] Step 2. Plot points arbitrarily in 2D Euclidean space Step 3. Calculate Euclidean proximity matrix [E ij ] Step 4. Minimize i,j (D ij E ij ) 2 Multidimensional scaling Step 5. Plot final configuration 39
44 Confinement visualization Tokamaks Confinement regime Euclidean no errors GD with errors Verdoolaege et al., Fusion Sci. Technol., 62, pp ,
45 ELM behavior Verdoolaege et al., Rev. Sci. Instrum., 83, art. no. 10D715,
46 k-nearest neighbor classification Confinement mode identification Training: 5%, testing: 95% k = 1: nearest neighbor Correct classification rates (%) Mode Euclidean GD GD with w/o errors with errors random errors L H
47 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 43
48 Geodesic regression p y p x Minimize sum of squared GD Use geodesic centroid 44
49 Synthetic data Original Data Simple regression Errors in variables Geodesic regression y x 45
50 Regression results ITPA global confinement: 10 0 Simple regression Geodesic regression τ E (s) τ reg (s) R 2 OLS = 0.44 R2 EIV = 0.52 R2 GR = 0.71 GR yields full probability distributions Precise error estimates are not required, but may improve estimates 46
51 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 47
52 Wavelet distributions 1D/2D discrete wavelet transform (DWT) Retain time information Spectral energy distribution identifies signal/image Histograms: zero-mean + heavy-tailed Generalized Gaussian distribution (GGD): p(x α, β) = [ ( ) ] β x β 2αΓ(1/β) exp : features α and β α 48
53 Geodesic distance for GGDs Gaussian (β = 2): GD[N (0, σ 1 ) N (0, σ 2 )] = 2 ln ( α2 α 1 ) Laplace (β = 1): GD[L(0, α 1 ) L(0, α 2 )] = ln ( α2 α 1 ) 49
54 Experiment 1: setup (1) JET campaigns C15 C20 Disruptive shots: 334 Known time of disruption (ToD) Time series for 13 plasma variables Sliding window: 30 ms Regular features: until 1 s before ToD Disruptive features: from 210 ms before ToD Similar to APODIS: Rattá et al., Nucl. Fusion 50, ,
55 Experiment 1: setup (2) Fourier vs. wavelet: Fourier Power spectrum Standard deviation Euclidean and GD Wavelet Detail coefficients Daubechies 4, 3 scales Laplace α GD k-nearest neighbor classifier Training: 65%, testing: 35% Reproduce 20 times 51
56 Experiment 1: results TPR FPR MA FA SR AVG True positive rate False positive rate Missed alarms False alarms Success rate (= 1 (MA + FA)) Average detection time before ToD Performance Fourier Fourier Wavelet measure Euclidean GD GD TPR (%) 76.5 ± ± ± 1.3 FPR (%) 17.4 ± ± ± 0.7 MA (%) 0.6 ± ± ± 0.7 FA (%) 48.0 ± ± ± 2.9 SR (%) 51.4 ± ± ± 2.8 AVG (ms) 61.0 ± ± ± 3.2 Verdoolaege et al., Fusion Sci. Technol., 62, pp , 2012 Verdoolaege et al., SOFT
57 Experiment 2: generalization Generalize to campaigns C21 C27 53
58 Disruptive trajectory Landmark MDS: real-time projections Visual tool in control room JET # disrupted at s due to NTM: 54
59 Outline 1 Data Science 2 Bayesian methods and integrated analysis Motivation Bayesian methodology Integrated data analysis Example applications 3 Pattern recognition in stochastic data sets Probabilistic manifolds Confinement regime visualization and identification Scaling laws Disruption prediction 4 Conclusion and outlook 55
60 Conclusion Emerging field of data science Huge potential for probability theory and pattern recognition in fusion: Physics studies Plasma control Quantification and reduction of uncertainties Probability distributions are maximally informative Full probability structure determines patterns Patterns reflect physics 56
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Java Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
A Statistical Framework for Operational Infrasound Monitoring
A Statistical Framework for Operational Infrasound Monitoring Stephen J. Arrowsmith Rod W. Whitaker LA-UR 11-03040 The views expressed here do not necessarily reflect the views of the United States Government,
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas University of Washington and Alex
Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
Tutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Section 5. Stan for Big Data. Bob Carpenter. Columbia University
Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model
Model-based Synthesis. Tony O Hagan
Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
THE MULTIVARIATE ANALYSIS RESEARCH GROUP. Carles M Cuadras Departament d Estadística Facultat de Biologia Universitat de Barcelona
THE MULTIVARIATE ANALYSIS RESEARCH GROUP Carles M Cuadras Departament d Estadística Facultat de Biologia Universitat de Barcelona The set of statistical methods known as Multivariate Analysis covers a
Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
Gaussian Processes to Speed up Hamiltonian Monte Carlo
Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Local Electron Thermal Transport in the MST Reversed-Field Pinch
Local Electron Thermal Transport in the MST Reversed-Field Pinch T.M. Biewer,, J.K., B.E. Chapman, N.E. Lanier,, S.R. Castillo, D.J. Den Hartog,, and C.B. Forest University of Wisconsin-Madison Recent
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization
Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano Spinello 1 Motivation Recall: Discrete filter Discretize
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Measurement and Simulation of Electron Thermal Transport in the MST Reversed-Field Pinch
1 EX/P3-17 Measurement and Simulation of Electron Thermal Transport in the MST Reversed-Field Pinch D. J. Den Hartog 1,2, J. A. Reusch 1, J. K. Anderson 1, F. Ebrahimi 1,2,*, C. B. Forest 1,2 D. D. Schnack
Bayesian Image Super-Resolution
Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
11. Time series and dynamic linear models
11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Tracking in flussi video 3D. Ing. Samuele Salti
Seminari XXIII ciclo Tracking in flussi video 3D Ing. Tutors: Prof. Tullio Salmon Cinotti Prof. Luigi Di Stefano The Tracking problem Detection Object model, Track initiation, Track termination, Tracking
Exploiting A Constellation of Narrowband RF Sensors to Detect and Track Moving Targets
Exploiting A Constellation of Narrowband RF Sensors to Detect and Track Moving Targets Chris Kreucher a, J. Webster Stayman b, Ben Shapo a, and Mark Stuff c a Integrity Applications Incorporated 900 Victors
Towards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
ANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
Tracking and Recognition in Sports Videos
Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey [email protected] b Department of Computer
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Statistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering
Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
Proposal 1: Model-Based Control Method for Discrete-Parts machining processes
Proposal 1: Model-Based Control Method for Discrete-Parts machining processes Proposed Objective: The proposed objective is to apply and extend the techniques from continuousprocessing industries to create
Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA 22030-4444, USA [email protected], hlv@gmu.
POSERIOR CRAMÉR-RAO BOUND FOR RACKING ARGE BEARING Kristine L. Bell and Harry L. Van rees Center of Excellence in C 3 I George Mason University Fairfax, VA 22030-4444, USA [email protected], [email protected] ABSRAC
Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT
Validation of Software for Bayesian Models using Posterior Quantiles Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT Abstract We present a simulation-based method designed to establish that software
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]
A Simple Feature Extraction Technique of a Pattern By Hopfield Network
A Simple Feature Extraction Technique of a Pattern By Hopfield Network A.Nag!, S. Biswas *, D. Sarkar *, P.P. Sarkar *, B. Gupta **! Academy of Technology, Hoogly - 722 *USIC, University of Kalyani, Kalyani
Master s thesis tutorial: part III
for the Autonomous Compliant Research group Tinne De Laet, Wilm Decré, Diederik Verscheure Katholieke Universiteit Leuven, Department of Mechanical Engineering, PMA Division 30 oktober 2006 Outline General
Lecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Cell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel [email protected] 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUMBER OF REFERENCE SYMBOLS
PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUM OF REFERENCE SYMBOLS Benjamin R. Wiederholt The MITRE Corporation Bedford, MA and Mario A. Blanco The MITRE
3F3: Signal and Pattern Processing
3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani [email protected] Department of Engineering University of Cambridge Lent Term Classification We will represent data by
JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions
Edith Cowan University Research Online ECU Publications Pre. JPEG compression of monochrome D-barcode images using DCT coefficient distributions Keng Teong Tan Hong Kong Baptist University Douglas Chai
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Big Data, Statistics, and the Internet
Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes
Machine Learning in Statistical Arbitrage
Machine Learning in Statistical Arbitrage Xing Fu, Avinash Patra December 11, 2009 Abstract We apply machine learning methods to obtain an index arbitrage strategy. In particular, we employ linear regression
VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS
VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS Aswin C Sankaranayanan, Qinfen Zheng, Rama Chellappa University of Maryland College Park, MD - 277 {aswch, qinfen, rama}@cfar.umd.edu Volkan Cevher, James
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca [email protected] Spain Manuel Martín-Merino Universidad
A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails
12th International Congress on Insurance: Mathematics and Economics July 16-18, 2008 A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails XUEMIAO HAO (Based on a joint
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Tracking Groups of Pedestrians in Video Sequences
Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification
Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification Presented by Work done with Roland Bürgi and Roger Iles New Views on Extreme Events: Coupled Networks, Dragon
Evaluation of Machine Learning Techniques for Green Energy Prediction
arxiv:1406.3726v1 [cs.lg] 14 Jun 2014 Evaluation of Machine Learning Techniques for Green Energy Prediction 1 Objective Ankur Sahai University of Mainz, Germany We evaluate Machine Learning techniques
Non-Inductive Startup and Flux Compression in the Pegasus Toroidal Experiment
Non-Inductive Startup and Flux Compression in the Pegasus Toroidal Experiment John B. O Bryan University of Wisconsin Madison NIMROD Team Meeting July 31, 2009 Outline 1 Introduction and Motivation 2 Modeling
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
Private Equity Fund Valuation and Systematic Risk
An Equilibrium Approach and Empirical Evidence Axel Buchner 1, Christoph Kaserer 2, Niklas Wagner 3 Santa Clara University, March 3th 29 1 Munich University of Technology 2 Munich University of Technology
Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni
1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed
99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, 99.42 cm
Error Analysis and the Gaussian Distribution In experimental science theory lives or dies based on the results of experimental evidence and thus the analysis of this evidence is a critical part of the
NEURAL NETWORKS A Comprehensive Foundation
NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments
Functional Data Analysis of MALDI TOF Protein Spectra
Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer [email protected]. Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF
Probabilistic Latent Semantic Analysis (plsa)
Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg [email protected] www.multimedia-computing.{de,org} References
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009
Exponential Random Graph Models for Social Network Analysis Danny Wyatt 590AI March 6, 2009 Traditional Social Network Analysis Covered by Eytan Traditional SNA uses descriptive statistics Path lengths
