Kernel methods with Imbalanced Data and Applications to Weather Prediction

Similar documents
Support Vector Machines Explained

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Support Vector Machines with Clustering for Training with Very Large Datasets

E-commerce Transaction Anomaly Classification

A Simple Introduction to Support Vector Machines

Azure Machine Learning, SQL Data Mining and R

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Support Vector Machine (SVM)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

NEURAL NETWORKS A Comprehensive Foundation

Scalable Developments for Big Data Analytics in Remote Sensing

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Simple and efficient online algorithms for real world applications

A fast multi-class SVM learning method for huge databases

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Linear Threshold Units

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Data Mining. Nonlinear Classification

Bootstrapping Big Data

Using Random Forest to Learn Imbalanced Data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Knowledge Discovery from patents using KMX Text Analytics

Supervised Learning (Big Data Analytics)

An Introduction to Machine Learning

Classification of Bad Accounts in Credit Card Industry

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Social Media Mining. Data Mining Essentials

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining - Evaluation of Classifiers

Analecta Vol. 8, No. 2 ISSN

A Learning Algorithm For Neural Network Ensembles

THE SVM APPROACH FOR BOX JENKINS MODELS

Document Image Retrieval using Signatures as Queries

Data Mining Practical Machine Learning Tools and Techniques

Statistical Machine Learning

Decision Support Systems

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

ClusterOSS: a new undersampling method for imbalanced learning

Support Vector Machines

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Component Ordering in Independent Component Analysis Based on Data Power

Using artificial intelligence for data reduction in mechanical engineering

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Machine Learning Big Data using Map Reduce

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Evaluation & Validation: Credibility: Evaluating what has been learned

An Overview of Knowledge Discovery Database and Data mining Techniques

Support Vector Machines for Classification and Regression

Support Vector Machines

Predictive Data modeling for health care: Comparative performance study of different prediction models

Numerical Analysis An Introduction

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY Principal Components Null Space Analysis for Image and Video Classification

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Knowledge Discovery and Data Mining

SVM Ensemble Model for Investment Prediction

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Predict Influencers in the Social Network

Machine Learning and Pattern Recognition Logistic Regression

Advanced Ensemble Strategies for Polynomial Models

Chapter 12 Discovering New Knowledge Data Mining

Random Forest Based Imbalanced Data Cleaning and Classification

Visualization by Linear Projections as Information Retrieval

1. Classification problems

Sanjeev Kumar. contribute

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Maschinelles Lernen mit MATLAB

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Lecture 3: Linear methods for classification

STA 4273H: Statistical Machine Learning

Big Data Analytics CSCI 4030

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A Survey on Pre-processing and Post-processing Techniques in Data Mining

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

Machine Learning in FX Carry Basket Prediction

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Making Sense of the Mayhem: Machine Learning and March Madness

Predict the Popularity of YouTube Videos Using Early View Data

The Scientific Data Mining Process

4.1 Learning algorithms for neural networks

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Machine Learning in Spam Filtering

Performance Metrics for Graph Mining Tasks

Lecture 6: Logistic Regression

Machine Learning Final Project Spam Filtering

Applications to Data Smoothing and Image Processing I

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

An Introduction to Data Mining

Transcription:

Kernel methods with Imbalanced Data and Applications to Weather Prediction Theodore B. Trafalis Laboratory of Optimization and Intelligent Systems School of Industrial and Systems Engineering University of Oklahoma ttrafalis@ou.edu San Francisco Airport Marriott Waterfront Congress Center San Francisco, CA, 8-10 August, 2015 INNS Conference on Big Data 2015 1 /

Part I Kernel Methods for Imbalanced Data and Application to Tornado Prediction INNS Conference on Big Data 2015 2 /

Imbalanced Data What is an Imbalanced Data Problem? 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 3 /

Imbalanced Data What is an Imbalanced Data Problem? Imbalanced Data Problems and their Importance The problem of learning from an imbalanced data set occurs when the number of samples in one class is significantly greater than that of the other class. Imbalanced data is very important in data mining and data classification. Examples of imbalanced data sets include: Fraudulent credit card transactions. Telecommunication equipment failures. Oil spills from satellite images. Tornado, earthquake and landslide occurrences. Cancer and health science data. INNS Conference on Big Data 2015 4 /

Imbalanced Data What is an Imbalanced Data Problem? Example of the Classification of Imbalanced Data Source: Tang et al. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1):281-288, 2009. INNS Conference on Big Data 2015 5 /

Imbalanced Data What is an Imbalanced Data Problem? Between Class and Within Class Imbalances Source: He and Garcia. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009. INNS Conference on Big Data 2015 6 /

Imbalanced Data Impact of Imbalanced Data on Learning Machines 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 7 /

Imbalanced Data Impact of Imbalanced Data on Learning Machines Impact of Imbalanced Data Problems in Classification Classifiers tend to provide an imbalanced degree of accuracy with the majority class having close to 100 percent accuracy, and the minority class having an accuracy close to 0-10 percent. In the tornado data set, for example, a 10 percent accuracy for the minority class suggests that 72 tornadoes would be classified as nontornadoes. INNS Conference on Big Data 2015 8 /

Imbalanced Data Impact of Imbalanced Data on Learning Machines Illustration of Impact of Imbalanced Classification On the left, the accuracy of the minority class is zero percent. On the right, the accuracy for the minority class is 80 percent. INNS Conference on Big Data 2015 9 /

Imbalanced Data State of the Art Techniques for Imbalanced Learning 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 10 /

Imbalanced Data State of the Art Techniques for Imbalanced Learning Current Approaches for Imbalanced Learning I Algorithm Level Threshold method. Learn only the minority class. Cost-sensitive approaches. Data Level Random under-sampling and over-sampling. Uninformed under-sampling (EasyEnsemble, BalanceCascade). Synthetic sampling with data generation (SMOTE). Adaptive synthetic sampling (Borderline-SMOTE, ADASYN). Sampling with data cleaning (OSS method, CNN+Tomek Links integration method, Neighborhood Cleaning rule, SMOTE+ENN, SMOTE+Tomek). Cluster-based sampling (CBO). Integration of sampling and boosting (SMOTEBoost). Kernel-based approaches Variations of Support Vector Machines (SVM). INNS Conference on Big Data 2015 11 /

Imbalanced Data State of the Art Techniques for Imbalanced Learning Current Approaches for Imbalanced Learning II Kernel Logistic Regression. Evaluation metrics. Metrics used to evaluate accuracies. Receiver Operating Characteristic (ROC), Precision-Recall (PR) and Cost Curves. Singular assessment metrics based on the confusion or multi-class cost matrix (F-measure, G-mean, etc). Source: He and Garcia. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009. INNS Conference on Big Data 2015 12 /

Imbalanced Data State of the Art Techniques for Imbalanced Learning Problems with Imbalanced Data Prediction Improper evaluation metrics. Lack of data: absolute rarity. Number of observations is small in absolute sense. Relative lack of data: relative rarity. Relative to other events. Data fragmentation. Absolute lack of data within a single partition. Inappropriate inductive bias. Such as an assumption of linearity. INNS Conference on Big Data 2015 13 /

Kernel Methods General Description of Kernel Methods 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 14 /

Kernel Methods General Description of Kernel Methods Historical Perspective Efficient algorithms for detecting linear relations were used in the 1950s and 1960s (perceptron algorithm). Handling nonlinear relationships was seen as major research goal at that time but the development of nonlinear algorithms with the same efficiency and stability has proven as an elusive goal. In the mid 80s, the field of pattern analysis underwent a nonlinear revolution with backpropagation neural networks (NNs) and decision trees (based on heuristics and lacking a firm theoretical foundation, local minima problems, nonconvexity). In the mid 90s, kernel based methods have been developed while retaining the guarantees and understanding that have been developed for linear algorithms. INNS Conference on Big Data 2015 15 /

Kernel Methods General Description of Kernel Methods Overview I Kernel Methods are a new class of machine learning algorithms which can operate on very general types of data and can detect very general types of relations (e.g., Potential function method; Aizerman et al., 1964, Vapnik, 1982, 1995). Correlation, factor, cluster and discriminant analysis are some of the types of machine learning analysis tasks that can be performed on data as diverse as sequences, text, images, graphs and vectors using kernels. Kernel methods provide also a natural way to merge and integrate different types of data. INNS Conference on Big Data 2015 16 /

Kernel Methods General Description of Kernel Methods Overview II Kernel methods offer a modular framework. In a first step, a dataset is processed into a kernel matrix. Data can be of various types and also of heterogeneous types. In a second step, a variety of kernel algorithms can be used to analyze the data, using only the information contained in the kernel matrix. INNS Conference on Big Data 2015 17 /

Kernel Methods General Description of Kernel Methods Modular Framework Source: J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, 2004. INNS Conference on Big Data 2015 18 /

Kernel Methods General Description of Kernel Methods Basic Idea of Kernel Methods Kernel Methods work by: Embedding data in a vector space called feature space using a kernel function. Looking for linear relations in such a space. Much of the geometry of the data in the embedding space (relative positions) is contained in all pair-wise inner products (information bottleneck). We can work in feature space by specifying an inner product function k between points in it. In many cases, inner product in the embedding space (feature space) is very cheap to compute. INNS Conference on Big Data 2015 19 /

Kernel Methods General Description of Kernel Methods Properties of Kernels I Definition (Mercer kernel) Let E be any set. A function k : E E R that is continuous, symmetric and finitely positive semi-definite is called here a Mercer kernel. Definition (Finitely positive semi-definiteness) A function k : E E R, where E is any set, is a finitely positive semi-definite kernel if m m i=1 j=1 k(x i,x j )λ i λ j 0, for any m N, λ i R, x i E and i 1,m. It can be seen as the generalization of a positive semi-definite matrix. INNS Conference on Big Data 2015 20 /

Kernel Methods General Description of Kernel Methods Properties of Kernels II Definition (RKHS) A Reproducing Kernel Hilbert Space (RKHS) F is a Hilbert space of complexvalued functions on a set E for which there exists a function k : E E C (the reproducing kernel) such that k(,x) F for any x E and such that f,k(,x) = f (x) for all f F (reproducing property). If k is a symmetric positive definite kernel then, by the Moore-Aronszajn s theorem, there is a unique RKHS with k as the reproducing kernel. A symmetric positive definite kernel k can be expressed as a dot product k : (x,y) φ(x),φ(y), where φ is a map from R n to a RKHS H (kernel trick). INNS Conference on Big Data 2015 21 /

Kernel Methods General Description of Kernel Methods Properties of Kernels III Properties For any x 1,...,x l the l l matrix K with entries K ij = k(x i,x j ) is symmetric and positive semi-definite. The matrix K is called kernel matrix. A kernel k can be expressed as k : (x,y) φ(x),φ(y), where φ is a map from R n to a Hilbert space H (kernel trick). The space H is called the feature space. INNS Conference on Big Data 2015 22 /

Kernel Methods General Description of Kernel Methods Properties of Kernels IV The image of R d by φ is a manifold S in H. Kernels can be interpreted as measures of distance and measures of angles on S. Simple geometric relations between S and hyperplanes of H can give complex forms in R d. INNS Conference on Big Data 2015 23 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 24 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Support Vector Machines Definition (SVM) Support Vector Machines are a family of learning algorithms that use kernel methods to solve supervised learning problems. Common supervised learning tasks concern problems of classification and regression. SVMs work by solving Quadratic Programming problems that aim to minimize the generalization error. If we are given a set S of l points x i R n where each x i belongs to either of two classes defined by y i { 1,+1}, then the objective is to find a hyperplane that divides S leaving all the points of the same class on the same side while maximizing the minimum distance between either of the two classes and the hyperplane [Vapnik 1995]. INNS Conference on Big Data 2015 25 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Optimal Separating Hyperplane Source: Microsoft Research. Vision, Graphics, and Visualization Group, http://research.microsoft.com/en-us/groups/vgv/. INNS Conference on Big Data 2015 26 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Dual problem in nonlinear case The optimal hyperplane is obtained by solving the following Quadratic Programming (QP) problem: { min α t Kα/2 1,α : α,y = 0 and 0 α C }. α R l This QP problem is the dual formulation of a QP problem that maximizes the margin of separation between the sets of points in the feature space. Given a solution α, the optimal hyperplane is expressed as {x R n : l i=1 α i y i k(x i,x)+b = 0} where b is computed using the complementary slackness conditions of the primal formulation. INNS Conference on Big Data 2015 27 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Binary Classification of Imbalanced Data with SVMs Binary classification of imbalanced data needs a rewritting of the primal SVM problem, namely: min w,b,ξ subject to 1 2 w,w H + C 1 ξ i + C 1 ξ i ( y i =1 y i = 1 y i w,φ(xi ) H + b ) 1 ξ i, i [1,l] ξ i 0, i [1,l] C 1 is the trade-off coefficient for the minority class and C 1 is the tradeoff coefficient for the majority class. For imbalanced data, we wish to have C 1 < C 1 i.e. the penalty for outliers in the minority class is greater than the one for the majority class. This approach is strongly related to robust classification. INNS Conference on Big Data 2015 28 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Illustration of SVM Training with Imbalanced Data Source: Tang et al. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1):281-288, 2009. INNS Conference on Big Data 2015 29 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data One-Class SVM for Anomaly Detection Anomaly detection is equivalent to building an enclosure around a cloud of points which are coding non-anomalous objects in order to separate them from outliers which represent anomalies. This problem is known as the soft minimal hypersphere problem and, for points mapped in the feature space H, it is expressed as min c,r,ξ subject to r 2 + C l i=1 ξ i φ(x i ) c 2 H r 2 + ξ i, i [1,l] ξ i 0, i [1,l] INNS Conference on Big Data 2015 30 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Example of a Soft Minimal Enclosing Hypersphere Sublevel sets for different values of the constant γ. INNS Conference on Big Data 2015 31 /

Kernel Methods Support Vector Machines Applied to Imbalanced Data Drawbacks of SVMs The soft-margin maximization paradigm minimizes the total error, which in return introduces a bias toward the majority class. Offline calculations. Unsuitable for processing data streams. Inadequate for large problems (except when using heuristics such as Platt s Sequential Minimal Optimization). INNS Conference on Big Data 2015 32 /

Application to Tornado Prediction Description of the Experiments 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 33 /

Application to Tornado Prediction Description of the Experiments Tornado Experiments I The data were randomly divided into two sets: training/validation and independent testing. In the complete training/validation set, there are 361 cases of tornadic observations and 5048 cases of non-tornadic observations from 59 storm days. In the independent testing set, there are 360 tornadic observations and 5047 non-tornadic observations from 52 storm days. The percentage of tornadic observations in each data set is 6.7%. INNS Conference on Big Data 2015 34 /

Application to Tornado Prediction Description of the Experiments Tornado Experiments II Cross validations were applied with different combinations of kernel functions (linear, polynomial and Gaussian radial basis function) and parameter values on the training/validation set. Each classifier is tested on the test observations drawn randomly with replacement using bootstrap resampling (Efron and Tibshirani, 1993) with 1000 replications on the independent testing set to establish confidence intervals. The best support vector solution is chosen for which the classifier has the highest mean Critical Success Index (Hit/(Hit + Miss + False Alarms)) on the validation set. The best classifier uses the Gaussian radial basis function kernel with radius of 0.0001. We apply these optimal parameters to predict the outcomes of the testing set. INNS Conference on Big Data 2015 35 /

Application to Tornado Prediction Description of the Experiments INNS Conference on Big Data 2015 36 /

Application to Tornado Prediction Results for the Tornado Data Set 1 Imbalanced Data What is an Imbalanced Data Problem? Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning 2 Kernel Methods General Description of Kernel Methods Support Vector Machines Applied to Imbalanced Data 3 Application to Tornado Prediction Description of the Experiments Results for the Tornado Data Set INNS Conference on Big Data 2015 37 /

Application to Tornado Prediction Results for the Tornado Data Set Results for the Tornado Data Set Results computed from the binary confusion matrix with a 95% confidence interval. Measure Validation Test POD 57% ± 13% 57% ± 13% FAR 18% ± 10% 31% ± 14% CSI 50% ± 10% 45% ± 12% Bias % ± 21% 83% ± 20% HSS 62% ± 9% 60% ± 11% POD: probability of detection (hit/(hit + correct null)); FAR: false alarm rate (false alarm/(hit + false alarm)); CSI: critical success ratio (hit/(hit + false alarm + miss)); Bias ((hit + false alarm)/(hit + miss)); HSS: Heidke skill score. Source: I. Adrianto, T. B. Trafalis, and V. Lakshmanan. Support vector machines for spatiotemporal tornado prediction. International Journal of General Systems, 38(7):759 776, 2009 INNS Conference on Big Data 2015 38 /

Part II Dynamic Forecasting Using Kernel Methods INNS Conference on Big Data 2015 39 /

Filtering with Kernel Methods Key Notions 4 Filtering with Kernel Methods Key Notions Approach Outline Assimilation with Kernel Methods 5 Applications to Meteorology Experimental Setup Lorenz 96 Model Quasi-Geostrophic Model INNS Conference on Big Data 2015 40 /

Filtering with Kernel Methods Key Notions Objectives of Dynamic Forecasting Using Kernel Methods Dynamical systems Physical systems are mathematically represented by states in some abstract space. Transitions between states are modeled using transition functions over the state space in order to simulate the system dynamics. Objectives To provide an alternative to Kalman filtering to predict the future states of nonlinear dynamical systems. To use machine learning techniques and kernel methods in order to build nonlinear state predictors. INNS Conference on Big Data 2015 41 /

Filtering with Kernel Methods Key Notions Kalman Filtering Definition (Kalman Filter) Given a sequence of perturbed measurements, a Kalman Filter is a process that estimates the states of a dynamical system. We will consider only differentiable real-time nonlinear dynamical systems (to which correspond nonlinear Kalman filters). The state transition and observation models of the nonlinear dynamical system are t x(t) = f (x,u,t) + w(t) and z(t) = h(x,t) + v(t), where x is the state of the system, z is the observation, f is the state transition function, h is the observation model, u is the control and (w,v) are the (Gaussian) noise. INNS Conference on Big Data 2015 42 /

Filtering with Kernel Methods Key Notions Example: radar tracking From Pattern Recognition and Machine Learning by C. M. Bishop. Blue points: true positions; Green points: noisy observations; Red crosses: forecasts. INNS Conference on Big Data 2015 43 /

Filtering with Kernel Methods Key Notions Kalman Filter and Kernel Methods: Comparison of Assumptions Unlike the linear Kalman Filter, the nonlinear variants do not necessarily give an optimal state estimator. The filter may also diverge if the initial estimate is wrong or if the model is incorrect. For Kalman filters, the process must be Markovian, the perturbations must be independent and they must follow a Gaussian distribution. Implementations must face problems related to matrix storage, matrix inversion and/or matrix factorization. Kernel methods need no statistical assumptions on the process noise and they work with both Markovian and non-markovian processes. Storage and computational requirements are modest. INNS Conference on Big Data 2015 44 /

Filtering with Kernel Methods Approach Outline 4 Filtering with Kernel Methods Key Notions Approach Outline Assimilation with Kernel Methods 5 Applications to Meteorology Experimental Setup Lorenz 96 Model Quasi-Geostrophic Model INNS Conference on Big Data 2015 45 /

Filtering with Kernel Methods Approach Outline Assimilation and Forecasting with Kernel Methods 1. Assimilation The assimilation step attempts to recover the unperturbed system states from the current and past observations using kernel-based regression techniques. Kernel methods are removing noise from states trajectories and are updating them from the previous forecasts. 2. Forecasting The last assimilated state can be used as an initial estimate for one iteration of a nonlinear Kalman filter. A polynomial predictive analysis on the last recorded state trajectories using a Lagrange interpolation with Chebyshev nodes can provide reliable extrapolations. The generalization property of the SVM regression function can be used to estimate the next future state. INNS Conference on Big Data 2015 46 /

Filtering with Kernel Methods Approach Outline Advantages and Shortcomings of Kernel Methods Advantages Low memory requirements (some kernels require only O(n) elements to be stored in memory). Acceptable computational complexity (of the order of O(n 2 ), data thinning can reduce the size of the input data set). Massive parallelization (can be applied separately to each trajectory). No statistical assumptions, no state transition model necessary. Can be combined with a Kalman filter if necessary. Shortcomings Estimation of some kernel parameters. INNS Conference on Big Data 2015 47 /

Filtering with Kernel Methods Assimilation with Kernel Methods 4 Filtering with Kernel Methods Key Notions Approach Outline Assimilation with Kernel Methods 5 Applications to Meteorology Experimental Setup Lorenz 96 Model Quasi-Geostrophic Model INNS Conference on Big Data 2015 48 /

Filtering with Kernel Methods Assimilation with Kernel Methods Interpolating States Trajectories without Model Main Idea Illustration Find a non-trivial function f such as for every given sample (t i,x i ) R 2 we have f (t i ) = x i (or f (t i ) is in an interval centered at x i and of half-width ε 0). The interpolation function is expressed using an affine combination of kernel-based functions k(t, ). The positive semi-definite matrix K with entries K ij = k(t i,t j ) is called the kernel matrix. Kalman filtering works differently. Contrary to this approach, no state trajectory interpolation takes place with KFs. Furthermore KFs absolutely need a model. Skip to Conclusions INNS Conference on Big Data 2015 49 /

Filtering with Kernel Methods Assimilation with Kernel Methods Fitting Functions The non-trivial function f such that f (t i ) = x i is chosen to belong to the function class: { F = t R l i=1 α i k(t i,t) + b R : α t Kα B 2 }. The Rademacher complexity of F measures the capability of the functions of F to fit random data with respect to a probability distribution generating this data. The empirical Rademacher complexity of F, denoted by ˆR(F ), is such that ˆR(F ) 2B tr(k)/l + 2 b / l. INNS Conference on Big Data 2015 50 /

Filtering with Kernel Methods Assimilation with Kernel Methods Minimizing the Generalization and the Empirical Errors We use ˆR(F ) to control the upper bound of the generalization error of the interpolation function. Small empirical errors and ˆR(F ) contribute to decrease this bound, therefore: Aim We need to minimize the absolute value of b. Also we have α t Kα K α 2. Thus minimizing α t α + b 2 contributes to a smaller ˆR(F ); The empirical error defined by ( l i=1 ξ i ) /l, where the ξ i s are the differences between targets and desired outputs, should be minimized. To minimize the quantity α t α + b 2 + Cξ t ξ with C > 0. INNS Conference on Big Data 2015 51 /

Filtering with Kernel Methods Assimilation with Kernel Methods Optimization Problem Introducing tolerances ρ i > 0, the empirical errors ξ i are equal to f (t i ) x i. That is the only constraints associated with the previous objective function. Hence the previous calculations lead to: Optimization Problem for Data Assimilation (Gilbert et al., 2010) { min α t α + b 2 + Cξ t ξ : Kα + b1 l x = ξ }. (α,ξ,b) R 2l+1 The solution of this optimization problem is: Analytical Solution ( ) K 2 + I l /C + 1 l 1 t l d = x, α = Kd, b = 1 t l d. The solutions α and b describe the regression function (that belongs to the function class F ) which interpolates state trajectories during assimilation. INNS Conference on Big Data 2015 52 /

Filtering with Kernel Methods Assimilation with Kernel Methods Computational details To compute the solution d of the linear system ( K 2 + I l /C + 1 l 1 t l) d = x, we define σ = 1/ C, the matrix A = K + σi l (A is symmetric and positive definite) and the following sequences: Aũ 0 = x, Au 0 = ũ 0, Aũ n = u n, Au n+1 = 2σ(u n σũ n ). Aṽ 0 = 1 l, Av 0 = ṽ 0, Aṽ n = v n, Av n+1 = 2σ(v n σṽ n ). We set u = u n and v = v n. Both series are rapidly convergent and they n 0 n 0 are truncated at a step m > 0 in practical problems. Once m is determined we then have d = u 1 l,u 1 + 1 l,v v. INNS Conference on Big Data 2015 53 /

Filtering with Kernel Methods Assimilation with Kernel Methods Approach Summary We have chosen a class of functions in order to interpolate state trajectories without a model. We defined a fitness measure for this class of functions and linked it to the parameters describing a function in that class. We defined an optimization problem that aims to minimize empirical errors and maximize the fitness of the function interpolating state trajectories. An analytical solution of the optimization problem was derived and its computation was reduced to solving a sequence of linear systems. INNS Conference on Big Data 2015 54 /

Applications to Meteorology Experimental Setup 4 Filtering with Kernel Methods Key Notions Approach Outline Assimilation with Kernel Methods 5 Applications to Meteorology Experimental Setup Lorenz 96 Model Quasi-Geostrophic Model INNS Conference on Big Data 2015 55 /

Applications to Meteorology Experimental Setup Experimental Setup Machine and Software All codes were implemented on MATLAB 7.9 using a 2002 DELL Precision Workstation 530 with two 2.4 GHz Intel Xeon processors and 2 GiB of RAM. EnKF forecasts were generated with the EnKF Matlab toolbox version 0.23 by Pavel Sakov (available at Evensen s webpage at enkf.nersc.no). Experimental Models The kernel approach was tested on the Lorenz 96 model and the Quasi- Geostrophic 1.5-layer reduced gravity model. Forecasts were obtained using a combination of polynomial predictive analysis and kernel-based extrapolation during the assimilation stage. INNS Conference on Big Data 2015 56 /

Applications to Meteorology Lorenz 96 Model 4 Filtering with Kernel Methods Key Notions Approach Outline Assimilation with Kernel Methods 5 Applications to Meteorology Experimental Setup Lorenz 96 Model Quasi-Geostrophic Model INNS Conference on Big Data 2015 57 /

Applications to Meteorology Lorenz 96 Model Lorenz 96 Model: Description Introduced by Lorenz and Emanuel (1998). The state transition model is representing the values of atmospheric quantities at discrete locations spaced equally on a latitude circle (1D problem). The state transition model at a location i on the latitude circle is: t x i = (x i+1 x i 2 )x i 1 } {{ } advection x i }{{} dissipation + }{{} F. external forcing The states represent an unspecified scalar meteorological quantity, e.g. vorticity or temperature (Lorenz and Emanuel). This model was introduced in order to select which locations on a latitude circle are the most effective in improving weather assimilation and forecasts. Observations where generated with an error variance of 1. The external forcing was set to F = 8. The system showed a chaotic behavior. INNS Conference on Big Data 2015 58 /

Applications to Meteorology Lorenz 96 Model Lorenz 96 Model: Assimilation The following figures illustrate how kernel methods remove the observational noise and interpolate state trajectories during assimilation. The used kernel is a Gaussian RBF kernel with σ = 3δ t. INNS Conference on Big Data 2015 59 /

Applications to Meteorology Lorenz 96 Model Lorenz 96 Model: Forecast Errors INNS Conference on Big Data 2015 60 /

Applications to Meteorology Quasi-Geostrophic Model 4 Filtering with Kernel Methods Key Notions Approach Outline Assimilation with Kernel Methods 5 Applications to Meteorology Experimental Setup Lorenz 96 Model Quasi-Geostrophic Model INNS Conference on Big Data 2015 61 /

Applications to Meteorology Quasi-Geostrophic Model Quasi-Geostrophic Model: Description It is an atmospheric dynamical model involving an approximation of actual winds. It is used in the analysis of large scale extratropical weather systems. System states are scalar quantities representing the air flow. Horizontal winds are replaced by their geostrophic values in the horizontal acceleration terms of the momentum equations, and horizontal advection in the thermodynamic equation is approximated by geostrophic advection. Furthermore, vertical advection of momentum is neglected. It is a 2D problem: the atmosphere has a single level in the vertical and was represented by a square 33 33 grid where each state is located on a node of that grid. Observations where generated with an error variance of 1. INNS Conference on Big Data 2015 62 /

Applications to Meteorology Quasi-Geostrophic Model Quasi-Geostrophic Model: Assimilation Example Despite the absence of dynamical model, the kernel interpolation obtained from the noisy observations closely matches the true states. Return to the Approach Outline INNS Conference on Big Data 2015 63 /

Applications to Meteorology Quasi-Geostrophic Model Quasi-Geostrophic Model: Forecast Example Kernel and EnKF forecasts at time step 51. The EnKF forecast is 8 units away from the true state while the kernel forecast is 0.5 units away. INNS Conference on Big Data 2015 64 /

Applications to Meteorology Quasi-Geostrophic Model Quasi-Geostrophic Model: Forecast Errors INNS Conference on Big Data 2015 65 /

Applications to Meteorology Quasi-Geostrophic Model Conclusions I What was Achieved? A viable kernel-based approach for data assimilation and forecasting has been introduced for nonlinear dynamical systems. It showed predictable performances on meteorological models ranging from equivalent to that of EnKF with 50 ensemble members to exceeding that of EnKF with 100 ensemble members, with lower error as an inverse function of the amount of chaos present in the dynamical system. Encouraging results in removing observational noise and interpolating state trajectories were obtained. They represent an improvement with respect to standard EnKF based on less than 20 ensembles. INNS Conference on Big Data 2015 66 /

Applications to Meteorology Quasi-Geostrophic Model Conclusions II Future Work We are currently applying these techniques to financial and petroleum engineering problems with the same type of multi-dimensional time series. We are developing approaches that identify independent factors influencing the shape of multi-dimensional time series in a nonlinear fashion. The same tools will be used for the prediction of rare events and their magnitude. INNS Conference on Big Data 2015 67 /

Questions? INNS Conference on Big Data 2015 68 /

For Further Reading R. C. Gilbert, M. B. Richman, L. M. Leslie, and T. B. Trafalis. Kernel Methods for Data Driven Numerical Modeling. Monthly Weather Review, submitted, 2010. E. N. Lorenz and K. A. Emanuel. Optimal sites for supplementary weather observations: simulation with a small model. Journal of the Atmospheric Sciences, 55(3):399-414, 1998. P. Sarma and W. H. Chen. Generalization of the Ensemble Kalman Filter Using Kernels for Nongaussian Random Fields. In SPE Reservoir Simulation Symposium Proceedings, 2009. Society of Petroleum Engineers. F. Steinke and B. Schölkopf. Kernels, regularization and differential equations. Pattern Recognition, 41(11): 3271-3286, 2004. J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, Cambridge, UK, 2004. V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, NY, USA, 1995. INNS Conference on Big Data 2015 /