Java Modules for Time Series Analysis

Transcription

1 Java Modules for Time Series Analysis

2 Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction

3 1. Clustering + Cluster 1 Synthetic Clustering + Time series Cluster 2 Synthetic + Cluster 3 Synthetic

4 Clustering Goal grouping of time series in such a way that the series with similar historical behavior to be in the same group Input A set of single time series (bond, share, fund prices) or time series groups (for example interest rate market curves) Number of clusters Output Clusters of time series, Clustering quality statistics Every cluster is represented by a prototype series (synthetic curve) with the same dimensionality as the all other series Using Clustering can be used to reduce a huge number of series and thus to facilitate and make feasible time consuming operation like calculation of huge correlation matrices, etc. The number of time series is reduced by: Identifying the cluster in which a series belongs to Using of the prototype of the cluster instead of the real series Determine similar behavior of market factors or Issuers (Cartels)

5 Clustering Clustering can be performed for: Time series (for example shares or bond prices having historical development) Curve time series (for example interest rate market curves having historical development) In addition to the clusters with their series and prototypes clustering quality statistics are generated: Inter and intra cluster statistics, adjuster R squared, average linkage, etc. Some of these statistics can be used to determine the optimal number of clusters, i.e. the best number of groups of min internal distance and max distance to each other

6 Error Finding optimal number of clusters using clustering error Num Clusters Adjusted R squared Error 2 0, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,20 0,15 0,10 0,05 0, Optimal number of clusters = 18 Number of clusters Optimal number of clusters = 18

7 Example: clustering of 20 spread curves Actual Spread Curves MM CM CM CM CM CM Num Maturity(Years) 0, FORTUM-AEUR-MM 0,1439% 0,2293% 0,3047% 0,3686% 0,4647% 0,5828% 2 SRBIA-AEUR-CR 1,2863% 1,2500% 1,7880% 2,1133% 2,6375% 3,0123% 3 UKRAIN-AUSD-CR 1,7606% 1,9218% 2,7834% 3,2418% 3,8589% 4,5407% 4 ITALY-AEUR-CR 0,1022% 0,1167% 0,1896% 0,2727% 0,3965% 0,4980% 5 SLOVEN-AEUR-CR 0,0376% 0,1044% 0,1351% 0,1584% 0,2338% 0,3622% 6 CZECH-AEUR-CR 0,1295% 0,1471% 0,2220% 0,2583% 0,3627% 0,4925% 7 TURKEY-AUSD-CR 0,8137% 0,9853% 1,6326% 2,1247% 2,7918% 3,4798% 8 ROMANI-AUSD-CR 0,5478% 0,5594% 1,1125% 1,3618% 1,7735% 2,0718% 9 POLAND-AUSD-CR 0,0883% 0,1608% 0,2432% 0,3576% 0,5238% 0,6876% 10 PEUGOT-AEUR-MM 0,6865% 0,8433% 1,0406% 1,2925% 1,6265% 1,7360% 11 JPM-CUSD-MR 1,0987% 1,0272% 1,1421% 1,2384% 1,3570% 1,3909% 12 DANBNK-AEUR-MM 0,1830% 0,2635% 0,3686% 0,4481% 0,5610% 0,6098% 13 GS-AUSD-MR 1,1586% 1,1238% 1,1868% 1,2137% 1,2638% 1,2969% 14 CROATI-AEUR-CR 0,1554% 0,2741% 0,4423% 0,5602% 0,8274% 1,0806% 15 ESPSAN-CEUR-MM 0,7725% 0,9577% 1,2941% 1,5505% 1,9120% 1,9371% 16 SUEDZU-AEUR-MM 0,6148% 0,7388% 0,9829% 1,2075% 1,5032% 1,6203% 17 LEH-AUSD-MR 6,7011% 6,8482% 5,4481% 4,5185% 3,2897% 2,6982% 18 HELLAS-CEUR-MM 4,6793% 4,9875% 8,3839% 9,9521% 10,6632% 10,6198% 19 REPHUN-AUSD-CR 0,2793% 0,3050% 0,5168% 0,7809% 1,1361% 1,3992% 20 BGARIA-AUSD-CR 0,3123% 0,4825% 0,8722% 1,1222% 1,5268% 1,8284%

8 All 20 spread curves 12,00% 10,00% 8,00% 6,00% 4,00% 2,00% 0,00% Actual Spread Curves on , FORTUM-AEUR-MM SRBIA-AEUR-CR UKRAIN-AUSD-CR ITALY-AEUR-CR SLOVEN-AEUR-CR CZECH-AEUR-CR TURKEY-AUSD-CR ROMANI-AUSD-CR POLAND-AUSD-CR PEUGOT-AEUR-MM JPM-CUSD-MR DANBNK-AEUR-MM GS-AUSD-MR CROATI-AEUR-CR ESPSAN-CEUR-MM SUEDZU-AEUR-MM LEH-AUSD-MR HELLAS-CEUR-MM REPHUN-AUSD-CR BGARIA-AUSD-CR

9 Clusters of spread curves Spread Curve Clusters - Actual Rates MM CM CM CM CM CM Std Deviation Num Maturity(Years) 0, Cluster 1 0,1725% 1 FORTUM-AEUR-MM 0,1439% 0,2293% 0,3047% 0,3686% 0,4647% 0,5828% 4 ITALY-AEUR-CR 0,1022% 0,1167% 0,1896% 0,2727% 0,3965% 0,4980% 5 SLOVEN-AEUR-CR 0,0376% 0,1044% 0,1351% 0,1584% 0,2338% 0,3622% 6 CZECH-AEUR-CR 0,1295% 0,1471% 0,2220% 0,2583% 0,3627% 0,4925% 9 POLAND-AUSD-CR 0,0883% 0,1608% 0,2432% 0,3576% 0,5238% 0,6876% 12 DANBNK-AEUR-MM 0,1830% 0,2635% 0,3686% 0,4481% 0,5610% 0,6098% 14 CROATI-AEUR-CR 0,1554% 0,2741% 0,4423% 0,5602% 0,8274% 1,0806% Cluster Spread 0,1200% 0,1851% 0,2722% 0,3463% 0,4814% 0,6162% Cluster 2 0,3019% 8 ROMANI-AUSD-CR 0,5478% 0,5594% 1,1125% 1,3618% 1,7735% 2,0718% 13 GS-AUSD-MR 1,1586% 1,1238% 1,1868% 1,2137% 1,2638% 1,2969% 10 PEUGOT-AEUR-MM 0,6865% 0,8433% 1,0406% 1,2925% 1,6265% 1,7360% 11 JPM-CUSD-MR 1,0987% 1,0272% 1,1421% 1,2384% 1,3570% 1,3909% 15 ESPSAN-CEUR-MM 0,7725% 0,9577% 1,2941% 1,5505% 1,9120% 1,9371% 16 SUEDZU-AEUR-MM 0,6148% 0,7388% 0,9829% 1,2075% 1,5032% 1,6203% 19 REPHUN-AUSD-CR 0,2793% 0,3050% 0,5168% 0,7809% 1,1361% 1,3992% 20 BGARIA-AUSD-CR 0,3123% 0,4825% 0,8722% 1,1222% 1,5268% 1,8284% Cluster Spread 0,6838% 0,7547% 1,0185% 1,2209% 1,5123% 1,6601% Cluster 3 0,8026% 2 SRBIA-AEUR-CR 1,2863% 1,2500% 1,7880% 2,1133% 2,6375% 3,0123% 3 UKRAIN-AUSD-CR 1,7606% 1,9218% 2,7834% 3,2418% 3,8589% 4,5407% 7 TURKEY-AUSD-CR 0,8137% 0,9853% 1,6326% 2,1247% 2,7918% 3,4798% 17 LEH-AUSD-MR 6,7011% 6,8482% 5,4481% 4,5185% 3,2897% 2,6982% Cluster Spread 2,6402% 2,7512% 2,9129% 2,9995% 3,1445% 3,4328% Cluster 4 0,0000% 18 HELLAS-CEUR-MM 4,6793% 4,9875% 8,3839% 9,9521% 10,6632% 10,6198% Cluster Spread 4,6793% 4,9875% 8,3839% 9,9521% 10,6632% 10,6198%

10 Clusters of market curves Cluster ,2000% 1,0000% FORTUM-AEUR-MM ITALY-AEUR-CR 0,8000% 0,6000% 0,4000% SLOVEN-AEUR-CR CZECH-AEUR-CR POLAND-AUSD-CR DANBNK-AEUR-MM Synthetic curve 0,2000% 0,0000% CROATI-AEUR-CR Cluster Spread Cluster ,5000% ROMANI-AUSD-CR Synthetic curve 2,0000% 1,5000% 1,0000% 0,5000% GS-AUSD-MR PEUGOT-AEUR-MM JPM-CUSD-MR ESPSAN-CEUR-MM SUEDZU-AEUR-MM REPHUN-AUSD-CR 0,0000% BGARIA-AUSD-CR Cluster Spread

11 Historical development Cluster 2: 6 Months Cluster 2: 1 Year 3,00% 3,00% 2,50% ROMANI-AUSD-CRMM 2,50% ROMANI-AUSD-CRCM GS-AUSD-MRMM GS-AUSD-MRCM 2,00% PEUGOT-AEUR-MMMM 2,00% PEUGOT-AEUR-MMCM JPM-CUSD-MRMM JPM-CUSD-MRCM 1,50% ESPSAN-CEUR-MMMM 1,50% ESPSAN-CEUR-MMCM SUEDZU-AEUR-MMMM SUEDZU-AEUR-MMCM 1,00% REPHUN-AUSD-CRMM 1,00% REPHUN-AUSD-CRCM BGARIA-AUSD-CRMM BGARIA-AUSD-CRCM 0,50% Cluster 0,50% Cluster 0,00% Cluster 2: 5 Years 0,00% Cluster 2: 10 Years Synthetic curve 4,00% 4,00% 3,50% ROMANI-AUSD-CRCM 3,50% ROMANI-AUSD-CRCM 3,00% 2,50% GS-AUSD-MRCM PEUGOT-AEUR-MMCM JPM-CUSD-MRCM 3,00% 2,50% GS-AUSD-MRCM PEUGOT-AEUR-MMCM JPM-CUSD-MRCM 2,00% ESPSAN-CEUR-MMCM 2,00% ESPSAN-CEUR-MMCM 1,50% 1,00% 0,50% SUEDZU-AEUR-MMCM REPHUN-AUSD-CRCM BGARIA-AUSD-CRCM Cluster 1,50% 1,00% 0,50% SUEDZU-AEUR-MMCM REPHUN-AUSD-CRCM BGARIA-AUSD-CRCM Cluster 0,00% 0,00%

12 2. Non-Normal Distributions Theoretical distribution type + parameters Non-normal distributions Cauchy Empirical distribution Normal

13 Non-normal distributions Goal automatically identification of distribution type and its parameters using market time series and use the Copula approach to simulate market factors in Monte Carlo VaR using mapped distributions Input The time series of the market factors Chosen standard distribution types (Beta, Cauchy, Student, Weibull, etc.) Output Identified distribution type The parameters of the identified distribution type Numerical estimation of the distance between the empirical distribution and all other distribution types (allows to order distribution types and choose other good fitting distribution type) Using Improving Monte Carlo VaR simulation by using of correlated non-normal distribution samples instead of correlated normal distribution samples

14 Non-normal distributions Calculation of Value at Risk Q Confidence level a quartile Market VaR(a) Expected value The distribution of time series for market factors is assumed to be normal in the most cases. But this don t correspond to reality, the time series expose often skewed and flat tail distributions which is connected to underestimation of market risk for improbable large loses (flat tail losses)

15 Non-normal distributions Mapping Risk Factors to best fit Distribution The best fit is given by the Cauchy Distribution (green) Normal Distribution The Beta Distribution will produce larger confidence risk because of the flat tail

16 Distribution parameters estimation The main important goal is to achieve best modeling of empirical distribution shape by reproducible theoretical distribution shape Together with the distribution type identification, the distribution parameters are also estimated from market data using the method of moments, least squares regression or maximum likelihood. The additional parameters shift and scale are also used to avoid distribution parameters values in undefined regions Data having a given distribution can be generated by: Distribution type Distribution parameters Additional parameters (shift, scale) Values count Cumulative distributions are used for the subsequent Copula Monte Carlo Simulation

17 Standard distribution parameters estimation 10 distribution types Distribution parameters Additional parameters Distribution Parameter 1 Parameter 2 Parameter 1 Parameter 2 Beta Shape Shape Shift Scale Cauchy Location Scale Exponential Rate --- Shift --- Inverse Normal Mu Lambda Shift --- Log Normal Log Scale Shape Shift --- Normal Mean Variance Shift --- Pareto Scale Shape Rayleigh Sigma --- Shift --- Student Nu --- Shift Scale Weibull Scale Shape Shift ---

18 Distribution mapping Two metrics are used to compare distributions: Histogram metric empirical histogram bins frequencies are compared against theoretical histogram bins probabilities Cumulative distances metric ignoring X-axis values, cumulative distances between market series data points are calculated. The same function is calculated using theoretically generated values for the distribution under consideration. These two cumulative values are compared. Both histogram and cumulative distances are compared using average squared error

19 Histogram metric Distances between theoretical and empirical histograms Theoretical histogram Empirical histogram Best (mapped) distribution is identified by the minimum sum of squared distances between the distribution theoretical histogram and empirical histogram min max

20 Cumulative distances metric Data values y Cumulative distances between values Cumulative distances graph p i p 1 = d 1 d 2 p 2 = d 1 + d 2 p 3 = d 1 + d 2 + d 3 d 1 i Best (mapped) distribution is identified by the minimum sum of squared distances between the empirical cumulative values and corresponding theoretical cumulative values

21 Copula Monte Carlo VaR Example for 2 Market Factors (Lognormal and Beta distributed) Market Risk Correlation Matrix Normal distributed correlated random samples Cumulative Distribution Lognormal Distribution x = F -1 (y) Equally distributed and Correlated random samples (0...1) Cumulative Distribution Monte Carlo Simulation VaR Distribution Skewed Distribution Beta Distribution Correlated non-normal distributed samples are put to Monte Carlo simulation instead of correlated normal distributed samples generated using the market risk correlation matrix

22 Copula Monte Carlo VaR Skewed and flat tail VaR distributions Skewed VaR distribution Flat tail VaR distribution

23 Prototype system Theoretical histograms Empirical histogram Cumulative values Parameters estimations Distributions generator Distances between theoretical and empirical distributions Best Fit for Weibul Distribution

24 3. Multifactor models Formula Target factor Multifactor Models Target factor = Coefficients Explanatory factors Functions -0, Instruments_Fund-FR , Instruments_Fund-LU sqrt -11, StockIndexCurve_DJIA ln 0, StockIndexCurve_GEX 0, StockIndexCurve_Nasdaq-Composite ^2.0 0, StockIndexCurve_Nikkei225 sqrt -0, StockIndexCurve_SDAXPI sqrt -10, StockIndexCurve_TECDAXPI ln -6739,26524 Target factor Obtained by formula Explanatory factors Time series

25 Multifactor models Goal building formulas describing unknown market instruments by instruments with known pricing models based on time series Input The historical time series of the target factor (the instrument with unknown pricing approach or unknown market factor dependency) Other available historical time series to be used as explanatory factors (indices, spread curves, interest rate, inter banking rate, foreign exchange rate, etc.) Output Polynomial like formula describing the dependency of the target factor by the explanatory factors Using The generated formula can be used to develop a new type instrument having a pricing approach based on a set of known factors Obtain a factor contribution to instrument price development and risk

26 Multifactor models object Available market factors Target instrument Formula building Formula calibration Time Target instrument time series Target instrument by formula Explanatory factor time series Other factors The target instrument is calculated by formula The formula is built and periodically calibrated using target instrument and explanatory instrument time series

27 Stages of modeling Start Target factor selection Explanatory factors suggestion/selection Basis functions combination determination Regression coefficients determination Final formula determination and error calculation End - all given factors in the system - determined by system and/or human - determined by system

28 Explanatory factors selection When a target factor is selected explanatory factors should be selected by automatic suggestion and/or hand choosing Automatic suggestion could be done by: Clustering Explanatory factors are obtained from the cluster in which the target factor is classified. If the number of explanatory factors determined in this way are insufficient then the number of clusters could be decreased in order to increase the number of elements in the cluster Minimal co-variances between candidate factors Co-variances between all factors are calculated and the first n minimal co-variances determine the factors Maximal co-variances between candidate factors and the target factor

29 Formula builder After the target and explanatory factors are selected formula building process should be started in which the system performs: Finding of combination of basis functions to the explanatory factors The basis functions are used to: improve the accuracy avoid linear dependencies between factors in that causes matrices equations problems Regression coefficients β i (Beta Factors) y = β 1 f 1 (x 1 ) + β 2 f 2 (x 2 ) β n f m (x n ) + β n+1 + ε y target factor x 1, x 2,, x n explanatory factors β 1, β 2,, β n, β n+1 regression coefficients f 1, f 2,, f m basis functions ε error

30 Basis functions combination Basis functions f 1 f 2 f 3 f 4 f m Function exponent logarithm sine cosine htangent Explanatory factors x 1 x 2 x 3 x n Name GOV Bel FX USD Oil price. Gold price Date 1 Date 2 Date t Date K Combination of basis functions applied to the explanatory factors Target factor y ỹ GOV Aut GOV Aut estimation ε = (y - ỹ) 2 Distance ỹ =β 1 f 3 (x 1 ) + β 2 f 1 (x 2 ) β n f 2 (x n ) + β n+1 + ε

31 Prototype system Generated formula Graphic results: Target and Multi-Factor

32 Prototype system settings

33 4. Implied Rating Scale building Time series Implied Ratings Classification Rating BB Tendency BBB

34 Implied Rating (Basel III) Goal building of a rating scale based on explicit CDS time series and using it to identify both the implied rating and the tendency of a new CDS input series Input A set of CDS time series that relate to assets or issuers (CDS spread curves or indices, bond prices, share prices, etc.) Rating system - number and symbols for the ratings of the rating scale Output Scale with boundaries between the ratings Using By supplying the built scale with a new time series representing an issuer, the system identifies: Current rating based on the historical development giving more importance to the last values Tendency what is the next probable rating

35 Steps to obtain implied rating Establishment of the rating scale Available time series are used to build given number of rating degrees and to determine their boundaries The time series are distributed into given rating degrees according to the historical behavior The center of every degree is determined using the all time series which belong to the degree The boundaries are derived from adjacent centers using equally distanced series A new time series is classified to a rating class by comparing with the centers (that are also time series) of the scale classes and finding the closest one The tendency is determined by Finding the second closest center of rating degrees Finding the closes boundary of the classified rating level

36 Rating degrees boundaries The time series of the rating degrees may overlap АА АА - A A The points of the ratings boundaries are calculated as average values of the corresponding points of the centers of the series in every rating degree The center of the series in a given degree resides not in the middle of the degree boundaries because in the most cases the time series is nonuniformly distributed

37 Rating degrees boundaries 1,80% Boundary Degree center The boundary resides in the mid of the series centers The center of the series resides not in the mid of the boundaries 1,40% 1,00% 0,60% 0,20% A AA

38 Weighting the historical values Weighting of the series values (EWMA by Decay Factor) is applied in order to make more important more actual date values 1,60% 1,40% 1,20% 1,00% 0,80% The last series values reside within the degree boundaries 0,60% 0,40% AA 0,20% 0,00%

39 Determine the rating of a new series In the classification phase histograms are build for the distributions of the data within the best and second best degrees (corresponding to the rating and tendency ) The histograms are shown with the centers of the class and the mean of the new classified series Mean and standard deviation used to build the histograms are calculated taking into account of the same decay factor used to build the ratings scale

40 02,08,,,, 16,08,,,, 30,08,,,, 13,09,,,, 27,09,,,, 11,10,,,, 25,10,,,, 10,11,,,, 24,11,,,, 09,12,,,, 23,12,,,, 10,01,,,, 24,01,,,, 07,02,,,, 21,02,,,, 07,03,,,, 21,03,,,, 04,04,,,, 18,04,,,, 03,05,,,, 17,05,,,, 31,05,,,, 16,06,,,, 01,07,,,, 15,07,,,, 29,07,,,, Determine the rating of a new series 1,60% 1,40% 1,20% 1,00% 0,80% 0,60% 0,40% 0,20% Classification of a new series - Barklays Bank PLC New time series (yellow) that should be classified A AA Mean of the 60 new series, 50 Rating AA and 40 Tendency to A ,08% 0,13% 0,18% 0,23% 0,28% 0,33% 0,37% 0,42% 0,47% 0,52% 0,57% 0,62% 0,66% 0,71% 0,76% 0,81% 0,86% 0,91% 0,96% 1,00% 1,05% 1,10% 1,15% 1,20% 1,25% 1,30% Mean and standard deviation with decay factor

41 Prototype system Time series used to build the scale Built ratings scale New input Rating and tendency Rating system Series within the selected degree Histograms for rating&tendency New input mean

42 5. Time Series Prediction Predicted future Prediction Time series Predicted time series

43 Time series prediction Goal prediction of a given time series for a given time horizon by analyzing the series historical development Input A time series Setting according to the used approach (for example learning iterations, time window size, etc.) Output The given time series with additional predicted values Confidence bounds Prediction quality statistics Using The predicted values can be used as the most probable future values, for instance in algorithmic trading

44 Time series prediction The most commonly used prediction methods are: Averages (MA, WMA, EWMA, etc.) Autoregressive methods (AR, ARMA, ARIMA, SARIMA, ARMAX, SETAR, etc.) with Box-Jenkins methodology Trend-extrapolation (based on LSE, trend polynomial finding, etc.) Neural Networks (MLP, RBF, SOM, ART, recurrent Elman/Jordan networks, etc.), Neural Network are used in current approach Other regression based (e.g. Observers) and econometric models Kalman, Wiener and other filters Wavelet based methods Holt-Winter decomposition Hybrid approaches The prediction could be used for technical analysis Confidence bounds are used Predictability indicators can be suggested (Hurst exponent, etc.)

45 Prediction by neural network Model identification Historical values Input vector Output vector Neural Network Target function Sliding window Optimization Prediction Recursive prediction Horizon Neural Network

46 Prediction by neural network Modeling process Data pre-processing Modeling of NN architecture Training Application of NN model Evaluation Manual by trying and error approach Preprocessing Post processing Prediction with confidence bounds 0,0008 0,0008 0,0007 0,0007 0,0007 0,0007 0,0007 0,0006 0,0006 0,0006 0, Neural network 0,0008 0,0008 0,0007 0,0007 0,0007 0,0007 0,0007 0,0006 0,0006 0,0006 0, Prediction The prediction generally includes data pre-processing, solving of matrix equations (batch or iteratively) and data post-processing Historical values Horizon

47 Prototype system Preprocessing Prediction methods Values Time horizon Test Error graphic Confidence bounds

48 Modules dependencies Series Calculations processing Neural Networks distributions and parameters estimation 1.Clustering 2. Non-normal distributions sating scale building histograms 4. Implied Ratings formula building factors selection 3. Multifactor Models learning & prediction 5. Prediction Time series