FUZZY IMAGE PROCESSING APPLIED TO TIME SERIES ANALYSIS
|
|
|
- Sherilyn Bradford
- 10 years ago
- Views:
Transcription
1 4.3 FUZZY IMAGE PROCESSING APPLIED TO TIME SERIES ANALYSIS R. Andrew Weekley 1 *, Robert K. Goodrich 1,2, Larry B. Cornman 1 1 National Center for Atmospheric Research **,Boulder,CO 2 University of Colorado, Department of Mathematics, Boulder, CO 1. INTRODUCTION The analysis of times series data plays a fundamental role in science and engineering. An important analysis step is the identification and classification of various features in the data. Quality control can be viewed as a subclass of general feature identification and classification, for example, differentiating between a true signal and a contaminating signal. Many algorithms exist for the quality control of time series data, such as Fourier or wavelet analysis, as well as robust and standard statistics. (Abraham and Ledolter 1983, Priestly, 1981 and Barnett and Lewis, 1977). However these algorithms are applicable only when certain assumptions are satisfied, such as stationarity, and a relative few number of outlier (less than 50% of the data). Unfortunately there are times when an instrument is failing and the assumptions of the standard methods are violated, and hence are not applicable. However, a quality indicator is still needed. Typically the image processing of a human is used to identify and mitigate failure mode data. Human analysts are adept at feature identification and classification, nevertheless in many applications it is desired to have an automated algorithm that performs this role. In this paper, a machine-intelligent algorithm that mimics the feature classification and identification processing of the human analyst is presented and applied to the quality control of time series data. In cases where there are a large number of outliers - a situation that is problematic to most algorithms - this algorithm is able to classify the desired signal as well as the outliers. For example, consider the time series in Figure 1 consisting of anemometer measurements of wind speed in ms -1 asafunctionoftimeinseconds. Aside from a few isolated outliers, it is straightforward to see that this is good quality data, and that most quality control algorithms * Corresponding author address: R. Andrew Weekley, NCAR/RAP, P.O. Box 3000, Boulder, CO [email protected] ** The National Center for Atmospheric Research is sponsored by the National Science Foundation. Figure 1. Nominal anemometer time series data. The vertical axis is wind speed in m s -1. The horizontal axis is time in seconds. Figure 2. A nearby anemometer time series in a failure mode. Here a nut holding the anemometer head has worked loose. Figure 3. Data from a spinning anemometer. The vertical axis is wind direction measured in degrees from north in a clockwise direction, the horizontal axis is time in seconds. Figure 4. Data from an anemometer in a failure mode caused by a bad transistor. would not have difficulty in diagnosing the outliers. On the other hand, the data in Figure 2 is certainly more complex, showing similar regions of good data as with the previous figure, but intermixed
2 with numerous outliers. This data is from an anemometer located three meters from the one whose data is shown in Figure 1, and from the same time period. In this situation, standard methods might work well on certain segments of this time series, but other sections -- such as between 800 and 1000 seconds -- might cause problems. However, except for some of the points close to the good data, a human analyst would not have much difficulty in discerning the good from the bad data (in fact, this anemometer was having a mechanical failure where the strong winds vibrated, then loosened, the nuts holding the anemometer in place). Consider the additional cases shown in Figure 3 and Figure 4. From video footage, it has been observed that certain wind frequencies can excite normal modes of this type of anemometer s wind direction head and can cause the device to spin uncontrollably. Data from such a case can be seen in Figure 3, where the vertical axis is wind directionmeasuredinaclockwisedirectionfrom North. The horizontal axis is again time measured in seconds. Between about 500 seconds and 1000 seconds the wind direction measuring device is spinning and the data becomes essentially a random sample of a uniform distribution between about 50 degrees to 360 degrees. The true wind direction is seen intermittently at about 225 degrees, which is in general agreement with the value from another nearby anemometer. Figure 4 shows the wind direction at another time distinct from that in Figure 3; the true wind direction is around 40 degrees. Notice the suspicious streaks in the time series data near 200 degrees, as well as other spurious data points. Again standard time series algorithms would have a difficult time with these two examples, however it is straight forward for the human analyst to identify both the suspect and good data. This illustrates another aspect of the time series problem beyond identifying points as outliers, that is, classifying the nature of the outliers. It is difficult to create a single algorithm that can detect and identify many different types of data quality problems. Given that the human analyst is able to quality control data in a pathological case, motivated the development of a multi-component, fuzzy logic machine intelligent algorithm, the Intelligent Outlier Detection Algorithm (IODA). IODA incorporates cluster analysis, fuzzy image processing, local and global analysis, correlation structure, as well as a priori knowledge when available, and returns a quality control index (confidence value) between 0 and 1 that indicates the reliability of the data. In this paper the techniques to accomplish such processing are given in the context of anemometer time series data. It is important to note that these techniques could be modified to accommodate time series data from other instruments with different failure processes, or failure modes. 2. FUZZY IMAGE PROCESSING When creating a fuzzy logic algorithm, the characteristics and rules a human expert might use to partition the data (Zimmerman, 1996) into a classification set C must be determined. Various tests are devised to ascertain whether a certain datum should be in C. The result of each test are transformed into values between 0 and 1 by membership functions. A membership function value near 1 for some test T is interpreted to mean that the test indicated that it was likely that the datum should be classified as in C, and a value near zero would indicate the datum is not likely to belong to C. These membership functions are combined in many different ways by fuzzy logic rules to obtain a final membership function. A threshold is set and the points with final membership values above thisthresholdaredefinedtobeinc. Ifathreshold is not used the value of the combined membership function may be used to indicate the degree to which the datum belongs to C (a confidence value). This final membership function is called the fuzzy set C since the membership value indicates the degree to which a datum belongs to C. This methodology allows for a multiplicity of tests and membership functions. In the case of fuzzy image processing, a membership value is assigned to each point from an analysis of an image (Chi, Hong and Tuan, 1996). This value can be thought of as a height and by interpolation a surface. See for example Figure 6 where a cold (blue color) represents a higher height in the surface. 3. INTELLIGENT OUTLIER DETECTION ALGO- RITHM (IODA) Suppose the data from Figure 2 is broken into overlapping subregions using a sequence of running windows. For each data window, an estimate of the probability density function (i.e. a normalized histogram) is calculated (note that the window size must be selected large enough to contain enough data to give reasonable estimates of the density functions, but small enough to reflect the local structure of the non-stationary time series). These histograms can be plotted as a waterfall
3 Figure 6. The data from Fig. 5 is plotted from left to right. The horizontal axis is time in seconds. The vertical axis is wind speed in m s -1. The color represents the height of this histogram with blue a higher height and red a height near zero. Figure 5. Plot of stacked time series histograms. The feature at the left represents the drop out data caused by the loose nut. plot, or stacked histograms, as shown in Figure 5. The histogram for the first time window is shown in the bottom left, the plots then run up the left column as a function of time and continue from the bottom right plot to the top right. These stacked histograms can also be plotted as a contour image, left to right as a function of time, as shown in Figure 6. The image in Figure 6 is the first fuzzy interestmapusedinioda,andiscalledthehistogram field (Figure 6 is a plot of the entire hour of data, whereas Figure 5 is of only the first 555 data points). The contour plot in Figure 6 can be thought of as a surface, where the color of the contour represents the height of the surface above each point in the time-wind speed plane. The dark blue colors represent a larger height, whereas the dark red color represents a height near zero. It is natural for a human to see large clumps of blue in Figure 6. These region in the image can be encircled using a contour algorithm and define concentrations or clusters of points as shown in Figure 6. Here the cluster boundaries that surround the data points of the cluster are shown in blue. These clusters are found in the histogram field by selecting a threshold for the contour algorithm, if a lower threshold is selected, the clusters grown in size and connect. A sequence of clusters can be found by incrementally lowering the contour threshold, or by lowering the water. This expression is related to the idea that the contour is a set of mountain peaks and the threshold level Figure 7. A threshold is selected and a contour plot is drawn around the data above the threshold. This breaks the data into several clusters. represents a water level. As the level is lowered, the peaks become connected by ridge lines. Notice these blue clumps do not contain all the data points in the original time series, i.e., there is cluster data and non-cluster data, however by inspection, the analyst combines these local clusters into larger scale features. For instance in Figure 6, a human expert might group the large clusters centered around 17 m/s -1 intoafeature and the others near 1 m/s -1 into a second feature. In actuality there are three tasks at hand: characterize/categorize the clusters, group the clusters into features and characterize/categorize the noncluster data. In this way, the good and bad data are distinguished. For example, in this case there are three cluster categories, atmospheric clusters, failure mode clusters, and unknown clusters. Atmospheric clusters contain points that fit an expected model, non-atmospheric clusters are clusters that do fit an expected model for a particular failure mode. Unknown clusters are neither atmospheric or non-atmospheric. The notion of atmospheric, non-atmospheric clusters are quantified using fuzzy logic algorithms. These fuzzy logic
4 Figure 8. A plot of the ordered pairs (y(t), y(t + 1)). The vertical and horizontal axes are wind speed in m s -1. algorithms require aprioriknowledge of the characteristics of the atmospheric signal as well as knowledge of the failure modes. One of the strengths of fuzzy logic algorithms is the ease in which new characteristics can be added, when needed or discovered. It is important to note that not all of the data points will clearly exhibit the a priori expected behavior. Again, this natural ambiguity in classifications is handled by fuzzy logic methods Classifying clusters In general, clusters found using the above techniques will, by definition, group data structures together. Clusters can be classified according to whether they are consistent with a known problem or failure mode, such as the loose nut scenario, or result from nominal atmospheric data, i.e., data with expected statistics. One such characterization is auto-correlation as defined by lags of data. Consider the scatter plot of y(t) vs. y(t+1), or lag 1, for the loose nut case, shown in Figure 8. Here y(t) is the wind speed at time t. The color for each point is simply the geometric mean of the initial confidence for the points y(i) and y(i+h) is given by: C ii, + h = C i C i + h The solid black line is the confidence-weighted linear best fit to the data (here h=1). Notice in the lag scatter plot there are two distinct groups of data, the atmospheric data centered near 18 m s -1,and the drop out data centered near the origin. The fact there are two groups of data in the lag plot indicates that the data is not stationary, and can be used later to separate the histogram clusters into stationary groups of clusters. It is possible to define a confidence-weighted auto correlation. Thus in lag(1) space, pairs of points that fit the expected model should cluster around a line with a slope close to one. A ρ=ρ(1) (the sample correlation) value close to zero indicates a poor fit and a value near one indicates an excellent fit. In fact, ρ squared represents the percent of variation in y(i+1) explained by the fit. Similar techniques can be applied to the lag plot to find clusters of points, that were applied to the time series data. An image can be created by calculating a local density using a tiling of overlapping rectangles. A contour algorithm can be applied to the resulting image and clusters of data can be found (other data clustering techniques would work well for this problem as well). Again the water can be lowered and new clusters can be found for each contour threshold, and a value for ρ can be calculated for each of these new clusters. The largest cluster that has a large value for ρ, and contains points that are close to the best fit line is then selected, as shown in Figure 8. This is done by a fuzzy algorithm. The color of the contour surrounding the lag points represents the atmospheric score given to that cluster. Where a cool color is a high score and a warm color is a low score. Notice, the high scoring cluster in Figure 8 contains the data from the time series plot that a human would probably classify as atmospheric. The pairs of points given by y(i) and y(i+1), in this lag cluster, are termed atmospheric since they fit the expected auto correlation model of atmospheric data. It is now possible to calculate an atmospheric score for each cluster in time series space Additional Membership Values Numerous additional membership values can be calculated. For instance a membership value can be calculated by recursively fitting the time series with straight lines. Suppose a segment of data is fit with a line, and an quality of fit for the data is calculated (such as ρ squared). If the quality indicator is too low, the best fit line is bisected and new fits are calculated for the two new sets of data. Fits are calculated until the quality indicator is good enough or there are to few points in the fit. A local best fit confidence can then be calculated by how far a point is from the fit. Another similar membership value can be calculated from the lag plot, and the atmospheric lag cluster. Specifically the number of sigma a point is from the best fit line canbecalculatedgiventhevarianceintheatmospheric lag cluster data. A lag cluster nsigma confidence can then be calculated using an appropriate membership function such as e x2.
5 3.3. Final confidence calculation and the Final Feature Recall from the Figure 6 that there were multiple clusters in the primary mode data. These clusterscanbecombinedintoasinglelargeclusteror a final feature that spans the entire time interval. The idea is to partition the clusters into meta-clusters, or cluster the clusters. In the histogram field shown in Figure 6, both the good clusters (centeredon17ms -1 ) and the dropout clusters (near zero) appear as bright blue regions. Recall that there were two clusters of data in the lag plot (Figure 8), if the time series data were stationary then there should only be a single lag cluster, i.e., the points in the lag-1 plot would be distributed along the one-to-one line. The fact there are two such clusters, can be used to partition the histogram clusters into meta-clusters that belong to the same stationary feature, that is by determining which data points occur in which cluster in both (Figure 8 and Figure 6). For instance, suppose all the points which are in the large cluster near the center of Figure 8 are given a large stationary lag cluster membership value. Next the stationary histogram cluster membership value is calculated by finding all the histogram clusters with a point which has a large stationary lag cluster membership value. Consequently all clusters centered on 17 ms -1 infigure6willbegivenalargestationarylag cluster membership value, and hence belong to the same stationary feature. membership value correctly gives a low confidence to the data dropouts, and the spurious points that fall between the dropouts and the primary signal. Notice the confidence value performs well in Figures 10 and 11 as well. Figure 10 is the final confidence values for the data from Figure 3. Notice most of the points from time periods when the anemometer was spinning are given low membership values (red). Figure 11 is the final confidence values given for the data in Figure 4. Notice the points in the suspicious streaks in the time series data near 200 degrees are given low membership values (red). These confidence values in Figures 9, 10 and 11 do identify most of the outliers. Simulations and human verifications have been done, and IODA does well in classification of outliers. These studies will be reported elsewhere. Figure 10. The data from Figure 3 is shown with blue colors representing points with high membership values. Figure 9. The data from Figure 2 is shown where the color of each point is determined by the combined membership function. A blue color has a membership near 1, and a red color has a membership near zero. A combined membership value or final confidence for each point can be calculated (Figure 9) by a fuzzy combination of all the individual membership values, i.e. the histogram membership value, the stationary feature membership value, the local best fit value, the lag cluster nsigma value, and so on. Notice that the combined Figure 11. The data from Figure 4 is shown. Notice the streaks in the data are given a low membership value. 4. ADDITIONAL APPLICATIONS The data in Figure 12 is the vertical wind as measured by an aircraft flying in the vicinity of Juneau. As before the color of the points indicates the confidence in the data. There are two categories of data in this plot, semi-continuous data
6 distribution with the same range as the nominal anemometer data. As can be seen from Figure 13 the underlying time series is discernible to the human eye (blue). Furthermore the confidences assigned to the time series again roughly corresponds the what a human might assign to the data (given infinite patience). Such examples are seen in lidar data in the presence of a weak signal return. Figure 12. An example of aircraft wind speed time series data is shown. The vertical axis is in m s -1 and the horizontal axis is in seconds. High membership values are blue. (mostly cool colors), and disperse data (disperse warm points centered on 0 ms -1 ). On a gross scale the confidence values assigned to the data correspond to what a human expert might give the points. However upon closer inspection there are low confidence points intermixed with the semicontinuous data. These halo points are not well auto correlated and hence are given a low atmospheric membership value. The confidence values shown in Figure 12 were calculated by IODA without any modification or tuning of parameters, and it is important to note that during the development of IODA such an example was never explicitly considered (although individual aspects were, such as the near uniform appearance of the disperse data). This example is encouraging evidence that the principles and implementation of IODA can be applied to time series data from sources other than anemometers. Figure 13 is the result of a simulation. Points were selected at random from an interval of time with few outliers. The selected data points were then replaced with values selected from a uniform 5. CONCLUSIONS We have studied the use of fuzzy image processing techniques to find outliers in time series data. Even though IODA was developed using anemometer data, and their specific failure modes, these techniques should apply to other data sets as well. Specifically, if the time series is approximately stationary over the analysis time window considered, the range of the correlation coefficients for nominal data is understood in the first few lag spaces, the outliers are not well correlated with the true data, and the statistical properties of the outliers are known, then such an analysis should perform well. That is, if a human expert can separate the true data from the outliers in these and other images, then there is hope to construct fuzzy modules to separate the data from the outliers. 6. REFERENCES Abraham, Bovas and Johannes Ledolter, 1983: Statistical Methods for Forecasting. J. Wiley and Sons, New York, pp. 445 Barnett, V. and T. Lewis, 1977: Outliers in Statistical Data, (3rd Edition), John Wiley and Sons, New York, pp. 584 Chi, Z., Y. Hong and P. Tuan, 1996: Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition, Wold Scientific, Singapore, pp. 225 Priestley, M. B., 1981: Spectral Analysis and Time Series, Academic Press, New York, pp 890. Zimmerman, H. J., 1996: Fuzzy set theory and its applications, Kluwer Academic Publishers, Boston, pp. 435 Figure 13. A simulation of anemometer data with uniform background noise. Notice the signal is given by high membership values (blue). 7. ACKNOWLEDGEMENTS We thank Brant Foote for providing funding through the National Center for Atmospheric Research, Research Applications Program opportunity funds. We also thank Corinne Morse and Inger Gallo for their editorial assistance.
Diagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
EXPLORING SPATIAL PATTERNS IN YOUR DATA
EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Face detection is a process of localizing and extracting the face region from the
Chapter 4 FACE NORMALIZATION 4.1 INTRODUCTION Face detection is a process of localizing and extracting the face region from the background. The detected face varies in rotation, brightness, size, etc.
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013
A Short-Term Traffic Prediction On A Distributed Network Using Multiple Regression Equation Ms.Sharmi.S 1 Research Scholar, MS University,Thirunelvelli Dr.M.Punithavalli Director, SREC,Coimbatore. Abstract:
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Teaching Multivariate Analysis to Business-Major Students
Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis
Parameterization of Cumulus Convective Cloud Systems in Mesoscale Forecast Models
DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Parameterization of Cumulus Convective Cloud Systems in Mesoscale Forecast Models Yefim L. Kogan Cooperative Institute
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Data Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
Probability and Random Variables. Generation of random variables (r.v.)
Probability and Random Variables Method for generating random variables with a specified probability distribution function. Gaussian And Markov Processes Characterization of Stationary Random Process Linearly
Unit 9 Describing Relationships in Scatter Plots and Line Graphs
Unit 9 Describing Relationships in Scatter Plots and Line Graphs Objectives: To construct and interpret a scatter plot or line graph for two quantitative variables To recognize linear relationships, non-linear
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4
4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression
Forecasting in supply chains
1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the
IBM SPSS Forecasting 22
IBM SPSS Forecasting 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release 0, modification
APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS. email [email protected]
Eighth International IBPSA Conference Eindhoven, Netherlands August -4, 2003 APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION Christoph Morbitzer, Paul Strachan 2 and
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Quantitative Inventory Uncertainty
Quantitative Inventory Uncertainty It is a requirement in the Product Standard and a recommendation in the Value Chain (Scope 3) Standard that companies perform and report qualitative uncertainty. This
Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I
Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA - Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Summarizing and Displaying Categorical Data
Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency
Exploratory Data Analysis
Exploratory Data Analysis Johannes Schauer [email protected] Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
Pennsylvania System of School Assessment
Pennsylvania System of School Assessment The Assessment Anchors, as defined by the Eligible Content, are organized into cohesive blueprints, each structured with a common labeling system that can be read
Introduction to Fuzzy Control
Introduction to Fuzzy Control Marcelo Godoy Simoes Colorado School of Mines Engineering Division 1610 Illinois Street Golden, Colorado 80401-1887 USA Abstract In the last few years the applications of
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Atomic Force Microscope and Magnetic Force Microscope Background Information
Atomic Force Microscope and Magnetic Force Microscope Background Information Lego Building Instructions There are several places to find the building instructions for building the Lego models of atomic
Common Core Unit Summary Grades 6 to 8
Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
SoMA. Automated testing system of camera algorithms. Sofica Ltd
SoMA Automated testing system of camera algorithms Sofica Ltd February 2012 2 Table of Contents Automated Testing for Camera Algorithms 3 Camera Algorithms 3 Automated Test 4 Testing 6 API Testing 6 Functional
Time Series Analysis
JUNE 2012 Time Series Analysis CONTENT A time series is a chronological sequence of observations on a particular variable. Usually the observations are taken at regular intervals (days, months, years),
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
with functions, expressions and equations which follow in units 3 and 4.
Grade 8 Overview View unit yearlong overview here The unit design was created in line with the areas of focus for grade 8 Mathematics as identified by the Common Core State Standards and the PARCC Model
Joseph Twagilimana, University of Louisville, Louisville, KY
ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim
2. What is the general linear model to be used to model linear trend? (Write out the model) = + + + or
Simple and Multiple Regression Analysis Example: Explore the relationships among Month, Adv.$ and Sales $: 1. Prepare a scatter plot of these data. The scatter plots for Adv.$ versus Sales, and Month versus
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
An Introduction to Point Pattern Analysis using CrimeStat
Introduction An Introduction to Point Pattern Analysis using CrimeStat Luc Anselin Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Chapter 10 Introduction to Time Series Analysis
Chapter 1 Introduction to Time Series Analysis A time series is a collection of observations made sequentially in time. Examples are daily mortality counts, particulate air pollution measurements, and
Change-Point Analysis: A Powerful New Tool For Detecting Changes
Change-Point Analysis: A Powerful New Tool For Detecting Changes WAYNE A. TAYLOR Baxter Healthcare Corporation, Round Lake, IL 60073 Change-point analysis is a powerful new tool for determining whether
Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.
DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,
GEOENGINE MSc in Geomatics Engineering (Master Thesis) Anamelechi, Falasy Ebere
Master s Thesis: ANAMELECHI, FALASY EBERE Analysis of a Raster DEM Creation for a Farm Management Information System based on GNSS and Total Station Coordinates Duration of the Thesis: 6 Months Completion
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Get The Picture: Visualizing Financial Data part 1
Get The Picture: Visualizing Financial Data part 1 by Jeremy Walton Turning numbers into pictures is usually the easiest way of finding out what they mean. We're all familiar with the display of for example
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 [email protected] 1. Descriptive Statistics Statistics
DEA implementation and clustering analysis using the K-Means algorithm
Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract
We can display an object on a monitor screen in three different computer-model forms: Wireframe model Surface Model Solid model
CHAPTER 4 CURVES 4.1 Introduction In order to understand the significance of curves, we should look into the types of model representations that are used in geometric modeling. Curves play a very significant
business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
REDUCING UNCERTAINTY IN SOLAR ENERGY ESTIMATES
REDUCING UNCERTAINTY IN SOLAR ENERGY ESTIMATES Mitigating Energy Risk through On-Site Monitoring Marie Schnitzer, Vice President of Consulting Services Christopher Thuman, Senior Meteorologist Peter Johnson,
A Game of Numbers (Understanding Directivity Specifications)
A Game of Numbers (Understanding Directivity Specifications) José (Joe) Brusi, Brusi Acoustical Consulting Loudspeaker directivity is expressed in many different ways on specification sheets and marketing
Exploratory Spatial Data Analysis
Exploratory Spatial Data Analysis Part II Dynamically Linked Views 1 Contents Introduction: why to use non-cartographic data displays Display linking by object highlighting Dynamic Query Object classification
Validation of measurement procedures
Validation of measurement procedures R. Haeckel and I.Püntmann Zentralkrankenhaus Bremen The new ISO standard 15189 which has already been accepted by most nations will soon become the basis for accreditation
Threshold Autoregressive Models in Finance: A Comparative Approach
University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Informatics 2011 Threshold Autoregressive Models in Finance: A Comparative
Topographic Change Detection Using CloudCompare Version 1.0
Topographic Change Detection Using CloudCompare Version 1.0 Emily Kleber, Arizona State University Edwin Nissen, Colorado School of Mines J Ramón Arrowsmith, Arizona State University Introduction CloudCompare
Solving Simultaneous Equations and Matrices
Solving Simultaneous Equations and Matrices The following represents a systematic investigation for the steps used to solve two simultaneous linear equations in two unknowns. The motivation for considering
Getting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
Normality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected]
AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS
AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS Cativa Tolosa, S. and Marajofsky, A. Comisión Nacional de Energía Atómica Abstract In the manufacturing control of Fuel
Data Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
TIETS34 Seminar: Data Mining on Biometric identification
TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland [email protected] Course Description Content
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
Direct and Reflected: Understanding the Truth with Y-S 3
Direct and Reflected: Understanding the Truth with Y-S 3 -Speaker System Design Guide- December 2008 2008 Yamaha Corporation 1 Introduction Y-S 3 is a speaker system design software application. It is
COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS
COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS B.K. Mohan and S. N. Ladha Centre for Studies in Resources Engineering IIT
Galaxy Morphological Classification
Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,
Time series Forecasting using Holt-Winters Exponential Smoothing
Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract
Math 202-0 Quizzes Winter 2009
Quiz : Basic Probability Ten Scrabble tiles are placed in a bag Four of the tiles have the letter printed on them, and there are two tiles each with the letters B, C and D on them (a) Suppose one tile
Motion Graphs. It is said that a picture is worth a thousand words. The same can be said for a graph.
Motion Graphs It is said that a picture is worth a thousand words. The same can be said for a graph. Once you learn to read the graphs of the motion of objects, you can tell at a glance if the object in
Common Tools for Displaying and Communicating Data for Process Improvement
Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
Prentice Hall Algebra 2 2011 Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009
Content Area: Mathematics Grade Level Expectations: High School Standard: Number Sense, Properties, and Operations Understand the structure and properties of our number system. At their most basic level
Section 3 Part 1. Relationships between two numerical variables
Section 3 Part 1 Relationships between two numerical variables 1 Relationship between two variables The summary statistics covered in the previous lessons are appropriate for describing a single variable.
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
MUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th
Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th Standard 3: Data Analysis, Statistics, and Probability 6 th Prepared Graduates: 1. Solve problems and make decisions that depend on un
Anomaly Detection and Predictive Maintenance
Anomaly Detection and Predictive Maintenance Rosaria Silipo Iris Adae Christian Dietz Phil Winters [email protected] [email protected] [email protected] [email protected]
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
International Year of Light 2015 Tech-Talks BREGENZ: Mehmet Arik Well-Being in Office Applications Light Measurement & Quality Parameters
www.led-professional.com ISSN 1993-890X Trends & Technologies for Future Lighting Solutions ReviewJan/Feb 2015 Issue LpR 47 International Year of Light 2015 Tech-Talks BREGENZ: Mehmet Arik Well-Being in
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
Determining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors. Content of the Precalculus Subpackage
Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft This article provides a systematic exposition
On Correlating Performance Metrics
On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are
Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error
For example, estimate the population of the United States as 3 times 10⁸ and the
CCSS: Mathematics The Number System CCSS: Grade 8 8.NS.A. Know that there are numbers that are not rational, and approximate them by rational numbers. 8.NS.A.1. Understand informally that every number
