Three Myths about Dynamic Time Warping Data Mining
|
|
|
- Marilynn Harper
- 10 years ago
- Views:
Transcription
1 Three Myths about Dynamic Time Warping Data Mining Chotirat Ann Ratanamahatana Eamonn Keogh Department of Computer Science and Engineering University of California, Riverside Riverside, CA { ratana, eamonn }@cs.ucr.edu Abstract The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two. A decade ago, DTW was introduced into Data Mining community as a utility for various tasks for time series problems including classification, clustering, and anomaly detection. The technique has flourished, particularly in the last three years, and has been applied to a variety of problems in various disciplines. In spite of DTW s great success, there are still several persistent myths about it. These myths have caused confusion and led to much wasted research effort. In this work, we will dispel these myths with the most comprehensive set of time series experiments ever conducted. Keywords Dynamic Time Warping, Data Mining, Experimentation. 1 Introduction In recent years, classification, clustering, and indexing of time series data have become a topic of great interest within the database/data mining community. The Euclidean distance metric has been widely used [9], in spite of its known weakness of sensitivity to distortion in time axis [6]. A decade ago, the Dynamic Time Warping (DTW) distance measure was introduced to the data mining community as a solution to this particular weakness of Euclidean distance metric [2]. This method s flexibility allows two time series that are similar but locally out of phase to align in a nonlinear manner. In spite of its O(n 2 ) time complexity, DTW is the best solution known for time series problems in a variety of domains, including bioinformatics [1], medicine [4], engineering, entertainment [22], etc. The steady flow of research papers on data mining with DTW became a torrent after it was shown that a simple lower bound allowed DTW to be indexed with no false dismissals [6]. The lower bound requires that the two sequences being compared are of the same length, and that the amount of warping is constrained. This work allowed practical applications of DTW, including real-time queryby-humming systems [22], indexing of historical handwriting archives [17], and indexing of motion capture data [5]. In spite of the great success of DTW in a variety of domains, there still are several persistent myths about it. These myths have caused great confusion in the literature, and led to the publication of papers that solve apparent problems that do not actually exist. The three major myths are: Myth 1: The ability of DTW to handle sequences of different lengths is a great advantage, and therefore the simple lower bound that requires different-length sequences to be reinterpolated to equal length is of limited utility [1][19][21]. In fact, as we will show, there is no evidence in the literature to suggest this, and extensive empirical evidence presented here suggests that comparing sequences of different lengths and reinterpolating them to equal length produce no statistically significant difference in accuracy or precision/recall. Myth 2: Constraining the warping paths is a necessary evil that we inherited from the speech processing community to make DTW tractable, and that we should find ways to speed up DTW with no (or larger) constraints[19]. In fact, the opposite is true. As we will show, the 1% constraint on warping inherited blindly from the speech processing community is actually too large for real world data mining. Myth 3: There is a need (and room) for improvements in the speed of DTW for data mining applications. In fact, as we will show here, if we use a simple lower bounding technique, DTW is essentially O(n) for data mining applications. At least for CPU time, we are almost certainly at the asymptotic limit for speeding up DTW. In this paper, we dispel these DTW myths above by empirically demonstrate our findings with a comprehensive set of experiments. This work is part of an effort to redress these mistakes. In terms of number of objective datasets and size of datasets, our experiments are orders of magnitude greater than anything else in the literature. In
2 particular, our experiments required more than 8 billion DTW comparisons. The rest of the paper is organized as follows. The next three sections consider each of the three myths above with a comprehensive set of experiments, testing on a wide range of both real and synthetic datasets. Section 5 gives conclusions and directions for future work. Due to space limitations, we decided to omit the datasets details and background/review of DTW (which can be found in [16]). However, their full details and actual datasets have been publicly available for free download at [8]. 2 Does Comparing Sequences of Different Lengths Help or Hurt? Many recent papers suggest that the ability of classic DTW to deal directly with sequences of different length is a great advantage; some paper titles even contain the phrase of different lengths [3][13] showing their great concerns in solving this issue. These claims are surprising in that they are not supported by any empirical results in the papers in question. Furthermore, an extensive literature search through more than 5 papers dating back to the 1 s failed to produce any theoretical or empirical results to suggest that simply making the sequences to be of the same length has any detrimental effect. To test our claimed hypothesis that there is no significant difference in accuracies between using variable-length time series and equal-length time series in DTW calculation, we carry out an experiment as follows. For all variable-length time series datasets (,,, and Wordspotting See [8] for dataset details), we compute 1-nearest-neightbor classification accuracies (leaving-one-out) using DTW for all warping window sizes (1% to %) in two different ways:- (1) The 4S way; we simply reinterpolated the sequences to have the same length, and (2) By comparing the sequences directly using their original lengths. To give the benefit of the doubt to different-length case, for each individual warping window size, we do all four possible normalizations above, and the best performing of the four options is recorded as the accuracy for the variable-length DTW calculation. For completeness, we test over every possible warping constraint size. Note that we start the warping window size of 1% instead of % since % size is Euclidean distance metric, which is undefined when the time series are not of the same length. Also, when measuring the DTW distance between two time series of different lengths, the percentage of warping window applied is based on the length of the longer time series to ensure that we allow adequate amount of warping for each pair and deliver a fair comparison. The variable-length datasets are then linearly reinterpolated to have the same length of the longest time series within each dataset. Then, we simply compute the classification accuracies using DTW for all warping window sizes (1% to %) for each dataset. The results are shown in Figure Wordspotting Figure 1. A comparison of the classification accuracies between variable-length (dotted lines) and the (reinterpolated) equal-length datasets (solid lines) for each warping window size (1-%). The two options produce such similar results that in many places the lines overlap. Note that the experiments do strongly suggest that changing the amount of warping allowed does affect the accuracy (an issue that will be discussed in depth in the next section), but over the entire range on possible warping widths, the two approaches are nearly indistinguishable. Furthermore, a two-tailed test using a significance level of.5 between each variable-length and equal-length pair indicates that there is no statistically significant difference between the accuracy of the two sets of experiments. An even more telling result is the following. In spite of extensive experience with DTW and an extensive effort, we were unable to create an artificial problem where reinterpolating made a significant difference in accuracy. To further reinforce our claim, we also reinterpolate the datasets to have the equal length of the shortest and averaged length of all time series within the dataset. We still achieve similar findings. These results strongly suggest that work allowing DTW to support similarity search that does require reinterpolation, is simply solving a problem that does not exist. The oftenquoted utility of DTW, such as (DTW is useful) to measure similarity between sequences of different lengths [21], for being able to support the comparison of sequences of different lengths is simply a myth. 3 Are Narrow Constraints Bad? 85 8 Apart from (slightly) speeding up the computation, warping window constraints were originally applied mainly to prevent pathological warping (where a relatively small section of one sequence maps to a much larger section of another). The vast majority of the data mining researchers have used a Sakoe-Chiba Band with a 1% width for the
3 global constraint [1][14][18]. This setting seems to be the result of historical inertia, inherited from the speech processing community, rather than some remarkable property of this particular constraint. Some researchers believe that having wider warping window contributes to improvement in accuracy [22]. Or without realizing the great effect of the warping window size on accuracies, some applied DTW with no warping window constraints [12], or did not report the window size used in the experiments [11] (the latter case makes it particularly difficult for others to reproduce the experiment results). In [19], the authors bemoan the fact that (4S) cannot be applied when the warping path is not constrained and use this fact to justify introducing an alterative approach that works for the unconstrained case. To test the effect of the warping window size to the classification accuracies, we performed an empirical experiment on all seven classification datasets. We vary the warping window size from % (Euclidean) to % (no constraint/full calculation) and record the accuracies. Since we have shown in Section 2 that reinterpolation of time series into the same length is at least as good as (or better than) using the original variable-length time series, we linearly interpolate all variable-length datasets to have the same length of the longest time series within the dataset and measure the accuracy using the 1-nearest-neighbor with leaving-one-out classification method. The results are shown in Figure 2. As we hypothesized, wider warping constraints do not always improve the accuracy, as commonly believed [22]. More often, the accuracy peaks very early at much smaller window size (average = 4%). We also did an additional experiment, where half of the objects in the databases were randomly removed from the database iteratively. We measure the classification accuracies for each database size; as the database size decreases, the classification accuracy also declines and the peak appears at larger warping window size. This finding suggests that warping window size adjustment does affect accuracy, and that the effect also depends on the database size. This in turn suggests that we should find the best warping window size on realistic (for the task at hand) database sizes, and not try to generalize from toy problems. To summarize, there is no evidence to support the idea that we need to be able to support wider constraints. While it is possible that there exist some datasets somewhere that could benefit from wider constraints, we found no evidence for this in a survey of more than 5 papers on the topic. More tellingly, in spite of extensive efforts, we could not even create a large synthetic dataset for classification that needs more than 1% warping. In fairness, we should note that it is only in the database/data mining community that this misconception exists. Researchers that work on real problems have long ago noted that constraining the warping helps. For example, Tomasi et al. who work with chromatographic data noted Unconstrained dynamic time warping was found to be too flexible for this chromatographic data set, resulting in a overcompensation of the observed shifts [2], or Rath & Manmatha have carefully optimized the constraints for the task of indexing historical archives [17] Two-Pattern Figure 2. The classification accuracies for all warping window sizes (% to %). All accuracies peak at very small window sizes. All the evidence suggests that narrow constraints are necessary for accurate DTW, and the need to support wide (or no) constraints is just a myth. 4 Can DTW be further Speeded up? Wordspotting Control Chart Smaller warping windows speed up the DTW calculations simply because there is less area of the warping matrix to be searched. Prior to the introduction of lower bounding, the amount of speedup was directly proportional to the width of the warping window. For example, a nearest neighbor search with a 1% warping constraint was almost exactly twice as fast as a search done with a 2% window. However, it is important to note that with the introduction of lower bounding based on warping constraints (i.e. 4S), the speedup is now highly nonlinear in the size of the warping window. For example, a nearest neighbor search with a 1% warping constraint may be many times faster than twice a search done with a 2% window.
4 In spite of this, many recent papers still claim that there is a need and room for further improvement in speeding up DTW. Surprisingly, as we will show, the amortized CPU cost of DTW is essentially O(n), using trivial 4S technique. To really understand what is going on, we will avoid measuring the efficiency of DTW when using index structures. The use of such index structures opens the possibility of implementation bias [9]; it is simply difficult to know if the claimed speedup truly reflects a clever algorithm, or simply care in choice of buffer size, caching policy, etc. Instead, we measure the computation time of DTW for each pair of time series in terms of the amortized percentage of the warping matrix that needs to be visited for each pair of sequences in our database. This number depends only on the data itself and the usefulness of the lower bound. As a concrete example, if we are doing a one nearest neighbor search on 12 objects with a 1% warping window size, and the 4S algorithm only needs to examine 14 sequences (pruning the rest), then the amortized cost for this calculation would be (w * 14) / 12 =.12*w, where w is the area (in percentage) inside the warping window constraint along the diagonal (Sakoe-Chiba band). Note that 1% warping window size does not always occupy 1% of the warping matrix; it mainly depends on the length of the sequence as well (longer sequences give smaller w). In contrast, if 4S was able to prune all but 3 objects, the amortized cost would be (w * 3) / 12 =.3*w. The amount of pruning we should actually expect depends on the lower bounds. For example, if we used a trivial lower bound hard-coded to zero (pointless, but perfectly legal), then line 4 of Table 1 would always be true, and we would have to do DTW for every pair of sequences in our dataset. In this case, amortized percentage of the warping matrix that needs to be accessed for each sequence in our database would exactly be the area inside the warping window. If, on the other hand, we had a magic lower bound that returned the true DTW distance minus some tiny epsilon, then line 4 of the Table 1 would rarely be true, and we would have to do the full DTW calculation only rarely. In this case, the amortized percentage of the warping matrix that needs to be accessed would be very close to zero. We measured the amortized cost for all our datasets, and for every possible warping window size. The results (and its -1% warping zoom-in) are shown in Figure 3. The results are surprising. For reasonably large datasets, simply using a good lower bound insures that we rarely have to use the full DTW calculation. In essence, we can say that DTW is effectively O(n), and not O(n 2 ), when searching large datasets. For example, in the,, and 2-Pattern problems (all maximum accuracy at 3% warping), we only need to do much less than half a percent of the O(n 2 ) work that we would have been forced to do without lower bounding. For some of the other datasets, it may appear that we need to do a significant percentage of the CPU work. However, as we will see below, these results are pessimistic in that they reflect the small size of these datasets. Amortized percentage of the O(n 2 ) calculation required CtrlChrt 2-Pattern WordSpotting Warping Window Size (%) Warping Window Size (%) Figure 3. (left)the amortized percentage of warping matrix that needs to be accessed during the DTW calculation for each warping window size. The use of a lower bound helps prune off numerous unnecessary calculations. (right) Zoom-in of the warping range -1% If the amortized cost of DTW is linear, where does the claimed improvement from recent papers come from? It is true that these approaches typically use indices, rather than sequential search, but an index must do costly random access rather than the optimized linear scans of sequential search. In order to break even in terms of disk access time, they must avoid looking at more than 1% of the data [7], but for time series where even the reduced dimensionality (i.e. the Fourier or wavelet coefficients) is usually greater than 2 [9], it is not obvious that this is possible. Some recent papers that claim speedups credit the improved lower bounds, for example we present progressively tighter lower bounds that allow our method to outperform (4S) [19]. Indeed, it might be imagined that speedup could be obtained by having tighter lower bounds. Surprisingly, this is not true! We can see this with our simple experiment. Let us imagine that we have a wonderful lower bound, which always returns a value that is within 1% of the correct value (more concretely, a value uniformly distributed between % and % of the true DTW value). We will call this idealized lower bound LB_Magic. In contrast, the current best-known lower bounds typically return a value between 4% and 6% of the true value [6]. We can compare the speedup obtained by LB_Magic with the current best lower bound, LB_Keogh [6], on 1-nearest neighbor search. Note that we have to cheat for LB_Magic by doing the full DTW calculation then assigning it a value up to 1% smaller. We will use a warping constraint of 5%, which is about the mean value for the best accuracy (cf. Sect. 3). As before, we measured the amortized percentage of the warping matrix that needs to be accessed for each sequence in our database. Here, we use a randomwalk data Amortized percentage of the O(n 2 ) calculation required CtrlChrt 2-Pattern WordSpotting
5 of length 128 data points, and vary the database size from 1 objects to 4, objects. Figure 4 shows the results. Once again, the results are very surprising. The idealized LB_Magic allows a very impressive speedup; for the largest size database, it eliminates.7% of the CPU effort. However, the very simple lower bounding technique that has been in the literature for several years is able to eliminate.369% of the CPU effort! The difference is not quite so dramatic for very small datasets, say less than 16 objects. But here we can do unoptimized search in much less than a hundredth of a second. Note that we obtain similar results for other datasets. Amortized percentage of the O(n 2 ) calculations required No Lower Bound LB_Keogh LB_Magic Size of Database (Number of Objects) Figure 4. Amortized percentage of the warping matrix that needs to be accessed. As the size of the database increases, the amortized percentage of the warping matrix accessed becomes closer to zero. To summarize, for problems involving a few thousand sequences or more, each with a few hundred data points, the significant CPU cost of DTW is simply non-issue (as for problems involving less than a few thousand sequences, we can do them in less than a second anyway). The lesson for the data mining community from this experiment is the following; it is almost certainly pointless to attempt to speed up the CPU time for DTW by producing tighter lower bounds. Even if you could produce a magical lower bound, the difference it would make would be tiny, and completely dwarfed by minor implementation choices. 5 Conclusions and Future Work In this work, we have pointed out and investigated some of the myths in Dynamic Time Warping measure. We empirically validated our three claims. We hope that our results will help researchers focus on more useful problems. For example, while there have been dozens of papers on speeding up DTW in the last decade, there has only been one on making it more accurate [15]. Likewise, we feel that the speed and accuracy of DTW that we have demonstrated in this work may encourage researchers to apply DTW to a wealth of new problems/domains. 6 References [1] Aach, J. & Church, G. (21). Aligning gene expression time series with time warping algs. Bioinformatics(17), [2] Berndt, D. & Clifford, J. (14). Using dynamic time warping to find patterns in time series. AAAI Workshop on Knowledge Discovery in Databases, pp [3] Bozkaya, T, Yazdatani, Z, & Ozsoyoglu, Z.M. (17). Matching and Indexing Sequences of Different Lengths. CIKM [4] Caiani, E.G., Porta, A., Baselli, G., Turiel, M., Muzzupappa, S., Pieruzzi, F., Crema, C., Malliani, A., & Cerutti, S. (18). Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEE Computers in Cardiology, pp [5] Cardle, M. (23). Music-Driven Animation. Ph.D. Thesis, Cambridge University. [6] Keogh, E. (22). Exact indexing of dynamic time warping. In 28 th VLDB. Hong Kong. pp [7] Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. (21). Locally adaptive dimensionality reduction for indexing large time series databases. SIGMOD, pp [8] Keogh, E. & Folias, T. (22) The UCR Time Series Data Mining Archive. [ ~eamonn/tsdma] [9] Keogh, E. & Kasetty, S. (22). On the Need for Time Seires Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8 th ACM SIGKDD, pp [1] Kim, S.W., Park, S., & Chu, W.W. (24). Efficient processing of similarity search under time warping in sequence databases: an index-based approach. Inf. Syst. 29(5): [11] Kornfield, E.M, Manmatha, R., & Allan, J. (24). Text Alignment with Handwritten Documents. 1 st Int l workshop on Document Image Analysis for Libraris (DIAL), pp [12] Laaksonen, J., Hurri, J., and Oja, Erkki. (18). Comparison of Adaptive Strategies for On-Line Character Recognition. In proceedings of ICANN, pp [13] Park, S.,, Chu, W, Yoon, J., and Hsu, C (2). Efficient searchs for similar subsequences of different lengths in sequence databases. In ICDE-. [14] Rabiner, L., Rosenberg, A. & Levinson, S. (18). Considerations in dynamic time warping algorithms for discrete word recognition. IEEE Trans. Acoustics Speech, and Signal Proc., Vol. ASSP-26, pp [15] Ratanamahatana, C.A. & Keogh, E. (24). Making Time-series Classification More Accurate Using Learned Constraints. SDM International conference, pp [16] Ratanamahatana, C.A. & Keogh, E. (24). Everything You Know about Dynamic Time Warping is Wrong. SIGKDD Workshop on Mining Temporal and Sequential Data. [17] Rath, T. & Manmatha, R. (23). Word image matching using dynamic time warping. CVPR, Vol. II, pp [18] Sakoe, H. & Chiba, S. (18). Dynamic programming algorithm optimization fro spoken word recognition. IEEE Trans. Acoustics, Speech, & Signal Proc, ASSP-26, [19] Shou, Y., Mamoulis, N., and Cheung, D.W. Efficient Warping of Segmented Time-series, HKU CSIS Tech rep, TR-24-1 [2] Tomasi, G., van den Berg, F., & Andersson, C. (24). Correlation Optimized Warping and DTW as Preprocessing Methods for Chromatographic Data. J. of Chemometrics. [21] Wong, T.S.F & Wong, M.H. (23). Efficient Subsequence Matching for Sequences Databases under Time Warping. IDEAS. [22] Zhu, Y. & Shasha, D. (23). Warping Indexes with Enve-lope Transforms for Query by Humming. SIGMOD,
Everything you know about Dynamic Time Warping is Wrong
Everything you know about Dynamic Time Warping is Wrong hotirat Ann Ratanamahatana Eamonn Keogh Department of omputer Science and Engineering University of alifornia, Riverside Riverside, A 951 { ratana,
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,
Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous
Time series databases Indexing Time Series A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Effects of Time Normalization on the Accuracy of Dynamic Time Warping
Effects of Time Normalization on the Accuracy of Dynamic Time Warping Olaf Henniger Sascha Müller Abstract This paper revisits Dynamic Time Warping, a method for assessing the dissimilarity of time series.
Offline Word Spotting in Handwritten Documents
Offline Word Spotting in Handwritten Documents Nicholas True Department of Computer Science University of California, San Diego San Diego, CA 9500 [email protected] Abstract The digitization of written
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining -
Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining - Hidenao Abe, Miho Ohsaki, Hideto Yokoi, and Takahira Yamaguchi Department of Medical Informatics,
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
Classification of Household Devices by Electricity Usage Profiles
Classification of Household Devices by Electricity Usage Profiles Jason Lines 1, Anthony Bagnall 1, Patrick Caiger-Smith 2, and Simon Anderson 2 1 School of Computing Sciences University of East Anglia
Synchronization of sampling in distributed signal processing systems
Synchronization of sampling in distributed signal processing systems Károly Molnár, László Sujbert, Gábor Péceli Department of Measurement and Information Systems, Budapest University of Technology and
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
Character Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL
What mathematical optimization can, and cannot, do for biologists Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL Introduction There is no shortage of literature about the
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION
FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., Canada [email protected],
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
The dream for a better life has made man look for gold coins from the World War II (1940) which were lost in various parts of Greece.
The search for gold and gold coins is one of the finest and the most widespread hobbies. It has its difficulties as well as its beautiful moments. The people involved in this hobby are commonly called
A comprehensive survey on various ETC techniques for secure Data transmission
A comprehensive survey on various ETC techniques for secure Data transmission Shaikh Nasreen 1, Prof. Suchita Wankhade 2 1, 2 Department of Computer Engineering 1, 2 Trinity College of Engineering and
Comparison of Elastic Matching Algorithms for Online Tamil Handwritten Character Recognition
Comparison of Elastic Matching Algorithms for Online Tamil Handwritten Character Recognition Niranjan Joshi, G Sita, and A G Ramakrishnan Indian Institute of Science, Bangalore, India joshi,sita,agr @ragashrieeiiscernetin
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA
Are Image Quality Metrics Adequate to Evaluate the Quality of Geometric Objects? Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA ABSTRACT
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
A Stock Pattern Recognition Algorithm Based on Neural Networks
A Stock Pattern Recognition Algorithm Based on Neural Networks Xinyu Guo [email protected] Xun Liang [email protected] Xiang Li [email protected] Abstract pattern respectively. Recent
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Biometric Authentication using Online Signatures
Biometric Authentication using Online Signatures Alisher Kholmatov and Berrin Yanikoglu [email protected], [email protected] http://fens.sabanciuniv.edu Sabanci University, Tuzla, Istanbul,
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
A Performance Comparison of Five Algorithms for Graph Isomorphism
A Performance Comparison of Five Algorithms for Graph Isomorphism P. Foggia, C.Sansone, M. Vento Dipartimento di Informatica e Sistemistica Via Claudio, 21 - I 80125 - Napoli, Italy {foggiapa, carlosan,
SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY
3 th World Conference on Earthquake Engineering Vancouver, B.C., Canada August -6, 24 Paper No. 296 SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY ASHOK KUMAR SUMMARY One of the important
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Community Mining from Multi-relational Networks
Community Mining from Multi-relational Networks Deng Cai 1, Zheng Shao 1, Xiaofei He 2, Xifeng Yan 1, and Jiawei Han 1 1 Computer Science Department, University of Illinois at Urbana Champaign (dengcai2,
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.0
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.0 ELINOR L. VELASQUEZ Dedicated to the children and the young people. Abstract. This is an outline of a new field in predictive
High-dimensional labeled data analysis with Gabriel graphs
High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use
How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode
Simultaneous Gamma Correction and Registration in the Frequency Domain
Simultaneous Gamma Correction and Registration in the Frequency Domain Alexander Wong [email protected] William Bishop [email protected] Department of Electrical and Computer Engineering University
MINING TIME SERIES DATA
Chapter 1 MINING TIME SERIES DATA Chotirat Ann Ratanamahatana, Jessica Lin, Dimitrios Gunopulos, Eamonn Keogh University of California, Riverside Michail Vlachos IBM T.J. Watson Research Center Gautam
CCPM: TOC Based Project Management Technique
CCPM: TOC Based Project Management Technique Prof. P.M. Chawan, Ganesh P. Gaikwad, Prashant S. Gosavi M. Tech, Computer Engineering, VJTI, Mumbai. Abstract In this paper, we are presenting the drawbacks
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
CHAPTER 6 TEXTURE ANIMATION
CHAPTER 6 TEXTURE ANIMATION 6.1. INTRODUCTION Animation is the creating of a timed sequence or series of graphic images or frames together to give the appearance of continuous movement. A collection of
IMPROVED NETWORK PARAMETER ERROR IDENTIFICATION USING MULTIPLE MEASUREMENT SCANS
IMPROVED NETWORK PARAMETER ERROR IDENTIFICATION USING MULTIPLE MEASUREMENT SCANS Liuxi Zhang and Ali Abur Department of Electrical and Computer Engineering Northeastern University Boston, MA, USA [email protected]
Removing Moving Objects from Point Cloud Scenes
1 Removing Moving Objects from Point Cloud Scenes Krystof Litomisky [email protected] Abstract. Three-dimensional simultaneous localization and mapping is a topic of significant interest in the research
Effect of Using Neural Networks in GA-Based School Timetabling
Effect of Using Neural Networks in GA-Based School Timetabling JANIS ZUTERS Department of Computer Science University of Latvia Raina bulv. 19, Riga, LV-1050 LATVIA [email protected] Abstract: - The school
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
299 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN
EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN Sridhar S Associate Professor, Department of Information Science and Technology, Anna University,
ClusterOSS: a new undersampling method for imbalanced learning
1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately
FOREX TRADING PREDICTION USING LINEAR REGRESSION LINE, ARTIFICIAL NEURAL NETWORK AND DYNAMIC TIME WARPING ALGORITHMS
FOREX TRADING PREDICTION USING LINEAR REGRESSION LINE, ARTIFICIAL NEURAL NETWORK AND DYNAMIC TIME WARPING ALGORITHMS Leslie C.O. Tiong 1, David C.L. Ngo 2, and Yunli Lee 3 1 Sunway University, Malaysia,
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Detection and Restoration of Vertical Non-linear Scratches in Digitized Film Sequences
Detection and Restoration of Vertical Non-linear Scratches in Digitized Film Sequences Byoung-moon You 1, Kyung-tack Jung 2, Sang-kook Kim 2, and Doo-sung Hwang 3 1 L&Y Vision Technologies, Inc., Daejeon,
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II
! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
305 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
56 Mining Time Series Data
56 Mining Time Series Data Chotirat Ann Ratanamahatana 1, Jessica Lin 1, Dimitrios Gunopulos 1, Eamonn Keogh 1, Michail Vlachos 2, and Gautam Das 3 1 University of California, Riverside 2 IBM T.J. Watson
Why Your CRM Process is Destroying Your Team s Prospecting and How to Fix It
Proof of Prospecting Why Your CRM Process is Destroying Your Team s Prospecting and How to Fix It When implementing any kind of sales improvement program, most sales organizations understandably focus
Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams
Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Pramod D. Patil Research Scholar Department of Computer Engineering College of Engg. Pune, University of Pune Parag
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan
Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Voronoi Treemaps in D3
Voronoi Treemaps in D3 Peter Henry University of Washington [email protected] Paul Vines University of Washington [email protected] ABSTRACT Voronoi treemaps are an alternative to traditional rectangular
The LCA Problem Revisited
The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
