Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous

Size: px
Start display at page:

Download "Time series databases. Indexing Time Series. Time series data. Time series are ubiquitous"

Transcription

1 Time series databases Indexing Time Series A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily temperature readings ECG data A time series database is a large collection of time series Time series data Time series are ubiquitous A time series is a collection of observations made sequentially in time. value axis People measure things... The president s approval rating. Their blood pressure. The annual rainfall in Santa Barbara. The value of their Yahoo stock. The number of web hits per second. and things change over time Thus time series occur in numerous medical, scientific and business 4.75 time axis applications. 3 4

2 Search: Similarity search in time series: Applications a doctor searching for a particular pattern (that implies a heart irregularity) in the ECG database for diagnosis a stock analyst searching for a particular stock price pattern in the stock database for prediction Data Mining: Classification: Classify patients based on ECG patterns Clustering: Group websites with similar traffic patterns. Association Rules: If we see this peak followed by this plateau, then we will have a downward trend with a confidence of.4 and a support of.. Time series problems (from a databases perspective) The Similarity Problem X = x, x,, x n and Y = y, y,, y n Define and compute Sim(X, Y) E.g. do stocks X and Y have similar movements? Retrieve similar time series (Similarity Queries) 5 6 Types of queries Examples Whole match vs sub-pattern match Range query vs k-nn Join (self) Find companies with similar stock prices over a time interval Find products with similar sell cycles Cluster users with similar credit card utilization Find similar subsequences in DNA sequences Find scenes in video streams 7 8

3 $price $price Problems $price 365 day 365 day 365 day distance function: by expert (eg, Euclidean distance) Define the similarity (or distance) function Find an efficient algorithm to retrieve similar time series from a database (Faster than sequential scan) The Similarity function depends on the Application 9 Euclidean similarity measure Euclidean model View each sequence as a point in n- dimensional Euclidean space (n = length of each sequence) Define (dis-)similarity between sequences X and Y as L p = ( n i= x y ) i i p / p Query Q n datapoints Euclidean Distance between two time series Q = {q, q,, q n } and S = {s, s,, s n } Q Database Distance Rank 4 p= Manhattan distance S p= Euclidean distance n (, S) ( ) q i s i i= n datapoints D Q

4 Advantages Disadvantages Easy to compute: O(n) Allows scalable solutions to other problems, such as indexing clustering etc... Does not allow for different baselines Stock X fluctuates at $, stock Y at $3 Does not allow for different scales Stock X fluctuates between $95 and $5, stock Y between $ and $4 Synchronization of signals Missing points 3 4 Dynamic Time Warping [Berndt, Clifford, 994] Example Allows acceleration-deceleration of signals along the time dimension Basic idea Consider X = x, x,, x n, and Y = y, y,, y n We are allowed to extend each sequence by repeating elements Euclidean distance now calculated between the extended sequences X and Y Matrix M, where m ij = d(x i, y j ) Euclidean distance vs DTW 4

5 Dynamic Time Warping [Berndt, Clifford, 994] Restrictions on warping paths j = i + w Y y3 y y warping path j = i w Monotonicity Path should not go down or to the left Continuity No elements may be skipped in a sequence Warping Window i j <= w x x x3 X Formulation Solution by Dynamic Programming Let D(i, j) refer to the dynamic time warping distance between the subsequences x, x,, x i y, y,, y j D(i, j) = x i y j + min { D(i, j), D(i, j ), D(i, j ) } Basic implementation = O(n ) where n is the length of the sequences will have to solve the problem for each (i, j) pair If warping window is specified, then O(nw) Only solve for the (i, j) pairs where i j <= w 5

6 Longest Common Subsequence Measures (Allowing for Gaps in Sequences) Basic LCS Idea X = 3,, 5, 7, 4, 8,, 7 Y =, 5, 4, 7, 3,, 8, 6 LCS =, 5, 7, Sim(X,Y) = LCS or Sim(X,Y) = LCS /n Gap skipped LCS for time series We can extend the idea of LCS to time series using a threshold ε for matching. LCS for two subsequences: x, x,, x i and y, y,, y j is: LCS[i,j] = if i= or j= + LCS[i-, j-] if x i y j < ε max(lcs[i-, j], LCSS[i, j-]) otherwise Metric properties and indexing Edit distance is a metric! So, we can use metric indexes DTW and LCS are not metrics we have to use specialized indexes How to convert this into distance? 6

7 Similarity retrieval Range Query Find all time series S where D ( Q, S ) ε Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan very slow A better approach DFT (other distance preserving transforms also possible) Preserves L norm Choose a few coefficients F-index Two-step process to remove false hits Discussion of F-index [AFS93] Subsequence matching Given a query Q and data sequence S, search for similarity over all subsequences of size length(q) over S Sequential scan Atomic subsequence approach Divide Q and S into atomic subsequences Index the atomic subsequences and look for a match 5 6 Window-based approach Generate information about a window of length w at each point in the data sequence Use DFT and choose the first few coefficients Generate a trail by plotting the information about each point Divide the trail into sub-trails Generate an MBR for each sub-trail Build an index on these MBRs, possibly from different sequences Division into sub-trails I-naïve: index points I-fixed: a fixed number of points per MBR I- adaptive: Compute the marginal cost of adding next point to the existing sub-trail if the marginal cost increases, start a new subtrail otherwise, assign to existing sub-trail 7 8 7

8 What if query length is greater than the window length w? Q is query, S is a database sequence, Q = kw If d(q,s[i..i+kw-]) ε then d(q[j..j+w-], S[i +j..i+j+w-]) ε for all j (k-)w Search the database using any atomic subsequence of query Q. Called Prefix Search Multi-piece search If d(q,s[i..i+kw-]) ε then there exists j such that d(q[j*w..j*w+w-], S[i+j*w..i+j*w+w-]) ε/ k. Search the database using each piece of query Q within a radius of ε/ k. If each pair of corresponding atomic subsequence of Q and S[i..i+kw-] is at a distance ε/ k, then the two sequences are at a distance of ε. If the two sequences are at a distance of ε and if we query using an atomic subsequence of Q then we will find an atomic subsequence of S[i..i+kw-] within a distance of ε/ k. 9 3 Searching with a smaller sequence Searching with an atomic subsequence of Q whose size is k times smaller within a radius of r is equivalent to searching with a radius of r. k with the whole Q. For Prefix Search, the resulting search volume is ( k ) d times the original volume, d is the number of embedding dimensions. For Multi-piece search, the resulting search volume for each piece is same as the original search volume. The combined search volume is k times the original search volume. Shift and scale invariance 3 3 8

9 Shift and scale invariance Shift and scale invariance: an example x x3 x x = a x + b V = [, 6, 4, ] V = [, 3,, 5] V 3 = [5, 9, 7, 3] V 4 = [5, 7, 6, 9] x t V 3 = V +3I V = V V 4 = V +4I Shift and scale invariant time sequences z-normalization [GK95] Motivation: The expression level of genes may be scaled or shifted due to experimental conditions. Different countries may have different units (e.g. Fahrenheit v.s. Celsius). There can be linear relationships between time sequences (stock market data). Z(x) = (x-mean(x))/std(x). Does not minimize distance under shifting and scaling! u = [,,, ]. v = [3,,, ]. d(z(u),z(v)) =.94. d(u, -.5v+.5) =.5. Distance between the transformed sequences is higher than the distance after scaling and shifting

10 SE-plane [CW99] To find shortest distance between lines SH(v) and SC(u), project them on the shift eliminated plane and measure the distance between point TSE(SH(v)) and the line TSE(SC(u)). SH(v) Problem with this scheme u=[,,, ] a=- b=.5 v=[3,,, ] d(u,v)= SE-plane v N=[,,...,] SC(u).5.5 TSE(SH(v)) u TSE(SC(u)) Problem with this scheme Shift & Scale Invariant Distance u=[,,, ] v=[3,,, ] a=-.5 b=.5 d(v,u)=.5 D(u,v) = min{d(u,v), d(v,u)} = min{ TSE(u), TSE(v) } sin(θ uv ) θ TSE(u) TSE(v) SE-plane

11 CS-index Structure CS-index Structure SE-plane SE-plane Fanout=3 Page size=p Fanout=3 Page size=p 4 4 CS-index Structure CS-index Structure SE-plane Fanout=3 Page size=p 43 44

12 Dimensionality curse Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases Keogh, Chakrabarti, Mehrotra, Pazzani, SIGMOD n (query length) is large (-s) Index structures perform poorly at such high dimensionalities Solution: First reduce dimensionality from n to n that can be efficiently handled by the index structure, then build index on n - dimensional data (GEMINI framework, Faloutsos et. al.) Correctness criteria: D indexspace (A,B) D true (A,B) no false dismissals n datapoints Dimensionality reduction techniques Piecewise Aggregate Approximation (PAA) DFT X X' X X' DWT Haar Haar Haar Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 SVD X X' eigenwave eigenwave eigenwave eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Original time series (n-dimensional vector) S={s, s,, s n } n -segment PAA representation (n -d vector) S = {sv, sv,, sv n } value axis time axis sv sv sv sv 5 sv3 4 sv 6 sv 7 PAA representation satisfies the lower bounding property (Keogh, Chakrabarti, Mehrotra and Pazzani, ; Yi and Faloutsos ) sv 8

13 Can we improve upon PAA? n -segment PAA representation (n -d vector) S = {sv, sv,, sv N } sv sv sv sv 5 sv3 4 sv 6 sv 7 sv 8 APCA approximates original signal better than PAA Improvement factor = Reconstruction error PAA Reconstruction error APCA Adaptive Piecewise Constant Approximation (APCA) sv 3 n /-segment APCA representation sv (n -d vector) sv S= { sv, sr, sv, sr,, sv M, sr M } sv (M is the number of segments = n /) sr sr sr 3 49 sr 4 5 Construction of APCA Distance measure Exact (Euclidean) distance D(Q,S) Lower bounding distance D LB (Q,S) Wavelet transform Keep highest coefficients Reconstruct series to obtain the intervals Use actual means Merge adjacent regions as needed Q S S S Q Q D(Q,S) D LB (Q,S) D(Q,S) D LB (Q,S) n M ( q i s i ) = ( sr i i sr i qv i sv )( i ) i= 5 5 3

14 R S Index on M-dimensional APCA space S S3 S5 R R3 R4 R3 R R4 S8 S7 M-dimensional APCA space S4 S6 S9 R R R3 R4 S S S3 S4 S5 S6 S7 S8 S9 k-nearest neighbor algorithm R MINDIST(Q,R) MINDIST(Q,R3) S S3 R3 S S4 Q R MINDIST(Q,R4) R4 S8 S9 S7 S5 S6 Any index structure can used. 53 For any node U of the index structure with MBR R, MINDIST(Q,R) D(Q,S) for any data item S under U 54 Index modification for MINDIST computation APCA point S= { sv, sr, sv, sr,, sv M, sr M } smax sv smin smax sv smin sv 3 smax 3 smin 3 smax 4 sv 4 smin 4 sr sr sr 3 sr 4 R R S S S3 R4 R3 S8 S7 S5 S4 S6 S9 MBR representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l, l,, l (N-), l N } and H= { h, h,, h (N-), h N } REGION REGION 3 APCA rectangle S= (L,H) where REGION l l 4 h h 4 l 6 L= { smin, sr, smin, sr,, smin M, sr M } and h 6 H = { smax, sr, smax, sr,, smax M, sr M } L= { l, l, l 3, l 4, l 5, l 6 } 55 time axis 56 value axis h l l 3 h 3 l 5 H= { h, h, h 3, h 4, h 5, h 6 } h 5 4

15 Regions REGION i M regions associated with each MBR; boundaries of ith region: h (i-) value axis h l REGION l 3 h 3 h 5 l (i-) l (i-) + REGION 3 h i value axis h l Regions ith region is active at time instant t if it spans across t The value s t of any time series S under node U at time instant t must lie in one of the regions active at t t l 3 t REGION h 3 h 5 REGION 3 l 5 REGION l 5 l l 4 h h 4 h 6 REGION l 6 l l 4 h h 4 h 6 time axis 57 l 6 58 time axis MINDIST Computation For time instant t, MINDIST(Q, R, t) = min region G active at t MINDIST(Q,G,t) MINDIST(Q,R,t) =min(mindist(q, Region, t), t MINDIST(Q, Region, t)) REGION =min((q t - h), (q t - h3) ) =(q h t - h) 3 h l l 3 REGION 3 h 5 Experimental Results Compare APCA with SVD, DFT, DWT and PAA in terms of Time to compute reduced representation Pruning power (quality of approximation) Query performance (in terms of I/O and CPU costs) Hybrid tree to index the reduced space Datasets: Electrocardiogram data,, objects, n (query length) varying from to Mixed Bag data,, objects, n varying from to l 5 REGION l l 4 h h 4 h 6 n MINDIST(Q,R) = l 6 = MINDIST ( Q, R, t) t Lemma: MINDIST(Q,R) D(Q,C) for any time series C under node U

16 5 Time (sec) K 3K 6K Time to Compute Reduced Representation 8K 4K SVD 3 K 3K 8 K 3K 6K 8K 4K APCA Fourier 3 K 3K 6K 8 K 3K 6K 8K 4K Wavelet 3 PAA 8 Pruning power Number of objects examined for -NN query 6K 8K 8K 4K 8 4K Original Reduced Database Size Dimensionality (n) dimensionality (n) dimensionality (n ) Spring (4K 3 - K) (3-) (-) (6-) # objects examined 3 Fourier 6 3 Wavelet/ PAA 6 3 APCA 6 3 Search Performance (I/O Cost) Linear Scan Fourier Wavelet/ PAA APCA Search Performance (CPU Cost) Linear Scan Fourier Wavelet/ PAA APCA 4 3 # random disk access CPU time (sec) Original Reduced Original Reduced dimensionality (n) dimensionality (n ) dimensionality (n) dimensionality (n ) (-) (6-) (-) (6-) 63 6

17 References Efficient Similarity Search in Sequence Databases, Rakesh Agrawal, Christos Faloutsos and Arun Swami, FODO conference, Evanston, Illinois, Oct. 3-5, 993 Fast subsequence matching in time-series databases, Christos Faloutsos and M. Ranganathan and Yannis Manolopoulos, pages , SIGMOD 994. Variable Length Queries for Time Series Data, Tamer Kahveci and Ambuj K. Singh, ICDE, pages Fast Time-Series Searching with Scaling and Shifting, Kelvin Kam Wing Chu, Man Hon Wong, ACM Priciples on Databases Systems, 999, pp Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, Eamonn Keogh, Kaushik Chakrabarti, Sharad Mehrotra, Michael Pazzani, ACM SIGMOD. On Similarity Queries for Time-Series Data: Constraint Specification and Implementation, D. Goldin and P. Kanellakis, Proceedings of the st International Conference on Principles and Practice of Constraint Programming, 995. Fast Time-Series Searching with Scaling and Shifting, K. Chu and M. H. Wong, PODS 999. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, E. J. Keogh, K. Chakrabarti, S. Mehrotra and M. J. Pazzani, SIGMOD, Similarity Searching for Multi-Attribute Sequences, T. Kahveci, A. K. Singh, and A. Gurel, SSDBM. 65 7

MINING TIME SERIES DATA

MINING TIME SERIES DATA Chapter 1 MINING TIME SERIES DATA Chotirat Ann Ratanamahatana, Jessica Lin, Dimitrios Gunopulos, Eamonn Keogh University of California, Riverside Michail Vlachos IBM T.J. Watson Research Center Gautam

More information

56 Mining Time Series Data

56 Mining Time Series Data 56 Mining Time Series Data Chotirat Ann Ratanamahatana 1, Jessica Lin 1, Dimitrios Gunopulos 1, Eamonn Keogh 1, Michail Vlachos 2, and Gautam Das 3 1 University of California, Riverside 2 IBM T.J. Watson

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

Three Myths about Dynamic Time Warping Data Mining

Three Myths about Dynamic Time Warping Data Mining Three Myths about Dynamic Time Warping Data Mining Chotirat Ann Ratanamahatana Eamonn Keogh Department of Computer Science and Engineering University of California, Riverside Riverside, CA 92521 { ratana,

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration EAMONN KEOGH SHRUTI KASETTY University of California, Riverside eamonn@cs.ucr.edu skasetty@cs.ucr.edu Editors: Hand,

More information

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&

More information

Temporal Data Mining for Small and Big Data. Theophano Mitsa, Ph.D. Independent Data Mining/Analytics Consultant

Temporal Data Mining for Small and Big Data. Theophano Mitsa, Ph.D. Independent Data Mining/Analytics Consultant Temporal Data Mining for Small and Big Data Theophano Mitsa, Ph.D. Independent Data Mining/Analytics Consultant What is Temporal Data Mining? Knowledge discovery in data that contain temporal information.

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

THE concept of Big Data refers to systems conveying

THE concept of Big Data refers to systems conveying EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho Ins+tuto Superior Técnico Technical University of Lisbon Big Data Bruno Lopes Catarina Moreira João Pinho Mo#va#on 2 220 PetaBytes Of data that people create every day! 2 Mo#va#on 90 % of Data UNSTRUCTURED

More information

Providing Diversity in K-Nearest Neighbor Query Results

Providing Diversity in K-Nearest Neighbor Query Results Providing Diversity in K-Nearest Neighbor Query Results Anoop Jain, Parag Sarda, and Jayant R. Haritsa Database Systems Lab, SERC/CSA Indian Institute of Science, Bangalore 560012, INDIA. Abstract. Given

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Image Compression through DCT and Huffman Coding Technique

Image Compression through DCT and Huffman Coding Technique International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

On Efficiently Searching Trajectories and Archival Data for Historical Similarities

On Efficiently Searching Trajectories and Archival Data for Historical Similarities On Efficiently Searching Trajectories and Archival Data for Historical Similarities Reza Sherkat IBM Toronto Lab rsherkat@ca.ibm.com Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT

More information

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession The VLDB Journal manuscript No. (will be inserted by the editor) Indexing Spatio-temporal Archives Marios Hadjieleftheriou 1, George Kollios 2, Vassilis J. Tsotras 1, Dimitrios Gunopulos 1 1 Computer Science

More information

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

Large Databases. mjf@inesc-id.pt, jorgej@acm.org. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex

Large Databases. mjf@inesc-id.pt, jorgej@acm.org. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex NB-Tree: An Indexing Structure for Content-Based Retrieval in Large Databases Manuel J. Fonseca, Joaquim A. Jorge Department of Information Systems and Computer Science INESC-ID/IST/Technical University

More information

Distance Degree Sequences for Network Analysis

Distance Degree Sequences for Network Analysis Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Community Mining from Multi-relational Networks

Community Mining from Multi-relational Networks Community Mining from Multi-relational Networks Deng Cai 1, Zheng Shao 1, Xiaofei He 2, Xifeng Yan 1, and Jiawei Han 1 1 Computer Science Department, University of Illinois at Urbana Champaign (dengcai2,

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

Introduction to Graph Mining

Introduction to Graph Mining Introduction to Graph Mining What is a graph? A graph G = (V,E) is a set of vertices V and a set (possibly empty) E of pairs of vertices e 1 = (v 1, v 2 ), where e 1 E and v 1, v 2 V. Edges may contain

More information

Data Structures for Moving Objects

Data Structures for Moving Objects Data Structures for Moving Objects Pankaj K. Agarwal Department of Computer Science Duke University Geometric Data Structures S: Set of geometric objects Points, segments, polygons Ask several queries

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Real-Time Adaptive Algorithm for Resource Monitoring

Real-Time Adaptive Algorithm for Resource Monitoring Real-Time Adaptive Algorithm for Resource Monitoring Mauro Andreolini, Michele Colajanni, Marcello Pietri, Stefania Tosi University of Modena and Reggio Emilia {mauro.andreolini,michele.colajanni,marcello.pietri,stefania.tosi}@unimore.it

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology k-nearest Neighbor

More information

Time Series Representation for Elliott Wave Identification in Stock Market Analysis

Time Series Representation for Elliott Wave Identification in Stock Market Analysis Time Series Representation for Elliott Wave Identification in Stock Market Analysis Chaliaw Phetking Faculty of Science and Technology Suan Dusit Rajabhat University Bangkok, Thailand +662-244-5600 chaliaw_phe@dusit.ac.th

More information

QUT Digital Repository: http://eprints.qut.edu.au/

QUT Digital Repository: http://eprints.qut.edu.au/ QUT Digital Repository: http://eprints.qut.edu.au/ Nayak, Richi and te Braak, Paul (2007) Temporal Pattern Matching for the Prediction of Stock Prices. In Ong, K.-L. and Li, W. and Gao, J., Eds. Proceedings

More information

Local features and matching. Image classification & object localization

Local features and matching. Image classification & object localization Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.

More information

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let) Wavelet analysis In the case of Fourier series, the orthonormal basis is generated by integral dilation of a single function e jx Every 2π-periodic square-integrable function is generated by a superposition

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Indexing the Trajectories of Moving Objects in Networks

Indexing the Trajectories of Moving Objects in Networks Indexing the Trajectories of Moving Objects in Networks Victor Teixeira de Almeida Ralf Hartmut Güting Praktische Informatik IV Fernuniversität Hagen, D-5884 Hagen, Germany {victor.almeida, rhg}@fernuni-hagen.de

More information

Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra

Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra João Clímaco and Marta Pascoal A new method to detere unsupported non-doated solutions

More information

Detecting Network Anomalies. Anant Shah

Detecting Network Anomalies. Anant Shah Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Visual Analysis of the Behavior of Discovered Rules

Visual Analysis of the Behavior of Discovered Rules Visual Analysis of the Behavior of Discovered Rules Kaidi Zhao, Bing Liu School of Computing National University of Singapore Science Drive, Singapore 117543 {zhaokaid, liub}@comp.nus.edu.sg ABSTRACT Rule

More information

Privacy Preserving Similarity Evaluation of Time Series Data

Privacy Preserving Similarity Evaluation of Time Series Data Privacy Preserving Similarity Evaluation of Time Series Data Haohan Zhu Department of Computer Science Boston University zhu@cs.bu.edu Xianrui Meng Department of Computer Science Boston University xmeng@cs.bu.edu

More information

Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste)

Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste) Clustering by fast search- and- find of density peaks Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste) What is a cluster? clus ter [kluhs- ter], noun 1.a number of things of the same

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

Towards Online Recognition of Handwritten Mathematics

Towards Online Recognition of Handwritten Mathematics Towards Online Recognition of Handwritten Mathematics Vadim Mazalov, joint work with Oleg Golubitsky and Stephen M. Watt Ontario Research Centre for Computer Algebra Department of Computer Science Western

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Graphical Representation of Multivariate Data

Graphical Representation of Multivariate Data Graphical Representation of Multivariate Data One difficulty with multivariate data is their visualization, in particular when p > 3. At the very least, we can construct pairwise scatter plots of variables.

More information

Module II: Multimedia Data Mining

Module II: Multimedia Data Mining ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA Module II: Multimedia Data Mining Laurea Magistrale in Ingegneria Informatica University of Bologna Multimedia Data Retrieval Home page: http://www-db.disi.unibo.it/courses/dm/

More information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Eric Hsueh-Chan Lu Chi-Wei Huang Vincent S. Tseng Institute of Computer Science and Information Engineering

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Part-Based Recognition

Part-Based Recognition Part-Based Recognition Benedict Brown CS597D, Fall 2003 Princeton University CS 597D, Part-Based Recognition p. 1/32 Introduction Many objects are made up of parts It s presumably easier to identify simple

More information

Content Delivery Networks. Shaxun Chen April 21, 2009

Content Delivery Networks. Shaxun Chen April 21, 2009 Content Delivery Networks Shaxun Chen April 21, 2009 Outline Introduction to CDN An Industry Example: Akamai A Research Example: CDN over Mobile Networks Conclusion Outline Introduction to CDN An Industry

More information

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

A hierarchical multicriteria routing model with traffic splitting for MPLS networks A hierarchical multicriteria routing model with traffic splitting for MPLS networks João Clímaco, José Craveirinha, Marta Pascoal jclimaco@inesccpt, jcrav@deecucpt, marta@matucpt University of Coimbra

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Zhao Wenbin 1, Zhao Zhengxu 2 1 School of Instrument Science and Engineering, Southeast University, Nanjing, Jiangsu

More information

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials 3. Interpolation Closing the Gaps of Discretization... Beyond Polynomials Closing the Gaps of Discretization... Beyond Polynomials, December 19, 2012 1 3.3. Polynomial Splines Idea of Polynomial Splines

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

CHAPTER-24 Mining Spatial Databases

CHAPTER-24 Mining Spatial Databases CHAPTER-24 Mining Spatial Databases 24.1 Introduction 24.2 Spatial Data Cube Construction and Spatial OLAP 24.3 Spatial Association Analysis 24.4 Spatial Clustering Methods 24.5 Spatial Classification

More information

Big Data Interpolation: An Effcient Sampling Alternative for Sensor Data Aggregation

Big Data Interpolation: An Effcient Sampling Alternative for Sensor Data Aggregation Big Data Interpolation: An Effcient Sampling Alternative for Sensor Data Aggregation Hadassa Daltrophe, Shlomi Dolev, Zvi Lotker Ben-Gurion University Outline Introduction Motivation Problem definition

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Jie Gao Department of Computer Science Stanford University Joint work with Li Zhang Systems Research Center Hewlett-Packard

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

0 )1, 2! 3!( +++ 4 5 677.689 :#4

0 )1, 2! 3!( +++ 4 5 677.689 :#4 ! # % & ( )!! ) ) +++,! &, ( &. / 0 )1, 2! 3!( +++ 4 5 677.689 :#4./7 9.8 7. ; A neural network for mining large volumes of time series data Bojian Liang and James Austin Advanced Computer Architectures

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

CHAPTER VII CONCLUSIONS

CHAPTER VII CONCLUSIONS CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information