Tutorial on Geographic and Spatial Data Mining

Transcription

1 Tutorial on Geographic and Spatial Data Mining 5th Italian Symposium on Advanced Database Systems - SEBD 7 Torre Canne, Italy June 7th Fraunhofer Society Joseph von Fraunhofer, German physicist and entrepreneur Fraunhofer mission: - do state-of-the-art research and use it in challenging customer projects - Funding is 33% research grants, 33% customer projects, 33% institutional funding 57 institutes, 4 locations, 2. employees, bill. annual volume Best-known invention: MP3 2

2 Fraunhofer IAIS: Intelligent Analysis- and Information Systems From sensor data to business intelligence, from media analysis to visual information systems: Our research allows companies to do more with data New name, long-standing experience - Founded in 26 as a merger of the Fraunhofer institutes AIS and IMK 23 people: scientists, project engineers, technical and administrative staff Located on Fraunhofer Campus Schloss Birlinghoven/Bonn Joint research groups and cooperation with Univ. Bonn 3 Fraunhofer IAIS: research and projects Core research areas: Machine learning and adaptive systems Data Mining and Business Intelligence Automated media analysis Interactive access and exploration Autonomous systems 4 2

3 Objectives Although it is about statistical concepts, algorithms and data structures, the tutorial has a practical, application oriented focus Integration of various technologies and algorithms. How do they combine? Covers a broad range I do not assume familiarity with spatial concepts, but some basic familiarity with data mining approaches Three Objectives: - to stimulate research on spatial data mining related issues - to stimulate development of more efficient spatial databases tailored for data mining applications - to stimulate real-world applications 5 A main message Spatial Data Mining is not an esoteric research topic; it is practically and commercially very important and sometimes business critical field! Later I give an example where the value of several dozens of companies directly depends on the predictions given by our spatial data mining algorithms. 6 3

4 Spatial vs. Geographic Data Mining Geographic Data is data related to the earth Spatial Data Mining deals with physical space in general, from molecular to astronomical level Geographic Data Mining is a subset of Spatial Data Mining Allmost all geographic data mining algorithms can work in a general spatial setting (with the same dimensionality) This tutorial focuses on geographic data in 2D, but most algorithms work on spatial data in general I do not talk about specificties of molecular data, face detection, etc. 7 Agenda Introduction Spatial and Geographic Data Mining Part I: Basic Concepts Spatial Databases and GIS Spatial Data Types Spatial Queries Construction of Complex Features Part II: Exploratory Analysis of Spatial Data Part III: Spatial and Geographic Data Mining Methods Autocorrelation Mining Point Data Clustering, Kriging Mining Points, Lines Areas Clustering, Subgroup Discovery, Association Rules Mining Networks A practical case study Mining Tracks in Space and Time Mining from GPS-Data Challenges Summary 8 4

5 Introduction Spatial Data Mining p n ( p) ( p p ) 9 A classical example of spatial analysis Disease cluster Dr. John Snow Investigating causes of a cholera epidemia London, September 854 Infected water pump? A good representation is often the key to solving a problem 5

6 Good representation because... Represents spatial relation of objects of the same type Represents spatial relation of objects to other objects Shows only relevant aspects and hides irrelevant It is not only important where a cluster is but also, what else is there (e.g. a water-pump)! Goals of Spatial Data Mining Identifying spatial patterns Identifying spatial objects that are potential generators of patterns Identifying information relevant for explaining the spatial pattern (and hiding irrelevant information) Presenting the information in a way that is intuitive to the analyst and supports further analysis 2 6

7 Spatial Data Mining Data Mining p n ( p) ( p p ) + Geographic Information Systems = Spatial Mining 3 Basic Concepts Spatial Databases and GIS p n ( p) ( p p ) 4 7

8 Public Sector Are there clusters of a certain disease? Is there a relationship between poverty and death rate? Are there crime hot spots or patterns? Commercial Where to build a new supermarket? Where are the customers that want to buy new product X? How many cars pass the main road per hour? Does it pay to install new antennas? What percentage of young females sees a billboard 5 located in Ripley avenue? Buildings Streets Schools Hospitals Rivers Factory Attribute Data Person p. Household No. of Cars Long-term illness Age Profession Ethnic group Unemployment Education Migrants Medical establishment 6 Shopping areas... 8

9 Elements of a spatial database Spatial Query Language SELECT c.holding_company, c.location FROM competitor c, bank b WHERE b.site_id = 64 AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE' Spatial Operators Spatial Data Types Metadata INSIDE Spatial Indexes 7 Examples from Oracle Spatial Spatial Datatypes p n ( p) ( p p ) 8 9

10 Two basic types of representation: Fields and Discrete Objects Fields: Raster Data Line Discrete Objects: Vector Data Model Area 9 Vector Data: Data Structure Ordered sets of xy-coordinates defining points, lines, or polygons 3D or 4D also possible Straight lines between points Draw line from last to first coordinate Easy to scale (linear transformation) Data Structure Point Line (Polyline) Area (Polygon) (5,) ((5,),(9,6),(2,7)) ((5,),(9,6),(2,7), ) Storage efficient Relationships between objects (e.g. overlap) are not explicitly represented Aka Spaghetti Model 2

11 Two Main Types of Vector Data - non regular tesselations closed polylines that partition the space - discrete isolated objects: point, line, area Point Line Area (Polygon) Tesselations very useful for aggregation of discrete objects and for feature extraction 2 UK, Greater Manchester, Stockport Buildings ID Geometry Address (,),(2,2), Gladstone Street 5 2 (3,3),(4,4), Islington Road 2 Geometry Address Type 3 (5,5),(6,6), Ripley Avenue 23 Type 2 Factory ID GeometryName Ty pe (,), Gladstone Street 5 2 (3,3), Islington Road (5,5), Ripley Avenue 23 Hospitals Geometry Address Schools Streets ID GeometryName Ty pe (,), Gladstone Street 5 2 (3,3), Islington Road (5,5), Ripley Avenue 23 ID GeometryName Ty pe (,), Gladstone Street 5 2 (3,3), Islington Road (5,5), Ripley Avenue 23 Phone #Beds ID 2 Geometry (,),(2,2), (3,3),(4,4), Address Stepping Hill Great Moore Phone Rivers ID GeometryName Ty pe (,), Gladstone Street 5 2 (3,3), Islington Road (5,5), Ripley Avenue Description of objects are organized in relations (database tables) Each row in a table describes one object Different categories of objects are organized in separate relations each having its own set of attributes.

12 Hierarchy Often data are organized in spatial hierarchies, e.g. Country State Zip Area Voting District District Parcel County District 2 UK census data District n Hierarchies may overlap Ward Ward 2 Ward n Ward Ward Ward 23 Representation of data in a spatial database A set of relations R,...,R n such that each relation R i has a geometry attribute G i or an identifier A i such that R i can be linked (joined) to a relation R k having a geometry attribute G k - Geometry attributes G i consist of ordered sets of x,y-pairs defining points, lines, or polygons - Different types of spatial objects are organized in different relations R i (geographic layers), e.g. streets, rivers, enumeration districts, buildings, and - each layer can have its own set of attributes A,..., A n and at most one geometry attribute G 24 2

13 Representation of data in a spatial database A set of relations R,...,R n such that each relation R i has a geometry attribute G i or an identifier A i such that R i can be linked (joined) to a relation R k having a geometry attribute G k - Geometry attributes G i consist of ordered sets of x,y-pairs defining points, lines, or polygons - Different types of spatial objects are organized in different relations R i (geographic layers), e.g. streets, rivers, enumeration districts, buildings, and Does not fit well to - each layer can have its own set of attributes A,..., A n and standard at most data mining one geometry attribute G approaches! This is where the specific research challenge for geographical data mining comes from! 25 Raster Data How to represent phenomena conceived as fields? Divide the world into square cells No variation within cells Cell value may be average, max, min, sum,central point, Represent discrete objects as collections of one or more cells Represent fields by assigning attribute values to cells Legend Raster representation. Each color represents a different value of a nominalscale field Mixed conifer Douglas fir Oak savannah Grassland Longley et al (2) 26 3

14 Raster and Vector: Comparison Legend Mixed conifer Douglas fir Oak savannah Grassland Raster Modell Advantages: Simple data structure Simple logical and algebraic structures Disadvantages: Large data volumes imprecise geometry expensive transformations of coordinates implicit coordinates Vector Model Advantages: Specify geometry by coordinates Topological relationships High geometric accuracy Storage efficient Disadvantages: Complex data structure Compute intensive logical and algebraic operations Remember: Raster is vaster and vector is correcter 27 Spatial Queries p n ( p) ( p p ) 28 4

15 Spatial Queries Problem: Vector data model does not explicitly capture relationships among objects. They have to be inferred using spatial predicates Spatial predicates evaluate to true or false for given objects A query returns the set of objects of which the statement is true; or using aggregates the [minimum,maximum,sum,average, ], object(s) of which the statement is true Queries are evaluated using a spatial join among different relations (layers) Here s where database technology and spatial indexing comes in to do the job efficiently! Still, they can be extremely time consuming! 29 Spatial Predicates: Egenhofer s 9-intersection model Each object has interior (i), exterior (e) and boundary (b) This results in a 9-intersection matrix for the relation between two spatial objects A and B A cell contains a iff the intersection of point sets is non-empty A meets B A overlaps B A contains B A B b i e b i e b i e b b b A i i i e e e B 3 5

16 Spatial Predicates 9-intersection model for 2 regions (Egenhofer 99) A disjoint B, B disjoint A A meets B, B meets A A overlaps B, B overlaps A A equals B, B equals A A covers B, B covered by A A covered-by B, B covers A A contains B, B inside A A inside B, B contains A INSIDE 3 Spatial Queries: Distance Metric spaces: Symmetry: d(i,j) = d(j,i) triangle inequality: d(i,k) d(i,j)+ d(j,k) i j k - Euclidian Distance: d e (i,j) = 2 2 ( x i x j ) + ( yi + y j ) Distance relation between polygons: Minimum distance between any 2 points of the polygons 32 6

17 Spatial Queries: Distance and Proximity Selects nearest neighbor in space Select all object within a certain distance Example: Oracle Spatial Select all competitors and locations within 2 miles distance from bank with id 64 SELECT c.holding_company, c.location FROM competitor c, bank b WHERE b.site_id = 64 AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE' X Distance Hospital # Main Street Hospital #2 33 Distance non-metric non metric spaces Asymmetry: d(i,j) d(j,i) triangle inequality does not hold drive time driving distance costs 34 7

18 Stockport Database Schema River Water spatially interacts Shopping Region Spatial Join spatially interacts spatially interacts ED Standard Join =zone_id =zone_id TAB... TAB6 Attribute data 95 tables with census data, ~8 attributes Spatial Hierarchy Geographical Layers Building inside Street spatially interact spatially interact Vegetation =zone_id... TAB95 County District Wards Enumeration district 85 tables Relations between objects 35 implicit; very flexible and storage efficient, but compute intensive Implementation of Spatial Databases Many popular databases have spatial extensions by now: Oracle Spatial PostgreSQL MySQL (since 4.) 36 8

19 Construction of Complex Features p n ( p) ( p p ) 37 Spatial Functions Example: Oracle Spatial g Return a geometry - Union - Difference - Intersect Constructs new geometry objects from existing ones using point set theory Original Union - XOR - Buffer - CenterPoint - ConvexHull Return a number - Length Efficient implementation using computational geometry Difference XOR Intersect - Area - Distance

20 Constructing Cells: Buffer How many competitors are in the catchment area of my shop? = How many shops are within the buffer? Simplistic approximation Does not take account of barriers (rivers, highways) Does not take into account road system 39 Voronoi diagramm Which are my nearest competitors? What is the cover of my radio antenna? Decompose space into regions around each point in a set of points S such that all the points in the region around p i are closer to p i than to any other point in S Complexity: O( n lg n) Related data structure: Delaunay triangulation (graph of Voronoi neighbors) = Find voronoi neighbors Approximation Does not take account of barriers (rivers, highways) Does not take into account road system 4 2

21 Drive-Time Zone (Dijkstra) How many competitors are in the catchment area of my shop? All streets segments within a drive time distance <= d from a given starting point Use Dijkstra s algortihm Complexity: 2 O ( V ) O( V lgv + E) depending on data structures used for implementation Realistic approximation Take account of barriers (rivers, highways) take into account road system, maximum speed on road 4 Pre-procesing Several of the feature extractions are computationally quite expensive (at least for large data sets) and there is often a combinatorial explosion of features that might be constructed. Several strategies are used in Spatial Warehouse Design: Selective Pre-processing: materializing important joins in advance (storage requirements!) Approximate precomputing: e.g. using Minimum Bound Rectangle to approximate polygon Schema Design (e.g. Star-Schema with selective materialization): Han J., Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD,

22 Spatial Database of Vector Objects: Discussion Relations between objects implicit Very flexible: depending on analysis task different relationsships can be constructed storage efficient; no overhead for storing relationship information compute intensive (thus spatial Indexing very important) Consider what and when to materialize Very rich possibilities to create new, non-trivial objects from existing ones Makes feature extraction an important topic for Data Mining Inherently multi-relational setting (but not first-order) Could also be formulated in a deductive database setting 43 Interactive Visualization of Spatial Data Exploratory Data Analysis p n ( p) ( p p ) 44 22

23 Interactive Visualization of Spatial Data Exploratory Data Analysis (work by G. Andrienko & N. Andrienko, H. Voss and others at Fraunhofer IAIS) For the theory behind CommonGIS, see the book Andrienko, N. and Andrienko G.: Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, Geographic Information Systems and CommonGIS Many commercial tools available - ESRI ARC GIS - Mapinfo - Intergraph - Manifold But CommonGIS is different and unique - Map-based exploratory data analysis - stresses interactive visualization manipulation of statistical data in space - elaborated facilities for time-series visualization CommonGIS can be aquired for non-commercial use by educational instutions for no fee See web page

24 CommonGIS = Fraunhofer IAIS Tool for Map-based Exploratory Data Analysis - combines interactive cartography and statistics Multi-dimensional - Time-series visualization and analysis - Combines Vector-Raster transformation Decision support - Weighted Sums - Ideal Point Analysis Multivariate - Similarity analysis - Dominant Attribut - Integration with Weka (Clustering, Decision Trees) 47 CommonGIS: Visual analysis of spatial data Interactive spatial search for geographic objects and recognition of spatial patterns: dynamic choropleth maps, pie charts, bar charts, etc. with dynamic removal of outliers and dynamic queries Comparison of attribute values of geographic objects (relations and correlations) and comparison of spatial patterns (spatial correlations): (Linked) dynamic maps and interactive diagrams multiple (linked) dynamic maps 48 24

25 CommonGIS: Visual analysis of spatio-temporal data CommonGIS as an interactive browser to study how a spatial pattern evolves over time: time aware maps (animations) time series charts CommonGIS as an interactive browser for temporal behaviours of objects: set of controls for analysing time intervals (object animations) CommonGIS as an interactive browser of discrete space-time events to find spatiotemporal clusters: space-time cube 49 Time Series Sales per Shop and Product Category 5 25

26 Time-Series: Sales per Shop and Product Category Bäckerei Stehcafé Sitzcafé Terrasse Different Time Hierarchies (Year, Quarter, Month, Day ) 5 CommonGIS: Data transformation Transformation of data for further analysis: Attribute transformations: calculate statistical indices transform and combine attribute data arithmetically dynamic classifiers (linked with dynamic choropleth map) cross classifiers (linked with dynamic choropleth map) Geographic transformations: query, transform, combine, derive raster data illumination model raster -> vector transformations (i.e. raster -> area aggregation) point/line -> raster transformations 52 26

27 CommonGIS: Combination of Vector and image data 53 27

28 Geographic and Spatial Data Mining Methods p n ( p) ( p p ) 54 Autocorrelation p n ( p) ( p p ) 55

29 Spatial Variation Field Soil Moisture How are variables distributed in space? Tobler s First Law of Geography: Everything is related to everything else, but near things are more related than distant things. distribution of variables depends on space variables are autocorrelated Franke, diploma thesis, Leipzig Univ., Spatial Autocorrelation: Binary Example binary attribute (blue, white) autocorrelation to four immediate neighbors Moran Index (here): n I = n equal equal n + n change change - equal - change I =.86 I =.39 I =. I = -. Goodchild, CATMOG, GeoBooks, Norwich,

30 Moran s I Morans s I is a measure for spatial autocorrelation. It is a weighted correlation coefficient used to detect departures from spatial randomness. Departures from randomness indicate spatial patterns such as clusters and geographic trend. Values of I larger than indicate positive spatial autocorrelation; values smaller than indicate negative spatial autocorrelation. Moran's I is a weighted product-moment correlation coefficient, where the weights reflect geographic proximity. z attribute of interest; w weight; n number of areal objects I n n n, j i ij, i j i i= j= = n n n i= j= w w ( z z)( z ij, i j i= ( z z) i j z) 2 Example: n = 4 A C B D w ij A B C A B C D weight matrix D 58 Spatial Autocorrelation similarity in location indicates similarity in attribute value differs from temporal autocorrelation - dimensional autocorrelation in time series, spatial autocorrelation spreads in 2 or 3 dimensions - only forward causality in time series, direction of causality not restricted in space depends on scale # sunspots Sunspot Time Series Temperature of Sunspots year 59 3

31 Effects of Autocorrelation makes spatial abstraction possible makes standard approaches of analysis impossible - most statistics assume iid makes local inference attractive - Kriging, knn, makes choice of sampling interval hard - autocorrelation depends on scale makes interpolation easier than extrapolation correlation + - spatial autocorrelation distance zero autocorrelation = independence of location 6 Problem types for Spatial Data Mining Spatial Data Mining := partially automated search for patterns and models in large spatial databases Classification of methods along the following hierarchy Points Points, Lines and Area Networks Tracks in space and time 6 4

32 Handling spatial data in Data Mining Basic Options Treat as ordinary variables no special algorithms needed spatial properties ignored, e. g. discontiguous areas Make spatial relationships explicit e. g. infer topological relationship expensive, but allows normal algorithms to be used Can by done as pre-processing or dynamically (latter requires specialized algortihms) Specialized algorithms - Neighborhood methods, kriging, Gaussian processes, density-based clustering Use proper combination of data, preprocessing, algorithms, and interaction software! 62 Mining Point Data p n ( p) ( p p ) 63 5

33 Mining Point Data Time Complexity Points Space Complexity 64 Clustering spatial point data Point data conceived as discrete objects Many approaches exists for clustering spatial point data In statistics, measures of spatial randomness or non-randomness have been developed (e.g. Ripley 99, Cressie 993) - Ripley s K function as measuring deviation from complete spatial randomness (as exemplified by a Poisson process) - Moran s I, which measures autocorrelation Bayesian approaches often coming from image analysis (cf. Lawson et al 22) In Geography, spatial clustering algorithms have been developed (Openshaw, GAM, 99) 65 6

34 Density Based Clustering a KDD approach [Ester et al. 996] Suitable for large databases Discovers areas of high density and turns them into clusters Discovers clusters of arbitrary shape Can handle noise Algorithm DBSCAN Note: Relatively straightforward extension to vector data possible (GDBSCAN); requires more complex definition of some key concepts (neighborhood and MinPts) 66 Clustering spatial data distance-based clustering is inherently spatial but assumption of convex clusters (e.g. k-means) inappropriate for many geographical tasks X X X X X X X X X X X X 67 source: Ester et al 997 7

35 Definitions Eps-neighborhood of a point p N ε (p) := {q D dist (p, q) ε} Definition of Eps is a crucial parameter! A point p is directly density-reachable from q iff. p N ε (q) 2. N ε (q) >MinPts ( q is core object ) - Not necessarily symmetric p: border object q:core object q p q p P directly density reachable from q Q not directly density 68 reachable from p Definitions 2 density-reachable = p is density-reachable from point q wrt to Eps and MinPts iff there is a chain of points p,,p n, p =q,p n =p such that p i+ is directly density-reachable from p i Transitive, not symmetric p is density-connected to q iff there is point o such that p and q are density-reachable from o wrt to Eps and MinPts. Symmetric q p o p and q densityconnected to each other by o p density reachable from q q not density reachable from p p p 69 8

36 Density-connected clustering A cluster C wrt. To Eps and MinPts is a non-empty subset of database D, where () p,q: if p C and q is density-reachable from p wrt Eps and MinPts, then q C (2) p,q C: p is density connected to q wrt to Eps and MinPts. Non-covered points are noise Each cluster contains at least MinPts Exactly one clustering 7 Algorithm DBScan Basic Idea Check Eps-Neigborhood of every unclassified point in database If neighborhood of p contains more than MinPts, a new cluster with p as core object is build Collect directly density reachable objects from this set, merging clusters as necessary Terminate when no new point can be added to any cluster Complexity: O(n log n) when spatial index is used, otherwise O(n 2 ) 7 9

37 Kriging-Spatial Interpolation p n ( p) ( p p ) 72 Kriging developed by G. Matheron in the 96s based on work of D. Krige geostatistical method of interpolation Point data conceived as samples from a continuous surface results are smoothly varying surfaces provides optimality given assumptions (best linear unbiased estimate) variety of methods, e.g. Ordinary Kriging, Universal Kriging, Co-Kriging, Block Kriging, Stratified Kriging, Indicator Kriging,???? measurements??? unknown values Good introduction: Burrough, P., McDonnell, R

38 Spatial Variation Problem: spatial variation of a continuous attribute is often too irregular to be modelled by a simple, smooth mathematical function Solution: variation can be described by stochastic surface A stochastic process is a family of random variables Z(x) over the index set D R n : { Z( x) : x D} x location in n-dimensional space Z(x) random variable of interest, e.g. soil moisture A Gaussian process is a stochastic process for which any finite set of Z-variables has a joint multivariate Gaussian distribution. 74 Components of Spatial Variation structural component, having a constant mean or trend random, but spatially correlated component (regionalized variable) spatially uncorrelated random noise term Z(x) trend autocorrelation random noise Z( x) = m( x) + ε '( x) + ε '' value at location x is random variable X 75

39 Stationarity Problem: spatial data set is single realization of random process inference is impossible without further restrictions on spatial variation Intrinsic Stationarity (stationarity under translation): constant mean (E[...] = ) or trend (E[...] > ): [ Z( x) Z( x + h) ] const. E = variance of differences h is independent of location: 2 E {Z(x) Z(x + h)} = 2 γ(h) x h x+h Isotropy (stationarity under rotation) : spatial process evolves the same in all directions 76 Ordinary Kriging Assumptions: intrinsic stationarity with a constant mean h x+h - constant mean value in sampling area x E [ Z( x) Z( x + h) ] = - variance of differences depends only on the distance h between sites Var [ Z( x) Z( x + h) ] 2 = E[{ Z( x) Z( x + h)} ] = 2γ ( h) 2 = E[{ ε '( x) ε '( x + h)} ] Once structural effects have been accounted for, remaining variation is homogeneous in variance so that difference at sites are merely a function of differences between them. semivariance 77 2

40 Ordinary Kriging Proceedure:. Estimate semivariance γ(h) from data sample 2. Plot the experimental variogram 3. Fit a theoretical model to the experimental variogram 4. Estimate unknown values as weighted sum of neighboring measurements, determine optimal weights from variogram 78 Semivariance and Experimental Variogram semivariance depends only on distance (lag) h estimate semivariance between all pairs of measurements with distance h (repeat for all possible h) γˆ( h) = 2n n { z( xi) z( xi+ h) } i= 2 γ(h) Experimental Variogram lag h 79 3

41 Variogram nugget: γˆ( h) = 2n n { z( xi) z( xi+ h) } i= 2 - γ(h) = (by definition) - nugget effect represents small scale variation and measurement errors - estimate of ε γ(h) range sill range: - spatial dependency - here, variance of differences increases with distance - two points are more similar the closer they are nugget lag h sill: - semivariance levels off - variance of differences h is independent of distance 8 Variogram Models Spherical Model Exponential Model experimental variogram must be fitted to an appropriate variogram model γ(h) γ(h) lag h lag h most commonly used are the spherical, exponential, linear or Gaussian model γ(h) Linear Model γ(h) Gaussian Model lag h lag h 8 4

42 Interpolation of unknown Values unknown value at location x is estimated as weighted sum of neighboring measurements * Z ( x ) = n i= w Z( i x i ) weights w i are determined according to two restrictions - Z*(x ) is an unbiased estimate of Z(x ) - Z*(x ) is an optimal estimate Have to solve system of n+ linear equations of semivariances and weights 82 Equation System restriction on weights introduces Lagrange parameter φ (Restriction ) system of (n+) equations must be solved to obtain optimal weights for each x γ(x x ) K γ(x x n) w γ(x x ) M O M M M M = γ(xn x ) L γ(xn x n) w n γ(xn x ) L φ Ordinary Kriging is an exact interpolator, i.e. interpolated value of a sample location will be identical with the measurement taken 83 5

43 Variants of Kriging Universal Kriging structural component may contain a external trend Co-Kriging interpolation for one attribute incorporates information of another, correlated attribute sparse measurements of an expensive variable are supported by plenty measurements of a cheap variable Stratified Kriging interpolation within sub-areas equations are adjusted to avoid discontinuities on boundaries More Details: Burrough, P., McDonnell, R Mining Points, Lines, and Areas p n ( p) ( p p ) 85 6

44 Points, Lines and Areas Time Complexity Points, Lines, and Areas Points Space Complexity 86 Points, Lines and Areas Requirements: Point data Polygons aggregations Applications Customer Segmentation, Catchment Areas, Location Planning, Radio Network Analysis Examples: GDBScan Clustering Spatial Subgroup Minig Spatial Association Rules Spatial Model Trees 87 7

45 Clustering of Vector Data: GDBScan [Sander et al 998] Extension of DBSCan - Sample Instantiations dist < ε intersects/meets neighbor S MinCard areas MinArea f (S) MinF 88 Spatial Subgroup Mining p n ( p) ( p p ) 89 8

46 Typical Data Mining representation spreadsheet data exactly table atomic values Data Mining for spatial data: very different from this representation 9 Subgroup Discovery Search (Klösgen 996, Wrobel 997) Subgroup discovery searches deviation patterns for subgroups overproportionally high share of target value (or mean of target variable) Top-down search from most general to most specific subgroups, exploiting partial ordering of subgroups S S 2 S more general than S 2 Beam search expands only the n best ones at each level Evaluating hypothesis according to quality function: N= Total population n= subgroup size p( T C) p( T ) p( T )( p( T )) n N N n p(t)= target share in total population p(t C)= target share in subgroup Extension to multi-relational representation in Wrobel (997) 9 9

47 Translating Multirelational Subgroups to Object-relational SQL Domain: relational database schema D = {R,..., R n } having geometry attributes G i Hypothesis Language Multirelational subgroups are represented by a concept set C = {C i }, where each C i consists of a set of attribute value-pairs {A =v,...,a n =v n } from a relation in D, a set of links L={L i } linking concepts C i, C k via their attributes A m, A k of the form (C i /A m {= inside overlaps... spatially_interact} C k /A n ) target attribute can be non-numeric (A =v ) or numeric aggregate (avg(a)=n) Example: C= {{district.long_term_illness=high, district.unemplyoment=high},{street.name= Manchester Road }} L= {{district.geometry spatially_interact street.geometry}} Enumeration districts with high rate of long term illness and unemplyoment crossed by Manchester Road Testing satisfaction of subgroup descriptions The number of tuples in D that satisfies a subgroup description is evaluated using SQL select statements including joins over multiple relations. 92 Approach: Translation of Spatial Subgroup Mining to SQL (Klösgen, May 22) Representing subgroups in object-relational SQL, i.e. multi-relational representation Using representation for spatial geometry based on Spatial Database Division of work between RDBMS and Search Manager Combining visualization in abstract and physical space 93 2

48 Division of labour between RDBMS and Search Manager (May, Savinov 23) mining query Database Server Search Algorithm statistics Database integration: efficiently organize mining queries Mining query delivers statistics (aggregations) sufficient for evaluating many hypotheses Mining Server search in hypothesis space generation and evaluation of hypotheses (subgroup patterns) 94 SPIN! Spatial Data Mining System Workspace Property Editor Flowchart-Tool Subgroup Result List Subgroup Viewer 95 2

49 Interactive Exploratory Analysis Parallel Coordinate Plot Choropleth Maps Combination of spatial and non-spatial visualization User selects and manipulates variables Powerful for analysis in low dimensions (3-4) Display dynamically linked Scatter Plot 96 Visualization of spatial sugroups High long-term illness in districts crossed by M6 p(t C) vs. p(c) Subgroup Overview Spatial Venn Diagram Subgroup Linked Display 97 22

50 Radio Network Planning in Telecommunication SPIN! Mapviewer (Common GIS) High cut of call ration in mountanous regions crossed by highways having a certain technical configuration Legende: Blau: Autobahn Braun: große Höhe Schwarz: Subgruppe 98 Other commercial applications of Subgroup Discovery How are my customers characterized. Are there interesting profiles? Where to open the next supermarket? Does it create competition for my other supermarkets? Should I invest in UMTS in rural areas? 99 23

51 Spatial Association Rules work and slides by Donato Malerba et al., Univ. Bari p n ( p) ( p p ) Spatial association rules An association pattern P (s%) is a spatial association pattern if it contains at least one spatial relation A large town intersects a road and is adjacent to water (62%) An association rule Q R (s%, c%) is a spatial association rule if Q R is a spatial association pattern IF a large town intersects a road THEN it is also adjacent to water (62%, 89%) Seminal work by Koperski & Han 995 Malerba et al 24

52 The problem Given a spatial database (SDB) with a set of reference objects S, some set R k, k m, of task-relevant objects some spatial hierarchies H involving objects in R k k M granularity levels in the descriptions aset of granularity assignments ψ k which associate each object in H k with a granularity level a couple of thresholds minsup[l] and minconf[l] for each granularity level a domain knowledge Find strong multiple-level spatial association rules. Malerba et al 2 The solution Solution (Appice et al., IDA Journal, 23) based on an Inductive Logic Programming (ILP) approach spatial relations easily handled spatial pattern conjuction of first-order logic atoms θ-subsumption orders the space of spatial patterns monotonicity of support w.r.t. θ-subsumption pruning of patterns at the same granularity level in the candidate generation phase monotonicity of pattern frequency w.r.t. granularity level pruning of patterns at different granularity levels in the candidate generation phase Implemented in SPADA (Spatial Pattern Discovery Algorithm) European project SPIN (Spatial Mining for Data of Public Interest) 3 25

53 Extensions of initial solutions Efficiency improvement of pattern evaluation by caching support objects for each stored pattern Definition of a declarative bias to filter out rules on the basis of users preferences efficiency improvement is a byproduct - In real-world applications a large number of spatial patterns can be generated even for a few hundred spatial objects. - Most of discovered patterns are useless for the application at hand - Urban accessibility application: only spatial patterns involving some sociological factor (household with no car) are interesting. Integration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial) 4 Mining Network Data p n ( p) ( p p ) 5 26

54 Networks Time complexity Networks Points, Lines, and Areas Points Space Complexity 6 Points and Networks Requirements: Point Data Polygons Aggregations Spatial dependencies and relations, networks Examples: Traffic frequency prediction Method: knn 7 27

55 Case Study: Outdoor Advertising - Frequency Atlas Customer: Fachverband für Außenwerbung (FAW; German Outdoor Advertising Association) Task: Performance value assessment of advertising media Traffic volume forecast separate for private cars, public transport, pedestrians 8 Determining reach of a poster board Gesellschaft für Konsumforschung Frequency + Media factories = poster reach 9 28

56 The project in numbers Complete model for all German cities with more than 5. inhabitants (92 cities) = ca.. street segments! Complete model includes, for each segment, item - car frequency - pedestrian frequency - public transport frequency The model is presently beeing extended to to all cities with between. and 5. inhabitants Basic Data: traffic measurements Manual traffic measurement at selected poster locations - 4 times 6 minutes at four days of the week at four times of day Additional empirical model of day totals Properties - Well defined measurements - Extended measurement period, so concept drift can not be excluded Total of 96. manual measurements 29

57 Secondary data Street network Sociodemographics + Socioeconomics Points of Interest (POI) Frequency measurements Public transport network DATA MINING Frequency classes 2 How Spatial Autocorrelation helps Local Measurements Inhomogeneous measurements on the same street

58 Spatial knn Attributes of street segments: - Name, type,. class - Points of Interest - Spatial coordinates Locations with measurement values Distance beetween two segments x a, x b d Selection of the k closest x,, x k Prediction for new segment x q (Project has actually used specially adapted distance measure) M ( xa, xb ) = yˆ q = k i= x am m= w y i i x bm k i= w i w i = with d ( x, x ) q i Segment 4 Spatial KNN - Properties knn captures well autocorrelation inherent in the data Allows to bring in background knowledge by fine-tuning distance function Database Integrated (Oracle Spatial) Performs dynamic spatial query (minimum distances among polygons) Performance improvements Spatial Queries use Index Structures (R-Tree), still relatively costly (i.e. dominates overall runtime) Partial evaluation of distance function based on lower bounds for distance to minimize number of spatial queries Can handle data sets that do not fit into main memory 5 3

59 Smoothing based on flow constraints Measurement errors lead to inconsistencies Need plausible assignment of frequencies Solution: Use Kirchhoff s law as constraint - Sum of inputs = sum of outputs Smoothing algorithm finds locally optimal solution using constraint relaxation 6 Explaining frequencies Problem: Customer wants transparent values, not a black box => Problem for Spatial knn Solution: Fit an explanatory model to the predicted values Allows to understand why predictions are as they are Allows to identify potential outliers and areas of high uncertainty Use Model Trees Geographic Space encoded in x-y coordinates 7 32

60 Numerical prediction with model trees ORTSTEIL = INNENSTADT (LR)... Straßenkategorie: Fussgängerzone: Nebenstr. Hauptstr. Bahnhof Nein Ja Nein Ja Distanz_zu_Bahnhof: <= 5 > 5 Anzahl_Restaurants : <= 5 > 5 Anzahl_Restaurants : <= 5 > 5 X-Koordinate <= > Y-Koordinate LM LM2 LM3 LM4 LM5 <= 9.6 > 9.6 LM FREQUENZ = * X * ANZAHL_EINKAUF * MESSE LM6 8 Improving model by spotting outliers based on model tree prediction Points with great prediction error are checked - Visual inspection - Getting additional empirical input by taking new measurements Corrected values are basis for next round in model building, leading to improved results 9 33

61 Final Result: Frequency Map Cars Pedestrians Public Cars Transport Public Transport Pedestrians 2 Final result: frequency atlas (cars, public transport, pedestrians) ~ ~ Million Million street street segments segments predicted predicted based based on on measurements measurements Used for determining poster prices in Germany since 26 2 Rare instance of a spatial data mining problem that has become business critical 34

62 Spatial Model Trees [Malerba, Appice, Cecci 25] Standard Model Trees (e.g. M5 ) can do Spatial Mining by splitting along x and y coordinates Mrs-Smoti (Malerba et al. 24) is a variant of Model Trees that - Allows regression nodes as interior nodes - Handles directly autocorrelation: Spatial regression model with dependencies in response variables: spatially lagged response It inputs spatial objects eventually belonging to separate thematic layers stored in a spatial database S - target objects (main subject of analysis) - non target objects (relevant for the task in hand) and outputs a spatial model tree T by - partitioning training spatial data according to intra-layer and inter-layer relationships - associating different regression models to disjoint spatial areas Integrates spatial database queries (see Subgroup Discovery) T Y=a+bX X 3 α 2 7 X 2 β Y =i+lx Y =c+dx X 4 γ Y =e+fx 2 Y =g+hx 22 3 Mining Tracks in Space and Time p n ( p) ( p p ) 23 35

63 Tracks in Space and Time Time complexity Tracks in Space and Time Networks Points, Lines, and Areas Points Space Complexity 24 Tracks in space and time Requirements: Point daa Polygons Aggregations Networks Tracks, GPS/RFID/Sensor-Measurement Applications: Traffic prediction, Mobility analysis Examples Sampling, Event analysis, non-linear optimization 25 36

64 Mobility analysis based on GPS-tracks introduction of new pricing model for poster sites based on GPS tracks registration of contact frequencies with poster sites contact extrapolation for target groups: - socio-demographic characteristics - residential areas 26 Media Trend Journal, Nov, 26 Time patterns Patterns / Questions - How long (days) does it take till x% of objects visit all locations? - How long does it take till x% of objects visit at least one location twice? Applications - determine mobility of a group of people - reach of poster networks - find popularity of locations (theatres, supermarkets, hospitals) 27 37

65 Modelling tasks Modelling mobility for cities with GPS-measurements for the overall population Predicting mobility for cities without measurements (hard task!) Extrapolating predictions in time 28 GeoPKDD - FET Project IST-495 Geographic Privacy-aware Knowledge Discovery and Delivery December 25 November 28 Project Leader: Fosca Giannotti General Project Idea extracting user-consumable forms of knowledge from large amounts of raw geographic data referenced in space and in time. knowledge discovery and analysis methods for trajectories of moving objects, which change their position in time, and possibly also their shape or other significant features devising privacy-preserving methods for data mining from sources that typically contain personal sensitive data 29 38

66 The Consortium ID Acronym Partner Country KDDLAB Knowledge Discovery and Delivery Laboratory, ISTI-CNR, Istituto di Scienza e Tecnologie dell Informazione, Pisa. - jointly with Univ. Pisa, Dept. of Computer Science I 2 LUC Univ. Limburg, Theoretical Computer Science Group. B 3 EPFL EPFL, Lab. DB, Lausanne. CH 4 FAIS Fraunhofer Institute for Autonomous Intelligent Systems, Sankt Augustin. D 5 WUR Wageningen UR, Centre for GeoInformation. NL 6 CTI Research Academic Computer Technology Institute, Research and Development Division. - jointly with Univ. Piraeus, Dept. of Informatics GR 7 UNISAB Sabanci University, Faculty of Engineering and Natural Sciences. TK 8 Michael WIND May WIND Telecomunicazioni SpA, Direzione Reti Wind Progetti Finanziati & Technology Scouting. 3 I Geographic Privacy-aware Knowledge Discovery Process Aggregative Location-based services Bandwidth/Power optimization Mobile cells planning Traffic Management Accessibility of services Mobility evolution Urban planning. Telecommunication company (WIND) Privacy-aware Data mining interpretation visualization Public administration or business companies GeoKnowledge trajectory reconstruction p(x)=.2 ST patterns warehouse Trajectories warehouse Privacy enforcement 3 39

67 GeoPKDD Specific Goals models for moving objects, and data warehouse methods to store their trajectories knowledge discovery and analysis methods for moving objects and trajectories, techniques to make such methods privacy-preserving techniques for reasoning on spatio-temporal knowledge and on background knowledge techniques for delivering the extracted knowledge within the geographic framework 32 From Traces to Trajectories: the Source Data GSM network Source: Pedreschi & Giannotti, 25 streams of log data of mobile phones, e.g. cells in the GSM/UMTS network Entering the cell - e.g. (UserID, time, IDcell, in) Exiting the cell - e.g. (UserID, time, IDcell, out) Movements inside the cell? - Eg (UserID, time, X,Y, Idcell Real trajectories are continuous functions Logs are discrete sampling of real trajectories, dependent on the wireless network technology - unregular granularity in time and space - possible imperfection/imprecision An approximated reconstruction of the real trajectory from its log traces is needed 33 4

68 Movement patterns Clustering Group together similar trajectories For each group produce a summary Frequent patterns Discover frequently followed (sub)paths Classification 2 Extract behaviour rules from history % 7%? Use them to predict behaviour of future users 6 % Source: Pedreschi & Giannotti, 25 5% 8% 34 Why emphasis on privacy? More, better data are gathered, more vulnerability from correlation On the other hand, more and new data bring new opportunities Need to maintain privacy without giving up opportunities Need to obtain social acceptance through demonstrably trustworthy solutions Privacy in GeoPKDD... is a technical issue, besides ethical, social and legal, in the specific context of ST data How to formalize privacy constraints over ST data and ST patterns? - E.g., anonymity threshold on clusters of individual trajectories How to design DM algorithms that, by construction, only yield patterns that meet the privacy constraints? 35 4

69 Challenges p n ( p) ( p p ) 36 Causal Inference from Statistical Spatio-Temporal Data Current project at IAIS for newspaper publisher: Sales prediction of individual shops. What happens if a shop closes or is sold out? Predict to which alternative shop customers go. Spatio-Temporal Clustering of shops Time Series Prediction Modeling customer behavior Causal inference about customer behavior If shop A closes, n% of A s customers go to B, m% to C 37 42

70 Sales data per day per shop for several years available Use similarity of time series over some period for determining anomaly in behavior 38 Closed Shop Alternative shops Other shops Use spatial structure to infer potential alternative shops. strong weak People went from A to B when A is closed and B shows anomaly in behavior that cannot be explained otherwise 39 43

71 Closed Shop Alternative shops Other shops Diagramms such as this one can be generated automatically for historic cases Challenge: based on historic examples come up with a predictive model strong 4 weak Ubiquitous Knowledge Discovery Ubiquitous Knowledge Discovery (Embedded Data Mining and mobile and /or distributed mobile, micro processors) Grid Mining (Distributed Architecture, Grid Computing) Knowledge Discovery in mobile Systems (Robots, RFID, GPS, mobile phones, Cars,...) Static and dynamic Sensor networks (Reality Mining) Privacy-Preserving Data Mining KDUbiq Coordination Action (EU, 25-28)

72 Ubiquitous Knowledge Discovery Characteristics of ubiquitous knowledge discovery systems objects are distributed in time and space dynamic infrastructure (moving objects, appear and disappear) analysis situation is in real-time, models evolve incrementally objects have access to local information only, never see the global picture: only knowledge of local spatial environment typically, objects exchange information with other objects Spatial Data Mining is a key issue here! KDUbiq reflects the future research challenges involved in this area 42 Summary Spatial Data form a rich environment for analysis Feature extraction and construction (Spatial Queries & Functions, Voronoi, ) play a very important role Efficiency is often a big concern A variety of approaches to Spatial Data Mining exist, coming from Statistics, Databases, Machine Learning We have seen examples for density based clustering, kriging, subgroup discovery, association rules, model trees, knn, Survival Analysis Methods are different in the data types they can handle Real-world applications are feasible today Many more challenges in the future due to ubiquous environments! 43 45

73 Literature () Andrienko, N. and Andrienko G.: Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, 25 Appice, A., M. Ceci, A. Lanza, F.A. Lisi, & D. Malerba (23). Discovery of Spatial Association Rules in Georeferenced Census Data: A Relational Mining Approach, Intelligent Data Analysis, 7, 6. Burrough, P., McDonnell, R., Principles of Geographical Information Systems, OUP, 998 Cressie, N, 993. Statistics for Spatial Data, Wiley Egenhofer, M.. Reasoning about binary topological relations. In Gunther O. and Schek H.-J., editors, Second Symposium on Large Spatial Databases, volume 525 of LNCS, pages Springer, 99. Ester M., Kriegel H.-P., Sander J. and Xu X A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, Giannotti, F., Nanni, M., Pedreschi, P.: Efficient Mining of Temporally Annotated Sequences. SDM 26 Goodchild, M.F., Spatial Autocorrelation. CATMOG 47,Geobooks. 986, Norwich UK. Han J., Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD, 998. Klösgen, W. (996) Explora: A multipattern and multistrategy discovery assistant In Fayyad, Advances in Knowledge Discovery and Data Mining. MIT Press. Klösgen, W., May, M.: Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database. PKDD 22: Klösgen, W., May, M., Petch, J. 23, Mining census data for spatial effects on mortality, Intelligent Data Analysis Issue: Volume 7, Number 6 / 23 Pages: Literature (2) Koperski, K., Han, J, Discovery of Spatial Association Rules in Geographic Information Databases (995), Proc. 4th Int. Symp. Advances in Spatial Databases, SSD Koperski, K., J. Adhikary and J. Han, `` Spatial Data Mining: Progress and Challenges'', 996 SIGMOD'96 Workshop. on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), Montreal, Canada, June 996 Lawson, A. B. and Denison, D. (22) (eds) Spatial Cluster Modelling Chapman & Hall CRC, London. Lisi, F.A, D. Malerba (24). Inducing Multi-Level Association Rules from Multiple Relations. Machine Learning, 55:75-2. Longley, P., Goodchild, M, MacGuire, D., Rhind, D, 2. Geographic Informations Systems and Science, Wiley Malerba, D., Appice, A., Cecci, M. 25, Mining Model Trees from Spatial Data, LNCS, PKDD25 May, M., Ragia, L. 22, Spatial Subgroup Discovery Applied to the Analysis of Vegetation Data, PAKM 22, LNCS 2569 May, M., Savinov, A 24 SPIN!-An Enterprise Architecture for Spatial Data Mining, Knowledge-Based Intelligent Information and Engineering Systems, LNCS 2773, 23 Openshaw, S., and Craft, A., (99) 'Using geographical analysis machines to search for evidence of cluster and clustering in childhood leukaemia and non-hodgkin Lymphomas in Britain. In G. Draper (ed) 'The Geographical Epidemiology of Childhood Leukaemia and non-hodgkin Lymphomas in Great Britain ', Studies in Medical and Population Subjects No 53, OPCS, London, HMSOBurroughs Ripley, B. 988, Statistical Inference for Spatial Processes, CUP Sander, J., M. Ester, H.-P. Kriegel, and X. Xu. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 2(2):69--94, 998. Wrobel, S. : An Algorithm for Multi-relational Discovery of Subgroups. PKDD 997:

74 Thanks! Fraunhofer IAIS Knowledge Discovery Dr. Contact: Schloss Birlinghoven Sankt Augustin Tel: 224 / / [email protected] 46 47