Geointelligence New Opportunities and Research Challenges in Spatial Mining and Business Intelligence Stefan Wrobel Christine Körner, Michael May, Hans Voss
Fraunhofer Society Joseph von Fraunhofer, German physicist and entrepreneur Fraunhofer mission: - do state-of-the-art research and use it in challenging customer projects - Funding is 33% research grants, 33% customer projects, 33% institutional funding 57 institutes, 40 locations, 12.000 employees, 1 bill. annual volume Best-known invention: MP3 2
Fraunhofer IAIS: Intelligent Analysis- and Information Systems From sensor data to business intelligence, from media analysis to visual information systems: Our research allows companies to do more with data New name, long-standing experience - Founded in 2006 as a merger of the Fraunhofer institutes AIS and IMK 230 people: scientists, project engineers, technical and administrative staff Located on Fraunhofer Campus Schloss Birlinghoven/Bonn Joint research groups and cooperation with Univ. Bonn 3
Fraunhofer IAIS: research and projects Core research areas: Machine learning and adaptive systems Data Mining and Business Intelligence Automated media analysis Interactive access and exploration Autonomous systems 4
Outline Introduction to spatial data mining - Project example: Geomarketing The spatial data mining process - Project example: Outdoor media reach estimation The importance of data - Project example: Customer selection in the gas industry Spatial mining tools and visual analytics - CommonGIS and SPIN! Research challenge: track data - Project example: SPR - GeoPKDD 5
Outline Introduction to spatial data mining - Project example: Geomarketing The spatial data mining process - Project example: Outdoor media reach estimation The importance of data - Project example: Customer selection in the gas industry Spatial mining tools and visual analytics - CommonGIS and SPIN! Research challenge track data - Project example: SPR - GeoPKDD 6
Why Spatial Data Mining now? Almost all data are (or can be) spatially referenced Almost all database and business intelligence systems handle spatial data New data sources push the topic - Satellite data (GPS, Galileo) - Toll collection data - Mobile phone data - RFID - GoogleEarth etc. Spatial Data Mining combines statistics, machine learning, databases, visualization with spatial data 7
A classic example for spatial analysis Disease cluster Dr. John Snow Deaths of cholera epidemia London, September 1854 Infected water pump? 8
Goals of Spatial Data Mining Identifying spatial patterns Identifying spatial objects that are potential generators of patterns Identifying information relevant for explaining the spatial pattern (and hiding irrelevant information) Presenting the information in a way that is intuitive and supports further analysis 9
Spatial Data point objects - located at x, y, (z) coordinates area objects - suitable area description (circle, polygon, path boundary) fields - quantity assumed continuously defined in 2 D or 3 D (+ time!) - e. g. temperature 10
Example: without spatial attributes 11
Example: with spatial attributes 12
Handling spatial data treat as ordinary variables no special algorithms needed spatial properties ignored, e. g. discontiguous areas make spatial relationships explicit e. g. infer topological relationship expensive, but allows normal algorithms to be used specialized algorithms - Neighborhood methods, kriging, Gaussian processes, density-based clustering Use proper combination of data, preprocessing, algorithms, and interaction software! 13
Outline Introduction to spatial data mining - Project example: Geomarketing The spatial data mining process - Project example: Outdoor media reach estimation The importance of data - Project example: Customer selection in the gas industry Spatial mining tools and visual analytics - CommonGIS and SPIN! Research challenge track data - Project example: SPR - GeoPKDD 14
Project example: Outdoor Advertising Reach - Frequency Atlas Customer: Fachverband für Außenwerbung (FAW; Outdoor Advertising Association) Task: Performance value assessment of advertising media Traffic volume forecast separate for private cars, public transport, pedestrians Spatial data mining, active learning procedures 15
Determining reach of a poster board Gesellschaft für Konsumforschung Frequency + Media factories = poster reach 16
The project in numbers Complete model for all German cities with more than 50.000 inhabitants (192 cities) = ca 1.000.000 street segments! Complete model includes, for each segment, item - car frequency - pedestrian frequency - public transport frequency The model is presently beeing extended to to all cities with between 20.000 and 50.000 inhabitants 17
Basic Data: traffic measurements Manual traffic measurement at selected poster locations - 4 times 6 minutes at four days of the week at four times of day Additional empirical model of day totals Properties - Well defined measurements - Distribution of measurements tries to avoid systematic bias - Extended measurement period, so conceptdrift can not be excluded Total of 96.000 manual measurements 18
Secondary data Street network Soxiodemographics + Socioeconomics Points of Interest (POI) Frequency measurements Public transport network DATA MINING 0 200 400 600 800 1000 1250 1500 1750 2000... Frequency classes 19
Smoothing based on flow constraints Measurement errors lead to inconsistencies Need plausible assignment of frequencies Solution: Use Kirchhoff s law as constraint - Sum of inputs = sum of outputs Smoothing algorithm finds locally optimal solution using constraint relaxation 20
Numerical prediction with model trees ORTSTEIL = INNENSTADT (LR)... Fussgängerzone: Nein Ja Straßenkategorie: Nebenstr. Hauptstr. Bahnhof Nein Ja Distanz_zu_Bahnhof: <= 150 > 150 Anzahl_Restaurants : <= 5 > 5 Anzahl_Restaurants : <= 15 > 15 X-Koordinate <= 52.385 > 52.385 Y-Koordinate LM1 LM2 LM3 LM4 LM5 <= 9.6 > 9.6 LM1 FREQUENZ = 2277.3186 * X + 75.4087 * ANZAHL_EINKAUF + -142.4217 * MESSE + -21221.8497 LM6 21
Final result: frequency atlas (cars, public transport, pedestrians) ~1 ~1Million Millionstreet streetsegments segments predicted based on predicted based on96.000 96.000 measurements measurements Accuracy Accuracyincreased increasedtwofold twofold 22
Outline Introduction to spatial data mining - Project example: Geomarketing The spatial data mining process - Project example: Outdoor media reach estimation The importance of data - Project example: Customer selection in the gas industry Spatial mining tools and visual analytics - CommonGIS and SPIN! Research challenge track data - Project example: SPR - GeoPKDD 23
Project example: New customer acquisition for gas supplier Given - nationwide address data with consumer and group data - Response data from original calling campaign To be determined - Nationwide addresses with a high probability of customer interest in a sales representative visit Regional address Interest in visit Nation wide address Consumer attributes Group attributes Interest in visit... yes.........??? 24
Project example: New customer aquisition for gas supplier 1. Use addresses to transfer consumer and group attributes to the regional sample 2. Construct a model for interest in visits based on the enhanced regional sample 3. Apply the model to the nation wide address data Regional address Interest in visit Consumer attributes Group attributes Nation wide Address Consumer attributes Group attributes Interest in visit... yes............... 0,8% 25
Aggregation level of available consumer data 16 federal states 41 districts 441 counties ca. 8.300 zip codes distribution Aggregation ca. 13.900 cities ca. 12.300 statistical districts ca. 40.000 Market Cluster ca. 80.000 voting districts ca. 85.000 market cells ca. 1,5 Mio street segments ca. 20 Mio. Household data 26
Outline Introduction to spatial data mining - Project example: Geomarketing The spatial data mining process - Project example: Outdoor media reach estimation The importance of data - Project example: Customer selection in the gas industry Spatial mining tools and visual analytics - CommonGIS and SPIN! Research challenge track data - Project example: SPR - GeoPKDD 27
Interactive Exploratory Analysis: CommonGIS and SPIN! Choropleth maps showing distribution of variable(s) in space Parallel Coordinate Plot Combining spatial and non-spatial displays Variables selected and manipulated by the user Powerful for lowdimensional dependencies (3-4) Displays dynamically linked Scatter Plot 28
Representation of in the database (Oracle) Klösgen & May 02 A set of relations R 1,...,R n, such that - any relation R i possesses a geometry attribute G i - or an identifier A i which allows joining R i with another relation R k, which in turn possesses a geometry attribute geometry attributes G i consist of sets of x,y-pairs, which define points, lines or polygons different kinds of spatial objects are stored in different relations R i (geographic layers) e.g. streets, rivers, districts, buildings every layer has a single geometry attribute and its own proper set of attributes A 1,..., A n 29
Division of labor between Oracle RDBMS Klösgen & May 02 and search manager mining query Database Server Search Algorithm sufficient statistics Mining Server Database integration: efficiently organize mining queries Mining query delivers statistics (aggregations) ufficient for evaluating many hypotheses search in hypothesis space generation and evaluation of hypotheses (subgroup patterns) 30
Outline Introduction to spatial data mining - Project example: Geomarketing The spatial data mining process - Project example: Outdoor media reach estimation The importance of data - Project example: Customer selection in the gas industry Spatial mining tools and visual analytics - CommonGIS and SPIN! Research challenge track data - Project example: SPR - GeoPKDD 31
Mobility analysis based on GPS-tracks introduction of new pricing model for poster sites based on GPS tracks registration of contact frequencies with poster sites contact extrapolation for target groups: - socio-demographic characteristics - residential areas 32
Time patterns Patterns / Questions - How long (days) does it take till x% of objects visit all locations? - How long does it take till x% of objects visit at least one location twice? Applications - determine mobility of a group of people - reach of poster networks - find popularity of locations (theatres, supermarkets, hospitals) 33
Challenges of track data Goals: - investigate the relationship between spatio-temporal data and frequency measurements - improve prediction performance with active learning Data: - tracks of mobile phones and / or GPS devices - street-map - possibly frequency measurements Tasks: - track-to-street mapping - prediction of traffic frequencies (regression) 34
Track-to-street Mapping Mapping of tracks from cell-level to street-level many possibilities 35
Track-to-street Mapping Mapping of tracks from cell-level to street-level Suppose, we have prior knowledge about traffic frequencies highly frequented streets some routes become more likely. 36
Frequency Prediction with Track Data Steps in Extrapolation: - count number of intersections of streets and tracks within a certain timeframe (e.g. one week) - extrapolate from sample to population Problems: Y-Coordinate - sample is not representative (biased), e.g. more young people have mobile phones than older people, different trafffic behavior of old and young people mobile data is sensitive, possibly only opt in customers - streets with 0-frequency - large gaps within tracks - censored data (people drop out of survey before end) - noise X-Coordinate 37
Probabilistic active track to street mapping [PhD thesis Körner] Tasks: 1. track-to-street mapping 2. extrapolation of traffic frequencies 3. improvement of (online) sampling by active learning 1. mapping 2. extrapolation 3. active learning 38
Integration of Spatial Background Knowledge Aggregation of attributes within a buffer of given location buffer spatially defined buffer places within a radius of 200 m driving zones temporally defined bufffer what places can be reached on foot / by car within the next 20 minutes 4 restaurants within 200m of X 2 hospitals to reach within 12 min 39
Research Questions 1. Can track data be used for frequency prediction? What problems arise? 2. How can track data and frequency measurements benefit from each other? improvement of track-to-street mapping with frequency data enhancement of frequency prediciton using tracks 3. How to incorporate active learning to improve the data model? How to select places for additional traffic measurements? How to select persons for track monitoring? 40
Summary New data sources make spatial mining a very promising topic Spatial data mining is a process consisting of data, preprocessing, algorithms and visualization - Project examples: Geomarketing, outdoor media frequencies Selection of the right data is crucial - Project example: gas industry Spatial data mining is inherently visual - Tools such as CommonGIS Research challenge track data - Project example: SPR GPS tracks, GeoPKDD and and We We are are hiring! hiring! 41