Mobility Data Mining and Analytics

Similar documents

Mobile phone data for Mobility statistics

Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach

Big Data & Privacy. It s Time for a New Deal on Personal Data Dino Pedreschi. KDD LAB ISTI CNR and Univ. of Pisa

Identifying users profiles from mobile calls habits

MOBILITY DATA MODELING AND REPRESENTATION

Discovering Trajectory Outliers between Regions of Interest

Recommendations in Mobile Environments. Professor Hui Xiong Rutgers Business School Rutgers University. Rutgers, the State University of New Jersey

Location-Based Social Networks: Users

Big Data Analytics in Mobile Environments

Advanced Methods for Pedestrian and Bicyclist Sensing

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Processes of urban regionalization in Italy: a focus on mobility practices explained through mobile phone data in the Milan urban region

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Introduction to Data Mining

Mapping Linear Networks Based on Cellular Phone Tracking

Estimation of Human Mobility Patterns and Attributes Analyzing Anonymized Mobile Phone CDR:

Traffic mining in a road-network: How does the

CHAPTER-24 Mining Spatial Databases

IBM Social Media Analytics

SPATIAL DATA CLASSIFICATION AND DATA MINING

Using Data Mining for Mobile Communication Clustering and Characterization

BIG DATA FOR MODELLING 2.0

The STC for Event Analysis: Scalability Issues

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

PH.D. THESIS (SSD) INF/01. Mastering the Spatio-Temporal Knowledge Discovery Process

DIGITS CENTER FOR DIGITAL INNOVATION, TECHNOLOGY, AND STRATEGY THOUGHT LEADERSHIP FOR THE DIGITAL AGE

Customer Analytics. Turn Big Data into Big Value

Use of System Dynamics for modelling customers flows from residential areas to selling centers

IBM Social Media Analytics

Mining Mobile Group Patterns: A Trajectory-Based Approach

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

Visualizing e-government Portal and Its Performance in WEBVS

IDENTIFICATION OF KEY LOCATIONS BASED ON ONLINE SOCIAL NETWORK ACTIVITY

Cluster Analysis: Advanced Concepts

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Fleet management system as actuator for public transport priority

Craig McWilliams Craig Burrell. Bringing Smarter, Safer Transport to NZ

The Scientific Data Mining Process

NetView 360 Product Description

Information Management course

DATA MINING - 1DL360

Complex Event Processing (CEP) Why and How. Richard Hallgren BUGS

Smart Transport for Sustainable City

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

The Data Mining Process

arxiv: v2 [cs.si] 8 Aug 2014

How To Make Sense Of Data With Altilia

GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

How To Create A Retail Analytics Platform With Tapway

Grid Density Clustering Algorithm

Use of Mobile Positioning Data for Tourism Statistics

Introduction. A. Bellaachia Page: 1

In comparison, much less modeling has been done in Homeowners

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Big Data Collection and Utilization for Operational Support of Smarter Social Infrastructure

Scalable Cluster Analysis of Spatial Events

Deep Insights Smart Decisions Motionlogic

Social Media Mining. Data Mining Essentials

2013 Student Competition

Using multiple models: Bagging, Boosting, Ensembles, Forests

1.5.3 Project 3: Traffic Monitoring

Spatio-Temporal Clustering: a Survey

PhoCA: An extensible service-oriented tool for Photo Clustering Analysis

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

DEMOCRATIZING BIG DATA: THE ETHICAL CHALLENGES OF SOCIAL MINING. Dino PEDRESCHI (KDDLab, Dipartimento di Informatica, Università di Pisa)

A Study of Web Log Analysis Using Clustering Techniques

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities

Protein Protein Interaction Networks

Use of a Web-Based GIS for Real-Time Traffic Information Fusion and Presentation over the Internet

3. Dataset size reduction. 4. BGP-4 patterns. Detection of inter-domain routing problems using BGP-4 protocol patterns P.A.

A framework for Itinerary Personalization in Cultural Tourism of Smart Cities

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

Transcription:

Sponsored by: Mobility Data Mining and Analytics 2 nd Datasim Summer School 14 th July 2014 S. Rinzivillo KDD Lab ISTI CNR Pisa, Italy

BIG DATA availability What we buy Whom we interact with What we search for Where we go

Country-wide mobile phone data

8 Analisi di Reti Sociali. Aprile-Maggio 2011 July 13, 2014

World Cup 2014 Football is a simple game: 22 men chase a ball for 90 minutes and at the end, the Germans always win -- Gary Lieneker (after Italy 1990 Final) @bigdatatales http://bigdatatales.com

BIG DATA availability What we buy Whom we interact with What we search for Where we go

Urban Mobility Complexity: vehicles

Urban Mobility Complexity: phones

Crash Course on MDM How can we manage the complexity coming from huge amount of data?

4-stage mobility data mining semantics derived models basic trajectory patterns and models raw trajectory data

Trajectory data q Mobility of an object is described by a set of trips q Each trip is a trajectory, i.e. a sequence of time-stamped locations Time (x 5,y 5,t 5 ) (x 5,y 5,t 5 ) Y (x 4,y 4,t 4 ) (x 4,y 4,t 4 ) X (x 1,y 1,t 1 ) (x 2,y 2,t 2 ) (x 3,y 3,t 3 ) Y X (x 1,y 1,t 1 ) (x 2,y 2,t 2 ) (x 3,y 3,t 3 )

Basic mobility patterns and models l T-Cluster: represents a group of similar trajectories l T-Pattern: represents trajectory segments that visit the same sequence of regions with similar transition times l T-Flock: represents trajectory segments that move together for a time interval

Basic mobility patterns & models: T-clustering q Trajectories are grouped based on similarity q Several possible notions of similarity q Start/End points q Shape of trajectory q Shape & time q Etc. Nanni, Pedreschi. Time-focused clustering of trajectories of moving objects. J. of Intelligent Information Systems, 2006. Rinzivillo, Pedreschi, Nanni, Giannotti, Andrienko, Andrienko. Visually-driven analysis of movement data by progressive clustering. J. of Information Visualization, 2008

Density Based Clustering K-means Density-based

Average Euclidean Distance Sincronized q Align point temporally q q Eventually assign penalties to non matching points

Common Destination q Select last point Plast for each trajectory q D(T,T ) = Euclidean(Plast, P last)

Common Origins q Select first point Pfirst for each trajectory q D(T,T ) = Euclidean(Pfirst, P first)

Route Similarity q Alignment of points, multiple matches q Average Euclidean Distance q Penalties for non matching initial points (no penalties for destinations)

Process Overview Simple and very efficient distance measure Dataset More selective and particular distance functions (or more restrictive parameters) Clusters Noise Subclusters Subclusters Noise Knowledge

Basic mobility patterns & models: T-pattern q T-Pattern Temporal information Area A Δt = 5 minutes Area B Δt = 35 minutes Area C Spatial information l Variations: Absolute time, visit duration, distance traveled, speed, sensor/ user provided measures (temp., pressure, ratings, ) Giannotti, Nanni, Pedreschi, Pinelli. Trajectory pattern mining. Proc. ACM SIGKDD 2007

Basic mobility patterns & models: T-Flocks q Group of objects that move together (close to each other) for a time interval M. Wachowicz, R. Ong, C. Renso, M. Nanni: Finding moving flock patterns among pedestrians through collective coherence. International Journal of Geographical Information Science 25(11): 1849-1864 (2011)

Derived patterns and models q Combination & refinement of basic patterns and models l Individual Mobility Profile: routines consistently followed by a single moving object l T-PTree: predictive tree built by combining T-Patterns

Derived patterns and models: mobility profiles User history An ordered sequence of spatio-temporal points. Trips construction Cutting the user history when a stop is detected Stops Spatial Threshold Stops Temporal Threshold Grouping Performing a density based clustering equipped with a spatio temporal distance function Spatial Tollerance Temporal Tollerance Spatio temporal distance Pruning Groups with a small Number of trips are Pruned Support Threshold Profile extraction The medoid of each group becomes user s routines and the all set become the user s mobility profile Trasarti, Pinelli, Nanni, Giannotti. Mining mobility user profiles for car pooling. ACM SIGKDD 2011

Derived patterns and models: T- Prediction Tree + q Rule-based prediction model q Each T-Pattern is used as a case q Tree = combination / simplification of a set of T- Patterns + Monreale, Pinelli, Trasarti, Giannotti. Where Next: a predictor on Trajectory pattern mining. Proc. ACM SIGKDD 2009

Derived patterns and models: T- PTree q Example: Compare actual trajectory against the T-PTree q Spatial and temporal similarity used to choose best rule A E D B C

Semantic Annotation Semantic trajectories: translate (x,y,t) trajectories to sequences of events with a semantics Semantic enrichment: tag and classify trajectories or patterns based on domain knowledge or mined information

Semantic trajectories q First transform a (geometric) trajectory into a semantic representation, then apply data mining. q Semantic trajectory represented as a sequence of stops (places where objects stay still) & moves (trajectory segments Tr1 = < Hotel where [21 08], objects change position) Monument [ 9 13], Restaurant [14-16] >

Mobility Diaries q Data-driven diaries q Describe daily mobility routines by means of a set of semantic trajectories...

... Mobility Diaries

Mobility Diaries (a) (b) Figure 1: (a) The top two eigenbehaviors for Subject 4 of the Reality Mining dataset, the lighter the color the higher the probability image taken from [5]. (b) Exemplary LDA-topics extracted from the Reality Mining dataset image taken from [6] Classification & Prediction of Whereabouts patterns from Reality Mining Data Sets. Ferrai andmamei. Pervasive & Mobile Computing Applying PCA or LDA to a set of these arrays allows to extract some lowdimensions latent variables (eigenvectors and LDA-topics respectively) representing underlying patterns in the data, and Journal, o ering conditional Dec. 2011 probability

M-Atlas system Download from: http://m-atlas.eu

M-Atlas input q M-Atlas: An atlas for urban mobility behaviors. A framework to query, analyze and navigate the results on mobility data

M-Atlas platform q A tool kit to extract, store, combine different kinds of models to build mobility knowledge discovery processes.

M-Atlas System Centralized database which contains all the data, patterns and models. It is possible to extend the system with new algorithms and new data, pattern or model types.

Practically the system adds new object-relational types to the database in order to represent the new types of data, patterns and models. The advantage of having an object-relational representation is threefold: (i) it allows the definition of complex data such as lists and trees, ure 8. We distinguish between models and patterns: a pattern is a representation of a local property that holds over a sub-group of mobility data, e.g., a flock of trajectories; on the other hand, a model is a representation of a global property that holds over an entire dataset: accordingly, a model is either a global aggregate (e.g., speed distribution in a trajectory dataset) or a collection of patterns (e.g., the clustering that partitions an entire dataset into separate clusters). Objects taxonomy in M-Atlas Spatial Object Temporal Object Moving Object Data Object M-Model M-Pattern T-Reachability T-Clustering T-ODMatrix T-PTree T-Pattern T-Flow T-Cluster T-Flock set of set of aggregation of Fig. 8 The M-Atlas type hierarchy. M-Model, M-Pattern and Data are the basic types of data. We can notice the relationship between M-Models and M-Patterns. For example, T-Clustering model is represented by a set of T-Cluster patterns, while T-PTree model is an aggregation of T-Patterns We distinguish between models and patterns: a pattern is a representation of a local property that holds over a sub-group of mobility data; a model is a representation of a global property that holds over an entire dataset.

CREATE DATA Travels BUILDING MOVING_POINTS FROM (SELECT userid,lon,lat,datetime FROM RawData ORDER BY userid,datetime) SET MOVING_POINT.MAX_SPACE_GAP = 0.2 AND DMQL: MOVING_POINT.MAX_TIME_GAP Model contructors = 1800 12 3.1.2 T-Flow. M-Pattern The Types T-Flow tf =< R 1,R 2,w > represents a flow of w 0 trajectories which move from region R 1 to region R 2 (Figure 9(d)). A mobility pattern, M-Pattern in short, represents the common behavior of a (sub-)group of trajectories, obtained as a result of a data mining algorithm. The types of M-Patterns 3.1.3 M-Model Types currently supported by M-Atlas are shown in Figure 9. Pattern s Mobility models, M-Models in short, are the global models extracted by a data mining algorithm, where the adjective global indicates the fact that each such model describes the entire input dataset. Figure 10 illustrates some of the available M-models in M-Atlas; other M- Models are simply the entire collection of T-Patterns, T-Clusters and T-Flocks mined over a trajectory dataset. Fig. 9 M-Pattern types: (a) T-Cluster, (b) T-Pattern, (c) T-Flock, (d) T-Flow Models T-Cluster. A T-Cluster (Figure 9(a)) is defined as a set S = {( 1,l), ( 2,l),...} of labelled trajectories, which share the same membership tag l. The trajectories of a T-Cluster are grouped on the basis of their similarity according to a specified similarity function, chosen from a repertoire of possible choices. of a data mining method with a specified parameter setting. M-Atla structor for each method in its data mining library, presented in sec T-Pattern: Fig. 10 M-Models it is represented types: (a) Reachability as tp =(R, plot, T, (b) s) T-PTree where and R mining =< (c) T-ODMatrix. r 0 constructor,...,r k > is query a sequence is the following, of which generates a step of regions, T =< t 1,...,t k > is a sequence of relative time clusters intervals under t j specific =[t s j,te j parameters: ] associated to each region and s is the support of tp, i.e., the number of trajectories that are compatible CREATE MODEL ClusteringTable t withreachability tp in space and plot: time. is a histogram Informally, ofadistances T-Patternbetween can trajectories, represented obtained as r 1 MINE AS T-CLUSTERING FROM (Select t.id, 0 r1 considering tk r t.trajobj k. a from TrajectoryTable t) Originally specificintroduced distance function in [17],(Figure a T-Pattern 10(a)). (Figure More9(b)) precisely, a concise SET it T-CLUSTERING.FUNCTION a sequence description of pairs of frequent Rp = =< ROUTE_SIMILARITY AND behaviors, (t 1,d 1 ) in...(t terms n,dof n )) both > where space t(i.e., j is athe trajectory regionsand of space d j is the visited distance T-CLUSTERING.EPS during between movements) t j = and 100 and t j+1 AND, T-CLUSTERING.MIN_PTS = 20 time where (i.e., the t j+1 duration is the of nearest movements). neighbor of t j which does not occur in {t 1,...,t j }. Using a threshold for distance, the reachability plot identifies a set of T-Clusters representing the T-Flock. A T-Flock f =(I,r,b) represents a spatio-temporal coincidence of a group of partition of the whole dataset into labelled groups of similar trajectories. moving points, where I =[t min,t max ] is the time interval of the coincidence, b is the base 3.2 Spatio-temporal query primitives moving T-PTree. point and A T-Pattern r is the spatial Tree, buffer T-PTree around in short, b which is a is compact used to representation determine the coincidence. of a set of T- This Patterns spatio temporal (Figure 10(b)). coincidence It is a prefix defines tree a PT common = {root, behavior N, E}, of where the people N is the which set of move nodes of the tree, E is the set of edges and root is the root of The thequerying tree. Each primitives node n i = over {r, data, supp} models and patterns are summ

The user Interface The process tree which organize the analyses done Each node has a type : Trajectories, Map, Clustering, Flocks, etc.. Each node is described by the chain of DMQL queries executed from the root The Map loaded from Open Street Map and composed by different layers Pre-built tools. Each one perform a set of DMQL queries on the selected node. Each tool has a set of parameters. Contextual Menu each node type has different options and tools. Each tool has a set of parameters. Additional panels for the navigation or pattern selection.

Mobility Data Mining process as a DMQL query q q q q CREATE MODEL MilanODMatrix AS MINE ODMATRIX FROM (SELECT t.id, t.trajectory FROM TrajectoryTable t), (SELECT orig.id, orig.area FROM MunicipalityTable orig), (SELECT dest.id, dest.area FROM MunicipalityTable dest) CREATE RELATION CenterToNESuburbTrajectories USING ENTAIL FROM (SELECT t.id, t.trajectory FROM TrajectoryTable t, MilanODMatrix m WHERE m.origin = Milan AND m.destination IN (Monza,..., Brugherio)) CREATE MODEL ClusteringTable AS MINE T- CLUSTERING FROM (Select t.id, t.trajectory from CenterToNESuburbTrajectories t) SET T-CLUSTERING.FUNCTION = ROUTE_SIMILARITY AND T-CLUSTERING.EPS = 400 AND T-CLUSTERING.MIN_PTS = 5 CREATE RELATION DistributionCluster USING CONTAINS FROM (SELECT t.id, t.trajectory, c.cid FROM ClusteringTable c, TrajectoryTable t WHERE c.tid=t.id), (SELECT * FROM Periods p) WHERE cid IN (0,2,3)

Mobility Atlas of a City Understanding urban human mobility

The (GeoP)KDD process Mobile phone data, GPS tracks End user Mobility Patterns Mobility manager Mobility Data Mining Mobility Data Raw data

Sensing the movement Several datasources avaiable

GSM data q q Mobile Cellular Networks handle information about the positioning of mobile terminals q CDR Call Data Records: call logs (tower position, time, duration,..) q Handover data: time of tower transition More sophisticated Network Measurement allow tracking of all active (calling) handsets

GPS tracks q Onboard navigation devices send GPS tracks to central servers Ide;Time;Lat;Lon;Height;Course;Speed;PDOP;State;NSat 8;22/03/07 08:51:52;50.777132;7.205580; 67.6;345.4;21.817;3.8;1808;4 8;22/03/07 08:51:56;50.777352;7.205435; 68.4;35.6;14.223;3.8;1808;4 8;22/03/07 08:51:59;50.777415;7.205543; 68.3;112.7;25.298;3.8;1808;4 8;22/03/07 08:52:03;50.777317;7.205877; 68.8;119.8;32.447;3.8;1808;4 8;22/03/07 08:52:06;50.777185;7.206202; 68.1;124.1;30.058;3.8;1808;4 8;22/03/07 08:52:09;50.777057;7.206522; 67.9;117.7;34.003;3.8;1808;4 8;22/03/07 08:52:12;50.776925;7.206858; 66.9;117.5;37.151;3.8;1808;4 8;22/03/07 08:52:15;50.776813;7.207263; 67.0;99.2;39.188;3.8;1808;4 8;22/03/07 08:52:18;50.776780;7.207745; 68.8;90.6;41.170;3.8;1808;4 8;22/03/07 08:52:21;50.776803;7.208262; 71.1;82.0;35.058;3.8;1808;4 8;22/03/07 08:52:24;50.776832;7.208682; 68.6;117.1;11.371;3.8;1808;4 q Sampling rate 30 secs q Spatial precision 10 m

Road side sensors q Measure the flow of a specific road arc q Laser-based sensors q Inductive loops q Traffic cameras

Other data sources q Social web services q Flickr q Foursquare q Gowalla q Twitter q Presence estimation q Hotel statistics q Airport departures and arrivals q Bus and public transportation q Park usage q Weather conditions

Dimensions to explore q Space q Administrative borders q E.g.: city q Distance travelled q How much a person is travelling Space Dimensions Individual Individual Preferred locations EigenMobility Time q Time q Hour of day q Day of week q Weekdays/weekends

A small city: Pisa Space Dimension s Individ ual Time

First dimension: space Travel length distribution Space Dimension s Individ ual Time

Travel length on the map

Pisa Pisa Firenze Lucca Livorno Siena Sum Firenze Lucca Livorno From everywhere To Firenze 26 January 26 Jan 27 Jan To Lucca From everywhere 26 January 28 Jan 29 Jan From everywhere To everywhere All times 30 Jan To Lucca From everywhere All times Exploring Origin and Destinations

Exploring Origins and Destinations

Exploring the origins of trips 0km 5km 5km 15Km > 150km

Exploring origins of trips > 150km 19 trips

Second dimension: time When people move to Pisa? Space Dimension s Individ ual Time

Let s focus at city level 0km 5Km 5km 15Km

Trips segmented by similarity Space Dimension s Individ ual Time

Explore clusters: Florence

Explore clusters: A1

Explore clusters: A12

Explore Clusters: Valdera

Explore clusters: Versilia

Trip segmentation by time Space Dimension s Individ ual Time

Trips Segmented by Time: from 5 to 8

Discover traffic jams

Aggregate trips by common destinations

Industry: Saint Gobain

Industry: Saint Gobain

Residential Area: I Passi

Residential Area: I Passi

Residential vs Industrial

Services: Montacchiello

Services: Montacchiello

Extracting travellers profiles - Analysis focused on the single individual - Find his/her systematic mobility User trips Mobility profile Routines

Services: Montacchiello (Profiles) Space Dimension s Individ ual Time

Impact of systematic mobility on access patterns

What-if scenarios

Service: Montacchiello (Car Pooling?) q Traj Blu q DT: 06:46:53 q Traj Red q DT: 11:52:06 q Traj Green q DT: 06:51:41 q Blu can give a ride to Green

Application: Car pooling Pro-active suggestions of sharing rides opportunities without the need for the user to explicitly specify the trips of interest. Matching two routines: Mobility profile share-ability:

Communities of users

Networks as a mining tool S. Rinzivillo, S. Mainardi, F. Pezzoni, M. Coscia, D. Pedreschi, F. Giannotti Discovering the Geographical Borders of Human Mobility KI - Künstliche Intelligenz, 2012.

Mobility coverages

Step 1: spatial regions

Step 2: evaluate flows among regions

Step 3: forget geography

Step 4: perform community detection

Step 4: perform community detection

Step 5: map back to geography

Step 6: draw borders

Final result

Final result: compare with municipality borders

Borders in different time periods Only weekdays movements Only weekend movements Similar to global clustering: strong influence of systematic movements Strong fragmentation: the influence of systematic movements (home-work) is missing

Borders at regional scale

Final results 7 (a) 500m (b) 1000m (c) 2000m (c) 5,000m (d) 10,000m (e) 20,000m Fig. 7: The resulting clusters obtained with different spatial granularities. 0.59 topology analysis of the networks performed in Section IV, that identified the most promising cell sizes at values smaller 0.58

Confronto con le nuove province

Explore borders by time q Use temporal projections to extract mobility networks q Identified three main periods q Week days q Week ends q Whole week q Having GPS data extending over 4 weeks we extracted 12 distinct networs, named as week0,weekday0,weekend0,week1,and so on Coscia, M., Rinzivillo, S., Giannotti, F. and Pedreschi, D., Optimal Spatial Resolution for the Analysis of Human Mobility. In ASONAM, 2012.

Degree distribution by time p(d) 1 0.1 0.01 0.001 Weekdays1 Weekdays2 Weekdays3 Weekdays4 Weekend1 Weekend2 Weekend3 Weekend4 Week1 Week2 Week3 Week4 0.0001 1e-05 1 10 100 1000 10000 d

Network properties (by day) # 60000 50000 40000 30000 20000 10000 Nodes Edges # Connected Components 220 200 180 160 140 120 0 May 2st May 8th May 15th May 22nd 100 May 29th May 2st May 8th May 15th May 22nd May 29th Day Day

Borders quality

Semantic Enrichment

NetMob 2013 MP4-A Project: Mobility Planning For Africa Mirco Nanni, Roberto Trasarti, Barbara Furletti, Lorenzo Gabrielli Peter Van Der Mede, Joost De Bruijn, Erik De Romph, Gerard Bruil

The Challenge q Incompleteness issue q Call Detail Records describe the location of users only during activity (calls, messages) q Most individual mobility might be invisible q Lack of semantics q No information about activities and purpose q Spatial uncertainty issue q Location described in terms of cells having dynamic and sometimes large extent

The approach (summary) q Analyze raw GSM data to q infer systematic mobility of individuals q Build origin-destination matrices q Describe (expected) flows between areas q Build a transportation model q Assigns O/D matrix to OSM road network through OmniTRANS system

Systematic mobility q A single trace of an individual can be poorly informative about his/her movements H B W A C H A B W C time

Systematic mobility q Yet, several daily traces of the same individual might allow to identify regular places H A W C H H W H H B H A W H H W W H H A B W C

H Systematic mobility q Yet, several daily traces of the same individual might allow to identify regular places A C W H W B H A H W H W H H H A B W H C W H H H H H H H H H H W W W W W W

H Systematic mobility q Yet, several daily traces of the same individual might allow to identify regular places and trips A C W H W B H A H W H W H H H A B W H C W H H H H H H H H H H W W W W W W

Systematic mobility q The whole individual mobility is then summarized by its systematic movements Afternoon routine H W Morning routine l They will be used as typical daily schedule of the individual

Systematic O/D matrix q Combine the ten 2-weeks datasets into one q For each user, extract significant L1 L2 q Aggregate (individual) systematic movements into (collective) systematic flows q Examples: Outgoing traffic Incoming traffic

Mobile phone socio-meters Analyze individual call habits to recognize profiles q Resident q Commuters q Visitors/Tourists

Call Habit Profiles Week: working days Time & weekend slots 0:00-7:59 8:00-18:59 19:00-23:59 Users call habit profile

Resident profile

Resident profile Commuter profile

Resident profile Commuter profile Visitor profile Night visitors Daylight visitors

User profile quantification Resident profile Commuter profile Visitor profile

Sponsored by: Investigating semantic regularity of human mobility lifestyle Vinicius Monteiro de Lira Federal University of Pernambuco, Brazil vcml@cin.ufpe.br Valeria Cesario Times Federal University of Pernambuco, Brazil vct@cin.ufpe.br Patricia Cabral Tedesco Federal University of Pernambuco, Brazil pcart@cin.ufpe.br Salvatore Rinzivillo ISTI-CNR, Pisa, Italy salvatore.rinzivillo@isti.cnr.it Chiara Renso ISTI-CNR, Pisa, Italy chiara.renso@isti.cnr.it 18th International Database Engineering & Applications Symposium IDEAS '14 Porto, Portugal 12 2

INTRODUCTION 123 The appearance and wide distribution of position-enabled personal devices boosted the study of the mobility behavior of the individuals based on crowsourced data. When these postioning data are enrcihed with semantic information (i.e. the place visited) we have semantic trajectories. The semantics helps in the human dynamics understanding

About Regularity 124 We study the tendency of mobile individuals to be regular or irregular when choosing the places and the time to perform some activities Semantic (or activity-based) Regularity Definition of spatial and temporal entropy as a measure of the semantic regularity of users computed from crowsensed data Values ranges from 0 to 1; Where 1 means highest regularity; and 0, lowest regularity or no regularity;

125 Why studying semantic regularity? Regularity profiles can characterize one specific aspect of the user lifestyle We give a quantitative measure of the regularity habits of the people under observation This can be useful in: Recommendation systems Carpooling Advertisement

METHODOLOGY 126 The semantic regularity behavior is measured according to two dimensions: Spatial: how much a user tends to visit the same places to perform a given activity. Temporal: the regularity of the user to perform an activity in a preferred temporal interval.

METHODOLOGY 127 Three phases: (i) Data Collection of users' visits to Points of Interest (POIs); (ii) Estimation of the regularity measures (iii) Extraction of the semantic regularity profiles.

Semantic regularity - Example 128 Visits dataset University Work/Study 14.00 18.00 Gym Leisure 18.30-20 Restaurant Eating 12.45-13.30 We associate a category of place to an activity with a static mapping University Work/Study 14.30 18.30 Gym Leisure 19.00 20.30 Restaurant Eating 13.00 14.00

Visits and frequency distributions 129 The Visits dataset provides the mobility information to associate a person p to a POI poi_id she visited. < VisitID; UserID; poi id; poi cat; timestamp > Formally, for a POI p of category C we define the spatial relative frequency distribution SRFD of u as: SRFD(u,C,p) = P(u in p C) = #visists to p #visits to C Formally, for a POI of category C we define the temporal relative frequency distribution TRFD of u as: TRFD(u,C,t) = P(u in t C) = #visists to t #visits to C

The Entropy measures 130 Given a user u and a place category C, his Spatial Entropy (SH): SH(u,C) = p C SRFD(u,p,C)logSRFD(u,p,C) And, analogously the Temporal Entropy (TH): TH(u,C) = p T TRFD(u,t,C)logTRFD(u,t,C) The Spatial Maximum Entropy (SMH) for each category: SMH(C) = log C The Temporal Maximum Entropy (TMH) for each category: SMH(C) = log I

Semantic regularity 131 Given a user u and a category C, the Semantic Spatial Regularity for C is: Given a user u, a set of interval I and a category C, the Semantic Temporal Regularity for C is: A semantic regularity profile for a user u and consists of a set of tuples < Ci,SSR(u,Ci),STR(u,Ci) > for all catgories of places (activities) Ci in C1,C2,...,Cn.

Example 132 Example of Semantic Spatial Regularity and Semantic Temporal Regularity for Gyms: We compute the Spatial Entropy (SH) and the Temporal Entropy (TH). Based on this we can see that the regularity measure for the gym SSR is high, while the temporal regularity STR is low

EXPERIMENTS 133 We tested our methodology using a dataset of check-ins generated from a Location-based Social Network (LBSN), called Brightkite. The dataset has a total of 968.784 check-ins performed by 2806 users around the world between March 22nd, 2008 and October 18th, 2010. Check-ins : user identification, the geographic coordinates and the time instant Foursquare API to annotate semantically the places where users performed the check-ins. 13 main categories of POIs mapped to most common activities

EXPERIMENTS - restaurants 134 Restaurant category Most REGULAR Most people tend to change the place when they go eating and also the time when they go. Most of the users are irregular in space and time Most IRREGULAR

EXPERIMENTS - University 135 University category Most REGULAR We clearly notice a very regular spatial behavior Most of the users are distributed close the value 1 (more regular) on the spatial dimension Most IRREGULAR

EXPERIMENTS High regularity 136 TL TR High irregularity BL BR

MAPMOLTY tool 137 MAPMOLTY computes a number of measures to summarize the loyalty level of each POI from different Categories, called loyalty indicators. The application is built upon the map to ease the navigability and visualization in the interesting area. Vinicius de Lira, Chiara Renso, Salvatore Rinzivillo, Valeria Cesario Times and Patricia Tedesco. MAPMOLTY: a web tool for discovering place loyalty based on mobile crowdsource data, Demo paper at ICWE 2014

Collect Movements from the Crowd q Investigate approaches to mine urban mobility patterns and anomalies by analyzing socially created trajectories: - Extract mobility from geo-enabled social media - Enrich with contextual/semantic information to extract more insights about the nature of the movements.

Twitter Data q Microblogging platform q User may send short messagges (up to 140 characters) on what is around them q Georeference of tweets q 600k tweets (300k geotagged) q 33k users q 8 weeks (may-june 2012)

How to build Tweet-trajectories q Aggregate consecutive tweets according to a spatio-temporal threshold

Sampling rate distribution of tweets

Trajectory Extraction

Trajectory Extraction

Origin Destination Analysis

Origin Destination Analysis: relevant fluxes From Airport From Sagrada Familia

Semantic Enrichment

Foursquare q q User contributed timestamped position 9 Top-level categories q Nightlife and Sport q Travel & Transport q Outdoor & Recreation q Shop & Service q College & University q Food q Art & Entertainment q News q Residence q Professional & Other Places

Semantic trajectory mining MWC2012 Semantic Trajectories q Dataset: 9689 trajectories built (75 min./100 mt.) from geo-located tweets of Barcelona during the week of the Mobile World Congress 2012 (MWC2012), the week before and the week after and semantically enriched by classifying as performed by tourists and locals associating the most-likely Foursquare venue.

Semantic Trajectory Mining Semantic Origin/Destinaiton matrix built considering the top Foursquare category of origin and destination of trajectories Start trajectory Foursquare place: Burger King, L Hospitalet Foursquare top category: Food End Trajectory Foursquare place: 22@, Glories Foursquare top category: Professional & Other places

Semantic Trajectory Mining Semantic Origin/Destinaiton matrix built considering the top Foursquare category of origin and destination of trajectories

Semantic Trajectory Mining Semantic Origin/Destinaiton matrix built considering the top Foursquare category Week before of origin and destination of trajectories MWC2012 Week Week after

Join Semantics with Spatial Flow 90 80 Trips by category entering to Sants Montjuic Week 0 Week 1 Week 2 70 60 # of trips 50 40 30 20 10 0 Food Arts & Entertainment Outdoors & Recreation Professional & Other Places Travel & Transport Shop & Service Nightlife Spot

Join Semantics with Spatial Flow 40 35 Trips exiting from Sants Montjuic by category Week 0 Week 1 Week 2 30 25 # of trips 20 15 10 5 0 Food Arts & Entertainment Outdoors & Recreation Professional & Other Places Travel & Transport Shop & Service Nightlife Spot College & University

Sponsored by: Where have you been today? Annotating trajectories with DayTag S. Rinzivillo, F. Siqueira, L. Gabrielli, C. Renso, V. Bogorny SSTD 2013, Demo Paper, Monaco

Sensing People Behavior: Surveys q Cons q Low spatial precision q Low temporal accuracy q Limited in time (usually one or two days) q Underestimation of short stops (e.g. ATM) q Pro q Semantically rich q User-view of movement q Motivation of the movement

Sensing People Behavior: GPS q Cons: q No semantic information q Difficult for user to reconstruct movement motivations q Pro q High spatial precision q High temporal accuracy q Unlimited time of track q Precise reconstruction of movement dynamic (accelaration, route, speed) q Low cost technology

DayTag

DayTag: Anatomy

DayTag: Timeline

DayTag: Spatial Reference

DayTag: Semantic Information

Cambia il traffico con i tuoi TAG Una inizifva di: In collaborazione con: tagmyday.isf.cnr.it

Join us Move Tag Send tagmyday.isf.cnr.it

Personal Data Store tagmyday.isf.cnr.it

AcFvity DistribuFon Incoming flow to Calci from Pisa tagmyday.isf.cnr.it

Atlas of Urban Mobility

Atlas of Urban Mobility

Pisa Traffico in Ingresso

Pisa Incoming Traffic

Trip distribution per day Pisa S. Giuliano Cascina 1600 120 1400 100 1200 1000 80 800 60 600 40 400 200 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0

From DATA to KNOWLEDGE Demographic data Transport data Movement data Geographic data Data T- Clustering T- Pa[erns Models ValidaFon Forecasts

Deployment of a model Data Integration and Semantic Enrichment Service Continuosly Sensed indicator CREATE MODEL MilanODMatrix AS MINE ODMATRIX FROM (SELECT t.id, t.trajectory FROM TrajectoryTable t), (SELECT orig.id, orig.area FROM MunicipalityTable orig), (SELECT dest.id, dest.area FROM MunicipalityTable dest) Dashboard Periodically Sensed indicator Validation

Privacy by Design in Data Mining

7 Billion October 2011

The dark side: Privacy Risks 176 ü Big data of human activity contain personal sensitive information ü Opportunities of discovering knowledge by analytical and data mining tools increase hand in hand with the risks of privacy violation ü An important question: May data publishing and mining violate individual privacy?

De-identified User Trajectory 177 ü ü Human data may reveal many facets of the private life Privacy protection is increasingly difficult and it cannot simply be accomplished by de-identification ü ü Color darkness of each region is proportional to the number of different visits Discovering persons living in that home and working in that company we can identify the user

178 How can we guarantee privacy protection in Data Mining? Privacy by Design Paradigm

Privacy by Design Paradigm 179 ü Design frameworks to counter the threats of undesirable and unlawful effects of privacy violation without obstructing the knowledge discovery opportunities of data mining technologies ü Natural trade-off between privacy quantification and data utility ü Our idea: Privacy by Design in Data Mining Philosophy and approach of embedding privacy into the design, operation and management of information processing technologies and systems

Privacy by Design in Data Mining 180 ü The framework is designed with assumptions about The sensitive data that are the subject of the analysis The attack model, i.e., the knowledge and purpose of a malicious party that wants to discover the sensitive data The target analytical questions that are to be answered with the data ü Design a privacy-preserving framework able to transform the data into an anonymous version with a quantifiable privacy guarantee guarantee that the analytical questions can be answered correctly, within a quantifiable approximation that specifies the data utility

Our Frameworks 181 q Privacy by Design for Data Publishing q Trajectory Anonymization by spatial generalization q Trajectory Anonymization by semantic generalization q Privacy by Design for Data Mining Outsourcing q Privacy-Preserving Mining of Association Rules from Outsourced Transaction Databases q Privacy by Design for GSM User Profiles q Privacy by Design in Distributed Movement Data

Privacy by Design for Movement Data Publication A. Monreale, G. Andrienko, N. Andrienko, F. Giannotti, D. Pedreschi, S. Rinzivillo, S. Wrobel. Movement Data Anonymity through Generalization. Journal of Transactions on Data Privacy

Privacy-Preserving Framework q Anonymization of movement data while preserving clustering q Trajectory Linking Attack: the attacker q knows some points of a given trajectory q and wants to infer the whole trajectory q Countermeasure: method based on q spatial generalization of trajectories q k-anonymization of trajectories

Trajectory Generalization q Given a trajectory dataset 1. Partition of the territory into Voronoi cells 2. Transform trajectories into sequence of cells

Partition of the territory Characteristic points extraction: Starts (1) Ends (2) Points of significant turns (3) Points of significant stops, and representative points from long straight segments (4) 1 4 4 3 4 4 4 2 4 4 3 4 Spatial Clusters : Group the extracted points with desired spatial extent (MaxRadius) defining the degree of the generalization Voronoi Tessellation: Partition the territory into Voronoi cells using the centroids of the spatial clusters as generating

Generation of trajectories Divide the trajectories into segments that link Voronoi cells For each trajectory: the area a 1 containing its first point p 1 is found the following points are checked If a point p i is not contained in a 1 for it the containing area a 2 is found and so on Generalized trajectory: From sequence of areas to sequence of centroids of areas

Generalization vs k-anonymity 187 q Generalization could not be sufficient to ensure k-anonymity: q For each generalized trajectory there exist at least others k-1 different people with the same trajectory? q Two transformation strategies q KAM-CUT q publishing only the k-frequent prefixes of the generalized trajectories q KAM-REC q recovering portions of trajectories which are frequent at least k times q minimizing the noise

Dataset 188 q Trajectory Data in Milan city q GPS traces by about 17,000 vehicles

Clustering on Anonymized Trajectories 189

Probability of re-identification 190

Conclusion q Opportunities and challenges to have a deep insight within human mobility q Mobility models as dual piece of knowledge q Enabler for new services q Decision support for planning and design q Creation and extraction of complex models supported by an integrated platform: M-Atlas q Management of complex analytical processes q Deployment of services

Conclusion q Privacy is ever-growing concern in our society q Privacy often brings to skepticism q Effects on the use of technologies q Effects on the opportunities of data understanding q Providing methodologies for risk evaluation and data control

Key publications q q q q q q q F Giannotti, M Nanni, F Pinelli, D Pedreschi. Trajectory pattern mining. ACM SIGKDD 2007 F Giannotti, D Pedreschi. Mobility, data mining and privacy: Geographic knowledge discovery. Springer, 2008 A Monreale, F Pinelli, R Trasarti, F Giannotti. WhereNext: a location predictor on trajectory pattern mining. ACM SIGKDD 2009 S Rinzivillo, D Pedreschi, M Nanni, F Giannotti, N Andrienko, G Andrienko. Visually driven analysis of movement data by progressive clustering. Information Visualization 7 (3-4), 225-239. 2008 D Wang, D Pedreschi, C Song, F Giannotti, AL Barabasi. Human mobility, social ties, and link prediction. ACM SIGKDD 2011 F Giannotti, M Nanni, D Pedreschi, F Pinelli, C Renso, S Rinzivillo, R Trasarti. Unveiling the complexity of human mobility by querying and mining massive trajectory data. The VLDB 20(5) 2011 R Trasarti, F Pinelli, M Nanni, F Giannotti. Mining mobility user profiles for car pooling. ACM SIGKDD 2011

Key publications q q q q q M Coscia, G Rossetti, F Giannotti, D Pedreschi. Demon: a local-first discovery method for overlapping communities. ACM SIGKDD 2012 S Rinzivillo, S Mainardi, F Pezzoni, M Coscia, D Pedreschi, F Giannotti. Discovering the geographical borders of human mobility. KI-Künstliche Intelligenz 26 (3) 2012 D Pennacchioli, M Coscia, S Rinzivillo, D Pedreschi, F Giannotti. Explaining the Product Range Effect in Purchase Data. IEEE BIGDATA 2013 B Furletti, L Gabrielli, C Renso, S Rinzivillo. Analysis of GSM Calls Data for Understanding User Mobility Behavior. IEEE BIG DATA 2013 L Milli, A Monreale, G Rossetti, D Pedreschi, F Giannotti, F Sebastiani. Quantification trees. IEEE ICDM 2013

Vision papers q F Giannotti, D Pedreschi, A Pentland, P Lukowicz, D Kossmann, J Crowley, D Helbing. A planetary nervous system for social mining and collective awareness. The European Physical Journal Special Topics 214 (1), 49-75, 2012 q J van den Hoven, D Helbing, D Pedreschi, J Domingo-Ferrer, F Giannotti. FuturICT The road towards ethical ICT. The European Physical Journal Special Topics 214 (1), 153-181, 2012 q M Batty, KW Axhausen, F Giannotti, A Pozdnoukhov, A Bazzani, M Wachowicz. Smart cities of the future. The European Physical Journal Special Topics 214 (1), 481-518, 2012