Mining Public Transport User Behaviour from Smart Card Data



Similar documents
Standardization of Components, Products and Processes with Data Mining

Calculation of Transit Performance Measures Using Smartcard Data

WEB-BASED ORIGIN-DESTINATION SURVEYS: AN ANALYSIS OF RESPONDENT BEHAVIOUR

Smart Card Data in Public Transit Planning: A Review

Interuniversity Research Centre on Enterprise Networks, Logistics and Transportation

MEASURING THE SIZE OF SMALL FUNCTIONAL ENHANCEMENTS TO SOFTWARE

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining in Telecommunication

Enhancing the value of an incidents database with an interactive visualization tool

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

ATSB RESEARCH AND ANALYSIS REPORT ROAD SAFETY. Characteristics of Fatal Road Crashes During National Holiday Periods

Data Mining Algorithms Part 1. Dejan Sarka

Scheduling Employees in Quebec s Liquor Stores with Integer Programming

Innovations in London s transport: Big Data for a better customer experience

Introduction. A. Bellaachia Page: 1

ADHAWK WORKS ADVERTISING ANALTICS ON A DASHBOARD

Fleet Size and Mix Optimization for Paratransit Services

Comparing data from mobile and static traffic sensors for travel time assessment

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Building Data Cubes and Mining Them. Jelena Jovanovic

Database Marketing, Business Intelligence and Knowledge Discovery

Introduction to Data Mining

How To Use Neural Networks In Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

SPATIAL DATA CLASSIFICATION AND DATA MINING

Security Requirements Analysis of Web Applications using UML

Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad

USING SMART CARD DATA FOR BETTER DISRUPTION MANAGEMENT IN PUBLIC TRANSPORT Predicting travel behavior of passengers

Azure Machine Learning, SQL Data Mining and R

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Social Media Mining. Data Mining Essentials

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Tracking System for GPS Devices and Mining of Spatial Data

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Data-mining-based methodology for the design of product families

From Big Data to Smart Data How to improve public transport through modelling and simulation.

Refundable Tax. revenuquebec.ca

Regulation respecting the distribution of information and the protection of personal information

USING BIG DATA OF AUTOMATED FARE COLLECTION SYSTEM FOR ANALYSIS AND IMPROVEMENT OF BRT- BUS RAPID TRANSIT LINE IN ISTANBUL

Chapter 20: Data Analysis

PUBLIC PRESCRIPTION DRUG INSURANCE PLAN. Provisions of the REFERENCE DOCUMENT

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Big Data for Transportation: Measuring and Monitoring Travel

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Operational Indicators for public transport companies

Computing construction logic in a multi-calendar environment using the Chronographic Method

Credit Cards: Disentangling the Dual Use of Borrowing and Spending

Cross-Analysis of Hazmat Road Accidents Using Multiple Databases

A Proposed Data Mining Model to Enhance Counter- Criminal Systems with Application on National Security Crimes

Numerical Field Extraction in Handwritten Incoming Mail Documents

Comparison of K-means and Backpropagation Data Mining Algorithms

Hexaware E-book on Predictive Analytics

Benchmarking web site functions Hugues Boisvert CMA International Centre, Montreal, Canada, and

Crowdsourcing mobile networks from experiment

Qlik s Associative Model

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites

Knowledge Discovery from patents using KMX Text Analytics

Application of Data Mining Methods in Health Care Databases

«VISUALIZATION OF POTENTIAL CUSTOMERS»

1. CANADA DAY 2012 PARKING LOT PARTIES IN THE BYWARD MARKET FÊTE DU CANADA 2012 CÉLÉBRATIONS TENUES DANS LE PARC DE STATIONNEMENT DU MARCHÉ BY

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Customer Analytics. Turn Big Data into Big Value

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

MS1b Statistical Data Mining

URBAN MOBILITY IN CLEAN, GREEN CITIES

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

Equational Reasoning as a Tool for Data Analysis

Customer Classification And Prediction Based On Data Mining Technique

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Course Syllabus For Operations Management. Management Information Systems

A Study of Web Log Analysis Using Clustering Techniques

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Study on Weber State University Parking Parking & Shuttle Buses

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

VISUALIZATION OF GEOMETRICAL AND NON-GEOMETRICAL DATA

Statistics Canada s National Household Survey: State of knowledge for Quebec users

Data Mining Solutions for the Business Environment

Agent-Based Simulation to Anticipate Impacts of Tactical Supply Chain Decision-Making in the Lumber Industry

Available online at Available online at Advanced in Control Engineering and Information Science

Tutorial Segmentation and Classification

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

Travaux publics et Services gouvernementaux Canada. Title - Sujet RFSA FOR THE PROVISION OF SOFTWARE. Solicitation No. - N de l'invitation

Overview of the. Tax Credit. Services for Seniors. revenuquebec.ca

The Data Mining Process

Keywords Data Mining, Knowledge Discovery, Direct Marketing, Classification Techniques, Customer Relationship Management

Considerations on Audience Measurement Procedures for Digital Signage Service

Implementation of Data Mining Techniques to Perform Market Analysis

Impact of Online Tracking on a Vehicle Routing Problem with Dynamic Travel Times

Ticketing and user information systems in Public Transport in Thessaloniki area

HELSINKI UNIVERSITY OF TECHNOLOGY T Enterprise Systems Integration, Data warehousing and Data mining: an Introduction

Improving the Wholesales Trough Using the Data Mining Techniques

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Visualization methods for patent data

An Overview of Knowledge Discovery Database and Data mining Techniques

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Transcription:

Mining Public Transport User Behaviour from Smart Card Data Bruno Agard Catherine Morency Martin Trépanier October 2007 CIRRELT-2007-42

Bruno Agard 1,2, Catherine Morency 1,3, Martin Trépanier 1,2,3,* 1 Interuniversity Research Centre on Enterprise Networks, Logistics and Transportation (CIRRELT), Université de Montréal, C.P. 6128, succursale Centre-ville, Montréal, Canada H3C 3J7 2 Département de mathématiques et génie industriel, École Polytechnique de Montréal, C.P. 6079, succursale Centre-ville, Montréal, Canada H3C 3A7 3 Département des génies civil, géologique et des mines, École Polytechnique de Montréal, C.P. 6079, succursale Centre-ville, Montréal, Canada H3C 3A7 Abstract. In urban public transport, smart card data is made of millions of observations of users boarding vehicles over the network across several days. The issue that we address is whether data mining techniques can be used to study user behaviour from these observations. This must be done with the help of transportation planning knowledge. Hence, this paper presents a common transportation planning / data mining methodology for user behaviour analysis. Experiments were conducted on data from a Canadian transit authority. This experience demonstrates that a combination of planning knowledge and data mining tool allows to produce travel behaviours indicators, mainly regarding regularity and daily patterns, from data issued from operational and management system. Results show that the public transport users of this study can rapidly be divided in four major behavioural groups, whatever type of ticket they use. Keywords. Smart card, public transportation, information system, data mining. Acknowledgements. The authors wish to acknowledge the support and collaboration of the Société de transport de l Outaouais, who gracefully provided data for this study. The research project is also supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), and by the Agence métropolitaine de transport of Montreal (AMT). Results and views expressed in this publication are the sole responsibility of the authors and do not necessarily reflect those of CIRRELT. Les résultats et opinions contenus dans cette publication ne reflètent pas nécessairement la position du CIRRELT et n'engagent pas sa responsabilité. * Corresponding author: Martin.Trepanier@cirrelt.ca Dépôt légal Bibliothèque nationale du Québec, Bibliothèque nationale du Canada, 2007 Copyright Agard, Morency, Trépanier and CIRRELT, 2007

1. INTRODUCTION While Smart Card Automated Fare Collection Systems popularity is increasing in public transport, a large quantity of data is gathered each day in the existing systems. Although they are mainly aimed to revenue collection, SCAFC systems can help planners to better understand public transport user behaviours, thus helping to improve the service. Indeed, these systems contain data on each boarding in a public transport network, with exact time and some precision on location. The datasets produced by these systems grow rapidly and requires both efficient data exploration tool and discrimination ability from planning experience. It is our belief that data mining gathers useful data processing methods to enhance the analytical relevance of smart card data. Hence, this paper tries to join the better of two worlds data mining methods and public transport planning models in order to obtain an improved portrait of user behaviours in a public transport system equipped with a SCAFC system. User behaviour is a quite large research field, and this paper will focus mainly on weekday s trips habits. In the literature review, we will expose some works on smart card data analysis in public transport. We will recall some references in data mining advances. Nonetheless, we could not find any previous work combining the two research fields. In the methodology section, the case study and the smart card data structure will be presented. The experiments section relates the different approaches used for analysis and the hypothesis for the different clustering actions. Finally, the paper exposes results on two aspects: data mining methods spin-offs, and user behaviour key elements. 2. REVIEW 2.1 Smart card in public transport The complex fare system that is used by many public transport authorities can be better managed with the help of a Smart Card Automated Fare Collection System, because smart cards can store more than one transport document at a time and the card is validated automatically (Card Technology Today, 2003, 2004). In the next years, the need to integrate fare policies within large metropolitan areas will promote smart card usage (Bonneau 2002). However, privacy is an important issue that could retain smart card implementation. The French Council for Computer and Liberty recommends being careful with such data because one may reconstitute the personal movements of a specific person (CNIL 2003). But Clarke (2001) recalls that smart card data is not different from other individual data collection systems like credit card usage, road tolls, police corps database. CIRRELT-2007-42 1

With technological and ethical problems resolved, several advantages arise from the analysis of SCAFC data (Bagchi and White 2004): Access to larger sets of individual data; Possibilities of links between user and card information; Continuous data available for long periods of time; Better knowledge of a large part of the transit users. These authors conducted a study on the passenger transfer behaviours on the Bradford and Merseyside transit networks (UK). The absence of alighting location information was then identified as the main issue for further analysis. In a recent paper (Bagchi and White 2005), they propose different actions to avoid undermining data quality in SCAFC systems. They especially insist on the need of implementing complementary surveys to validate SCAFC data. They also propose that the organizations prepare a 2- year settling-in period to implement such systems. In the case of the Société de transport de l Outaouais (STO), Trépanier et al. (2004) have shown the potentialities of using SCAFC data for public transport network planning with the help of a Transportation Object-Oriented Modelling. 2.2 Data mining tools and applications Anand and Büchner (1998) defined data mining as the discovery of non-trivial, implicit, previously unknown, and potentially useful and understandable patterns from large data sets. Westphal and Blaxton (1998) categorized data mining functions as classification, estimation, segmentation, and description: Classification involves assigning labels to new data based on the knowledge extracted from historical data. Estimation deals with filing in missing values in the fields of a new record as a function of fields in other records. Segmentation (called also clustering) divides a population into smaller sub-populations with similar behaviour according to a predefined metric. It maximizes homogeneity within a group and maximizes heterogeneity between the groups. Description and visualization are used to explain the relationships among the data. Frequent patterns may be extracted in the form of A=>B rules with two measures of quality: the support which represents the number of times A occurs as a fraction of the total number of examples and confidence which expresses the number of times B exists in the data when A is present. Data mining techniques offer applications in many areas, on can cite as examples Braha (2001) who propose many applications of in design and manufacturing as well as Berry and Linoff (2004) who presented numerous examples and applications of data mining in marketing, sales, and customer support. CIRRELT-2007-42 2

Agard and Kusiak (2004) developed a methodology for the design of product families with data mining. da Cunha et al. (2005) analyzed manufacturing quality operation for resequencing. 3. METHODOLOGY 3.1 Case study Experiments were conducted at the Société de transport de l Outaouais (STO), Gatineau, Quebec. The STO is a medium-size public transport authority operating 200 buses and servicing 240,000 inhabitants. The STO operates its smart card system since 2001. Today, more than 80% of all STO passengers hold a smart card. Moreover, every STO bus is equipped with GPS reader. At each boarding, stop location and bus route are stored in the database along with a timestamp. Since the STO uses a high-level secured procedure to ensure the privacy of the data, smart card data are completely anonymous. No nominal information on user is known in any kind. 3.2 Smart Card Data structure A special information system has been developed at the STO to manage SCAFC data. This fare collection information system is made up of several subsystems as shown on Figure 1. Smart cards are first bought at emission locations and can be recharged at further locations. Then, when the user boards the buses, the smart card s fare gets validated. The validation is done following those steps: the bus system contains the planned runs for the day (a run is a sequence of stops to be deserved; it usually represents one direction of a route). The Global Positioning System (GPS) reader on the bus identifies the stop where the boarding is made. The system validates the run (correct route) at this location. Card number, date, time, validation status and stop number are stored at each boarding. This information is downloaded to the central server at each end of day. CIRRELT-2007-42 3

Smart card holder validation (boarding) Bus with smart card reader and GPS user info boarding data (asynchrone) planned routes and runs for the bus Smart card emission locations (4) Smart card recharge locations (20) financial transactions user and card infos user data privacy control reports «SIVT»server boarding data overall planned routes and runs Accounting system Service operation information system Fig. 1. The STO smart card information system The central database management system (called SIVT server, Système d information et de validation des titres) holds the information of users and boardings. The data is voluntarily separated to preserve confidentiality; individual user information will not be used in this study. The central server is also fed by service operation information system and smart card point-of-sale systems. Figure 2 presents the simplified data model that stores boarding data. The key table is Boardings, which contains a record for each smart card validation on the network. Boarding refusals are also stored in the table, so they will be ignored for the analysis. A link is also made between boarding and the public transport network geometry ( Routes and Route-stops ). The link ensures the validity of the time and location of boardings. Finally, information on cards and card recharges are available but will not be used here. CIRRELT-2007-42 4

Points of sale Card types Card refills Cards Boarding types Boardings Date, time Card # Route, direction Stop # Boarding type Validation result Routes Route-stops Fig. 2. Simplified database model for boarding data storage 3.3 Mining tools Different data mining techniques and tools were used in this project: Filtering data Clusters : K-mean, HAC Group characterization The dataset was extracted from a SQL Server database and was analysed within the TANAGRA free software (Rakotomalala, 2005). 3.4 Dataset The dataset is a compilation of 2,147,049 boarding validations made by 25,452 card holders on the STO network between January 10 th and April 1 st, 2005. The transactions were summarized into twelve 1-week record for each card (238,895 user-weeks overall). Each record is divided into 20 binary variables, representing 5 weekdays X 4 periods per day (AM = 5:30 to 8:59, MI = 9:00 to 15:29, PM = 15:30 to 17:59 and SO = 18:00 and over). This day division is common to transport planning studies in Quebec, other countries may have different time intervals. Table 1 presents sample records. Fields J2_AM to J6_SO represent the 20 periods (2=Monday,, 6=Friday). For example, the card #12244 is of category T1 (in the group adult ). This card has been validated (user has boarded a bus at least once) on Monday morning (J2_AM) and Monday PM peak hour (J2_PM) on week S1, on week S2, and so on. In addition, we can assume that the holder of card #23231 has made at least one trip at each period on Monday in the first week. CIRRELT-2007-42 5

Information will be extracted from that dataset in order to split the user-weeks in homogeneous groups according to their behaviours and thus explain (or reveal) the comportment of each group. In the sample dataset, customers are classified as adults, students and elderly for commercial raisons. This classification relies on the transit card type. Table 1: Sample dataset Num_card Num_title group week J2_AM J2_MI J2_PM J2_SO... J6_SO 12244 T1 adult S1 1 0 1 0 12244 T1 adult S2 1 0 1 0 1234312 T2 adult S1 1 0 0 1 23231 T34 student S1 1 1 1 1 3.5 Protocol Let us remind that the goal is to characterize user behaviour by looking at similar trip habits during weekdays and through the weeks. The analysis will conduct to an aggregate view of the 25,452 card holders. The first step considers the whole dataset as an ensemble and subdivides it in large homogeneous clusters on the base of the trip patterns observed on a weekly basis. The goal is to determine natural groupings of user-weeks. The number of groups is not fixed a priori; the Hierarchical Ascending Clustering method (HAC) is employed (Lebart et al., 2000). In order to accelerate the HAC algorithm, a first grouping is computed with a k-mean method to provide 20 groups. The result of the k-mean clustering becomes the input of the HAC. In order to build groups that have the same behaviours, the inputs of the clustering methods are constituted of all the Jx_xx columns. In the second step, we analyse the composition of the natural grouping with respect to card type in order to see if the clustering method only reconstructs pre-known segments (adults, students, and elderly). In the last step of this study, we examine the variability of the group belongings for the twelve weeks of observation. This will give a good idea of the regularity of the habits over time. It will also help identify the unusual weeks in terms of travel behaviours. 4. RESULTS 4.1 General user behaviour The processing of the whole dataset produces 4 clusters of user-weeks with similar patterns. Two of the clusters (Group 1 and Group 3) have easily interpretable travel behaviours. Fig. 3: The first group (45.6% of the user-weeks) clearly relates to the people with regular trips to and from constraint activities such as work since they mainly travel during the peak hours. On average, 79.4% of these users will travel during the AM peak hour and 71.0% during the PM peak hour on CIRRELT-2007-42 6

weekdays. The proportion of users travelling during the other periods is quite low, 6.4% during the day and 2.6% during the evening. Group 1 % users boarding on the transit network 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% "J2_AM" "J2_MI" "J2_PM" "J2_SO" "J3_AM" "J3_MI" "J3_PM" "J3_SO" "J4_AM" "J4_MI" "J4_PM" "J4_SO" "J5_AM" "J5_MI" "J5_PM" "J5_SO" "J6_AM" "J6_MI" "J6_PM" "J6_SO" Monday Tuesday Wednesday Thursday Friday Fig. 3. General user behaviour Group 1 Fig. 4: The third group (14.3% of the user-weeks) relates to people with regular activities in the first part of the day with, on average, 77.6% of the users travelling during the AM peak hour and 74.8% during the day. Group 3 % users boarding on the transit network 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% "J2_AM" "J2_MI" "J2_PM" "J2_SO" "J3_AM" "J3_MI" "J3_PM" "J3_SO" "J4_AM" "J4_MI" "J4_PM" "J4_SO" "J5_AM" "J5_MI" "J5_PM" "J5_SO" "J6_AM" "J6_MI" "J6_PM" "J6_SO" Monday Tuesday Wednesday Thursday Friday Fig. 4. General user behaviour Group 3 The two other clusters (Group 2 and Group 4 with respectively 14.8% and 25.2% of the user-weeks) gather users with similar patterns of non-travelling. Actually, no clear travel pattern can be observed for Group 2 while Group 4 gathers users with lower use of the network. At most, 36% of the users of this group will use the transit network during the day period. CIRRELT-2007-42 7

Group 2 % users boarding on the transit network 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% "J2_AM" "J2_MI" "J2_PM" "J2_SO" "J3_AM" "J3_MI" "J3_PM" "J3_SO" "J4_AM" "J4_MI" "J4_PM" "J4_SO" "J5_AM" "J5_MI" "J5_PM" "J5_SO" "J6_AM" "J6_MI" "J6_PM" "J6_SO" Monday Tuesday Wednesday Thursday Friday Fig. 5. General user behaviour Group 2 Group4 % users boarding on the transit network 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% "J2_AM" "J2_MI" "J2_PM" "J2_SO" "J3_AM" "J3_MI" "J3_PM" "J3_SO" "J4_AM" "J4_MI" "J4_PM" "J4_SO" "J5_AM" "J5_MI" "J5_PM" "J5_SO" "J6_AM" "J6_MI" "J6_PM" "J6_SO" Monday Tuesday Wednesday Thursday Friday Fig. 6. General user behaviour Group 4 4.2 Composition of the natural groups The following tables summarize the composition of the four clusters created by the data mining method. On the first hand, Table 2 presents the distribution of the user-weeks in the four groups according to the type of card hold by the traveller. For instance, it shows that almost 80% of the weeks of travel of elderly card holders are classified in Group 4. This is no surprise since this Group has a low level of mobility. It also confirms that the symmetrical movements observed for user-weeks of group 1 are partly explained by the plausible level of constraint of the activities of the adults, with almost 60% of the adult card holders being in this group. CIRRELT-2007-42 8

Table 2: Distribution of user-weeks in the four clusters according to card type Card type Gr1 Gr2 Gr3 Gr4 TOT Adult 58,8% 13,9% 9,2% 18,1% 100% Student 21,0% 17,7% 26,4% 34,8% 100% Elderly 6,2% 6,4% 7,9% 79,5% 100% Table 2 also shows that the weeks of travel of student card holders are spread over the four natural groups. This is quite intriguing. It may be caused by the fact that some students automatically receive transit passes due to their status without actually needing it for their main travel needs. What is then observed is a very low subset of their travelling behaviours on weekdays. On the second hand, Table 3 presents the composition of the four natural clusters in terms of card type (of the travellers for which user-weeks are examined). We see that the adult card holders represent 85% of the user-weeks of Group 1. This also explains the regularity of the travel patterns on weekdays of this Group. This table shows that student are the main segment composing Group 3 which shows regular travel patterns in the first part of the weekdays. Table 3: Composition of the four clusters regarding card type Card type Adult Student Elderly Total Gr1 85,6% 13,9% 0,5% 100% Gr2 62,4% 36,1% 1,4% 100% Gr3 42,7% 55,4% 1,8% 100% Gr4 47,7% 41,7% 10,6% 100% What these tables show as well is that some card types are linked to regular behaviours such as the elderly cards (app. 80% of the users with this type of card belong to group 4) but that other types can reveal variable travel behaviours over several weeks. 4.3 Variability over 12 weeks A better understanding of these two groups is obtained by the study of the variability of belonging of the users over the 12 weeks. For the student case, a study of the belonging among these groups over the 12 weeks, presented in Figure 7, clearly spots one irregular week of travel. The atypical distribution observed on week 10 is due to the school break where students temporarily adopt other activity patterns. CIRRELT-2007-42 9

Variability of cluster belonging over 12 weeks for student card holders 100% % users in each natural group 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Gr4 Gr3 Gr2 Gr1 W3: Jan 10 W4: Jan 17 W5: Jan 24 W6: Jan 31 W7: Feb 07 W8: Feb 14 W9: Feb 21 W10: Feb 28 W11: Mar 07 W12: Mar 14 W13: Mar 21 W14: Mar 28 Fig. 7. Variability of cluster belonging over 12 weeks of the student card holders Adult Student Elderly Gr1 Gr2 Gr3 Gr4 ALL BETWEEN 80% & 99% UNDER 80% Fig. 8. Proportion of users regarding their cluster belonging over weeks (week # 10 removed) In the figure 8, the black sectors of the pie charts indicate users that belong to the same group for all weeks (gray is for most of the weeks). The size of the charts is proportional to the number of users. We can see that a large majority of the adults of group 1 have a regular behaviour. It is less the case for adults of other groups. Students have less regular behaviour. That could be related to 1) irregular college schedules and 2) students are sometimes accompanied to school by their parents and will not use public transit. Elderly CIRRELT-2007-42 10

people of group 4 are concentrated in Group 4, but let us remind that this group is atypical with a low level of mobility. 5. CONCLUSION This study is only a first impression about mining data from Smart Card Automated Fare Collection systems. First results tell that data mining techniques help to identify and characterize market segments among public transportation users. It also demonstrates the wide set of statistics that can be calculated from the dataset; much work is still to come to pinpoint the most important statistics for the STO and the user behaviour researchers. Further studies will be conducted to better characterise both supply and demand on the STO public transport network. Typically, the following elements may be examined: Geospatial trip behaviour (with the help of geospatial data mining techniques). Specific route usage over space and time (to measure the turnover of the users). More detailed mining in time period (for example, using the exact time of the first boarding of each day over a year period). 6. REFERENCES ANAND, S.S. and BÜCHNER, A.G. (1998), Decision Support Using Data Mining, Financial Times Pitman Publishers, London, UK. BAGCHI, M., WHITE, P.R. (2004), What role for smart-card data from bus system?, Municipal Engineer 157, mars 2004, p.39-46 BERRY, M.J.A. and LINOFF, G. (2004), Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley. BONNEAU, W. and editors (2002). The role of smart cards in mass transit systems, Card Technology Today, June 2002, p.10. BRAHA, D. (2001), Data Mining for Design and Manufacturing, Kluwer, Boston, MA. CHAPLEAU, Robert (1986). Transit Network Analysis and Evaluation with a Totally Disaggregate Approach, Selected proceedings of the World Conference on Transportation Research, Vancouver. CLARKE, Roger (2001). Person location and person tracking: Technologies, risks and policy implications, Information Technology & People, 14 (2), 2001, pp. 206-231. CIRRELT-2007-42 11

CNIL Commission nationale de l informatique et des libertés (2003). Recommandation relative à la collecte et au traitement d'informations nominatives par les sociétés de transports collectifs dans le cadre d applications billettiques, CNIL, Délibération N 03-038. LEBART, L., MORINEAU, A., PIRON, M., (1998), Statistique exploratoire multidimensionnelle, Ed. Dunod, 2000. RAKOTOMALALA, R. (2005), "TANAGRA : un logiciel gratuit pour l'enseignement et la recherche", in Actes de EGC'2005, RNTI-E-3, vol. 2, pp.697-702. TRÉPANIER, M., BARJ, S., DUFOUR, C., POILPRÉ, R. (2004), Examen des potentialités d'analyse des données d'un système de paiement par carte à puce en transport urbain, Exposé préparé pour la séance sur "Utilisation des systèmes de transport intelligents (STI) à l'appui de la gestion de la circulation" du congrès annuel de 2004 de l'association des transports du Canada à Québec (Québec), p.4, 10-14. WESTPHAL, C. and BLAXTON, T. (1998), Data Mining Solutions, John Wiley, New York. AGARD, B. and KUSIAK, A. (2004) Data-Mining Based Methodology for the Design of Product Families, International Journal of Production Research, Vol. 42, No. 15, pp. 2955-2969. C. Da CUNHA, C., AGARD, B., and KUSIAK, A. (2005), Improving manufacturing quality by resequencing assembly operations: a data-mining approach, 18th International Conference on Production Research ICPR 18, University of Salerno, Fisciamo, Italy, July 31 August 4. 7. ACKNOWLEDGEMENTS The authors wish to acknowledge the support and collaboration of the Société de transport de l Outaouais, who gracefully provided data for this study. The research project is also supported by the Natural Science and Engineering Research Council of Canada (NSERC), and by the Agence métropolitaine de transport of Montreal (AMT). CIRRELT-2007-42 12