A hidden Markov model for criminal behaviour classification



Similar documents
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

STA 4273H: Statistical Machine Learning

Bayesian networks - Time-series models - Apache Spark & Scala

The Start of a Criminal Career: Does the Type of Debut Offence Predict Future Offending? Research Report 77. Natalie Owen & Christine Cooper

Lecture 3: Linear methods for classification

Crime Location Crime Type Month Year Betting Shop Criminal Damage April 2010 Betting Shop Theft April 2010 Betting Shop Assault April 2010

Item Response Theory in R using Package ltm

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Model-Based Cluster Analysis for Web Users Sessions

Reject Inference in Credit Scoring. Jie-Men Mok

Statistics in Retail Finance. Chapter 6: Behavioural models

Package MixGHD. June 26, 2015

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Note on the EM Algorithm in Linear Regression Model

Linear Classification. Volker Tresp Summer 2015

Introduction to mixed model and missing data issues in longitudinal studies

Cell Phone based Activity Detection using Markov Logic Network

ASC 076 INTRODUCTION TO SOCIAL AND CRIMINAL PSYCHOLOGY

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Likelihood Approaches for Trial Designs in Early Phase Oncology

Conditional Random Fields: An Introduction

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Course: Model, Learning, and Inference: Lecture 5

Statistical Machine Learning

Support Vector Machines with Clustering for Training with Very Large Datasets

MS1b Statistical Data Mining

SAS Software to Fit the Generalized Linear Model

Curriculum Vitae of Francesco Bartolucci

Statistical Machine Learning from Data

Message-passing sequential detection of multiple change points in networks

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Detection of changes in variance using binary segmentation and optimal partitioning

Christfried Webers. Canberra February June 2015

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Introduction to General and Generalized Linear Models

Statistics Graduate Courses

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Poisson Models for Count Data

Fitting Subject-specific Curves to Grouped Longitudinal Data

Maximum Likelihood Estimation

APPLIED MISSING DATA ANALYSIS

Nominal and ordinal logistic regression

Gerry Hobbs, Department of Statistics, West Virginia University

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Bayesian Statistics in One Hour. Patrick Lam

Standard errors of marginal effects in the heteroskedastic probit model

Probabilistic trust models in network security

Question 2 Naïve Bayes (16 points)

Questionnaire: Domestic (Gender and Family) Violence Interventions

A Bayesian Antidote Against Strategy Sprawl

UW CSE Technical Report Probabilistic Bilinear Models for Appearance-Based Vision

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Bayesian Statistics: Indian Buffet Process

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

A mixture model for random graphs

Social Media Mining. Data Mining Essentials

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Package EstCRM. July 13, 2015

CHAPTER 2 Estimating Probabilities

B U R E A U O F C R I M E S T A T I S T I C S A N D R E S E A R C H. Contemporary Issues in Crime and Justice Number 86

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

A Basic Introduction to Missing Data

Analysis of Irish third-level college applications data

Machine Learning.

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Tracking Groups of Pedestrians in Video Sequences

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Linear Threshold Units

Applications of R Software in Bayesian Data Analysis

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Topic models for Sentiment analysis: A Literature Survey

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Machine Learning and Pattern Recognition Logistic Regression


Criminal Justice Statistics Quarterly Update to March 2013

Predict Influencers in the Social Network

Language Modeling. Chapter Introduction

Introduction to Logistic Regression

Neural Networks Lesson 5 - Cluster Analysis

Classification Problems

Transcription:

RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University of Florence, Italy.

RSS2004 p.2/19 Background Analysis of criminal behaviour: we want to model offending patterns as well as taking into account the nature of offending and the sequence of offence type; criminal histories recorded as official histories: England and Wales Offenders Index which is a court based record of the criminal histories of all offenders in England and Wales from 1963 to the current day; general population sample of n =5, 470 individuals paroled from the cohort of those born in 1953, and followed through to 1993; offences are combined into J =10major categories described in the Offendex Index Codebook (1998); following Francis et al. (2004) we have define T =6time windows or age strips:10-15,16-20, 21-25, 26-30, 31-35.

RSS2004 p.3/19 Univariate Latent Markov model Used by Bijleveld and Mooijaart (2003): the offending pattern of a subject within strip age t, t =,...,T is represented by X t a single discrete random variable; {X t } depends only on a random process {C t }; {C t } follows a first-order homogeneous Markov chain with k states, initial probabilities π c s and transition probabilities π c1 c 2 ; the joint distribution of {X t } may be expressed as p(x 1 = x 1,...,X T = x T )= φ x1 c 1 π c1 φ x2 c 2 π c1 c 2 φ xt c T π ct 1 c T, c 2 c T c 1 where φ x c = p(x t = x C t = c).

RSS2004 p.4/19 Multivariate Extension X tj is a binary random variable equal to 1 if he/she is convicted for offence of type j within the strip age t and to 0 otherwise; we assume local independence i.e. that for t =1,..., T, X tj are conditionally independent given C t : φx c = p(x t = x C t = c) = J j=1 λ x j j c (1 λ j c) 1 x j, where λ j c = p(x tj =1 C t = c), X t =(X t1,,x tj ) and x j denotes the j element of the vector x.

RSS2004 p.5/19 Restricted version of the model (unidimensional Rasch) We assume that for each type of offence we have logit(λ j c )=α c + β j, (1) where α c is the tendency to commit crimes of the subject in the latent class c (i.e. individual characteristic) β j is the easiness to commit crime of type j; it allows for an appropriate labelling of the latent classes to order the latent classes λ j 1 <= <= λ j k, j =1,...,J, such constrain is used to formulate a latent class version of the Rasch (1961) model which is well-known in the Psychometric literature.

RSS2004 p.6/19 Restricted version of the model (multidimensional Rasch) The previous model assumes that each type of offence has the same latent trait: this may be too much restrictive; we consider that the crimes may be partitioned into s homogenous subgroups so that logit(λ j c )= s δ jd α cd + β j, (2) d=1 where α cd is the tendency of the subject in the latent class c to commit crimes in the subgroup d; δ jd is equal to 1 if the crime j is in the subgroup d and to 0 otherwise; we can classify the offences into groups where crimes belonging to the same group have the same latent trait.

RSS2004 p.7/19 Likelihood inference The log-likelihood of the model for an observed cohort of n subjects is l(θ) = n log[l i (θ)], i=1 where θ is the notation for all the parameters, L i (θ) is the function p(x i1,...,x it ) defined evaluated at θ. L i (θ) may be computed through the well-known recursions in the hidden Markov literature (see Levinson et al., 1983, and MacDonald and Zucchini, 1997, Sec. 2.2); l(θ) is maximized with the EM algorithm which requires the log-likelihood of the complete data l (θ).

RSS2004 p.8/19 The complete data log-likelihood may be expressed as l (θ) = v 1c log π c + u c1 c 2 log π c1 c 2 + c c 1 c 2 v itc {x itj log λ cj +(1 x itj )log(1 λ cj )}, i t c j where v itc is a dummy variable, referred to the i-th subject, which is equal to 1 if C t = c and to 0 otherwise, v tc = i v itc and u c1 c 2 is the number of transitions from the c 1 -th to the c 2 -th state.

RSS2004 p.9/19 EM algorithm E : computes the conditional expected value of l (θ), given the observed data and the current value of the parameters. M : updates the parameter estimates by maximizing the expected value of l (θ) computed above. When the model is constrained (unidimensional or multidimensional Rasch) the parameters α cd and β j are estimated by fitting a logistic model with a suitable design matrix Z defined according to the model of interest to the data.

RSS2004 p.10/19 Choice of the number of classes (k) The optimal number of latent classes can be chosen with the likelihood ratio between the model with k states and that with k +1 states, D k = 2(ˆl k ˆl k+1 ), for increasing values of k; or using the Bayesian Information Criterion (Kass and Raftery, 1995) defined as BIC k = 2l k + r k log(n) where r k is the number of parameters in the model with k states. According to this strategy, the optimal number of states is the one for that BIC k is minimum.

RSS2004 p.11/19 Choice of the number of latent traits The crimes are clustered using a hierarchical algorithm. At each step the algorithm aggregates the two cluster of crimes which are the closest in terms of deviance between the model fitted at the previous step and the multidimensional Rasch model fitted after the aggregation of the two clusters. The steps are iterated until the BIC of the resulting model is lower than the unconstrained model. The algorithm stops when all the items are grouped together.

An application We applied the model to a sample of n =5, 470 males taken from the dataset illustrated above; we used the estimated number of live births in the cohort year 1953 as reported by Prime et al. (2001). For a number of classes between 1 and 7 we obtain k l k r k BIC k 1 21, 341 10 42, 768 2 20, 076 23 40, 349 3 19, 643 38 39, 612 4 19, 284 55 39, 041 5 19, 142 74 38, 921 6 19, 086 95 38, 990 7 19, 010 118 39, 036 We choose k =5states as we have the smallest BIC. RSS2004 p.12/19

RSS2004 p.13/19 Choice of the clusters Using the hierarchical algorithm the best fit (BIC =35, 433) was for the following cluster aggregations for each of the the 10 typology of crimes and the estimation of β s. latent trait Offence s category (j) 1 2 3 β j Violence against the person X 5.824 Sexual offences X 7.787 Burglary X 7.004 Robbery X 10.212 Theft and handling stolen goods X 5.375 Fraud and Forgery X 6.473 Criminal Damage X 5.890 Drug Offences X 6.720 Motoring Offences X 8.170 Other offences X 7.493

RSS2004 p.14/19 Estimated α s parameters Values of the estimated tendencies of the subject for each latent state in every subgroup c α 1 α 2 α 3 1 0.000 0.000 0.000 2 0.134 2.860 9.513 3 3.315 7.100 6.192 4 3.831 4.445 5.02 5 5.283 6.990 7.439

Estimate of π and Π Initial probabilities π c π 1 π 2 π 3 π 4 π 5 0.393 0.552 0.054 0.000 0.000 Transition probabilities π cd s of the Markov Chain are the following c 1 2 3 4 5 1 0.996 0.000 0.000 0.003 0.000 2 0.364 0.375 0.010 0.226 0.024 3 0.000 0.241 0.288 0.172 0.300 4 0.555 0.012 0.000 0.429 0.005 5 0.000 0.071 0.014 0.445 0.470 RSS2004 p.15/19

RSS2004 p.16/19 Advantages of the proposed methodology We achieve parsimonious description of the dynamic process underlying the data; the approach is based on general population sample and not on an offender-based sample as in other studies; it allows to estimate a waste choice of models and to choose the best one going to the simple latent class model to the constrained model with subgroups; it can provide important information for policy, such as incarceration or incapacitation policy against the offenders.

RSS2004 p.17/19 Future extensions Constraint the probabilities λ j c s to be equal to 0 for a latent class so that this class may be identified as that of non-offensive subjects; consider also models in which the transition probabilities may vary with age (non homogeneous of the Markov chains); consider restriced models in which the transition matrix has a particular structure (e.g. triangular, symmetric); include explanatory variables, such as gender or race, in the model.

RSS2004 p.18/19 References Bijleveld, C. J. H., and Mooijaart, A. Neerlandica, 57, 3, 305-320. (2003). Latent Markov Modelling of Recidivism Data. Statistica (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. series B, 39, 1-38. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1996). Using Bootstrap Likelihood Ratios in Finite Mixture Models. J. R. Statist. Soc., B, 58, 3 609-617. Feng, Z. and McCulloch, C. E. (2004). Identifying Patterns and Pathways of Offending Behaviour: A New Approach to Typologies of Crime. European Journal of Criminology, 1, 47-87. Francis, B., Soothill, K. and Fligelstone, R. Kass R. E. and Raftery A. (1995). Bayes factors. Journal of the American Statistical Association, 90 (430), 773-795. Lazarsfeld, P. F. and Henry, N. W (1968). Latent Structure Analysis. Boston: Houghton Mifflin. Levinson S. E., Rabiner, L. R. and Sondhi, M. M. (1983). An introduction to an application of theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Thechnical Journal, 62, 1035-74. (1991). Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis. Journal of the American Statistical Association, 86, 96-107. Lindsay, B., Clogg, C. and Grego, J.

RSS2004 p.19/19 (1995). Patterns of drug use among white institutionalized delinquents in Georgia. Evidence from a latent class analysis. Journal of Drug Education, 25, 61-71. McCutcheon, A. L. and Thomas, G. (1997). Hidden Markov and Other Models for Discrete-valued Time Series. London: Chapman & Hall. MacDonald I. and Zucchini W. McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models, New York, John and Wiley. (1998). Offenders Index Codebook, London: Home Office. Available at http://homeoffice.gov.uk/rds/pdfs/oicodes.pdf. Research development and Statistics Directorate (2001). Criminal careers of those born between 1953 and 1978. Statistical Bulletin 4/01. London: Home Office. Prime, J., White, S., Liriano, S. and Patel, K. Rasch, G. (1961). On general laws and the meaning of measurement in psychology, Proceedings of the IV Berkeley Symposium on Mathematical Statistics and Probability, 4, 321-333. (1973). Panel Analysis: Latent Probability Models for Attitudes and Behavior Processes. Amsterdam: Elsevier. Wiggins, L. M.