Outline. Sequential Data Analysis Issues With Sequential Data. How shall we handle missing values? Missing data in sequences

Similar documents
Course/Seminar Gilbert Ritschard Wednesday 10h15-14h M-5383 Anne-Laure Bertrand (Ass)

Mining sequence data in R with the TraMineR package: A user s guide 1

Excel 2003 Tutorials - Video File Attributes

Article: Main results from the Wealth and Assets Survey: July 2012 to June 2014

Excel 2007 Basic knowledge

Advanced Microsoft Excel 2010

The Interaction of Workforce Development Programs and Unemployment Compensation by Individuals with Disabilities in Washington State

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

Probability Distributions

Access Tutorial 3 Maintaining and Querying a Database. Microsoft Office 2013 Enhanced

Tutorial 3 Maintaining and Querying a Database

Scatter Plots with Error Bars

Excel 2007 Tutorials - Video File Attributes

Imputation and Analysis. Peter Fayers

About PivotTable reports

Microsoft Excel Training - Course Topic Selections

Microsoft Excel 2010 Part 3: Advanced Excel

NICK COLLIER - REPAST DEVELOPMENT TEAM

Wave Analytics Data Integration

How to Make the Most of Excel Spreadsheets

RECOMMENDED CITATION: Pew Research Center, January, 2016, Republican Primary Voters: More Conservative than GOP General Election Voters

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Association Between Variables

Creating a Simple Macro

Introduction Course in SPSS - Evening 1

ABOUT THIS DOCUMENT ABOUT CHARTS/COMMON TERMINOLOGY

Developmental Research Methods and Design. Types of Data. Research Methods in Aging. January, 2007

Estimates of the number of people facing inadequate retirement incomes. July 2012

Microsoft Excel 2010 Pivot Tables

Problem of Missing Data

All-in-one webinar solution. User Guide For Account Holders and Moderators

Topographic Change Detection Using CloudCompare Version 1.0

ECDL / ICDL Spreadsheets Syllabus Version 5.0

PowerScheduler Load Process User Guide. PowerSchool Student Information System

InfiniteInsight 6.5 sp4

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Pearson Student Mobile Device Survey 2013

Default Rates by Institution Level vs. Degree Program

Symbol Tables. Introduction

Excel -- Creating Charts

A Basic Introduction to Missing Data

Application in Predictive Analytics. FirstName LastName. Northwestern University

Survey Analysis: Options for Missing Data

Mass . General Use

2003 National Survey of College Graduates Nonresponse Bias Analysis 1

Gamma Distribution Fitting

Client Marketing: Sets

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

Wave Analytics Data Integration Guide

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Aras Corporation Aras Corporation. All rights reserved. Notice of Rights. Notice of Liability

Solutions to Homework 10 Statistics 302 Professor Larget

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

NATIONAL STUDENT CLEARINGHOUSE RESEARCH CENTER

Detail Report Excel Guide for High Schools

MEASURING INCOME DYNAMICS: The Experience of Canada s Survey of Labour and Income Dynamics

The responses to this assessment will help you identify key opportunities to derive full value from the Net Promoter system process.

The American Recovery and Reinvestment Act of 2009, Meaningful Use and the Impact on Netsmart s Public Health Clients

3-Step Competency Prioritization Sequence

Life after Lotus Notes

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Hill s Cipher: Linear Algebra in Cryptography

Technical Note. Consumer Confidence Survey Technical Note February Introduction and Background

COLLEGE RETIREMENT EQUITIES FUND RULES OF THE FUND

Scientific Graphing in Excel 2010

Economic inequality and educational attainment across a generation

Microsoft Excel 2010 Tutorial

Excel Intermediate. Table of Contents UPPER, LOWER, PROPER AND TRIM...28

2. Incidence, prevalence and duration of breastfeeding

SEQUENCES ARITHMETIC SEQUENCES. Examples

Logi Ad Hoc Reporting System Administration Guide

Kyubit Business Intelligence OLAP analysis - User Manual

Remarriage in the United States

Math Quizzes Winter 2009

Descriptive Methods Ch. 6 and 7

USER CONVERSION P3, SURETRAK AND MICROSOFT PROJECT ASTA POWERPROJECT PAUL E HARRIS EASTWOOD HARRIS

Visualization with Excel Tools and Microsoft Azure

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Cluster Analysis using R

Successful Mailings in The Raiser s Edge

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

A Guide. to Assessment of Learning Outcomes. for ACEJMC Accreditation

Linear Models in STATA and ANOVA

Drawing a histogram using Excel

Transcription:

Outline Sequential Data Analysis Issues With Sequential Data Gilbert Ritschard Alexis Gabadinho, Matthias Studer Institute for Demographic and Life Course Studies, University of Geneva and NCCR LIVES: Overcoming vulnerability, life course perspectives http://mephisto.unige.ch/traminer September - November, 2012 1 2 3 State codings 4 Weights 5 Data size 6 Conclusion G. Ritschard (2012), 1/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 2/44. Distributed under licence CC BY-NC-ND 3.0 Coding the missing states Missing data in sequences Coding the missing states How shall we handle missing values? Missing values in the expanded (STS) form of a sequence occur, for example, when: Sequences do not start on the same date while using a calendar time axis; The follow-up time is shorter for some individuals than for others yielding sequences that do not end up at the same position; The observation at some positions is missing due to nonresponse, yielding internal gaps in the sequences. Handling may be different for each of the listed situations. In case of different start times, maintain the starting missing values to preserve alignment across sequences, or possibly left-align sequences by switching to a process time axis. In case of different end times, ending missing terms could just be ignored. In case of information missing due to non response, add an explicit non-response state to the alphabet; or maintain missing values to preserve alignment. G. Ritschard (2012), 5/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 6/44. Distributed under licence CC BY-NC-ND 3.0

Coding the missing states Coding left, gaps and right missing states To allow such differentiated treatments, TraMineR distinguishes left, in-between and right missing values. Use the left, gaps and right arguments of seqdef() to specify how each of the missing types should be encoded. By default, gaps and left-missing states are coded as NA, while all missing values encountered after the last valid (rightmost) state in a sequence are considered void elements (right="del"); i.e., the sequence is considered to end after the last valid state. (sequences with missing states) is more the rule than the exception. Unlike Event History Analysis (Survival analysis), which can handle censored data, no universal elegant way of handling censored data in sequences. G. Ritschard (2012), 7/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 9/44. Distributed under licence CC BY-NC-ND 3.0 Strategies in presence of uncomplete sequences Reliability of analysis with uncomplete sequences What can we do in presence of uncomplete sequences? Delete all uncomplete sequences. Delete sequences with more than an acceptable number of missing states. Consider the NA state as an element of the alphabet. Impute some missing states Not too restrictive assumptions often permit to guess the value of some missing state. For example, we can assume that people leaving with their both parents at 20, leaved with them since their birthday.... A mix of the previous solutions When states are missing at random, global picture given by the sequences remains satisfactory whatever the handling strategy for the missing states. G. Ritschard (2012), 10/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 11/44. Distributed under licence CC BY-NC-ND 3.0

Illustration: randomly turning states into NA in mvad Randomly turning states into NA in mvad To illustrate we randomly insert missing states into the mvad data, 1 Randomly select a proportion p of sequences to be modified. 2 In each selected sequence insert a random proportion < p G of gaps, set as missing a random proportion < p L of states from the left, set as missing a random proportion < p R of states from the right. For the next examples, we used p =.6, p G =.2, p L =.4, p R =.5 Missings where introduced with segen.missing(), from TraMineRextras R> mvadm.seq <- seqgen.missing(mvad.seq, p.cases = 0.6, p.left = 0.4, p.gaps = 0.2, p.right = 0.5, mt.gaps = "nr", mt.right = "nr") G. Ritschard (2012), 12/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 13/44. Distributed under licence CC BY-NC-ND 3.0 Rendering with and without missing states I-plot Rendering with and without missing states d-plot, with.missing=true G. Ritschard (2012), 14/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 15/44. Distributed under licence CC BY-NC-ND 3.0

Rendering with and without missing states d-plot, with.missing=false A crucial point when analyzing state sequences is to chose a relevant time alignment Calendar date Same date start date for each sequence. Process time, i.e., time since a event of interest birth date (position defined by age) date when starting to live with a partner, first childbirth,... start of first job, first unemployment month, immigration date,... G. Ritschard (2012), 16/44. Distributed under licence CC BY-NC-ND 3.0 Loading the srh data We illustrate with sequences of self reported health from the SHP (30% sample data in srh30.rdata) R> source(paste(scriptdir, "extractseqfromw.r", sep = "")) R> load(paste(datadir, "srh30.rdata", sep = "")) R> srh <- srh30 R> srh.shortlab <- c("b2", "B1", "M", "G1", "G2") R> srh.longlab <- c("not well at all", "not very well", "so, so", "well", "very well") R> srh.alph <- c("not well at all", "not very well", "so, so (average)", "well", "very well") R> var <- getcolumnindex(srh, "P$$C01") R> xtlab <- 1999:(1999 + length(var) - 1) R> mycol5 <- brewer.pal(5, "RdYlGn") R> srh.seq <- seqdef(srh[, var], right = NA, alphabet = srh.alph, states = srh.shortlab, labels = srh.longlab, cnames = xtlab, cpal = mycol5) R> x <- apply(is.na(srh[, var]), 1, sum) R> sel <- (x < seqlength(srh.seq) - 1) R> srh <- srh[sel, ] R> srh.seq <- srh.seq[sel, ] G. Ritschard (2012), 20/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 19/44. Distributed under licence CC BY-NC-ND 3.0 Illustration: Self-reported health, SHP 1999/2010 Sequences aligned on calendar year G. Ritschard (2012), 21/44. Distributed under licence CC BY-NC-ND 3.0

Changing alignment Illustration: Self-reported health, SHP 1999/2010 Sequences aligned on age Changing alignment with seqstart() from TraMineRextras. R> startyear <- 1999 R> birthyear <- srh$birthy R> agesrh <- seqstart(srh[, var], data.start = startyear, new.start = birthyear) R> colnames(agesrh) <- 1:ncol(agesrh) R> agesrh <- agesrh[, 10:90] R> agesrh.seq <- seqdef(agesrh, alphabet = srh.alph, states = srh.shortlab, labels = srh.longlab, cpal = mycol5, right = NA, xtstep = 10) G. Ritschard (2012), 22/44. Distributed under licence CC BY-NC-ND 3.0 Illustration: Self-reported health, SHP 1999/2010 Sequences aligned on age, with ignored right missing positions, right="del" G. Ritschard (2012), 23/44. Distributed under licence CC BY-NC-ND 3.0 Illustration: Self-reported health, SHP 1999/2010 Focus on people born between 1930 and 1934 G. Ritschard (2012), 24/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 25/44. Distributed under licence CC BY-NC-ND 3.0

Time granularity Time granularity Time granularity Changing time granularity of the mvad data Monthly vs yearly states Time granularity: density of state positions within a given time length. defined by the duration of the used unit of time examples: year, quarter, month, week, day, hour,... Can switch from a fine granularity to a more rough one. But, cannot switch to a finer granularity than available in the data. Change granularity with seqgranularity() from TraMineRextras R> mvadg.seq <- seqgranularity(mvad.seq, tspan = 12) G. Ritschard (2012), 27/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 28/44. Distributed under licence CC BY-NC-ND 3.0 Time granularity Changing time granularity of the mvad data Monthly vs yearly states State codings State codings: What is the optimal alphabet size? The larger the alphabet, the less clear the results. Similarly to time aggregation, we can also merge together elements of the alphabet. Useful when different states reflect similar situations For example: in mvad, the distinction between further education (FE) and school (SC) is not so clear. Merging those categories improves readability of the outcomes. Avoid merging dissimilar states. Do not hide useful distinction such as Full time and Part time. G. Ritschard (2012), 29/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 31/44. Distributed under licence CC BY-NC-ND 3.0

State codings Merging two states Merging Further education with School in mvad State codings Merging two states Merging Further education with School in mvad R> mvadr.seq <- seqrecode(mvad.seq, recodes = list(fs = c("fe", "SC"))) R> seqdplot(mvadr.seq, group = mvad$gcse5eq, border = NA) G. Ritschard (2012), 32/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 33/44. Distributed under licence CC BY-NC-ND 3.0 State codings Merging two states Merging Further education with School in mvad Weights Weights Weights serve to improve sample representativeness Weights also useful for reducing the sequence data size by retaining only unique sequences. weight reflect the number of cases sharing the same unique sequence In any case, when weights are present, they should be accounted for. In TraMineR with the weights= argument of seqdef() When assigned to the state sequence object, weights are automatically accounted for. in produced plots, distributions, statistics,... G. Ritschard (2012), 34/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 36/44. Distributed under licence CC BY-NC-ND 3.0

Weights Results may be quite different R> layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), heights = c(2, 1.3)) R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, weighted = FALSE, title = "Non Weighed") R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, title = "Weighed") R> seqlegend(mvad.seq, ncol = 2, position = "top") Weights Which weights to use with panel data? Each wave of a panel survey usually includes 2 weights: a transversal weight (representativeness of current population) a longitudinal weight (representativeness of initial population), applies to full trajectories. Which weights should be used for uncomplete trajectories? For sequences over a subinterval of time? No evident solution. Weights lose their meaning when cases are filtered out! In SHP there are weights for cases for Sample I (1999) and for Sample I+II (2004). See http://www.swisspanel.ch/img/pdf/user_guide_e_short.pdf G. Ritschard (2012), 37/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 38/44. Distributed under licence CC BY-NC-ND 3.0 Data size Data size, scalability Data size Size limitations: What can we do? Three types of size limitations: Number of sequences: no problem up to about 10 000. Main problem is matrix of pairwise dissimilarities! Sequence length: no problem up to a few hundreds ( 300) In some functions default limit set as 100 should be increased Size of alphabet: not a too big problem for computation, but rendering becomes difficult with more than say 20 elements Default colors only for A 12 For number of sequences: Work on a representative sample of the sequences. For sequence length: Change time granularity. Split position (time) scale and work on subintervals For size of alphabet Merge elements of the alphabet. G. Ritschard (2012), 40/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 41/44. Distributed under licence CC BY-NC-ND 3.0

Conclusion Conclusion Conclusion Many issues in sequence analysis Solutions necessitate trade-offs Losing sequences (cases) vs allowing for missing states Losing sequences (cases) vs restricting time coverage... Holistic view provided by sequence analysis Cost: cannot account for most recent cohorts with yearly data. For example: Studying life course until 45 years with SHP biographical survey of 2002, means, if we want only complete trajectories, that younger people are born in 1957. The finer the granularity, the less constrained we are. Thank Thank you! you! Questions? See you next week. G. Ritschard (2012), 43/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 44/44. Distributed under licence CC BY-NC-ND 3.0