Outline. Sequential Data Analysis Issues With Sequential Data. How shall we handle missing values? Missing data in sequences

Transcription

1 Outline Sequential Data Analysis Issues With Sequential Data Gilbert Ritschard Alexis Gabadinho, Matthias Studer Institute for Demographic and Life Course Studies, University of Geneva and NCCR LIVES: Overcoming vulnerability, life course perspectives September - November, State codings 4 Weights 5 Data size 6 Conclusion G. Ritschard (2012), 1/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 2/44. Distributed under licence CC BY-NC-ND 3.0 Coding the missing states Missing data in sequences Coding the missing states How shall we handle missing values? Missing values in the expanded (STS) form of a sequence occur, for example, when: Sequences do not start on the same date while using a calendar time axis; The follow-up time is shorter for some individuals than for others yielding sequences that do not end up at the same position; The observation at some positions is missing due to nonresponse, yielding internal gaps in the sequences. Handling may be different for each of the listed situations. In case of different start times, maintain the starting missing values to preserve alignment across sequences, or possibly left-align sequences by switching to a process time axis. In case of different end times, ending missing terms could just be ignored. In case of information missing due to non response, add an explicit non-response state to the alphabet; or maintain missing values to preserve alignment. G. Ritschard (2012), 5/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 6/44. Distributed under licence CC BY-NC-ND 3.0

2 Coding the missing states Coding left, gaps and right missing states To allow such differentiated treatments, TraMineR distinguishes left, in-between and right missing values. Use the left, gaps and right arguments of seqdef() to specify how each of the missing types should be encoded. By default, gaps and left-missing states are coded as NA, while all missing values encountered after the last valid (rightmost) state in a sequence are considered void elements (right="del"); i.e., the sequence is considered to end after the last valid state. (sequences with missing states) is more the rule than the exception. Unlike Event History Analysis (Survival analysis), which can handle censored data, no universal elegant way of handling censored data in sequences. G. Ritschard (2012), 7/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 9/44. Distributed under licence CC BY-NC-ND 3.0 Strategies in presence of uncomplete sequences Reliability of analysis with uncomplete sequences What can we do in presence of uncomplete sequences? Delete all uncomplete sequences. Delete sequences with more than an acceptable number of missing states. Consider the NA state as an element of the alphabet. Impute some missing states Not too restrictive assumptions often permit to guess the value of some missing state. For example, we can assume that people leaving with their both parents at 20, leaved with them since their birthday.... A mix of the previous solutions When states are missing at random, global picture given by the sequences remains satisfactory whatever the handling strategy for the missing states. G. Ritschard (2012), 10/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 11/44. Distributed under licence CC BY-NC-ND 3.0

3 Illustration: randomly turning states into NA in mvad Randomly turning states into NA in mvad To illustrate we randomly insert missing states into the mvad data, 1 Randomly select a proportion p of sequences to be modified. 2 In each selected sequence insert a random proportion < p G of gaps, set as missing a random proportion < p L of states from the left, set as missing a random proportion < p R of states from the right. For the next examples, we used p =.6, p G =.2, p L =.4, p R =.5 Missings where introduced with segen.missing(), from TraMineRextras R> mvadm.seq <- seqgen.missing(mvad.seq, p.cases = 0.6, p.left = 0.4, p.gaps = 0.2, p.right = 0.5, mt.gaps = "nr", mt.right = "nr") G. Ritschard (2012), 12/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 13/44. Distributed under licence CC BY-NC-ND 3.0 Rendering with and without missing states I-plot Rendering with and without missing states d-plot, with.missing=true G. Ritschard (2012), 14/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 15/44. Distributed under licence CC BY-NC-ND 3.0

4 Rendering with and without missing states d-plot, with.missing=false A crucial point when analyzing state sequences is to chose a relevant time alignment Calendar date Same date start date for each sequence. Process time, i.e., time since a event of interest birth date (position defined by age) date when starting to live with a partner, first childbirth,... start of first job, first unemployment month, immigration date,... G. Ritschard (2012), 16/44. Distributed under licence CC BY-NC-ND 3.0 Loading the srh data We illustrate with sequences of self reported health from the SHP (30% sample data in srh30.rdata) R> source(paste(scriptdir, "extractseqfromw.r", sep = "")) R> load(paste(datadir, "srh30.rdata", sep = "")) R> srh <- srh30 R> srh.shortlab <- c("b2", "B1", "M", "G1", "G2") R> srh.longlab <- c("not well at all", "not very well", "so, so", "well", "very well") R> srh.alph <- c("not well at all", "not very well", "so, so (average)", "well", "very well") R> var <- getcolumnindex(srh, "P$$C01") R> xtlab <- 1999:( length(var) - 1) R> mycol5 <- brewer.pal(5, "RdYlGn") R> srh.seq <- seqdef(srh[, var], right = NA, alphabet = srh.alph, states = srh.shortlab, labels = srh.longlab, cnames = xtlab, cpal = mycol5) R> x <- apply(is.na(srh[, var]), 1, sum) R> sel <- (x < seqlength(srh.seq) - 1) R> srh <- srh[sel, ] R> srh.seq <- srh.seq[sel, ] G. Ritschard (2012), 20/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 19/44. Distributed under licence CC BY-NC-ND 3.0 Illustration: Self-reported health, SHP 1999/2010 Sequences aligned on calendar year G. Ritschard (2012), 21/44. Distributed under licence CC BY-NC-ND 3.0

5 Changing alignment Illustration: Self-reported health, SHP 1999/2010 Sequences aligned on age Changing alignment with seqstart() from TraMineRextras. R> startyear < R> birthyear <- srh$birthy R> agesrh <- seqstart(srh[, var], data.start = startyear, new.start = birthyear) R> colnames(agesrh) <- 1:ncol(agesrh) R> agesrh <- agesrh[, 10:90] R> agesrh.seq <- seqdef(agesrh, alphabet = srh.alph, states = srh.shortlab, labels = srh.longlab, cpal = mycol5, right = NA, xtstep = 10) G. Ritschard (2012), 22/44. Distributed under licence CC BY-NC-ND 3.0 Illustration: Self-reported health, SHP 1999/2010 Sequences aligned on age, with ignored right missing positions, right="del" G. Ritschard (2012), 23/44. Distributed under licence CC BY-NC-ND 3.0 Illustration: Self-reported health, SHP 1999/2010 Focus on people born between 1930 and 1934 G. Ritschard (2012), 24/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 25/44. Distributed under licence CC BY-NC-ND 3.0

6 Time granularity Time granularity Time granularity Changing time granularity of the mvad data Monthly vs yearly states Time granularity: density of state positions within a given time length. defined by the duration of the used unit of time examples: year, quarter, month, week, day, hour,... Can switch from a fine granularity to a more rough one. But, cannot switch to a finer granularity than available in the data. Change granularity with seqgranularity() from TraMineRextras R> mvadg.seq <- seqgranularity(mvad.seq, tspan = 12) G. Ritschard (2012), 27/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 28/44. Distributed under licence CC BY-NC-ND 3.0 Time granularity Changing time granularity of the mvad data Monthly vs yearly states State codings State codings: What is the optimal alphabet size? The larger the alphabet, the less clear the results. Similarly to time aggregation, we can also merge together elements of the alphabet. Useful when different states reflect similar situations For example: in mvad, the distinction between further education (FE) and school (SC) is not so clear. Merging those categories improves readability of the outcomes. Avoid merging dissimilar states. Do not hide useful distinction such as Full time and Part time. G. Ritschard (2012), 29/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 31/44. Distributed under licence CC BY-NC-ND 3.0

7 State codings Merging two states Merging Further education with School in mvad State codings Merging two states Merging Further education with School in mvad R> mvadr.seq <- seqrecode(mvad.seq, recodes = list(fs = c("fe", "SC"))) R> seqdplot(mvadr.seq, group = mvad$gcse5eq, border = NA) G. Ritschard (2012), 32/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 33/44. Distributed under licence CC BY-NC-ND 3.0 State codings Merging two states Merging Further education with School in mvad Weights Weights Weights serve to improve sample representativeness Weights also useful for reducing the sequence data size by retaining only unique sequences. weight reflect the number of cases sharing the same unique sequence In any case, when weights are present, they should be accounted for. In TraMineR with the weights= argument of seqdef() When assigned to the state sequence object, weights are automatically accounted for. in produced plots, distributions, statistics,... G. Ritschard (2012), 34/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 36/44. Distributed under licence CC BY-NC-ND 3.0

8 Weights Results may be quite different R> layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), heights = c(2, 1.3)) R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, weighted = FALSE, title = "Non Weighed") R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, title = "Weighed") R> seqlegend(mvad.seq, ncol = 2, position = "top") Weights Which weights to use with panel data? Each wave of a panel survey usually includes 2 weights: a transversal weight (representativeness of current population) a longitudinal weight (representativeness of initial population), applies to full trajectories. Which weights should be used for uncomplete trajectories? For sequences over a subinterval of time? No evident solution. Weights lose their meaning when cases are filtered out! In SHP there are weights for cases for Sample I (1999) and for Sample I+II (2004). See G. Ritschard (2012), 37/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 38/44. Distributed under licence CC BY-NC-ND 3.0 Data size Data size, scalability Data size Size limitations: What can we do? Three types of size limitations: Number of sequences: no problem up to about Main problem is matrix of pairwise dissimilarities! Sequence length: no problem up to a few hundreds ( 300) In some functions default limit set as 100 should be increased Size of alphabet: not a too big problem for computation, but rendering becomes difficult with more than say 20 elements Default colors only for A 12 For number of sequences: Work on a representative sample of the sequences. For sequence length: Change time granularity. Split position (time) scale and work on subintervals For size of alphabet Merge elements of the alphabet. G. Ritschard (2012), 40/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 41/44. Distributed under licence CC BY-NC-ND 3.0

9 Conclusion Conclusion Conclusion Many issues in sequence analysis Solutions necessitate trade-offs Losing sequences (cases) vs allowing for missing states Losing sequences (cases) vs restricting time coverage... Holistic view provided by sequence analysis Cost: cannot account for most recent cohorts with yearly data. For example: Studying life course until 45 years with SHP biographical survey of 2002, means, if we want only complete trajectories, that younger people are born in The finer the granularity, the less constrained we are. Thank Thank you! you! Questions? See you next week. G. Ritschard (2012), 43/44. Distributed under licence CC BY-NC-ND 3.0 G. Ritschard (2012), 44/44. Distributed under licence CC BY-NC-ND 3.0