Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis

Transcription

1 Foundations and Trends R in Computer Graphics and Vision Vol. 1, No 2/3 (2005) c 2006 D.A. Forsyth, O. Arikan, L. Ikemoto, J. O Brien, D. Ramanan DOI: / Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis David A. Forsyth 1, Okan Arikan 2, Leslie Ikemoto 3, James O Brien 4 and Deva Ramanan 5 1 University of Illinois Urbana Champaign 2 University of Texas at Austin 3 University of California, Berkeley 4 University of California, Berkeley 5 Toyota Technological Institute at Chicago Abstract We review methods for kinematic tracking of the human body in video. The review is part of a projected book that is intended to cross-fertilize ideas about motion representation between the animation and computer vision communities. The review confines itself to the earlier stages of motion, focusing on tracking and motion synthesis; future material will cover activity representation and motion generation. In general, we take the position that tracking does not necessarily involve (as is usually thought) complex multimodal inference problems. Instead, there are two key problems, both easy to state. The first is lifting, where one must infer the configuration of the body in three dimensions from image data. Ambiguities in lifting can result in multimodal inference problem, and we review what little is known about the extent to which a lift is ambiguous. The second is data association, where one must determine which pixels in an image

2 come from the body. We see a tracking by detection approach as the most productive, and review various human detection methods. Lifting, and a variety of other problems, can be simplified by observing temporal structure in motion, and we review the literature on datadriven human animation to expose what is known about this structure. Accurate generative models of human motion would be extremely useful in both animation and tracking, and we discuss the profound difficulties encountered in building such models. Discriminative methods which should be able to tell whether an observed motion is human or not do not work well yet, and we discuss why. There is an extensive discussion of open issues. In particular, we discuss the nature and extent of lifting ambiguities, which appear to be significant at short timescales and insignificant at longer timescales. This discussion suggests that the best tracking strategy is to track a 2D representation, and then lift it. We point out some puzzling phenomena associated with the choice of human motion representation joint angles vs. joint positions. Finally, we give a quick guide to resources.

3 1 Tracking: Fundamental Notions In a tracking problem, one has some measurements that appear at each tick of a (notional) clock, and, from these measurements, one would like to determine the state of the world. There are two important sources of information. First, measurements constrain the possible state of the world. Second, there are dynamical constraints the state of the world cannot change arbitrarily from time to time. Tracking problems are of great practical importance. There are very good reasons to want to, say, track aircraft using radar returns (good summary histories include [51, 53, 188]; comprehensive reviews of technique in this context include [32, 39, 127]). Not all measurements are informative. For example, if one wishes to track an aircraft where state might involve pose, velocity and acceleration variables, and measurements might be radar returns giving distance and angle to the aircraft from several radar aerials some of the radar returns measured might not come from the aircraft. Instead, they might be the result of noise, of other aircraft, of strips of foil dropped to confuse radar apparatus (chaff or window; see [188]), or of other sources. The problem of determining which measurements are informative and which are not is known as data association. 79

4 80 Tracking: Fundamental Notions Data association is the dominant difficulty in tracking objects in video. This is because so few of the very many pixels in each frame lie on objects of interest. It can be spectacularly difficult to tell which pixels in an image come from an object of interest and which do not. There are a very wide variety of methods for doing so, the details of which largely depend on the specifics of the application problem. Surprisingly, data association is not usually explicitly discussed in the computer vision tracking literature. However, whether a method is useful rests pretty directly on its success at data association differences in other areas tend not to matter all that much in practice. 1.1 General observations The literature on tracking people is immense. Furthermore, the problem has quite different properties depending on precisely what kind of representation one wishes to recover. The most important variable appears to be spatial scale. At a coarse scale, people are blobs. For example, we might view a plaza from the window of a building or a mall corridor from a camera suspended from the ceiling. Each person occupies a small block of pixels, perhaps pixels in total. While we should be able to tell where a person is, there isn t much prospect of determining where the arms and legs are. At this scale, we can expect to recover representations of occupancy where people spend time, for example [424] or of patterns of activity how people move from place to place, and at what time, for example [377]. At a medium scale, people can be thought of as blobs with attached motion fields. For example, a television program of a soccer match, where individuals are usually pixels high. In this case, one can tell where a person is. Arms and legs are still difficult to localize, because they cover relatively few pixels, and there is motion blur. However, the motion fields around the body yield some information as to how the person is moving. One could expect to be able to tell where a runner is in the phase of the run from this information are the legs extended away from the body, or crossing? At a fine scale, the arms and legs cover enough pixels to be detected, and one wants to report the configuration of the body.

5 1.1. General observations 81 We usually refer to this case as kinematic tracking. At a fine spatial scale, one may be able to report such details as whether a person is picking up or handling an object. There are a variety of ways in which one could encode and report configuration, depending on the model adopted is one to report the configuration of the arms? the legs? the fingers? and on whether these reports should be represented in 2D or in 3D. We will discuss various representations in greater detail later. Each scale appears to be useful, but there are no reliable rules of thumb for determining what scale is most useful for what application. For example, one could see ways to tell whether people are picking up objects at a coarse scale. Equally, one could determine patterns of activity from a fine scale. Finally, some quite complex determinations about activity can be made at a surprisingly coarse scale. Tracking tends to be much more difficult at the fine scale, because one must manage more degrees of freedom and because arms and legs can be small, and can move rather fast. In this review, we focus almost entirely on the fine scale; even so, space will not allow detailed discussion of all that has been done. Our choice of scale is dictated by the intuition that good fine-scale tracking will be an essential component of any method that can give general reports on what people are doing in video. There are distinctive features of this problem that make fine scale tracking difficult: State dimension: One typically requires a high dimensional state vector to describe the configuration of the body in a frame. For example, assume we describe a person using a 2D representation. Each of ten body segments (torso, head, upper and lower arms and legs) will be represented by a rectangle of fixed size (that differs from segment to segment). This representation will use an absolute minimum of 12 state variables (position and orientation for one rectangle, and relative orientation for every other). A more practical version of the representation allows the rectangles to slide with respect to one another, and so needs 27 state variables. Considerably more variables are required for 3D models.

6 82 Tracking: Fundamental Notions Nasty dynamics: There is good evidence that such motions as walking have predictable, low-dimensional structure [335, 351]. However, the body can move extremely fast, with large accelerations. These large accelerations mean that one can stop moving predictably very quickly for example, jumping in the air during a walk. For straightforward mechanical reasons, the body parts that move fastest tend to be small and on one end of a long lever which has big muscles at the other end (forearms, fingers and feet, for example). This means that the body segments that the dynamical model fails to predict are going to be hard to find because they are small. As a result, accurate tracking of forearms can be very difficult. Complex appearance phenomena: In most applications one is tracking clothed people. Clothing can change appearance dramatically as it moves, because the forces the body applies to the clothing change, and so the pattern of folds, caused by buckling, changes. There are two important results. First, the pattern of occlusions of texture changes, meaning that the apparent texture of the body segment can change. Second, each fold will have a typical shading pattern attached, and these patterns move in the image as the folds move on the surface. Again, the result is that the apparent texture of the body segment changes. These effects can be seen in Figure 1.4. Data association: There is usually no distinctive color or texture that identifies a person (which is why people are notoriously difficult to find in static images). One possible cue is that many body segments appear at a distinctive scale as extended regions with rather roughly parallel sides. This isn t too helpful, as there are many other sources of such regions (for example, the spines of books on a shelf). Textured backgrounds are a particularly rich source of false structures in edge maps. Much of what follows is about methods to handle data association problems for people tracking.

7 1.2. Tracking by detection Tracking by detection Assume we have some form of template that can detect objects reasonably reliably. A good example might be a face detector. Assume that faces don t move all that fast, and there aren t too many in any given frame. Furthermore, the relationship between our representation of the state of a face and the image is uncomplicated. This occurs, for example, when the faces we view are always frontal or close to frontal. In this case, we can represent the state of the face by what it looks like (which, in principle, doesn t change because the face is frontal) and where it is. Under these circumstances, we can build a tracker quite simply. We maintain a pool of tracks. We detect all faces in each incoming frame. We match faces to tracks, perhaps using an appearance model built from previous instances and also at least implicitly a dynamical model. This is where our assumptions are important; we would like faces to be sufficiently well-spaced with respect to the kinds of velocities we expect that there is seldom any ambiguity in this matching procedure. This matching procedure should not require one-one matches, meaning that some tracks may not receive a face, and some faces may not be allocated a track. For every face that is not attached to a track, we create a new track. Any track that has not received a face for several frames is declared to have ended (Algorithm 1 breaks out this approach). This basic recipe for tracking by detection is worth remembering. In many situations, nothing more complex is required, and the recipe is used without comment in a variety of papers. As a simple example, at coarse scales and from the right view, background subtraction and looking for dark blobs of the right size is sufficient to identify human heads. Yan and Forsyth use this observation in a simple track-by-detection scheme, where heads are linked across frames using a greedy algorithm [424]. The method is effective for obtaining estimates of where people go in public spaces. The method will need some minor improvements and significant technical machinery as the relationship between state and image measurements grows more obscure. However, in this simple form, the

8 84 Tracking: Fundamental Notions Assumptions: We have a detector which is reasonably reliable for all aspects that matter. Objects move relatively slowly with respect to the spacing of detector responses. As a result, a detector response caused either by another object or by a false positive tends to be far from the next true position of our object. First frame: Create a track for each detector response. N th frame: Link tracks and detector responses. Typically, each track gets the closest detector response if it is not further away than some threshold. If the detector is capable of reporting some distinguishing feature (colour, texture, size, etc.), this can be used too. Spawn a new track for each detector response not allocated to a track. Reap any track that has not received a measurement for some number of frames. Cleanup: We now have trajectories in space time. Link any where this is justified (perhaps by a more sophisticated dynamical or appearance model, derived from the candidates for linking). Algorithm 1: The simplest tracking by detection method gives some insight into general tracking problems. The trick of creating tracks promiscuously and then pruning any track that has not received a measurement for some time is a quite general and extremely effective trick. The process of linking measurements to tracks is the aspect of tracking that will cause us the most difficulty (the other aspect, inferring states from measurements, is straightforward though technically involved). This process is made easier if measurements have features that distinctively identify the track from which they come. This can occur because, for example, a face will not change gender from frame to frame, or because tracks are widely spaced with respect

9 1.2. Tracking by detection 85 to the largest practical speed (so that allocating a measurement to the closest track is effective). All this is particularly useful for face tracking, because face detection determining which parts of an image contain human faces, without reference to the individual identity of the faces is one of the substantial successes of computer vision. Neither space nor energy allow a comprehensive review of this topic here. However, the typical approach is: One searches either rectangular or circular image windows over translation, scale and sometimes rotation; corrects illumination within these windows by methods such as histogram equalization; then presents these windows to a classifier which determines whether a face is present or not. There is then some post-processing on the classifier output to ensure that only one detect occurs at each face. This general picture appears in relatively early papers [299, 331, 332, 382, 383]. Points of variation include: the details of illumination correction; appropriate search mechanisms for rotation (cf. [334] and [339]); appropriate classifiers (cf. [259, 282, 333, 339] and [383]); building an incremental classification procedure so that many windows are rejected early and so consume little computation (see [186, 187, 407, 408] and the huge derived literature). There are a variety of strategies for detecting faces using parts, an approach that is becoming increasingly common (compare [54, 173, 222, 253, 256] and [412]; faces are becoming a common category in so-called object category recognition, see, for example, [111]) Background subtraction The simplest detection procedure is to have a good model of the background. In this case, everything that doesn t look like the background is worth tracking. The simplest background subtraction algorithm is to take an image of the background and then subtract it from each frame, thresholding the magnitude of the difference (there is a brief introduction to this area in [118]). Changes in illumination will defeat this approach. A natural improvement is to build a moving average estimate of the background, to keep track of illumination changes (e.g. see [343, 417]; gradients can be incorporated [250]). In outdoor scenes,

10 86 Tracking: Fundamental Notions this approach is defeated by such phenomena as leaves moving in the wind. More sophisticated background models keep track of maximal and minimal values at each pixel [146], or build local statistical models at each pixel [59, 122, 142, 176, 177, 375, 376]. Under some circumstances, background subtraction is sufficient to track people and perform a degree of kinematic inference. Wren et al. describe a system, Pfinder, that uses background subtraction to identify body pixels, then identifies arm, torso and leg pixels by building blobby clusters [417]. Haritaoglu et al. describe a system called W4, which uses background subtraction to segment people from an outdoor view [146]. Foreground regions are then linked in time by applying a second order dynamic model (velocity and acceleration) to propagate median coordinates (a robust estimate of the centroid) forward in time. Sufficiently close matches trigger a search process that matches the relevant foreground component in the previous frame to that in the current frame. Because people can pass one another or form groups, foreground regions can merge, split or appear. Regions appearing, splitting or merging are dealt with by creating (resp. fusing) tracks. Good new tracks can be distinguished from bad new tracks by looking forward in the sequence: a good track continues over time. Allowing a tracker to create new tracks fairly freely, and then telling good from bad by looking at the future in this way is a traditional, and highly useful, trick in the radar tracking community (e.g. see the comprehensive book by Blackman and Popoli [39]). The background subtraction scheme is fairly elaborate, using a range of thresholds to obtain a good blob (Figure 1.1). The resulting blobs are sufficiently good that the contour can be parsed to yield a decomposition into body segments. The method then segments the contours using convexity criteria, and tags the segments using: distance to the head which is at the top of the contour; distance to the feet which are at the bottom of the contour; and distance to the median which is reasonably stable. All this works because, for most configurations of the body, one will encounter body segments in the same order as one walks around the contour (Figure 1.2). Shadows are a perennial nuisance for background subtraction, but this can be dealt with using a stereoscopic reconstruction, as Haritaoglu et al. show ([147]; see also [178]).

11 1.2. Tracking by detection 87 Fig. 1.1 Background subtraction identifies groups of pixels that differ significantly from a background model. The method is most useful for some some cases of surveillance, where one is guaranteed a fixed viewpoint and a static background changing slowly in appearance. On the left, a background model; in the center, a frame; and on the right, the resulting image blobs. The figure is taken from Haritaoglu et al. [146]; in this paper, authors use an elaborate method involving a combination of thresholds to obtain good blobs. Figure 1.2 illustrates a method due to these authors that obtains a kinematic configuration estimate by parsing the blob. Figure from W4: Real-time surveillance of people and their activities, Haritaoglu et al., IEEE Trans. Pattern Analysis and Machine Intelligence, 2000, c 2000 IEEE. Fig. 1.2 For a given view of the body, body segments appear in the outline in a predictable manner. An example for a frontal view appears on the left. Haritaoglu et al identify vertices on the outline of a blob using a form of convexity reasoning (right (b) and right (c)), and then infer labels for these vertices by measuring the distance to head (at the top), feet (at the bottom) and median (below right). These distances give possibly ambiguous labels for each vertex; by applying a set of topological rules obtained using examples of multiple views like that on the left, they obtain an unambiguous labelling.figure from W4: Real-time surveillance of people and their activities, Haritaoglu et al., IEEE Trans. Pattern Analysis and Machine Intelligence, 2000, c 2000 IEEE.

12 88 Tracking: Fundamental Notions Deformable templates Image appearance or appearance is a flexible term used to refer to aspects of an image that are being encoded and should be matched. Appearance models might encode such matters as: Edge position; edge orientation; the distribution of color at some scale (perhaps as a histogram, perhaps as histograms for each of some set of spatially localized buckets); or texture (usually in terms of statistics of filter outputs. A deformable template or snake is a parametric model of image appearance usually used to localize structures. For example, one might have a template that models the outline of a squash [191, 192] or the outline of a person [33], place the template on the image in about the right place, and let a fitting procedure figure out the best position, orientation and parameters. We can write this out formally as follows. Assume we have some form of template that specifies image appearance as a function of some parameters. We write this template which gives (say) image brightness (or color, or texture, and so on) as a function of space x and some parameters θ as T (x θ). We score a comparison between the image at frame n, which we write as I(x,t n ), and this template using the a scoring function ρ ρ(t (x θ),i(x,t n )). A point template is built as a set of active sites within a model coordinate frame. These sites are to match keypoints identified in the image. We now build a model of acceptable sets of active sites obtained as shape, location, etc., changes. Such models can be built with, for example, the methods of principal component analysis (see, for example, [185]). We can now identify a match by obtaining image keypoints, building a correspondence between image keypoints and active sites on the template, and identifying parameters that minimize the fitting error. An alternative is a curve template, an idea originating with the snakes of [191, 192]. We choose a parametric family of image curves for example, a closed B-spline and build a model of acceptable shapes,

13 1.2. Tracking by detection 89 using methods like principal component analysis on the control points. There is an excellent account of methods in the book of Blake and Isard [41]. We can now identify a match by summing values of some image-based potential function over a set of sample points on the curve. A particularly important case occurs when we want the sample points to be close to image points where there is a strong feature response say an edge point. It can be inconvenient to find every edge point in the image (a matter of speed) and this class of template allows us to search for edges only along short sections normal to the curve an example of a gate. Deformable templates have not been widely used as object detectors, because finding a satisfactory minimum one that lies on the object of interest, most likely a global minimum can be hard. The search is hard to initialize because one must identify the feature points that should lie within the gate of the template. However, in tracking problems this difficulty is mitigated if one has a dynamical model of some form. For example, the object might move slowly, meaning that the minimum for frame n will be a good start point for frame n +1. As another example, the object might move with a large, but near constant, velocity. This means that we can predict a good start point from frame n + 1 given frame n. A significant part of the difficulty is caused by image features that don t lie on the object, meaning that another useful case occurs in the near absence of clutter perhaps background subtraction, or the imaging conditions, ensures that there are few or no extra features to confuse the fitting process. Baumberg and Hogg track people with a deformable template built using a B-spline as above, with principal components used to determine the template [33]. They use background subtraction to obtain an outline for the figure, then sample the outline. For this kind of template, correspondence is generally a nuisance, but in some practical applications, this information can be supplied from quite simple considerations. For example, Baumberg and Hogg work with background subtracted data of pedestrians at fairly coarse scales from fixed views [33]. In this case, sampling the outline at fixed fractions of length, and starting at the lowest point on the principal axis yields perfectly acceptable correspondence information.

14 90 Tracking: Fundamental Notions Robustness We have presented scoring a deformable template as a form of least squares fitting problem. There is a basic difficulty in such problems. Points that are dramatically in error, usually called outliers and traditionally blamed on typist error [153, 330], can be overweighted in determining the fit. Outliers in vision problems tend to be unavoidable, because nature is so generous with visual data that there is usually something seriously misleading in any signal. There are a variety of methods for managing difficulties created by outliers that are used in building deformable template trackers. An estimator is called robust if the estimate tends to be only weakly affected by outliers. For example, the average of a set of observations is not a robust estimate of the mean of their source (because if one observation is, say, mistyped, the average could be wildly incorrect). The median is a robust estimate, because it will not be much affected by the mistyped observation. Gating the scheme of finding edge points by searching out some distance along the normal from a curve is one strategy to obtain robustness. In this case, one limits the distance searched. Ideally, there is only one edge point in the search window, but if there are more one takes the closest (strongest, mutatis mutandis depending on application details). If there is nothing, one accepts some fixed score, chosen to make the cost continuous. This means that the cost function, while strictly not differentiable, is not dominated by very distant edge points. These are not seen in the gate, and there is an upper bound on the error any one site can contribute. An alternative is to use an m-estimator. One would like to score the template with a function of squared distance between site and measured point. This function should be close to the identity for small values (so that it behaves like the squared distance) and close to some constant for large values (so that large values don t contribute large biases). A natural form is ρ(u) = u u + σ so that, for d 2 small with respect to σ, we have ρ(d 2 ) d 2 and for d 2 large with respect to σ we have ρ(d 2 ) 1. The advantage of this

15 1.2. Tracking by detection 91 approach is that nearby edge points dominate the fit; the disadvantage is that even fitting problems that are originally convex are no longer convex when the strategy is applied. Numerical methods are consequently more complex, and one must use multiple start points. There is little hope of having a convex problem, because different start points correspond to different splits of the data set into important points and outliers; there is usually more than one such split. Again, large errors no longer dominate the estimation process, and the method is almost universally applied for flow templates The Hausdorff distance The Hausdorff distance is a method to measure similarity between binary images (for example, edge maps; the method originates in Minkowski s work in convex analysis, where it takes a somewhat different form). Assume we have two sets of points P and Q; typically, each point is an edge point in an image. We define the Hausdorff distance between the two sets to be where H(P,Q) =max(h(p,q),h(q,p )) h(p,q) = max min p q. p P q Q The distance is small if there is a point in Q close to each point in P and a point in P close to each point in P. There is a difficulty with robustness, as the Hausdorff distance is large if there are points with no good matches. In practice, one uses a variant of the Hausdorff distance (the generalized Hausdorff distance) where the distance used is the k-th ranked of the available distances rather than the largest. Define to be the operator that orders the elements of its input largest to smallest, then takes the k th largest. We now have F th k where H k (P,Q) =max(h k (P,Q),h k (Q,P )) h k (P,Q) =Fk th (min p q ) q Q

16 92 Tracking: Fundamental Notions (for example, if there are 2n points in P, then h n (P,Q) will give the median of the minimum distances). The advantage of all this is that some large distances get ignored. Now we can compare a template P with an image Q by determining some family of transformations T (θ) and then choosing the set of parameters ˆθ that minimizes H k (T (θ) P,Q). This will involve some form of search over θ. The search is likely to be simplified if as applies in the case of tracking we have a fair estimate of ˆθ to hand. Huttenlocher et al. track using the Hausdorff distance [165]. The template, which consists of a set of edge points, is itself allowed to deform. Images are represented by edge points. They identify the instance of the latest template in the next frame by searching over translations θ of the template to obtain the smallest value of H k (T (θ) P,Q). They then translate the template to that location, and identify all edge points that are within some distance of the current template s edge points. The resulting points form the template for the next frame. This process allows the template to deform to take into account, say, the deformation of the body as a person moves. Performance in heavily textured video must depend on the extent to which the edge detection process suppresses edges and the setting of this distance parameter (a large distance and lots of texture is likely to lead to catastrophe). 1.3 Tracking using flow The difficulty with tracking by detection is that one might not have a deformable template that fully specifies the appearance of an object. It is quite common to have a template that specifies the shape of the domain spanned by the object and the type of its transformation, but not what lies within. Typically, we don t know the pattern, but we do know how it moves. There are several important examples: Human body segments tend to look like a rectangle in any frame, and the motion of this rectangle is likely

17 1.3. Tracking using flow 93 to be either Euclidean or affine, depending on imaging circumstances. A face in a webcam tends to fill a blob-like domain and undergo mainly Euclidean transformations. This is useful for those building user interfaces where the camera on the monitor views the user, and there are numerous papers dealing with this. The face is not necessarily frontal computer users occasionally look away from their monitors but tends to be large, blobby and centered. Edge templates, particularly those specifying outlines, are usually used because we don t know what the interior of the region looks like. Quite often, as we have seen, we know how the template can deform and move. However, we cannot score the interior of the domain because we don t know (say) the pattern of clothing being worn. In each of these cases, we cannot use tracking by detection as above because we do not posess an appropriate template. As a matter of experience, objects don t change appearance much from frame to frame (alternatively, we should use the term appearance to apply to properties that don t change much from frame to frame). All this implies that parts of the previous image could serve as a template if we have a motion model and domain model. We could use a correspondence model to link pixels in the domain in frame n with those in the domain in frame n +1. A good linking should pair pixels that have similar appearances. Such considerations as camera properties, the motion of rigid objects, and computational expense suggest choosing the correspondence model from a small parametric family. All this gives a formal framework. Write a pixel position in the n th frame as x n, the domain in the n th frame as D n, and the transformation from the n th frame to the n + 1 th frame as T n n+1 ( ;θ n ). In this notation θ n represent parameters for the transformation from the n th frame to the n + 1 th frame, and we have that x n+1 = T n n+1 (x n ;θ n ). We assume we know D n. We can obtain D n+1 from D n as T n n+1 (D n ;θ n ). Now we can score the parameters θ n representing the

18 94 Tracking: Fundamental Notions change in state between frames n + 1 and n by comparing D n with D n+1 (which is a function of θ n ). We compute some representation of image information R(x), and, within the domain D n+1 compare R(x n+1 ) with R(T n n+1 (x n ;θ n )), where the transformation is applied to the domain D n Optic flow Generally, a frame-to-frame correspondence should be thought of as a flow field (or an optic flow field) a vector field in the image giving local image motion at each pixel. A flow field is fairly clearly a correspondence, and a correspondence gives rise to a flow field (put the tail of the vector at the pixel position in frame n, and the head at the position in frame n + 1). The notion of optic flow originates with Gibson (see, for example, [128]). A useful construction in the optic flow literature assumes that image intensity is a continuous function of position and time, I(x,t). We then assume that the intensity of image patches does not change with movement. While this assumption may run into troubles with illumination models, specularities, etc., it is not outrageous for small movements. Furthermore, it underlies our willingness to compare pixel values in frames. Accepting this assumption, we have di dt = I dx dt + I t =0 (known as the optic flow equation, e.g. see [160]). Flow is represented by dx/dt. This is important, because if we confine our attention to an appropriate domain, comparing I(T (x;θ n ),t n+1 ) with I(x,t n ) involves, in essence, estimating the total derivative. In particular, I(T (x;θ n ),t n+1 ) I(x,t n ) di dt. Furthermore, the equivalence between correspondence and flow suggests a simpler form for the transformation of pixel values. We regard T (x;θ n ) as taking x from the tail of a flow arrow to the head. At short timescales, this justifies the view that T (x;θ n )=x + δx(θ n ).

19 1.3. Tracking using flow Image stabilization This form of tracking can be used to build boxes around moving objects, a practice known as image stabilization. One has a moving object on a fairly uniform background, and would like to build a domain such that the moving object is centered on the domain. This has the advantage that one can look at relative, rather than absolute, motion cues. For example, one might take a soccer player running around a field, and build a box around the player. If one then fixes the box and its contents in one place, the vast majority of motion cues within the box are cues to how the player s body configuration is changing. As another example, one might stabilize a box around an aerial view of a moving vehicle; now the box contains all visual information about the vehicle s identity. Efros et al. use a straightforward version of this method, where domains are rectangles and flow is pure translation, to stabilize boxes around people viewed at a medium scale (for example, in a soccer video) [100]. In some circumstances, good results can be obtained by matching a rectangle in frame n with the rectangle in frame n +1 that has smallest sum-of-squared differences which might be found by blank search, assisted perhaps by velocity constraints. This is going to work best if the background is relatively simple say, the constant green of a soccer field as then the background isn t a source of noise, so the figure need not be segmented (Figure 1.3). For more complex backgrounds, the approach may still work if one performs background subtraction before stabilization. At a medium scale it is very difficult to localize arms and legs, but they do leave traces in the flow field. The stabilization procedure means that the flow information can be computed with respect to a torso coordinate system, resulting in a representation that can be used to match at a kinematic level, without needing an explicit representation of arm and leg configurations (Figure 1.3) Cardboard people Flow based tracking has the advantage that one doesn t need an explicit model of the appearance of the template. Ju et al. build a model of legs in terms of a set of articulated rectangular patches ( cardboard people ) [190]. Assume we have a domain D in the n th image I(x,t n )

20 96 Tracking: Fundamental Notions Fig. 1.3 Flow based tracking can be useful for medium scale video. Efros et al. stabilize boxes around the torso of players in football video using a sum of squared differences (SSD) as a cost function and straightforward search to identify the best translation values. As the figure on the left shows, the resulting boxes are stable with respect to the torso. On the top right, larger versions of the boxes for some cases. Note that, because the video is at medium scale, it is difficult to resolve arms and legs, which are severely affected by motion blur. Nonetheless, one can make a useful estimate of what the body is doing by computing an estimate of optic flow (bottom right, F x, F y), rectifying this estimate (bottom right, F + x, F x, F + y, F y ) and then smoothing the result (bottom right, Fb + x, etc.). The result is a smoothed estimate of where particular velocity directions are distributed with respect to the torso, which can be used to match and label frames. Figure from Recognizing Action at a Distance, Efros et al., IEEE Int. Conf. Computer Vision 2003, c 2003 IEEE. and a flow field δx(θ) parametrized by θ. Now this flow field takes D to some domain in the n + 1 th image, and establishes a correspondence between pixels in the n th and the n + 1 th image. Ju et al. score ρ(i n+1 (x + δx(θ)) I n (x)) D where ρ is some measure of image error, which is small when the two compare well and large when they are different. Notice that this is a very general approach to the tracking problem, with the difficulty that, unless one is careful about the flow model the problem of finding a minimum might be hard. To our knowledge, the image score is always applied to pixel values, and it seems interesting to wonder what would happen if one scored a difference in texture descriptors. Typically, the score is not minimized directly, but is approximated with the optic flow equation and with a Taylor series. We have ρ(i(x + δx(θ),t n+1 ) I n (x,t n )) D