Data Clustering for Forecasting

Transcription

1 Data Clustering for Forecasting James B. Orlin MIT Sloan School and OR Center Mahesh Kumar MIT OR Center Nitin Patel Visiting Professor Jonathan Woo ProfitLogic Inc. 1

2 Overview of Talk Overview of Clustering Error-based clustering Use of clustering in forecasting But first, a few words from Scott Adams 2

3 3

4 4

5 5

6 What is clustering Clustering is the process of partitioning a set of data or objects into clusters with the following properties: Homogeneity within clusters: data that belong to the same cluster should be as similar as possible Heterogeneity between clusters: data that belong to different clusters should be as different as possible. 6

7 Overview of this talk Provide a somewhat personal view of the significance of clustering in life, and why it has not met its promise Provide our technique for how to incorporate uncertainty about data into clustering, so as to reduce uncertainty in forecasting. 7

8 Iris Data (Fisher, 1936) can2 can1 Species Setosa Versicolor Virginica 8

9 Cluster the iris data This is a 2-dimensional projection of 4-dimensional data. (sepal length and width, petal length and width) It is not clear if there are 2, 3 or 4 clusters There are 3 clusters Clusters are usually chosen to minimize some metric (e.g., sum of squared distances from center of the cluster) 9

10 Iris Data can2 can1 Species Setosa Versicolor Virginica 10

11 Iris Data, using ellipses can2 can1 Species Setosa Versicolor Virginica 11

12 Why is clustering important: a personal perspective Two very natural aspects of intelligence: grouping (clustering) and categorizing It s an organizing principle of our minds and of our life Just a few examples We cluster life into work life and family life We cluster our life by our roles father, mother, sister, brother, teacher, manager, researcher, analyst, etc We cluster our work life into various ways, perhaps organized by projects, or who we report to, or by who reports to us, etc. We even cluster what talks we attend, perhaps organized by quality, or what we learned, or where it was. 12

13 More on Clustering in Life More clustering Examples: Go shopping: products are clustered in the store (useful for locating things) As a professor: I need to cluster students into letter grades: what really is the difference between a B + and an A -? (useful in evaluations) When we figure out what to do, we often prioritize by clustering things (important vs. non-important) We cluster people into multiple dimensions based on appearance, intelligence, character, religion, sexual orientation, place of origin, etc Conclusion: Humans are clustering and categorizing by nature. It is part of our nature. It is part of our intelligence 13

14 Fields that have used clustering Marketing (market segmentation, catalogues) Chemistry (the periodic table is a great example) Finance (making sense of stock transactions) Medicine (clustering patients) Data mining (what can we do with transactional data, such as click stream data?) Bioinformatics (how can we make sense of proteins?) Data compression and aggregation (can we cluster massive data sets into smaller data sets for subsequent analysis? plus much more 14

15 Has clustering been successful in data mining? Initial hope: clustering would find many interesting patterns and surprising relationships arguably not met, at least not nearly enough perhaps it requires too much intelligence perhaps we can do better in the future Nevertheless: clustering has been successful in use computers for things that humans are quite bad at dealing with massive amounts of data effectively using knowledge of uncertainty 15

16 An issue in clustering: the effect of scale Background: an initial motivation for our work in clustering (as sponsored by the e- business Center) is to eliminate the effect of scale in clustering 16

17 A Chart of 6 Points Clustering 6 points

18 Two Clusters of the 6 Points Clustering 6 points

19 We added two points and adjusted the scale Clustering 8 points

20 3 clusters of the 8 points Clustering 8 points The 6 points on the left are clustered differently 20

21 Scale Invariance A clustering approach is called scale invariant if it develops the same solution, independent of the scales used The approach developed next is scale invariant 21

22 Using clustering to reduce uncertainty. Try to find the average of the 3 populations can2 can1 Species Setosa Versicolor Virginica 22

23 Using uncertainty to improve clustering: an example with 4 points in 1 dimension The four points were obtained as sample means for four samples, two from one distribution, and two from another. Objective: cluster into two groups of two each so as to maximize the probability that each cluster represents two samples from the same distribution

24 Standard Approach Consider the four data points, and cluster based on these values. Resulting cluster

25 Incorporating Uncertainty a common assumption in statistics data comes from populations or distributions from data, we can estimate the mean of the population and the standard deviation of the original Usual approach to clustering keep track of the estimated mean ignore the standard deviation (estimate of the error) Our approach: use both the estimated mean and the estimate of the error

26 The two samples on the left were samples with 10,000 points each. The samples on the right were two samples with 100 points each The radius corresponds to standard deviation. Smaller circles! larger data sets! more certainty. 26

27 probability = 4/19 probability = 8/19 probability = 7/

28 10,000 points with mean points with mean ,100 points with mean.501 True mean:.5 10,000 points with mean points with mean ,100 points with mean.537 True mean:.53 28

29 More on using uncertainty We will use clustering to reduce uncertainty We will use our knowledge of the uncertainty to improve the clustering In the previous example, the correct cluster was probability = 8/19 We had generated 20 sets of four points at random. The data was from the second set of four points. 29

30 Error based clustering 1. Start with n points in k-dimensional space next example has 15 points, 2 dimensions Each point has an estimated mean as well as a standard deviation of the estimate 2. Determine the likelihood for each pair of points coming from the same distribution 3. Merge the two points with the greatest likelihood 4. Return to Step 2. 30

31 Using Maximum Likelihood Maximum Likelihood Method Suppose we have G clusters, C 1, C 2,, C G. Out of exponentially many clusterings possible, which clustering is most likely w.r.t. to the observed data. Objective: x 1 x max ( ) ( ) ( ) G i t 1 i k= 1 i C σ k i i C σ k i i C σ k i Computationally difficult! 31

32 Heuristic solution based on maximum likelihood Greedy heuristic Start with n single point clusters Combine pair of clusters that lead to maximum increase in the objective value (based on maximum likelihood) Stop when we have G clusters. Similar to hierarchical Clustering 32

33 Error-based clustering At each step combine pair of clusters C i, C j with smallest ( ) t x x ( σ + σ ) ( x x ) i j i j i j x i, x i : maximum likelihood of means of clusters " i, " j : standard errors in x s. We define the distance between two clusters as t i j σi + σ j i j ( x x ) ( ) ( x x ) Computationally much easier!! 33

34 Error-based Clustering Algorithm distance(c i, C j ) = t i j σi + σ j i j ( x x ) ( ) ( x x ) Start with n singleton clusters At each step combine pair of clusters C i, C j with smallest distance. Stop when we have desired number of clusters It is a generalization of Ward s method. 34

35 The mean is the dot. The error is given by the ellipse. A small ellipse means that the data is quite accurate. 35

36 Determine the two elements most likely to come from the same distribution. Merge them into a single element. 36

37 Merge them into a single element. Determine the two elements most likely to come from the same distribution. 37

38 Continue this process, reducing the number of clusters one at a time. 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 Here we went all the way to a single cluster. We could stop with 2 or 3 or more clusters. We can also evaluate different numbers of clusters at the end. 50

51 Rest of the Lecture The use of clustering in forecasting developed while Mahesh Kumar worked at ProfitLogic. Joint work: Mahesh Kumar, Nitin Patel, Jonathon Woo. 51

52 Motivation Accurate sales forecasting is very important in retail industry in order to make good decisions. Shipping Allocation Pricing Manufacturer Wholesaler Retailer Customer Kumar et al. used clustering to help in accurate sales forecasting. 52

53 Forecasting Problem Goal: Forecast Sales Parameters that affect sales Price When a product is introduced Promotions Inventory Base demand as a function of time of the year. Random effects. 53

54 Seasonality Definition Seasonality is the hypothesized underlying base demand of a group of similar merchandize as a function of time of the year. It is a vector of size 52, describing variations over the year. It is independent of external factors like changes in price, promotions, inventory, etc. and is modeled as a multiplicative factor. e.g., two portable CD players have essentially the same seasonality, but they may differ in price, promotions, inventory, etc. 54

55 Seasonality Examples (made up data) weekly sales for summer shoes weekly sales for winter boots 55

56 Objective: determine seasonality of products Difficulty: observations of a product s seasonality is complicated by so other factors when the product is introduced sales and promotions inventory Solution methods preprocess data to compensate for sales and promotions and inventory effects average over lots of very similar products to eliminate some of the uncertainty Further clustering of products can eliminate more uncertainty 56

57 Retail Merchandize Hierarchy J-Mart Chain Men s summer Shoes Shoes Department Class Item Debok walkers Sales data available for items 57

58 Modeling Seasonality i = i1 σi 1 i2 σi2 i52 σi52 = i σi Seas {( x, ),( x, ),...,( x, )} ( x, ) Seasonality is modeled as a vector with 52 components Assumptions: We assume errors are Guassian We treat the estimate of the σ s as if they are the correct values 58

59 Illustration on simulated data Kumar et al generated data with 3 different seasonalities. They then combined similar products and produced estimates of seasonalities. Clustering produced much better final estimates. 59

60 Simulation Study 3 different seasonalities were used to generate sales data for 300 items. All 300 items divided into 12 classes. 12 estimates of seasonality coefficients along with associated errors. Used clustering into three clusters to forecast correct seasonalities. 60

61 Seasonalities 61

62 Initial seasonality estimates 62

63 Clustering Cluster classes with similar seasonality to reduce errors. Example: Men s winter shoes, men s winter coats. Standard Clustering methods do not incorporate information contained in the errors. Hierarchical clustering K-means clustering Ward s method 63

64 Further Clustering They used K-means, hierarchical, and Ward s technique They also used error based clustering 64

65 Kmeans, hierarchical (avg), Ward s Result 65

66 Error-based Clustering Result 66

67 Real Data Study Data from retail industry. 6 department: books, sporting goods, greeting cards, videos, etc. 45 classes. Sales forecast Without clustering Standard clustering Error-based clustering 67

68 Forecast Result (An example) No Clustering Sales Standard Clustering Error-based Clustering Weeks 68

69 Result Statistics Average Forecast Error ForecastSale = ActualSale ActualSale 69

70 Summary and Conclusion A new clustering method that incorporates information contained in errors It has strong theoretical justification under appropriate assumptions Computationally easy Works well in practice 70

71 Summary and Conclusion Major point: if one is using clustering to reduce uncertainty, then it makes sense to use error-based clustering. Scale invariance. Error-based clustering has strong theoretical justification and works well in practice. The concept of using errors can be applied to many other applications where one has reasonable estimate of errors. 71