Big Data Analysis. Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH

Size: px

Start display at page:

Download "Big Data Analysis. Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH"

Eleanore Willis
8 years ago
Views:

1 Big Data Analysis Rajen D Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) University of Cambridge Mathematical Sciences Showcase 29 January 2014 Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

ETH Zürich) University of Cambridge Mathematical Sciences Showcase 29

2 What is Big Data? The size of the data is such that computational considerations become important when choosing what algorithm to use Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

3 Large-scale classification with binary data Customer 1a {cheese, eggs, juice, milk, } Customer 2a {cereal, cheese, eggs, milk, } Customer 3a {cheese, crisps, eggs, milk, } Customer 1b {cereal, eggs, juice, milk, } Customer 2b {book, cheese, crisps, DVD, eggs, } Customer 3b {cheese, crisps, juice, milk, } Given two groups of customers, the aim is to find a collection of items that is often bought together within one group, but only rarely in the other group Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

eggs, } Customer 3b {cheese, crisps, juice, milk, } Given two groups of customers, the aim is to find a collection of items that

4 Large-scale classification with binary data Customer 1a {cheese, eggs, juice, milk, } Customer 2a {cereal, cheese, eggs, milk, } Customer 3a {cheese, crisps, eggs, milk, } Customer 1b {cereal, eggs, juice, milk, } Customer 2b {book, cheese, crisps, DVD, eggs, } Customer 3b {cheese, crisps, juice, milk, } Given two groups of customers, the aim is to find a collection of items that is often bought together within one group, but only rarely in the other group Can also imagine having two groups of documents, or s Here the aim would be to find groups of words that occur frequently together in one class, but not in the other Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

is to find a collection of items that is often bought together within one group, but only rarely in the other group Can also imagine having two groups of documents, or

5 Alternative view of the data book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

3a 1 1 1 1 1b 1 1 1 1 2b 1 1 1 1 1 3b 1 1 1 1

6 Searching for subsets book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

1 1 1b 1 1 1 1 2b 1 1 1 1 1 3b 1 1 1 1 Rajen

7 Searching for subsets book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

8 Searching for subsets book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

9 Alternative view of the data book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

3a 1 1 1 1 1b 1 1 1 1 2b 1 1 1 1 1 3b 1 1 1 1 1

10 Scaling of the algorithm With p different variables, the number of subsets of size 2 (or potential two-way interactions) is roughly p 2 /2 The number of three-way interactions is roughly p 3 /6 In general, the number of d-way interactions will be O(p d ) Take p = 10, 000 Then the number of two-way interactions is roughly ie 50 million Three-way interactions: > ie 100 billion Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

d-way interactions will be O(p d ) Take p = 10, 000 Then the number of two-way interactions is roughly 5 10 7 ie

11 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

12 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

13 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

14 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

15 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

16 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

17 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

18 Decision trees and related methods Decision Trees like CART (Breiman, 84) build up interactions / patterns greedily, starting from the individual variables They often work well, but give no guarantee that a strong interaction will be found Alternative strategies such as linear models, logistic regression and association rule mining techniques all have the same drawback Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

interaction will be found Alternative strategies such as linear models, logistic regression and association

19 Toy example where most current methods fails Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 1a a 1 1 3a a a a a 1 1 8a a a a a a a Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

7a 1 1 8a 1 1 1 1 1 1 1 9a 1 1 10a 1 1 1 11a 1 1 1 1 1 1 1 12a 1 1 1 1 1 1 13a

20 Idea: look at rows rather than columns Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 1a a 1 1 3a a a a a 1 1 8a a a a a a a Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

1 1 8a 1 1 1 1 1 1 1 9a 1 1 10a 1 1 1 11a 1 1 1 1 1 1 1 12a 1 1 1 1 1 1 13a

21 Idea: look at rows rather than columns Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 1a a 1 1 3a a a a a 1 1 8a a a a a a a Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

22 Arrange the search in a tree Computing intersections between large sets can be time-consuming However, computing an intersection between a small and large set is very cheap We should try to re-use intersections between large sets that we have calculated Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

23 !!! Example: Tic-Tac-Toe Data!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Dataset with endgames of Tic-Tac-Toe games Learn the rules of the game (or probabilities of winning) by looking at the database!! Each variable is coded as binary (eg is the first square occupied by a black stone? ) Marginal effects are weak Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

24 Arranging the search on a tree Random Intersection Tree Intersections are shown in the nodes Random observations along edges Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

25 Analysis of the algorithm We can try to study the average number of operations required to recover an interaction with a given probability Can show that for the coin flipping toy example, the number of operations required is just over O(p) (there are some additional log(p) terms) More generally, whenever the prevalence of the target interaction is high and the data is sparse, Random Intersection Trees tends to perform quite well Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

26 Analysis of the algorithm We can try to study the average number of operations required to recover an interaction with a given probability Can show that for the coin flipping toy example, the number of operations required is just over O(p) (there are some additional log(p) terms) More generally, whenever the prevalence of the target interaction is high and the data is sparse, Random Intersection Trees tends to perform quite well To speed up evaluation of the prevalence of candidate interactions, we use a method based on a technique from Computer Science called min-wise hashing (Broder, 1998) Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

27 Discussion Simple problems in the small data setting can become interesting in the big data setting Sometimes solutions to big data problems must draw on ideas from both Computer Science and Statistics Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25

Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea