COS 424: Interacting with Data. Lecturer: Robert Schapire Lecture # 5 Scribe: Megan Lee February 20, 2007

Similar documents
Machine Learning Applications in Grid Computing

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

How To Get A Loan From A Bank For Free

Information Processing Letters

AUC Optimization vs. Error Rate Minimization

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

arxiv: v1 [math.pr] 9 May 2008

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

A quantum secret ballot. Abstract

Applying Multiple Neural Networks on Large Scale Data

Searching strategy for multi-target discovery in wireless networks

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

Factor Model. Arbitrage Pricing Theory. Systematic Versus Non-Systematic Risk. Intuitive Argument

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Factored Models for Probabilistic Modal Logic

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

Data Set Generation for Rectangular Placement Problems

2. FINDING A SOLUTION

Physics 211: Lab Oscillations. Simple Harmonic Motion.

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

An Approach to Combating Free-riding in Peer-to-Peer Networks

An Innovate Dynamic Load Balancing Algorithm Based on Task

Partitioned Elias-Fano Indexes

6. Time (or Space) Series Analysis

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

SAMPLING METHODS LEARNING OBJECTIVES

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

Budget-optimal Crowdsourcing using Low-rank Matrix Approximations

Halloween Costume Ideas for the Wii Game

Dynamic Placement for Clustered Web Applications

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

Online Bagging and Boosting

The Velocities of Gas Molecules

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

A Scalable Application Placement Controller for Enterprise Data Centers

Lecture L9 - Linear Impulse and Momentum. Collisions

Modeling Parallel Applications Performance on Heterogeneous Systems

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Markovian inventory policy with application to the paper industry

Introduction to Unit Conversion: the SI

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008

Efficient Key Management for Secure Group Communications with Bursty Behavior

CLOSED-LOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

Energy Proportionality for Disk Storage Using Replication

Lesson 44: Acceleration, Velocity, and Period in SHM

Preference-based Search and Multi-criteria Optimization

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

International Journal of Management & Information Systems First Quarter 2012 Volume 16, Number 1

Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

Position Auctions and Non-uniform Conversion Rates

ASIC Design Project Management Supported by Multi Agent Simulation

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

Markov Models and Their Use for Calculations of Important Traffic Parameters of Contact Center

Software Quality Characteristics Tested For Mobile Application Development

Fuzzy Sets in HR Management

Quality evaluation of the model-based forecasts of implied volatility index

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects

Presentation Safety Legislation and Standards

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

Considerations on Distributed Load Balancing for Fully Heterogeneous Machines: Two Particular Cases

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

ENZYME KINETICS: THEORY. A. Introduction

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

Standards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates

Insurance Spirals and the Lloyd s Market

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration

Binary Embedding: Fundamental Limits and Fast Algorithm

Local Area Network Management

Design of Model Reference Self Tuning Mechanism for PID like Fuzzy Controller

Stochastic Online Scheduling on Parallel Machines

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Method of supply chain optimization in E-commerce

Investing in corporate bonds?

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

The Fundamentals of Modal Testing

Impact of Processing Costs on Service Chain Placement in Network Functions Virtualization

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

Investing in corporate bonds?

STRONGLY CONSISTENT ESTIMATES FOR FINITES MIX'IURES OF DISTRIBUTION FUNCTIONS ABSTRACT. An estimator for the mixing measure

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

Modeling Cooperative Gene Regulation Using Fast Orthogonal Search

Image restoration for a rectangular poor-pixels detector

Modified Latin Hypercube Sampling Monte Carlo (MLHSMC) Estimation for Average Quality Index

Lecture L26-3D Rigid Body Dynamics: The Inertia Tensor

( C) CLASS 10. TEMPERATURE AND ATOMS

Online Methods for Multi-Domain Learning and Adaptation

ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET

Load Control for Overloaded MPLS/DiffServ Networks during SLA Negotiation

Models and Algorithms for Stochastic Online Scheduling 1

Multi-Class Deep Boosting

The students will gather, organize, and display data in an appropriate pie (circle) graph.

Online Appendix I: A Model of Household Bargaining with Violence. In this appendix I develop a simple model of household bargaining that

Transcription:

COS 424: Interacting with Data Lecturer: Robert Schapire Lecture # 5 Scribe: Megan Lee February 20, 2007 1 Decision Trees During the last class, we talked about growing decision trees fro a dataset. In doing this, there are two conflicting goals: achieving a low training error building a tree that is not too large We discussed a greedy, heuristic algorith that would try to achieve both of these conflicting goals. 2 Classification Error Suppose that we cut off the growing process at various points over the growing processs, and we evaluate the error of the tree at that point and tie. This would lead to a graph of size vs. error (where error is the probability of aking a istake). There are two error rates to be considered: training error (i.e. fraction of istakes ade on the training set) testing error (i.e. fraction of istakes ade on the testing set) The error curves are as follows: tree size vs. training error tree size vs. testing error

As the tree size increases, training error decreases. However, as the tree size increases, testing error decreases at first since we expect the test data to be siilar to the training data, but at a certain point, the training algorith starts training to the noise in the data, becoing less accurate on the testing data. At this point we are no longer fitting the data and instead fitting the noise in the data. Therefore, the shape of the testing error curve will start to increase at a certain point at which the tree is too big and too coplex to perfor well on testing data (an application of Occa s razor). This is called overfitting to the data, in which the tree is fitted to spurious data. As the tree grows in size, it will fit the training data perfectly and not be of practical use for other data such as the testing set. We want to choose a tree at the iniu of the curve, but we are not aware of the test curve during training. We build the tree only using the training error curve, which appears to be decreasing with tree size. Again, we have two conflicting goals. There is a tradeoff between training error and tree size. 3 Tree Size There are two general ethods of controlling the size of the tree: grow the tree ore carefully and try to end the growing process at an appropriate point early on grow the biggest tree possible (one that copletely fits the data), then prune it to be saller (this is the ore coon ethod) One coon technique is to separate the training set into two parts, the growing set and the pruning set. The tree is grown using only the growing set, and the pruning set is used to estiate the testing error of all possible subtrees that can be built, and the subtree with the lowest error on the pruning set is chosen as the decision tree. In this ethod, we are using the pruning set as a proxy for the testing set with the hope of achieving a curve siilar to the test curve when using the pruning set. For an exaple, 2/3 of the training set ay be used for growing, while 1/3 is used for pruning. A disadvantage of this ethod is that training data is wasted, a serious proble if the dataset is sall. Another approach is to try to explicitly optiize a tradeoff between the nuber of errors and the size of the tree. Consider the value #training errors + constant size of tree Now there is only one value that ust be iniized to deterine the optial tree. This value attepts to capture the two conflicting interests siultaneously. 4 Assuptions in creating decision trees As with any algorith, there are various assuptions that are ade when building decision trees. Three of these assuptions are that: The data can be described by features, such as the features of Batan characters. Soeties we assue these features are discrete, but we can also use decision trees when the features are continuous. Binary decisions are ade on the basis of continuous features by deterining a threshold that divides the range of values into intervals correlated with decisions. 2

The class label can be predicted using a logical set of decisions that can be suarized by the decision tree. The greedy procedure will be effective on the data that we are given, where effectiveness is achieved by finding a sall tree with low error. 5 Decision tree history Decision trees have been widely used since the 1980s. CART was an algorith widely used in the statistical counity, and ID3 and its successor, C4.5, were doinant in the achine learning counity. These algoriths are fast procedures, fairly easy to progra, and interpretable (i.e. understandable). A drawback of decision trees is that they are generally not as accurate as other achine learning ethods. We will be looking at soe of these state of the art algoriths later in the course. It is difficult to explain why decision trees are not optially accurate. Decision trees ay fail if the data is probabilistic or if the data is noisy. Features in the tree are not weighted and siplicity is hard to control, with overfitting a constant proble while growing the tree. 6 Theory: a atheatical odel for the learning proble Until now, we ve taken a very intuitive approach. Although we know that we need data, low error, and a siple rule, there are still any unresolved questions. For an exaple, what does it ean for a rule to be siple? Why is siplicity so iportant? How uch data is enough? What can we guarantee about accuracy? And how can we explain overfitting? We want to foralize the learning proble and define a easure of coplexity. 6.1 Data Training and testing exaples should be siilar. For an exaple, if we were classifying iages of handwritten digits by the digits they represent, we would want training exaples of the digits 0-9, not, for exaple, only 0-5. In the latter case, the algorith would fail iserably. Generally, the testing and training exaples can be siilar if they are produced by the sae process. The following is a foralization of this idea of the testing and training exaples being generated by the sae process. Assue that the data is rando, and the testing and training exaples are generated by the sae source, a distribution D. Fro this distribution D, we get an exaple, x. The distribution is unknown, but all exaples are IID since they originate fro the sae distribution. There is also a target function, c(x), that indicates the true label of each exaple. During training, the learning algorith is given a set of exaples, x 1,..., x n, each fro the distribution D, and each exaple is labeled with its correct label, c(x 1 ),..., c(x n ). This data is fed to the learning algorith, which outputs the prediction rule (i.e. hypothesis), which can take in a new exaple, x, and output a prediction h(x). We easure the goodness of the classifier by deterining the generalization error, which is the probability that a new exaple x, chosen at rando with respect to the distribution D, will be isclassified. This is equivalent to the expected test error, which can also be denoted as err(h). Our goal is to iniize the generalization error. 3

generalization error = P r x D [h(x) c(x)] = E[test error] = err(h) 6.2 Coplexity Coplexity here is an inforal ter defined as the opposite, or lack of, siplicity. Consider the two trees, h A, which consists of five nodes, and h B, which consists of one node. Obviously, h A is ore coplex. But why is this so? One way of thinking about the coplexities of these trees is to consider the length of the progras needed to copute the trees. That is, how any bits or ascii characters would we need to write the tree down? The progra for h A would be far longer than the progra for h B, which would be one line. But this is a slippery idea. Suppose that in the next edition of Java, there was a library function that exactly coputed tree h A. Now, h A only requires one line of code, and our definition of coplexity is invalid. Although description length as coplexity is an intuitive idea, it is a rather slippery easure. That is, coplexity should not be easured for a single rule. Instead, it is actually a class of rules that is siple or coplex. In this case, h A belongs to a class of hypotheses, H 5, which is the set of all trees with five nodes. The real reason why h A is ore coplex than h B is because h A belongs to class H 5, which is ore coplex, or richer, than class H 1, which is the set of trees with one node. Now, why is H 5 richer than H 1? H 5 ust be bigger because it contains all trees in H 1. H 5 is the set of all possible trees that have five nodes. For now, we are assuing that trees in the sae class have the sae coplexity. This prevents us fro thinking about the coplexity of individual trees rather than the coplexity of a set of trees. Our clai is that we need ore data to learn with the sae accuracy in a large hypothesis space than in a sall hypothesis space. Essentially, the size of the hypothesis space is iportant. H 5 > H 1 To illustrate this point, iagine this class experient. There are training objects 1-10 and testing objects 11-20. Each person in the class represents a hypothesis, and each person writes down his/her hypothesis (0 or 1) for each object in the training set. When the true labels of the training set are revealed, each person calculates his error. During class, the ost accurate hypothesis was 80%. The two people who achieved 80% accuracy then checked their accuracy on the testing set, on which they were 30-40% accurate. However, everybody s true generalization error is 50%, since the target values were generated by randoly flipping a coin. Regardless, since there are any people and any hypotheses, there is a chance that a few people will have high accuracy during training. When there is a large set of hypotheses, there is a good chance that one of the ight see to do well on the training set, although it ay still perfor poorly on a testing set. In this way, the testing accuracy soehow depends on the size of the hypothesis space. (The hypothesis space exists before any data is seen. Once we learn the data, the best hypothesis in this space is deterined.) 6.3 Theore Let lg H be a easure of the coplexity (i.e. size) of the hypothesis space in ters of the nuber of bits required to describe any hypothesis in the space. This coplexity is 4

proportional to the size of the tree. coplexity = lg H = O(n) where n is the nuber of nodes in the tree. Theore 1 Say that algorith A finds a hypothesis h A H. This hypothesis is consistent (no istakes on training set) with all training exaples. That is, assue that the training error is zero. Then, err(h A ) ln H + ln 1 δ with probability greater than 1 δ (i.e., with high probability). δ, a sall constant, is included because the algorith cannot be perfect. Whenever we are using rando data, there is the possibility of getting a weird dataset. For an exaple, the algorith would perfor badly if the nuber 7 never appeared in a set of zip codes. In this way, there is always a sall chance that an algorith will perfor very poorly. This relation says that as the aount of data,, increases, the error decreases. Also, as the coplexity increases, the error increases as well. The proof follows. Proof. We want to show that h A H Let err(h A ) ln H + ln 1 δ ɛ = ln H + ln 1 δ The hypothesis h is ɛ-bad if err(h) > ɛ. Show that h A is not ɛ-bad with probability 1 δ. That is, P r[h A is not ɛ-bad] 1 δ So prove: P r[h A is ɛ-bad] δ We are proving a bound on the probability. h A is consistent with the data, so P r[h A is ɛ-bad] = Pr[h A is consistent and ɛ-bad] 5

If the condition of this probability holds, then there is soe hypothesis that is consistent and ɛ-bad. Thus, Let P r[h A is consistent and ɛ-bad] P r[ h H : h is consistent and ɛ-bad] (1) Then (1) is equal to B = {h H : h is ɛ-bad} = {h 1,..., h k } P r[ h B : h is consistent] = P r[h 1 is consistent... h k is consistent] This is the probability that any h is bad. Use the union bound, so P r[a b] P r[a] + P r[b] P r[h 1 is consistent... h k is consistent] P r[h 1 is consistent] +... + P r[h k is consistent] Now copute the probability that any h B is consistent with the training set. P r[h is consistent] = P r[h(x 1 ) = c(x 1 )... h(x ) = c(x )] This becoes a product because the exaples are IID. For each x, = P r[h(x 1 ) = c(x 1 )]... P r[h(x ) = c(x )] P r[h(x) = c(x)] = 1 err(h) 1 ɛ P r[h is consistent] (1 ɛ) P r[h A is consistent and ɛ-bad] P r[h 1 is consistent] +... + P r[h k is consistent] B (1 ɛ) H (1 ɛ) H e ɛ since and since the definition of ɛ is = δ 1 + x e x x ɛ = ln H + ln 1 δ We have shown that Pr[h A is ɛ-bad] δ 6