The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others.



Similar documents
How To Rate Plan On A Credit Card With A Credit Union

U.S. Department of Housing and Urban Development: Weekly Progress Report on Recovery Act Spending

New York Public School Spending In Perspec7ve

State Corporate Income Tax-Calculation

Standardized Pharmacy Technician Education and Training

TITLE POLICY ENDORSEMENTS BY STATE

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: Continuing Competence

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: Continuing Competence

Regional Electricity Forecasting

NHIS State Health insurance data

The Lincoln National Life Insurance Company Variable Life Portfolio

NAAUSA Security Survey

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: License Renewal Who approves courses?

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: PTA Supervision Requirements

Table 12: Availability Of Workers Compensation Insurance Through Homeowner s Insurance By Jurisdiction

ehealth Price Index Trends and Costs in the Short-Term Health Insurance Market, 2013 and 2014

EFFECTS OF LEGALIZING MARIJUANA 1

CINCINNATI HILLS CHRISTIAN ACADEMY COLLEGE QUESTIONNAIRE FOR STUDENTS

Department of Business and Information Technology

Florida Workers Comp Market

Health Workforce Data Collection: Findings from a Survey of States

Dashboard. Campaign for Action. Welcome to the Future of Nursing:

State of the Residential Property Management Market Survey Report, Fall 2012

Life Settlements Source List

U.S. Department of Education NCES NAEP. Tools on the Web

State Annual Report Due Dates for Business Entities page 1 of 10

Broadband Technology Opportunities Program: Sustainable Broadband Adoption and Public Computer Centers

In Utilization and Trend In Quality

Annual Survey of Public Pensions: State- and Locally- Administered Defined Benefit Data Summary Brief: 2015

Table 11: Residual Workers Compensation Insurance Market By Jurisdiction

Understanding Payroll Recordkeeping Requirements

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2013

LexisNexis Law Firm Billable Hours Survey Report

AN INSIDE LOOK AT SOCIAL RECRUITING IN THE USA

Download at

The following rates are the maximum rates that should be illustrated. Be sure to update the IRIS illustration system

Rates and Bills An Analysis of Average Electricity Rates & Bills in Georgia and the United States

Preapproval Inspections for Manufacturing. Christy Foreman Deputy Director Division of Enforcement B Office of Compliance/CDRH

An Introduction to... Equity Settlement

PRODUCTS CURRENTLY AVAILABLE FOR SALE. Marquis SP

ANALYSIS OF US AND STATE-BY-STATE CARBON DIOXIDE EMISSIONS AND POTENTIAL SAVINGS IN FUTURE GLOBAL TEMPERATURE AND GLOBAL SEA LEVEL RISE

The Economic Impact of Commercial Airports in 2010

Pharmacist Administered Vaccines Types of Vaccines Authorized to Administer

Surety Bond Requirements for Mortgage Brokers and Mortgage Bankers As of July 15, 2011

FIELD SERVICE BULLETIN

Moving TIM from Good to Great?

State Survey Results MULTI-LEVEL LICENSURE TITLE PROTECTION

Final Expense Life Insurance

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2014

GE Inventory Finance. Unlock your cash potential.

Payroll Tax Chart Results

Return-to-Work Outcomes Among Social Security Disability Insurance (DI) Beneficiaries

Ambulance Industry Receives Financial Relief Through the MMA

Fixed Indexed Annuity Rates

AmGUARD Insurance Company EastGUARD Insurance Company NorGUARD Insurance Company WestGUARD Insurance Company GUARD

Building a Market for Small Wind: The Break-Even Turnkey Cost of Residential Wind Systems in the United States

Dental Therapist Initiatives, Access, and Changing State Practice Acts The ADHA Perspective: An Update

States Future Economic Standing

LIMITED LIABILITY COMPANY ORGANIZATION CHART

CDFI FUND NEW MARKETS TAX CREDIT PROGRAM:

Nurse Practitioners and Physician Assistants in the United States: Current Patterns of Distribution and Recent Trends. Preliminary Tables and Figures

Standardization of Technician Education Want it? Need it? Janet Teeters, M.S., R.Ph. Director of Accreditation Services ASHP

The Future of Nursing Report

Hourly Wages. For additional information, please contact:

Athene Annuity (DE) Rates

Suitability Agent Continuing Education Requirements by State

Who provides this training? Are there any requirements? The parents/guardians and the doctor go through the medication curriculum with the student.

The Praxis Series Passing Scores by Test and State

DEGREE QUALIFICATIONS PROFILE: A PRIMER FOR IR AND ASSESSMENT PROFESSIONALS

States Served. CDFI Fund 601 Thirteenth Street, NW, Suite 200, South, Washington, DC (202)

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2015

Admitting Foreign Trained Lawyers. National Conference of Bar Examiners Chicago, May 2, 2015

American Equity Investment Life Insurance Company Bonus Gold (Index 1-07) PFG Marketing Group, Inc.

STC Insured Deposit Program (STID) Updated 06/16/2016

A R R A P R E S E N T A T I O N

Funding for Accreditation of Medicolegal Death Investigation Offices and Certification of Medicolegal Death Investigation Personnel

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2012

Auto Insurance Underwriting/Rating

FILING MEMORANDUM ITEM U-1399A REVISIONS TO STATISTICAL PLAN FOR WORKERS COMPENSATION AND EMPLOYERS LIABILITY INSURANCE AMENDED PENSION TABLE VALUES

Trends in Medigap Coverage and Enrollment, 2011

The Praxis Series Passing Scores by Test and State

Regional Short-term Electricity Consumption Models

2016 Individual Exchange Premiums updated November 4, 2015

2014 APICS SUPPLY CHAIN COUNCIL OPERATIONS MANAGEMENT EMPLOYMENT OUTLOOK

10 Reasons Why Vertex SMB is A Better Way to Handle Your Sales and Use Tax Automation 11:00 11:30. Scott Coleman. Channel Sales Manager

IRA Distribution Form

National Student Clearinghouse. CACG Meeting

List of HUD Accepted Insured Ten-Year Protection Plans (As of September 22, 2008) Posted as a courtesy by MSI on 11/05/08

Florida 1/1/2015 Workers Compensation Rate Filing

2013 Best Best & Krieger LLP. Telecommunications Law

Recipient Demographics

Forethought Medicare Supplement and ForeLife Final Expense Life Insurance Phase 1

STATES VEHICLE ASSET POLICIES IN THE FOOD STAMP PROGRAM

Transcription:

Outliers Outliers are data points which lie outside the general linear pattern of which the midline is the regression line. A rule of thumb is that outliers are points whose standardized residual is greater than 3.3 (corresponding to the.001 alpha level). The removal of outliers from the data set under analysis can at times dramatically affect the performance of a regression model. Outliers should be removed if there is reason to believe that other variables not in the model explain why the outlier cases are unusual -- that is, these cases need a separate model. Alternatively, outliers may suggest that additional explanatory variables need to be brought into the model (that is, the model needs respecification). Another alternative is to use robust regression, whose algorithm gives less weight to outliers but does not discard them. The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others. Belsley, Kuh, and Welsch (1980) define the leverage (h i ) of the ith observation as h i = 1 n % (x i & x)2 (n&1)s 2 x Leverage assesses how far away a value of the independent variable value is from the mean value: the farther away the observation the more leverage it has. From the definition you can see that leverage is mitigated by a larger sample size (any single point should have less influence) and by a larger variance of the independent variable (again, any single point should have less influence). 0<h<1 The leverage statistic varies from 0 (no influence on the model) to 1 (completely determines the model). A rule of thumb is that cases with leverage under.2 are not a problem, but if a case has leverage over.5, the case has undue leverage and should be examined for the possibility of measurement error or the need to model such cases separately. STATA command predict h, hat. Cook's distance, D, is another measure of the influence of a case. Cook's distance measures the effect of deleting a given observation. Observations with larger D values than the rest of the data are those which have unusual leverage. D > 4/n the criterion to indicate a possible problem. STATA command predict D, cooksd

dfbetas, is another statistic for assessing the influence of a case. If dfbetas > 0, the case increases the slope; if <0, the case decreases the slope. The case may be considered an influential outlier if dfbetas > 2/%n. STATA command dfbeta creates dfbeta s for all variables. or predict DFx1, dfbeta(x1) for individual variables dffit. DfFit measues how much the estimate changes as a result of a particular observation being dropped from analysis. dfits is defined as h dfits i = Rstudent i. Where Rstudent is the studentized residual. 1&h i The case may be considered an influential outlier if DFITS> 2/%k/n STATA command predict DFITS, dfits Studentized residuals and deleted studentized residuals are also used to detect outliers with high leverage. A "studentized residual" is the observed residual divided by the standard deviation. The "studentized deleted residual," also called the "jacknife residual," is the observed residual divided by the standard deviation computed with the given observation left out of the analysis. Analysis of outliers usually focuses on deleted residuals. Other synonyms include externally studentized residual or, misleadingly, standardized residual. There will be a t value for each residual, with df - n - k - 1, where k is the number of independent variables. When t exceeds the critical value for a given alpha level (ex.,.05) then the case is considered an outlier. In a plot of deleted studentized residuals versus ordinary residuals, one may draw lines at plus and minus two standard units to highlight cases outside the range where 95% of the cases normally lie; points substantially off the straight line are potential leverage problems. STATA command predict student, rstudent or predict standard, rstandard Partial regression plots, also called partial regression leverage plots or added variable plots, are yet another way of detecting influential sets of cases. Partial regression plots are a series of bivariate regression plots of the dependent variable with each of the independent variables in turn. The plots show cases by number or label instead of dots. One looks for cases which are outliers on all or many of the plots.

STATA command avplots Example using the Murder.dta data set. reg mrdrte exec unem d90 d93 Source SS df Number of obs= 153 F( 4, 148) = 3.05 Model 977.390644 4 244.347661 Prob > F = 0.0190 Residual 11867.9475 148 80.1888343 R-squared = 0.0761 Adj R-squared= 0.0511 Total 12845.3381 152 84.5088034 Root E = 8.9548 mrdrte Coef. Std. Err. t P>t [95% Conf. Interval] exec.1627547.1939295 0.84 0.403 -.2204738.5459832 unem 1.390786.4508653 3.08 0.002.4998207 2.281751 d90 2.675335 1.816934 1.47 0.143 -.91515 6.26582 d93 1.607317 1.774768 0.91 0.367-1.899842 5.114476 _cons -1.864393 3.069517-0.61 0.545-7.930134 4.201349 predict e, residual predict yhat predict standard, rstandard predict student, rstudent predict h, hat predict D, cooksd predict DFITS, dfits predict W, welsch dfbeta DFexec: DFbeta(exec) DFunem: DFbeta(unem) DFd90: DFbeta(d90) DFd93: DFbeta(d93)

rvplot Residuals -20 0 20 40 60 0 5 10 15 Fitted values To create a standardized residual plot graph twoway scatter standard yhat, yline(0)

Standardized residuals -2 0 2 4 6 8 0 5 10 15 Fitted values To identify the outliers. graph twoway scatter standard yhat, yline(0) mlabel(state) Standardized residuals -2 0 2 4 6 8 NH MD CAFL GA DE CT NC MD GA MD HI VA SCNCNCSC CA GA AZ AR MI MA NE MO HI KS AZ AZAL AL CO AK FL AL HI KS NV NVNV IL IL MI MOIL RJ VA TN TN TN VT ME AR AR DE COCTCT DE CO AK FL IAIA IA KS KYKYMO MI NE NE PA OR NM NM MN MN NJ OHPA OKOK SC SD SD UT NM ND OH ND SD WI VA WI WI PA MN KY ND UT UT VTWA WY WY VT ME IDMTMT OR WA NHNH IDID MAOR NJ RIWA WY ME CA AK 0 5 10 15 Fitted values

avplots, mlabel(state) e( mrdrte X ) -20 0 20 40 60 AK AK NM ME RI WY NJ OR PA MD AR CO IDNH OK NV IDMT MI AK RI VT CO NM MO IL OR MA KY OH CA SC OR IAND PA MN HI AZ TN MD IADE CT ME NJ VT WA CO KS NE MN WI KS ME CA NC NJ RI NE SD VT WY CT NV MA DE HI NH SC NE AL NC UT OK AR NC SC AZ GANV VA VA AL MO MO FL FL FL GA VA e( mrdrte X ) 0 20 40 60 80 NH NENE DE MA CT SDVT HI RI HI UT VA MD NJ IASD HI ND VA ME NC FL GA NC MD SD WI ND NE VA KS IAUT MN ND CO KS WI SC NV GA DE WY MN VT IAMN PA TN CA WA CO NJ VT OK ME DE CT MO TN AR GA MT WI AZ AZ MD OR KY CT NV PA IDWY MO FL MO CA UT OR OK OH FL TN NH MT KY ID MA OH PA MI IL NM AL OK IL OR MT NJ WA SC NM WA AK AR AL CO AK MI RI ME ID AR MI WY KY NMCA AK 0 10 20 30 e( exec X ) coef =.16275467, se =.19392954, t =.84-4 -2 0 2 4 6 e( unem X ) coef = 1.3907856, se =.45086525, t = 3.08 e( mrdrte X ) -20 0 20 40 60 80 NHMA DE CT VT HI RI NJ VA MD SDME NC FL NE KS ND MN GA IA SC CA PA WI AZ OR NV MO UT TN OH MT ILOK WA CO AL ID AR MI KY NM AK NE NC SDUT WY IAHI ND WI VA KS MN CO DE WY VT TN GA OK AZ AR MT ID KY NV CT MD MO OH NH MA FL PA MI IL OR NJ SC WA NM AK CA NEHI NC VA RI ME SDND IA UT KS WI MD SC NV MN WA CO NJ VT ME CT DE TN AZ GA PA WY OR OK CA MO NH OH KY ID MT FL ILNM AL AR MA RI AK MI -1 -.5 0.5 e( d90 X ) coef = 2.6753348, se = 1.8169343, t = 1.47 e( mrdrte X ) -20 0 20 40 60 GA FLVA SC NC AL NV UT NH DE MA CT MD VT HI RI NJ SD ME KS NE ND MN IA PA CA WI OR AZ MO TN OH IL OK MI MT WA CO AR ID KY NM AK WY FLMO AL VA NV GA SC AR UT OK NC MD NE HI TN CA SD ND IAWI KS MN WA CO NJ VT CT ME DE AZ WY PA OR OH KY ILNM NH IDMT MA RI AK MI VA FLMO GA NC AZ OK AR AL CA UT WY SC NE SD WA IAHI ND WI KS MN CO TN NV MD DE VT KY MT ID CT OH NH MA PA MI IL OR NJ NM AK RI ME -.5 0.5 e( d93 X ) coef = 1.6073174, se = 1.774768, t =.91 To graphically measure the influence of observations graph twoway scatter standard yhat [aweight=d], msymbol(oh) yline(0) Standardized residuals -2 0 2 4 6 8 0 5 10 15 Fitted values Leverage versus residual squared plot marks the means of leverage and squared residuals. Leverage

tells us how much potential for influencing the regression an observation has. lvr2plot To examine the numerical measure for outliers. leverage h>2k/n Cook s D>4/n DFITS>2/%k/n Welsch s W>3/%k DFBETA>2/%n For example, sort D list state yhat D DFITS W in -5/l state yhat D DFITS W 149. 13.15609.0150447 -.2742131-3.513339 150. 6.897557.0452811.4927538 6.137678 151. 15.01208.0504226 -.500824-8.794935 152. 9.990127.2905912 1.547051 19.3077 153. 11.5646.4116747 1.832155 22.9866

Recall that D > 4/n indicates a possible problem. With n=153, D >.02614379 DFITS> 2/%k/n may be considered an influential outlier. With n=153, k=4, DFITS>12.369317