Detection of Health Insurance Fraud with Discrete Choice Model: Evidence from Medical Expense Insurance in China



Similar documents
Can Auto Liability Insurance Purchases Signal Risk Attitude?

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

An Alternative Way to Measure Private Equity Performance

DEFINING %COMPLETE IN MICROSOFT PROJECT

Forecasting the Direction and Strength of Stock Market Movement

Covariate-based pricing of automobile insurance

The Use of Analytics for Claim Fraud Detection Roosevelt C. Mosley, Jr., FCAS, MAAA Nick Kucera Pinnacle Actuarial Resources Inc.

Traffic-light a stress test for life insurance provisions

Testing Adverse Selection Using Frank Copula Approach in Iran Insurance Markets

Estimating Total Claim Size in the Auto Insurance Industry: a Comparison between Tweedie and Zero-Adjusted Inverse Gaussian Distribution

TESTING FOR EVIDENCE OF ADVERSE SELECTION IN DEVELOPING AUTOMOBILE INSURANCE MARKET. Oksana Lyashuk

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Financial Instability and Life Insurance Demand + Mahito Okura *

Searching and Switching: Empirical estimates of consumer behaviour in regulated markets

Estimating Total Claim Size in the Auto Insurance Industry: a Comparison between Tweedie and Zero-Adjusted Inverse Gaussian Distribution

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

An Interest-Oriented Network Evolution Mechanism for Online Communities

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Criminal Justice System on Crime *

Overview of monitoring and evaluation

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

How To Calculate The Accountng Perod Of Nequalty

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET *

Statistical Methods to Develop Rating Models

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Marginal Returns to Education For Teachers

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Analysis of Premium Liabilities for Australian Lines of Business

HARVARD John M. Olin Center for Law, Economics, and Business

Gender differences in revealed risk taking: evidence from mutual fund investors

How To Study The Nfluence Of Health Insurance On Swtchng

What is Candidate Sampling

Calculation of Sampling Weights

Two Faces of Intra-Industry Information Transfers: Evidence from Management Earnings and Revenue Forecasts

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET

Factors Affecting Outsourcing for Information Technology Services in Rural Hospitals: Theory and Evidence

Capacity-building and training

Logistic Regression. Steve Kroon

Using an Ordered Probit Regression Model to Assess the Performance of Real Estate Brokers

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Survive Then Thrive: Determinants of Success in the Economics Ph.D. Program. Wayne A. Grove Le Moyne College, Economics Department

Willingness to Pay for Health Insurance: An Analysis of the Potential Market for New Low Cost Health Insurance Products in Namibia

THE EFFECT OF PREPAYMENT PENALTIES ON THE PRICING OF SUBPRIME MORTGAGES

The OC Curve of Attribute Acceptance Plans

Scale Dependence of Overconfidence in Stock Market Volatility Forecasts

Trivial lump sum R5.0

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

Data Mining from the Information Systems: Performance Indicators at Masaryk University in Brno

A discrete choice approach to model credit card fraud

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Are Women Better Loan Officers?

Forecasting and Stress Testing Credit Card Default using Dynamic Models

Traditional versus Online Courses, Efforts, and Learning Performance

How To Evaluate A Dia Fund Suffcency

Detection of Health Insurance Fraud with Discrete Choice Model: Evidence from Medical Expense Insurance in China. Working paper August 2014

How Much is E-Commerce Worth to Rural Businesses?

Traffic-light extended with stress test for insurance and expense risks in life insurance

Transition Matrix Models of Consumer Credit Ratings

The demand for private health care in the UK

Underwriting Risk. Glenn Meyers. Insurance Services Office, Inc.

IMPACT ANALYSIS OF A CELLULAR PHONE

Evaluating the generalizability of an RCT using electronic health records data

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Evidence from a Natural Experiment in China

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank.

presented by TAO LI. born in Yangling, Shaanxi Province, P.R.China

Single and multiple stage classifiers implementing logistic discrimination

SIMPLE LINEAR CORRELATION

STATISTICAL DATA ANALYSIS IN EXCEL

Health Insurance and Household Savings

Detecting Credit Card Fraud using Periodic Features

The Application of Fractional Brownian Motion in Option Pricing

High Correlation between Net Promoter Score and the Development of Consumers' Willingness to Pay (Empirical Evidence from European Mobile Markets)

LAW ENFORCEMENT TRAINING TOOLS. Training tools for law enforcement officials and the judiciary

Prediction of Disability Frequencies in Life Insurance

Is There A Tradeoff between Employer-Provided Health Insurance and Wages?

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Tuition Fee Loan application notes

CHAPTER 14 MORE ABOUT REGRESSION

1. Measuring association using correlation and regression

How To Find The Dsablty Frequency Of A Clam

Classification errors and permanent disability benefits in Spain

! # %& ( ) +,../ # 5##&.6 7% 8 # #...

The impact of hard discount control mechanism on the discount volatility of UK closed-end funds

Determinants of employment-based private health insurance coverage in Denmark

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

Analysis of Demand for Broadcastingng servces

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

A discrete choice approach to model credit card fraud

Management Quality, Financial and Investment Policies, and. Asymmetric Information

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Transcription:

Detecton of Health Insurance Fraud wth Dscrete Choce Model: Evdence from Medcal Expense Insurance n Chna Abstract: Health nsurance fraud ncreases the neffcency and nequalty n our socety. To address the wdespread problem, cost effect technques are n need to detect fraudulent clams. Wth a dataset from medcal expense nsurance n Chna, we propose a dscrete choce model to dentfy predctng factors of fraudulent clams, and we address the major lmtatons of dscrete choce model by consderng over samplng of fraudulent cases, as well as mslabelng of legtmate clams (omsson error). Our results show that a few factors, such as hosptal s qualfcaton and polcyholder s renewal status, could be used to predct fraudulent clams for further nvestgaton. Key words: Medcal expense nsurance, nsurance fraud, dscrete choce model, predctng factor, omsson error 1

1 Introducton Health nsurance s a crtcal mechansm for fnancng healthcare need n a modern socety. Health nsurance fraud comes as an unwanted byproduct, contrbutng to rsng health nsurance costs and resultng n sgnfcant socal welfare loss. Accordng to the Global Health Care Ant-fraud Network (GHCAN), health nsurance fraud has become a worldwde problem suffered by both developed countres wth sophstcated healthcare systems and developng countres wth emergng health nsurance markets. Globally, t s estmated that the annual total cost of health nsurance fraud could reach $260 bllon, or 6% of global healthcare spendng. 1 In the U.S. alone, t s estmated that health nsurance fraud costs up to $80 bllon annually, accountng for 3% of the annual natonal health care spendng. 2 In an emergng market such as Chna, commercal health nsurance market s stll at a nascent stage n terms of premum ncome 3, but fraud s already wde-spread, causng losses equal to 10%-30% of premum ncome (Mao, 2008; Munch Re, 2013). 4 The Chna Insurance Regulatory Commsson (CIRC) estmated the growth rate of nsurance fraud cases was around 20% n 2011, and n response to ths rsng problem CIRC proposed to buld ts own nsurance ant-fraud system n 2012. 5 Multple stakeholders should be nvolved to detect fraudulent clams effectvely and accurately, ncludng academa, the nsurance ndustry, regulatory nsttutons and nternatonal organzatons such as the GHCAN. In a developed market, all stakeholders work coherently and develop an advanced fraud detecton system usng abundant data and predctve analytcs to provde effcent fraud management. 6 In an emergng market such as Chna, the typcal procedure to detect health nsurance fraud stll follows smple gudance crtera such as clam amount threshold, and then largely reles on the experence and skll of an ndvdual clam adjuster to perform a manual nvestgaton. Both the effcency and accuracy could be mproved dramatcally wth an automated fraud detecton system. Despte the urgent need, as far as we know, there has been no study focused on health nsurance fraud n Chna yet. We attempt to 1 http://www.ghcan.org/challenge.html 2 http://www.fb.gov/about-us/nvestgate/whte_collar/health-care-fraud 3 In 2012, the premum ncome of health nsurance n Chna s 86.3 bllon Yuan ($14 bllon), comprsng merely 8% of the total lfe nsurance premum. 4 http://www.docn.com/p-718280146.html 5 http://www.crc.gov.cn/tabd/5171/infoid/219312/frtd/5225/default.aspx 6 http://www.fco.com/en/products/fco-nsurance-fraud-manager-health-care-edton/ 2

fll ths gap and provde evdence on contrbutng factors n predctng health nsurance fraud n ths emergng market. We develop our hypotheses and theoretcal background n the followng, then we present our data and the emprcal models, as well as dscuss the results. We present the concludng remarks n the end. 2 Theory and Hypotheses Development 2.1 Overvew on Fraud Detecton Methodology The methods of detectng nsurance fraud fall largely nto two groups. The supervsed learnng methods make use of pror nformaton on the dependent varable (fraudulent or legtmate) n a tranng subset of data to obtan patterns n predctng varables. Some examples of supervsed learnng methods nclude dscrete choce models (Arts et al., 1999, 2002; Belhadj et al., 2000; Caudll et al., 2005), other standard econometrc models (Wesberg and Derrg, 1991, 1995, 1998), the expert system (Major and Rednger, 2002; Stefano and Gsella, 2001), as well as actve learnng and costsenstve learnng methods. Unsupervsed learnng methods do not rely on predetermned status of dependent varable but extract nformaton from the predctng varables drectly. Some examples nclude cluster analyss, unsupervsed neural network (Brockett et al., 1998) and other data mnng methods (Kou et al., 2004; Yamansh, 2004). Compared to unsupervsed methods, supervsed methods tend to be more accurate snce addtonal nformaton on dependent varable s employed n the tranng sample. But the major lmtatons are: frst, t could dffcult and (or) costly to obtan labels for tranng sample; second, due to the nature of fraud, unbalanced data (too few fraudulent cases compared wth legtmate ones) s almost nevtable and requres specfc treatment; thrd, the labelng of dependent varable could be naccurate (msclassfcaton problem). In our study, we obtaned a dataset wth pror nformaton of whether the clam s fraudulent, therefore the choce of usng a supervsed learnng method s natural. Among dfferent supervsed methods, we choose dscrete choce model. It s 3

straghtforward to use, and the results could be easly nterpreted. In addton, we used weghted exogenous samplng maxmum lkelhood estmaton to address the oversamplng of fraudulent clams n our sample, and further consder omsson error to address the naccuracy of predetermned labellng of dependent varable. 2.2 Lterature Summary on Health Insurance Fraud Predctng Indcators In the area of detectng nsurance fraud, varous methods are appled n dfferent lnes of products as shown n Table 1. Methodology Emprcal auto nsurance health nsurance other lnes (BI n auto) Table 1 Summary of relevant lterature supervsed learnng methods Derrg (2002) Hausman et al. (1998) L et al. (2008) Mansk and Lerman (1977) Arts et al. (1999) Arts et al. (2002) Belhadj et al. (2000) Caudll et al. (2005) Derrg and Ostaszewsk (1995) Stefano and Gsella (2001) He et al. (1997) Lou et al. (2008) Vaene et al. (2002) Wesberg and Derrg (1990, 1993, 1995, 1998) unsupervsed learnng methods Brockett et al. (1998) Kou et al. (2004) Yamansh (2004) He et al. (2000) Lou et al. (2008) Major and Rednger (2002) Ortega et al. (2006) Shn et al. (2012) Yamansh et al. (2004) Yang and Hwang (2006) A et al. (2009) Brockett et al. (2002) Whle there are a seres of emprcal studes on nsurance fraud n auto lnes (for ether property damage or bodly njury clams) (Arts et al., 1999; Brockett et al., 2002; Caudll et al., 2005; Derrg and Ostaszewsk, 1995), scholars start to present fndngs n health nsurance as data becomes avalable (He et al., 1997, Lou et al., 2008; Major and Rednger, 2002; Yamansh et al., 2004). As suggested by L et al. (2008), due to legal ssues or concerns over prvacy protecton, the papers presentng detals on ndcators for health care fraud s scarce. 4

Most of the exstng studes employ unsupervsed learnng methods. Major and Rednger (1992) analyzed Electronc Fraud Detecton (EFD) usng by an nsurance company, and t provdes a general framework for health nsurance fraud ndcator classfcaton, and t ncludes fve categores,.e. fnancal ndcators, medcal logc ndcators (whether a medcal stuaton would normally happen), abuse ndcators (frequency of treatment), logstcs ndcators (the place, tmng and sequences of actvtes) as well as dentfcaton ndcators (the way provders present nformaton). In specfc for health fraud commtted by medcal laboratory, Yamansh et al. (2004) used outler detecton method to dentfy the test categores (chemcal, mcrobology, and mmunology) dstrbuton, the number of dfferent patents, and the test frequency as potental ndcators to detect fraud. Regardng abusve utlzaton n outpatent clncs, Shn et al. (2012) uses a scorng model to detect outpatent abusve bllng patterns usng proflng nformaton extracted from electronc nsurance clams n South Korea. They rely on doman experts to generate an ndex to decde whether further nvestgaton s warranted. It ncludes measurement of varous charges composton (total utlzaton, medcatons, njectons, laboratory tests, and dagnostc radology), total charges for the fve most frequent dagnoses, rates of utlzaton of specfc servces (antbotcs and cortcosterods), utlzaton of vsts and prescrpton drugs. Smlar to Shn et al. (2012), Lou et al. (2008) also takes healthcare provder as the unt to examne ts fraudulent medcal clams, and t uses three dfferent approaches ncludng logstc regresson, neural network and classfcaton trees. It uses nne varables ncludng average days of drug dspense, average drug cost, average consultaton and treatment fees, average dagnoss fees, average dspensng servce fees, average medcal expendture, average amount clamed, average drug cost per day, and average medcal expendture per day, and t fnds eght out of the nne varables beng sgnfcant predctng ones. Our study s dfferent from the prevous lterature n three ways. Frst, we adopt more sophstcated dscrete choce model to detect medcal fraud and ths type of method was 5

not frequently used before. Second, we are the frst to focus on product provdng npatent medcal expense nsurance n Chna. Thrd, we focus on ndcators of ndvdual fraudulent behavor (nsured) rather than nsttutonal behavor (healthcare provder), therefore our results could provde more nformatve nference for nsurer. 2.3 Hypotheses Development for Specfc Indcators We choose characterstcs on healthcare provder and servce (type of hosptal, number of days n hosptal ths tme and prevously under ths product, total cost, and composton of total cost across bed charge, medcne, care, dagnoss, treatment, operaton and lab test) and characterstcs on polcy (coverage, renew status, clam duraton, fle duraton and prevous clam frequency etc.) as our fraud ndcators, controllng for demographcs of the nsured (sex, age, occupaton, martal status, and ncome). In specfc, we hypothesze that: a. A few varables defnng the nature of hosptal are predctve of medcal fraud. The type/rankng of hosptal could be predctve of fraud. Those lower ranked communty clncs could be networked more easly, therefore prone to fraudulent behavor compared to the natonal wde top hosptal (ranked III-A). In addton, f a hosptal s qualfed provder under the nsurance contract, the probablty of fraud would decrease. Furthermore, f the polcyholder seeks servce from a recommended provder, the probablty of fraud should also decrease. b. The number of days stayed n hosptal and total cost for current stay. These two varables are dependent to each other to some extent. We hypothesze that as the patent spends more days n hosptal, or has a larger bll for the stay, t s a lkely sgnal for fraudulent behavor. As these sgnals draw attenton of clam adjuster, there s a hgher probablty of fraudulent behavor beng dscovered. c. Composton of cost. The total cost s conssted of seven categores ncludng bed charge, medcne charge, dagnoss charge, treatment charge, test charge, operaton charge and charge for care (labor) delvered. If one or a few categores are domnant n the total cost, t could be a potental sgnal of fraudulent clam. d. Coverage type. 6

If t s a planned fraud, the fraudster may tend to purchase polcy wth hgher lmt and more comprehensve coverage, therefore, we hypothesze there s a postve correlaton between coverage type and fraud. e. Renewal status, number of days stayed n hosptal n prevous clams, and number of clam fled prevously. These three varables ndcates the hstory of a gven nsured wth the product. We hypothesze that f t s a renewed customer, t s less lkely to commt fraud. Furthermore, f the nsured fled clams prevously, then he/she had undergone clam audtng before, therefore dmnshng the probablty of fraud. f. Number other polces wth the same company. We hypothesze that f the customer bought other polces (such as auto nsurance), then t s less lkely to commt fraud, because nformaton gathered from other polces could be used by nsurer n clam audtng. g. Self-clam preparaton. If a clam s fled and materals beng prepared by nsured hmself/herself, we hypothesze the probablty of fraud would dmnshng. h. Clam duraton. It s the number of days between polcy commencement to hosptalzaton. If t s a planned fraud, the fraudster tends to shorten the clam duraton, therefore there s a negatve correlaton between clam duraton and fraud.. Fle duraton. It s the number of days between hosptalzaton and submsson of complete clam fles. For fraudulent clams, t mght take longer to get forge the materal resultng n a postve correlaton between fle duraton and fraud. 3 Data 3.1 Medcal Expense Insurance Fraud n Chna There are three man types of health nsurance products n Chna, namely medcal expense nsurance, crtcal llness nsurance and accdent nsurance wth health expense coverage. We chose medcal expense nsurance as our target product because t s the domnant health nsurance product, and the fraud s more prevalent compared to the other two products. We obtaned data of an ndvdual npatent medcal expense 7

nsurance product from a leadng health nsurance company n Chna. Insured aged between 28 days and 59 years old are elgble to purchase ths product. It s desgned wth three levels of coverage, wth the premum dependng on age, gender and coverage level. The coverage lmts n varous sub-categores are descrbed n Table 2. There s no deductble, and the copayment percentage s 20%. An addtonal coverage of 5% of the medcal expense clam payoff s provded f the nsured seeks healthcare from a recommended hosptal. Table 2 Insurance coverage for ndvdual medcal expense nsurance Insurance Coverage Medcal expense coverage n subcategores (n Yuan) Bed charge Medcne charge Average daly lmt Low coverage Medum coverage Hgh coverage 50 80 100 Total lmt 4,500 7,200 9,000 Average daly lmt 100 150 200 Total lmt 9,000 13,500 18,000 Care charge 200 500 900 Dagnoss charge 200 500 900 Treatment charge 1,500 3,000 4,500 Lab charge 2,000 4,000 6,000 Operaton charge 2,000 4,000 6,000 Addtonal 5% of medcal expense Addtonal coverage (n Yuan) clam payoff There exsts a range of defntons for health nsurance fraud, from hard fraud n the form of crmnal actons to soft fraud n the form of over-utlzaton or over-estmaton of exstng expense (A, et.al, 2009). In ths product, the major types of fraud nclude concealng a pre-exstng condton, forgery of medcal expense recepts and documents, as well as nflatng days of npatent servce. There s vrtually no consensus on the defnton of nsurance fraud n the exstng lterature. We use the nsurer s decson as a proxy of nsurance fraud n model 1 and adjust for the nsurer s omsson error n model 2. 8

3.2 Sample Selecton We obtaned data of all clams fled n 2009 and 2010 for ths npatent medcal expense nsurance product. It s dvded nto two categores, zero payoff and non-zero payoff, accordng to nsurer s clam decson. We treat zero clam payoff as defnte evdence for the exstence of fraud. The non-zero payoff clams could be further dvded nto fully pad (adjusted to copayment and coverage lmt) and partally pad clams. However the majorty of the partally pad clams are due to the deducton of payment from the socal medcal nsurance program, so t would be unfar to label them as fraud. Therefore we treat all partally pad clams as legtmate clams n our analyss, and only regard zero clam payoff as fraud cases. Year Total # of clam Table 3 Summary of total clams and sampled clams Fraudule nt cases Fraud % n populaton Sampled # of clam Sampled Fraudulent cases Fraud % n sample 2009 3,868 236 6.10% 451 155 34.37% 2010 4,205 255 6.06% 512 224 43.75% Overall 8,073 491 6.08% 963 379 39.36% Table 3 summarzes our samplng procedure. Overall n 2009 and 2010, the percentage of fraudulent cases (zero clam payoff) s around 6%. In order to capture enough fraud cases n the tranng sample to analyze ts predctor varables, we use nonrandom samplng, so n our sample the percentage of fraudulent cases ncreases to 39%. 7 We wll adjust for non-random samplng n specfcaton 2 3.3 Descrptve Statstcs Table 4 gves a complete summary of varable defntons and descrptve statstcs for our sample. Overall, the data provdes nformaton on three dfferent levels: frst, characterstcs of the nsured (sex, age, occupaton, martal status, and ncome); second, characterstcs on healthcare provder and servce (hosptal type, days of hosptal stay, total cost, composton of the total cost); thrd, characterstcs on polcy (coverage, 7 Among all 491 fraud cases n 2009 and 2010, we am to capture all nformaton, but due to duplcaton of clams and mssng nformaton, we were left wth 77% of all fraud clams, resultng n 379 sampled fraudulent cases. Among all 7582 legtmate clams, we randomly selected 600 clams (300 each from 2009 and 2010), and due to duplcaton and mssng nformaton, resultng n 585 sampled legtmated cases comprsng 7.7% of all legtmate clams. 9

clam hstory nformaton and number of polces purchased from other nsurance companes). Table 5 shows some descrptve measures for the two subsamples of fraudulent and legtmate clams n comparson. 10

Table 4 Varable Defnton and Summary Statstcs Varable Defnton Mean Dependent varable Fraud1 Characterstcs of the Insured Equals 1 f the clam s rejected completely, 0 otherwse. Standard Devaton Mnmum Maxmum 0.394 0.489 0 1 sex Equals 1 f male, 0 f female. 0.496 0.500 0 1 Age_clam Age of nsured when the clam s fled. 28.150 19.559 0 60 chld_dummy adult_dummy Equals 1 f the nsured s age s between 0 and 18 when the clam s fled, 0 otherwse. Equals 1 f the nsured s age s between 19 and 59 when the clam s fled, 0 otherwse. 0.332 0.471 0 1 0.663 0.473 0 1 elder_dummy Equals 1 f the nsured s age s or above 60 when the clam s fled, 0 otherwse. 0.005 0.072 0 1 occupaton A standard classfcaton of occupaton type from 1 to 6 wth the greater number correspondng to greater rsk. 2.109 0.847 0 4 martal Equals 1 f marred, 0 otherwse. 0.614 0.487 0 1 ncome Indvdual annual ncome. 60,885 49,831 6,000 500,000 Characterstcs of healthcare provder and servce hosp_type hosp_rec Hosptal type equals 3 f t s a grade III-A hosptal, equals 2 f t s a grade III hosptal, equals 1 f t s a grade II-A hosptal, and 0 otherwse. The grade III-A hosptals are the best ones n Chna. Equals 2 f the hosptal s on the recommendaton lst of nsurer, equals 1 f t s assgned hosptal of nsurer, and 0 otherwse. 1.890 1.048 0 3 1.038 0.712 0 2 hosp_rec_dummy1 Equals 1 f the hosptal s a qualfed hosptal of nsurer but not on the 0.492 0.500 0 1 11

recommendaton lst, 0 otherwse. hosp_rec_dummy2 Equals 1 f the hosptal s not a qualfed hosptal of nsurer, 0 otherwse. 0.235 0.424 0 1 hosp_day Number of days that the nsured stayed n hosptal ths tme. 14.020 13.900 0 218 hosp_day_pre Number of days for prevous hosptal stays under ths polcy. 0.541 3.635 0 72 tot_cost Total expendture. 8,209 18,943 262 476,385 bed_per Percentage of expendture on bed cost. 0.083 0.117 0.000 1.000 med_per Percentage of expendture on medcne. 0.453 0.223 0.000 1.000 care_per Percentage of expendture on care (labor). 0.017 0.029 0.000 0.500 dag_per Percentage of expendture on dagnoss servce. 0.013 0.036 0.000 0.502 treat_per Percentage of expendture on treatment. 0.184 0.162 0.000 1.000 test_per Percentage of expendture on lab test. 0.193 0.152 0.000 0.900 oper_per Percentage of expendture on operaton cost. 0.058 0.121 0.000 0.647 Characterstcs of the polcy coverage_type The level of coverage (correspondng to levels n table 1). 1.130 0.380 1 3 self_polcyholder Equals 1 f the nsured s the polcy holder and 0 otherwse. 0.563 0.496 0 1 renew The total number of years snce the nsured frst purchased ths product. 2.980 1.770 1 7 num_other_polcy Number of vald polcy the nsured purchased from other nsurance company. 0.078 0.318 0 3 self_clam Equals 1 f the nsured fled the clam hmself, and 0 otherwse. 0.733 0.443 0 1 clam_duraton Number of days between polcy commencement date and hosptal admsson date. 193.351 97.493 0 364 fle_duraton Number of days between hosptal admsson date and clam materal submsson date. 68.627 84.520 6 829 clamfreq_pre Number of clams fled pror to current clam. 0.736 1.679 0 18 12

Varable Dependent varable Mean Observed Fraudulent Clams Standard Devaton Table 5 Summary Statstcs for Two Subsamples Mnmum Maxmum Mean Observed legtmate Clams Standard Devaton Mnmum Maxmum fraud1 1.000 0.000 1 1 0.000 0.000 0 0 Characterstcs of the Insured Mean Dfference P-Value sex 0.464 0.499 0 1 0.517 0.500 0 1 0.0527 0.110 age_clam 31.369 17.612 0 60 26.06 0 20.470 0 60-5.3095*** 0.000 chld_dummy 0.240 0.428 0 1 0.392 0.489 0 1 0.1520*** 0.000 adult_dummy 0.755 0.431 0 1 0.603 0.490 0 1-0.1519*** 0.000 elder_dummy 0.005 0.073 0 1 0.005 0.072 0 1-0.0001 0.977 occupaton 1.939 0.816 0 4 2.219 0.849 0 4 0.2799*** 0.000 martal 0.686 0.465 0 1 0.567 0.496 0 1-0.1192*** 0.000 ncome 62,244 55,267 10,000 500,000 60,00 3 45,989 6,000 500,000-2241.0330 0.496 Characterstcs of healthcare provder and servce hosp_type 1.923 1.090 0 3 1.868 1.020 0 3-0.0553 0.424 hosp_rec 0.860 0.708 0 2 1.154 0.691 0 2 0.2940*** 0.000 hosp_rec_dumm y1 0.480 0.500 0 1 0.500 0.500 0 1 0.0198 0.549 hosp_rec_dumm y2 0.330 0.471 0 1 0.173 0.379 0 1-0.1569*** 0.000 hosp_day 13.960 12.023 0 113 14.05 8 15.002 0 218 0.0978 0.915 13

hosp_day_pre 0.322 2.251 0 22 0.683 4.297 0 72 0.3613 0.132 tot_cost 10,392 27,077 262 476,385 6,792 10,565 515 145,878-3600.7280*** 0.004 bed_per 0.088 0.170 0.000 1.000 0.080 0.062 0.000 0.613-0.0082 0.286 med_per 0.430 0.261 0.000 1.000 0.468 0.193 0.000 1.000 0.0384*** 0.009 care_per 0.015 0.036 0.000 0.500 0.018 0.023 0.000 0.234 0.0027 0.156 dag_per 0.015 0.054 0.000 0.502 0.011 0.017 0.000 0.215-0.0037 0.119 treat_per 0.191 0.188 0.000 1.000 0.179 0.144 0.000 1.000-0.0116 0.278 test_per 0.181 0.146 0.000 0.801 0.201 0.155 0.000 0.900 0.0202** 0.044 oper_per 0.081 0.139 0.000 0.647 0.043 0.105 0.000 0.629-0.0377*** 0.000 Characterstcs of the polcy premum 703 275 326 2398 734 245 326 1778 31.4033* 0.065 coverage_type 1.161 0.440 1 3 1.110 0.334 1 3-0.0514** 0.040 self_polcyholder 0.633 0.483 0 1 0.517 0.500 0 1-0.1161*** 0.000 renew 2.171 1.537 1 7 3.502 1.715 1 7 1.3307*** 0.000 num_other_polc y 0.069 0.301 0 3 0.084 0.328 0 3 0.0153 0.466 self_clam 0.789 0.409 0 1 0.697 0.460 0 1-0.0920*** 0.002 clam_duraton 185.285 93.049 0 364 198.5 86 99.980 0 363 13.3007** 0.039 fle_duraton 81.277 97.303 7 829 60.41 8 74.010 6 650-20.8592*** 0.000 clamfreq_pre 0.327 0.934 0 8 1.002 1.977 0 18 0.6745*** 0.000 14

4 Model and Methodology 4.1 Dscrete-choce Model The model takes the form of a bnary probt regresson wth the dependent varable equal to one f the clam s dentfed as a fraudulent case. Assume the followng functonal relatonshp: Y X e * * Y s the latent varable. X s a vector of the observed explanatory varables. s a vector of unknown parameter, and e s a dsturbance term. The clam wll be determned to be fraudulent f Y 0 *, otherwse t s legtmate. Let the observed ndcator of fraud be Y, then we have: Y Y * 1, f Y 0 0, otherwse The probablty of fraud s * Pr( Y 1 X ) Pr( Y 0 X ) The probablty of the clam beng legtmate s Pr( X e 0 X ) Pr( e X ) 1 F( X ) * Pr( Y 0 X ) Pr( Y 0 X ) Pr( X e 0 X ) Pr( e X ) 1 F( X ) where F() s the cumulatve dstrbuton functon of e. If we assume that e follows a normal dstrbuton, t s a probt model that we choose. Let the cumulatve dstrbuton functon of standard normal dstrbuton be (), then Pr( Y 1 X ) ( X ) 15

The log-lkelhood functon s 1 Pr( Y 0 X ) 1 ( X ) n 1 ln L [ Y ln ( X ) (1 Y )ln(1 ( X ))] (1) n Due to the samplng method and nature of our data, we mprove probt model n two drectons n the followng two sesson. Model 1 n sesson 4.2 addresses the oversamplng problem and model 2 n sesson 4.3 attempts to address the msclassfcaton problem. 4.2 Probt Model wth Weghted Exogenous Samplng Maxmum Lkelhood Estmaton Overall, 6% of all clams n 2009 and 2010 are fraudulent, but n our sample fraudulent cases ncreases to 39% because of an oversamplng of fraudulent clams. To adjust for the oversamplng, we follow Mansk and Lerman (1977) to nclude a weghted exogenous samplng maxmum lkelhood (WESML) estmator. It modfes the classc log-lkelhood functon and provdes a consstent and asymptotcally normal WESML estmator. Arts, Ayuso and Gullen (1999) use ths method to correct the oversamplng of fraud clams n auto nsurance. Consder the followng specfc weghted exogenous samplng lkelhood functon correspondng to our model. ln L ( y) ln( p ) ln(1 p ) w 1 0 { y 1} { y 0} (2) Where, 1 1 1 1 1 0 2 2 Here 1 s the percentage of fraudulent samples n the total clams (populaton), and 2 s the percentage of fraudulent samples n the sample. The summary of weghted factors are gven n Table 6. We obtan the estmates by maxmzng equaton(2). 16

Table 6 Summary of the weghted factors Year 1 2 1 0 2009 4.01% 34.37% 1.463 0.117 2010 5.33% 43.75% 1.683 0.122 Total 4.69% 39.36% 1.572 0.119 4.3 Maxmum Lkelhood Estmaton wth Omsson Error Detectng fraud s a classfcaton problem. There are two types of msclassfcaton, but n ths paper we assume that all fraudulent clams are correctly classfed and the only possble msclassfcatons s omsson error (undetected fraudulent clams by nsurer). Followng the method proposed by Hausman et al. (1998), we take the omsson error nto consderaton n model 2, and estmate the percentage of fraudulent clams that are not detected. Arts, Ayuso and Gullen (2002) also apples ths method to auto nsurance market. Assume a regresson model for * Y such that: Y X e * Let Y be a dchotomous varable ndcatng presence of fraud such that: Y * 1, f Y 0 Y 0, otherwse. If there s no measurement error n the response, Y ndcates the true outcome wth the followng probablty: * Prob( Y =1 X)=Prob( Y >0 X) Wthn the msclassfcaton frame work, assume that the observed dependent varable could be dfferent from the underlyng outcome. Call the observed bnary varable Y. Assume that the probablty of msclassfcaton s as follows: 0 Prob( Y=1 Y=0) 1 Prob( Y 0 Y 1) 17

In our specfcaton, we assume 0 0, and estmate 1. The condtonal expectaton of the observed dependent varable s gven by: E( Y X ) (1 ) ( X ) 1 Where () s the cumulatve dstrbuton functon of the standard normal dstrbuton. The corrected log-lkelhood functon s: n 1 ln L [ Y ln(1 1) ( X ) (1 Y )ln(1 (1 1) ( X ))] n 1 (3) 1 can be estmated by maxmzng the log-lkelhood functon n equaton (3). 5 Emprcal Results and Dscussons Correspondng to model specfcatons n secton 4, we consder three specfcatons n our model. Frst, we use probt model to obtan the estmaton. Second, we take nto account the effect of the over-representaton of fraudulent clams n our sample. And n the thrd specfcaton, the omsson error s consdered. The dependent varable s the clam decson judged by the nsurance company. We treat clam completely rejected as fraudulent clam, and therefore the dependent varable equals to one, and zero otherwse. The explanatory varables nclude ndcators for fraudulent clam as well as control varables of the nsured. We perform a lkelhood rato test, and the result s 18.9 wth 1 degree of freedom. Ths ndcates that a sgnfcant mprovement occurs when we ncludes the omsson error parameter (specfcaton 3), compared wth the restrcted model wth no omsson errors (specfcaton 2). In specfcaton 3, we fnd that the parameter 1 estmatng the probablty of omsson error s sgnfcantly dfferent from zero. The result shows that the fraudulent clams are underestmated by 4.66 percent. The complete regresson results are shown n Table 7. 18

Table 7 Regresson Results SPECIFICATIONS 1: Probt 2: over-samplng addressed 3: omsson error addressed VARIABLES sex 0.00518 0.0230 0.0036 (0.0973) (0.160) (0.483) chld_dummy -0.501* -0.660-2.411* (0.275) (0.442) (1.421) elder_dummy 0.598 0.493 1.309 (0.660) (1.004) (2.665) occupaton -0.137* -0.0911-0.263 (0.0751) (0.124) (0.338) martal -0.230-0.277-1.578 (0.191) (0.316) (1.252) lnncome 0.0117 0.0217-0.218 (0.0855) (0.143) (0.469) hosp_type 0.0297 0.0143-0.0225 (0.0505) (0.0800) (0.17) hosp_rec_dummy1 0.146 0.138 0.29 (0.119) (0.187) (0.414) hosp_rec_dummy2 0.738*** 0.760*** 1.814** (0.138) (0.230) (0.899) hosp_day -0.0108** -0.0109-0.042* (0.00455) (0.00731) (0.0224) hosp_day_pre -0.0495** -0.0532* -0.165* (0.0203) (0.0311) (0.0885) lntot_cost 0.249*** 0.199* 0.166 (0.0725) (0.113) (0.259) bed_per 1.264*** 1.042 1.749 (0.490) (0.761) (1.528) care_per -2.508-2.510-8.015 (1.604) (2.314) (6.27) dag_per 3.569* 2.833 4.797 (1.824) (3.096) (7.642) treat_per 0.00177 0.181 1.52 (0.299) (0.493) (1.628) test_per -0.123 0.0447 0.698 (0.330) (0.534) (1.714) oper_per 0.707* 0.895 4.767 (0.397) (0.664) (2.922) coverage_type 0.177 0.133 0.319 (0.129) (0.206) (0.474) self_polcyholder 0.128 0.213 0.922 (0.149) (0.244) (0.66) renew -0.319*** -0.286*** -0.785** (0.0316) (0.0504) (0.345) num_other_polcy -0.188-0.132 0.052 (0.148) (0.231) (0.61) self_clam -0.0796-0.146-0.196 (0.175) (0.290) (0.809) clam_duraton -0.00195*** -0.00181** -0.00518** (0.000490) (0.000807) (0.00214) fle_duraton 0.00173*** 0.00216* 0.0246** (0.000600) (0.00119) (0.0124) clamfreq_pre -0.0795** -0.0653-0.0318 (0.0399) (0.0567) (0.114) 19

Constant -1.193 0.436 6.115 (1.129) (1.882) (6.49) 1 - - 0.0466*** - - (0.0135) - - 0 - - - Pseudo R 2 0.2305 0.2057 - Observatons 963 963 963 0 *** p<0.01, ** p<0.05, * p<0.1, standard errors are n parentheses In Table 7, we fnd that most of the parameters sgns are consstent wth our expectaton. Table 8 lsts the expected versus the obtaned parameter sgns. Table 8 Comparson for the Expected and the Obtaned Parameter Sgns Varable Obtaned sgn Expected Sgn hosp_type Inconsstent 8 - hosp_rec_dummy1 + + hosp_rec_dummy2 + + hosp_day - + hosp_day_pre - - lntot_cost + + bed_per + ndefnte care_per - ndefnte dag_per + ndefnte treat_per + ndefnte test_per nconsstent ndefnte oper_per + ndefnte coverage_type + + self_polcyholder + ndefnte Renew - - num_other_polcy nconsstent - self_clam - - clam_duraton - - fle_duraton + + clamfreq_pre - - 8 Inconsstent ndcates the sgns of coeffcent are not all the same across three dfferent specfcatons. 20

Most of the sgns of parameter are consstent n all three specfcatons except for coeffcents of ncome (lnncome), hosptal type (hosp_type), test percentage (test_per) and number of other polces (num_other_polcy). The coeffcents of these four explanatory varables are not sgnfcant though. As shown n Table 8, we fnd several ndcators for fraudulent medcal clams. And most of them are related to ether medcal servce and provder, or measurement of nsurance polcy. The hosp_rec_dummy2 varable demonstrates a strong negatve relatonshp wth a clam beng fraudulent. It shows that f the nsured seeks medcal servce n an unqualfed provder of the nsurer, t s more lkely to be a fraudulent case. However, the hosp_rec_dummy1 varable whch ndcates t s a qualfed provder but not beng recommended by nsurer s not sgnfcant. But t does have a postve sgn as expected, showng that compared to provders recommended by nsurer, those not on the recommendaton lst have a hgher probablty of commttng fraud. Both the length of hosptal stay n ths tme and n pror are sgnfcant ndcators of commttng medcal fraud. And all sgns n three specfcatons are negatve, meanng that the longer the nsured stays n the hosptal ths tme or n pror, the lower the probablty of fraudulent clams s. The expected sgn of number of hosptal stay s dfferent from our orgnal hypothess. We propose two reasons. Frst, the longer the hosptal stay s, the hgher the probablty that the clam wll be subjected to scrutnze n clam handlng, therefore the nsured who plan to commt fraud wll choose to keep the hosptal stay n a reasonable lmt. Second, there s coverage lmt for bed charge that could be rembursed by ths nsurance product, therefore f t s a planned fraud, the fraudster wll lmt the length of hs/her stay. The nfluence of the total cost s sgnfcant at the 1 percent sgnfcance level n specfcaton 1 and s sgnfcant at the 10 percent sgnfcance level n specfcaton 2. The parameter sgns n all three specfcatons are postve, ndcatng that the hgher the total cost, the hgher the probablty of fraudulent clams s, whch s consstent wth 21

our expectaton. Dfferent from results n pror study (Shn et al., 2012), the nfluence of composton of expendture are not sgnfcant n general. Only bed charge, dagnoss expendture and operaton cost are sgnfcant at 10 percent level n specfcaton 1, but none s sgnfcant when over samplng or omsson error s taken nto consderaton. The major reason we propose s that the pror studes ether controlled for dagnoss nformaton or just focused on certan knd of dsease (Ireson, 1997). In our sample, we have lmted number of observatons and varous dsease types, therefore, wthout controllng for dsease type, the cost composton cannot be used to predct fraudulent cases. The renew varable ndcatng the total number of years snce the nsured frst purchased ths product. Consstent wth our expectaton, the further the nsured renewed wth the same nsurer, the less lkely he/she commts fraud. The varables of clam_duraton and fle_duraton are both sgnfcant n all three specfcatons, and the sgns are consstent wth our expectaton. The clam_duraton measures the number of days between the polcy commencement and hosptal admsson. The negatve sgn shows that the nsured who would lke to commt fraud s eager to forge the accdents. The fle_duraton measures the number of days between hosptal admsson to clam materal submsson. The postve sgn shows that nsured who spend more tme on preparng the clam materal are more lkely to commt fraud. The number of clams fled pror to the current clam has a negatve mpact on the probablty of fraud as expected, but t s only sgnfcant at the 5 percent level n specfcaton 1. In our set of control varables regardng the characterstcs of the nsured, most of them are not statstcally sgnfcant when omsson error s consdered, except for the chld_dummy. The sgn of chld_dummy parameter s negatve, as expected, snce chldren are less lkely to be nvolved n medcal nsurance fraud. 22

Table 9 Margnal effects MODELS Probt Model 1 Model 2 VARIABLES sex 0.00193 0.00282 9.253E-07 (0.0362) (0.0197) - chld_dummy -0.187* -0.0810-6.204E-04 (0.103) (0.0543) - elder_dummy 0.223 0.0605 3.367E-04 (0.246) (0.123) - occupaton -0.0510* -0.0112-6.779E-05 (0.0280) (0.0152) - martal -0.0857-0.0341-4.061E-04 (0.0712) (0.0386) - lnncome 0.00435 0.00266-5.599E-05 (0.0318) (0.0175) - hosp_type 0.0111 0.00176-5.780E-06 (0.0188) (0.0098) - hosp_rec_dummy1 0.0544 0.0170 7.454E-05 (0.0443) (0.0230) - hosp_rec_dummy2 0.275*** 0.0933*** 4.668E-04 (0.0511) (0.0285) - hosp_day -0.00401** -0.00133-1.080E-05 (0.00169) (0.00090) - hosp_day_pre -0.0184** -0.00654* -4.238E-05 (0.00753) (0.00384) - lntot_cost 0.0926*** 0.0244* 4.259E-05 (0.0270) (0.0139) - bed_per 0.471** 0.128 4.499E-04 (0.183) (0.0920) - care_per -0.934-0.308-2.062E-03 (0.598) (0.282) - dag_per 1.330* 0.348 1.234E-03 (0.681) (0.375) - treat_per 0.000660 0.0223 3.911E-04 (0.112) (0.0606) - test_per -0.0459 0.00549 1.795E-04 (0.123) (0.0655) - oper_per 0.263* 0.110 1.226E-03 (0.148) (0.0809) - coverage_type 0.0660 0.0163 8.210E-05 (0.0481) (0.0253) - self_polcyholder 0.0475 0.0261 2.372E-04 (0.0555) (0.0299) - renew -0.119*** -0.0351*** -2.019E-04 (0.0117) (0.00634) - num_other_polcy -0.0699-0.0162 1.339E-05 (0.0551) (0.0285) - self_clam -0.0297-0.0179-5.051E-05 (0.0650) (0.0357) - clam_duraton -0.000728*** -0.000223** -1.333E-06 (0.000182) (0.0000988) - fle_duraton 0.000645*** 0.000265* 6.320E-06 (0.000224) (0.000144) - clamfreq_pre -0.0296** -0.00802-8.180E-06 (0.0148) (0.00702) - 23

-2.5-2.3-2.1-1.9-1.7-1.5-1.3-1.1-0.9-0.7-0.5-0.3-0.1 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 *** p<0.01, ** p<0.05, * p<0.1, robust standard errors clustered by groups are n parentheses Margnal effects at the means of ndependent varables are reported n Table 9. We note that the margnal effect n specfcaton 3 s very small, compared to the other 2 models. The underlyng reason s that our latent varable * Y n specfcaton 3 s hgher compared wth the ones n specfcaton 1 and 2. In a probt model, the probablty of a case beng fraudulent s ( X ),therefore the margnal effect of X s ( X ), n whch () denotes the densty functon of a standard normal dstrbuton. X represents the latent varable * Y,and could be calculated after a regresson assumng each of X takng ts mean. As we takng both over representaton and omsson error nto consderaton n specfcaton 3, the predcted * Y margnal effect to be dmnshng, as shown n Fgure 1. becomes larger, resultng n the ( Y) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Y* n specfcaton 2 Y* n specfcaton 1 Y* n specfcaton 3 Y Fgure 1: Margnal Effects n Dfferent Specfcatons To check the adequacy of our models, we report the classfcaton results n Table 10, 11 and 12. We chose the threshold of predctng fraudulent clam usng a grd search framework, and we made compromse between the best classfcaton n whole sample and the best classfcaton wthn fraudulent cases. 24

Table 10 Classfcaton Table for Specfcaton 1 Predcted Type Legtmate Fraudulent Total Observed Type Legtmate 474 110 584 Fraudulent 145 234 379 Total 619 344 963 When the estmated probablty of fraud exceeded 0.5, the predcted type was fraud. Table 11 Classfcaton Table for Specfcaton 2 Predcted Type Legtmate Fraudulent Total Observed Type Legtmate 317 267 584 Fraudulent 42 337 379 Total 359 604 963 When the estmated probablty of fraud exceeded 0.8, the predcted type was fraud. Table 12 Classfcaton Table for Specfcaton 3 Predcted Type Legtmate Fraudulent Total Observed Type Legtmate 339 245 584 Fraudulent 54 325 379 Total 393 570 963 When the estmated probablty of fraud exceeded 0.9, the predcted type was fraud. In specfcaton 1 (the basc probt model), usng threshold of 0.5 9, the total percentage of observatons beng correct classfed was 74 percent, whch s acceptable. The condtonal percentage of legtmate clams that were correctly classfed was 81 percent. However, the condtonal percentage of fraudulent clams that were correctly classfed was only 62 percent, showng that the probt model wthout weghted samplng and omsson error s not deal for detectng medcal nsurance fraud. In specfcaton 2, the threshold was set to 0.8 snce t yelds the hghest overall classfcaton percentage whle keepng the correctly classfed fraudulent cases above 85%. In ths case, the condtonal percentage of fraudulent clams beng correctly classfed was about 89 percent and the percentage of legtmate clams beng correctly 9 For a complete result of threshold grd search, please refer to appendx. 25

classfed was 54 percent. Overall, 68 percent of observatons are correctly classfed. In ths way, the model s more effectve n detectng fraud than the basc probt model. Usng the same crtera as n specfcaton 2, the threshold was set to 0.9 n specfcaton 3 to yeld the best compromse between overall performance and the segment of fraudulent clam. The condtonal percentage of fraudulent clams beng correctly classfed was about 86 percent and the percentage of legtmate clams beng correctly classfed was 58 percent. The total percentage of correct classfcaton was 67 percent, whch s acceptable n terms of both adequacy and effcency n detectng the medcal nsurance fraud. 6 Concludng Remarks Health nsurance fraud causes hgher nsurance prces and sgnfcant welfare loss to socety, therefore, detectng fraud s mportant for mprovng effcency n the nsurance ndustry. The fraud detecton technques have been studed extensvely by both academcs and ndustry analysts, yet most emprcal studes focus on fraud n health nsurance n developed countres and there s lttle evdence on the nascent commercal health nsurance market n Chna. We use a dscrete choce model consdered for over-samplng and omsson error to dentfy the predctve factors of medcal nsurance fraud, and we fnd hosptal s qualfcaton, total cost of healthcare, polcyholder s renewal status, clam duraton and fle duraton are contrbutng factors of medcal nsurance fraud. Our research provde a sgnfcant contrbuton by broadenng the understandng of predctve varables for health nsurance fraud n Chna. We expect our analyss to help nsurers n Chna to better evaluate ther clams and mprove the effcency and accuracy of clam management. 26

Appendx: Grd search result for thresholds n classfcaton s shown n table 13. Table 13 The percentage of correctly classfed clams under dfferent levels of threshold Threshold Correctly classfed % 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Specfcaton 1 total* 73.31 73.94 73.52 73.52 72.79 71.34 69.68 66.87 65.01 - - - fraudulent* 75.46 68.87 61.74 55.67 48.28 39.05 30.34 21.11 14.78 - - - legtmate* 71.92 77.23 81.16 85.10 88.70 92.29 95.21 96.58 97.60 - - - Specfcaton 2 total 44.65 47.04 48.08 50.47 54.00 57.42 59.81 63.86 67.91 71.13 73.10 72.48 fraudulent 100.00 99.74 99.47 98.94 98.42 97.89 96.83 94.20 88.92 81.00 70.71 50.40 legtmate 8.73 12.84 14.73 19.01 25.17 31.16 35.79 44.18 54.28 64.73 74.66 86.82 Specfcaton 3 total 55.24 56.49 58.15 58.98 60.44 61.99 63.03 64.80 65.94 67.71 68.95 70.61 fraudulent 99.47 98.94 98.68 98.42 98.15 97.89 97.36 96.04 93.93 91.29 85.75 67.28 legtmate 26.54 28.94 31.85 33.39 35.96 38.70 40.75 44.52 47.77 52.40 58.05 72.77 Note: Total: The total percentage of observatons beng correct classfed. Fraudulent: The percentage of fraudulent clams beng correctly classfed. Legtmate: The percentage of legtmate clams beng correctly classfed. Reference: A, J., P. Brockett, and L. Golden (2009) Assessng Consumer Fraud Rsk n Insurance Clams: An Unsupervsed Learnng Technque Usng Dscrete and Contnuous Predctor Varables, North Amercan Actuaral Journal, 13(4):439-458. Artís, M., M. Ayuso, and M. Gullén (1999) Modellng Dfferent Types of Automoble Insurance Fraud Behavour n the Spansh Market, Insurance: Mathematcs and Economcs, 24: 67-81. Artís, M., M. Ayuso, and M. Gullén (2002) Detecton of Automoble Insurance Fraud wth Dscrete Choce Models and Msclassfed Clams, The Journal of Rsk and Insurance, 69(3):325-340. Belhadj, El Bachr, G. Donne, and F. Tarkhan (2000), "A Model for the Detecton of Insurance Fraud", The Geneva Papers on Rsk and Insurance, 25(4):517-538. Brockett, P. L., R. Derrg, L. Golden, A. Levne, and M. Alpert (2002), The Journal of Rsk and Insurance, 69(3): 341-371. Brockett, P. L., X. Xa, and R. A. Derrg (1998) "Usng Kohonen's Self-Organzng 27

Feature Map to Uncover Automoble Bodly Injury Clams Fraud", The Journal of Rsk and Insurance, 65(2): 245-274. Caudll, S. B., M. Ayuso, and M. Gullén (2005) Fraud Detecton Usng a Multnomal Logt Model wth Mssng Informaton, The Journal of Rsk and Insurance, 72(4): 539-550. Derrg, R. A. (2002) Insurance Fraud, The Journal of Rsk and Insurance, 69(3): 271-287. Derrg, R.A., and K.M. Ostaszewsk (1995),"Fuzzy Technques of Pattern Recognton n Rsk and Clam Classfcaton", The Journal of Rsk and Insurance, 62(3), 447-482. Hausman J. A., J. Abrevaya, and F. M. Scott-Morton (1998) Msclassfcaton of the Dependent Varable n a Dscrete-response Settng, Journal of Econometrcs, 87: 239-269. He, H., J. Wang, W. Graco, and S. Hawkns (1997),"Applcaton of Neural Networks to Detecton of Medcal Fraud", Expert Systems wth Applcatons, 13(4): 329-336. He, H., W. Graco, and X. Yao (1999), "Applcaton of Genetc Algorthm and k-nearest Neghbour Method n Medcal Fraud Detecton", 2nd Asa-Pacfc Conference on Smulated Evoluton and Learnng (SEAL 98), Nov. 24-27, 1998. Ireson, C. L (1997), "Crtcal Pathways: Effectveness n Achevng Patent Outcomes", Journal of Nursng Admnstraton, 27(6): 16-23. Kou, Y., C. Lu, S. Srwongwattana, and Y. Huang (2004),"Survey of Fraud Detecton Technques", Internatonal Conference on Networkng, Sensng & Control Tape, Tawan, March 21-23, 2004. L, J., K. Huang, J. Jn, and J. Sh (2008) "A Survey on Statstcal Methods for Healthcare Fraud Detecton", Health Care Manage Scence, 11:275-287. Lou, F., Y. Tang, and J. Chen (2008), "Detectng Hosptal Fraud and Clam Abuse through Dabetc Outpatent Servces", Health Care Manage Scence, 11:353 358. Major, J. A. and D. R. Rednger (2002), "EFD: A Hybrd Knowledge/Statstcal-Based System for the Detecton of Fraud", The Journal of Rsk and Insurance, 69(3):309-324. Mansk, C. and S. R. Lerman (1977) The Estmaton of Choce Probabltes from Choce Based Samples, Econometrca, 45(8):1977-1988. Mao, L. (2008) Research on the Health Insurance Ant-fraud n Chna, workng paper, http://www.docn.com/p-224528482.html. Ortega, P. A., C. J. Fgueroa and G. A. Ruz (2006), "A Medcal Clam Fraud/Abuse Detecton System based on Data Mnng: A Case Study n Chle", In Proceedngs of Internatonal Conference on Data Mnng, Las Vegas, Nevada, USA. Shn, H., H. Park, J. Lee, and W. C. Jhee (2012), "A Scorng Model to Detect Abusve 28

Bllng Patterns n Health Insurance Clams", Expert Systems wth Applcatons, 39:7441-7450. Stefano, B., and F. Gsella (2001), "Insurance Fraud Evaluaton A Fuzzy Expert System", 2001 IEEE Internatonal Fuzzy Systems Conference. Vaene, S., R. A. Derrg, B. Baesens, and G. Dedene (2002), "A Comparson of Stateof-the-Art Classfcaton Technques for Expert Automoble Insurance Clam Fraud Detecton", The Journal of Rsk and Insurance, 69(3):373-421. Wesberg, H. I., and R. A. Derrg (1991),"Fraud and Automoble Insurance: A Report on the Baselne Study of Bodly Injury Clams n Massachusetts", Journal of Insurance Regulaton, 9: 497-541. Wesberg, H. I., and R. A. Derrg (1995),"Identfcaton and Investgaton of Suspcous Clams, n: AIB Cost Contanment/Fraud Flng", (DOI Docket R95-12) (Boston, Mass.: Automoble Insurers Bureau of Massachusetts). Wesberg, H. I., and R. A. Derrg (1998),"Quanttatve Methods for Detectng Fraudulent Automoble Bodly Injury Clams", Rsques, July-September: 35: 75-99. Yamansh, K., J. Takeuch, G. Wllams, and P. Mlne (2004), "On-lne Unsupervsed Outler Detecton Usng Fnte Mxtures wth Dscountng Learnng Algorthms", Data Mnng and Knowledge Dscovery, 8:275-300. Yang, W., S. Hwang (2006), "A Process-mnng Framework for the Detecton of Healthcare Fraud and Abuse", Expert Systems wth Applcatons, 31:56-68. 29