Inferring Individual Level Relationships from Aggregate Data *



Similar documents
Can Auto Liability Insurance Purchases Signal Risk Attitude?

How To Calculate The Accountng Perod Of Nequalty

Calculation of Sampling Weights

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

DEFINING %COMPLETE IN MICROSOFT PROJECT

An Alternative Way to Measure Private Equity Performance

RECENT DEVELOPMENTS IN QUANTITATIVE COMPARATIVE METHODOLOGY:

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Statistical Methods to Develop Rating Models

An Empirical Study of Search Engine Advertising Effectiveness

What is Candidate Sampling

CHAPTER 14 MORE ABOUT REGRESSION

The OC Curve of Attribute Acceptance Plans

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Traffic-light a stress test for life insurance provisions

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

The Current Employment Statistics (CES) survey,

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

1. Measuring association using correlation and regression

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

An Interest-Oriented Network Evolution Mechanism for Online Communities

Analysis of Premium Liabilities for Australian Lines of Business

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

A Secure Password-Authenticated Key Agreement Using Smart Cards

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Start me up: The Effectiveness of a Self-Employment Programme for Needy Unemployed People in Germany*

Quantification of qualitative data: the case of the Central Bank of Armenia

Recurrence. 1 Definitions and main statements

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

Scale Dependence of Overconfidence in Stock Market Volatility Forecasts

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

Using Series to Analyze Financial Situations: Present Value

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

High Correlation between Net Promoter Score and the Development of Consumers' Willingness to Pay (Empirical Evidence from European Mobile Markets)

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET *

Portfolio Loss Distribution

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

When Talk is Free : The Effect of Tariff Structure on Usage under Two- and Three-Part Tariffs

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Small pots lump sum payment instruction

Support Vector Machines

Traffic State Estimation in the Traffic Management Center of Berlin

Forecasting the Direction and Strength of Stock Market Movement

Section 5.3 Annuities, Future Value, and Sinking Funds

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

The Racial and Gender Interest Rate Gap. in Small Business Lending: Improved Estimates Using Matching Methods*

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

HARVARD John M. Olin Center for Law, Economics, and Business

Evaluating the generalizability of an RCT using electronic health records data

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Evaluating credit risk models: A critique and a new proposal

Section 5.4 Annuities, Present Value, and Amortization

Evaluating the Effects of FUNDEF on Wages and Test Scores in Brazil *

L10: Linear discriminants analysis

Survival analysis methods in Insurance Applications in car insurance contracts

Demographic and Health Surveys Methodology

8 Algorithm for Binary Searching in Trees

Analysis of Demand for Broadcastingng servces

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Prediction of Disability Frequencies in Life Insurance

A 'Virtual Population' Approach To Small Area Estimation

Returns to Experience in Mozambique: A Nonparametric Regression Approach

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Binomial Link Functions. Lori Murray, Phil Munz

How To Find The Dsablty Frequency Of A Clam

Diagnostic Tests of Cross Section Independence for Nonlinear Panel Data Models

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Calculating the high frequency transmission line parameters of power cables

A Probabilistic Theory of Coherence

The Application of Fractional Brownian Motion in Option Pricing

Simple Interest Loans (Section 5.1) :

VoIP Playout Buffer Adjustment using Adaptive Estimation of Network Delays

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

Marginal Returns to Education For Teachers

Transition Matrix Models of Consumer Credit Ratings

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

The demand for private health care in the UK

The impact of hard discount control mechanism on the discount volatility of UK closed-end funds

total A A reag total A A r eag

RequIn, a tool for fast web traffic inference

Transcription:

Inferrng Indvdual Level Relatonshps from Aggregate Data * Khong Eom **, Youngjae Jn *** < ABSTRACT > Ths paper ntroduces a technque for nferrng ndvdual level relatonshps from aggregate data. Socal scentsts encounter wth the ecologcal fallacy problem where ndvdual level data are not avalable, yet the ndvdual level relatonshp s sought. In partcular, t s the case that socal scentsts attempt to examne a subject for whch no survey data are avalable or relable. A seres of cross-level nference technques have been ntroduced snce the Goodman s semnal work (1959). We ntroduce a new technque of sgnfcantly mprovng the cross-level nference, the Gary Kng s soluton. For the purpose of verfcaton, we examned regonal votng n the 16 th Natonal Assembly electons of South Korea. The estmates of regonal votng are compared wth those of survey results. We found that the estmates from the Kng s soluton are closely matchng wth those from survey results f cell frequency n survey results s large enough. The Kng s soluton produces more relable estmates n a case that cell frequency n survey results s small. Key words : Cross-level nference, ecologcal fallacy, aggregate data, ndvdual level relatonshp * An earler verson was presented at the 53rd meetng of the Internatonal Statstcal Insttute, Seoul, Korea, August 22-29, 2001. ** Frostburg State Unversty, Lecturer (emal: keom@frostburg.edu) *** Yonse Unversty, Assocate Professor (emal: y2jn@yonse.ac.kr)

I. Introducton Who voted for Chung-hee Park n South Korea n the 1963 presdental electon? What s the level of regonal votng n that electon? How many tmes dd female have an experence on aborton for her lfe tme? They all are nterestng questons, yet hard to examne, partly because t s a hstorcal matter so that survey data s not avalable or partly because a survey respondent had a poltcally correct answer, f data were avalable, and thus results mght be based. The purpose of ths paper s to ntroduce a new technque for solvng these problems, the Gary Kng s ecologcal regresson. To verfy ths technque, we compare estmates from survey results wth those from the Kng s method. The case selected for the verfcaton s "regonal votng" n the 16 th Korean Natonal Assembly electons. The regonal votng refers to the concentraton of votes along regonal party lnes n a number of Korean regons (Km 1994; Lee 1998; Lee and Brunn 1996). 1) Stated another way, voters whose hometown s Jeolla mostly vote for the canddate of the party whose leaders were born n the regon. Snce ths appears to happen regardless of the qualty of canddates and the deology of party, votng patterns result n partes often beng representatves of regons 1) By defnton, regonalsm refers to the voters affectve dentfcatons wth, and support for, canddates wth roots n ther respectve regons (Km and Koh 1980, 81). - 1 -

nstead of dstrcts or the naton (Shn, Jn, Gross and Eom 2005). However, the level of regonal votng has been a dffcult topc to be examned, because votng s secret and a survey respondent may have a poltcally correct answer. Snce the Goodman s semnal work (1959), several technques for the cross-level nference s developed to solve the ecologcal fallacy problem (Palmqust 1993). The Kng s soluton (1997) s well known for producng effcent and robust estmates. In addton, t contans nformaton on the uncertanty of estmates at the level of analyss. After the explanaton of the Kng s soluton, we analyzed regonal votng n the 16th Natonal Assembly electons. The estmates from the Kng's soluton are compared wth those from survey results. We found that estmates from the Kng s soluton are closely matchng wth those from survey results f cell frequency n survey results s large enough. The Kng s soluton produces more relable estmates n a case that cell frequency n survey results s small. II. Ecologcal Fallacy and Ecologcal Regressons The cross-level nference s "the process of usng aggregate (.e., "ecologcal") data to nfer dscrete ndvdual level relatonshps of nterest" (Kng 1997, xv). It provdes a soluton for the problem of ecologcal fallacy. In ths secton, we ntroduce the problem of - 2 -

ecologcal fallacy. We then move to descrbe a seres of efforts to solve ths problem. 1. Ecologcal Fallacy It s well known that usng aggregate level data to fgure out ndvdual level relatonshps generates the ecologcal fallacy problem whch produces based and neffcent estmates (Palmqust 1993). For example, suppose that our research queston s to examne the level of lteracy between the foregn born and the natve (Robnson 1950). Further, assume that we have three groups (the sophstcated, the regular and the foregn born), and both the sophstcated and the foregn born prefer to lve n a cty and the regular lke to lve n a rural area. If a researcher regresses the percentage of the foregn born on lteracy rate at the county level, he or she may fnd that the greater the percentage of foregn born, the hgher the lteracy rate. It would be a shockng result, because the foregn born are not lkely to be lterate. However, f one analyzes the relatonshp at the ndvdual level, he or she may fnd a dfferent and more convncng result; the natve tend to have a hgher lteracy than the foregn born. Ths dscrepancy occurs because the sophstcated as well as the foregn born resde n the same type of area,.e., cty. Wthout consderaton of aggregaton unt, the fndngs from aggregate data mslead the ndvdual level relatonshp. It shows the napproprateness of usng aggregate data to examne the ndvdual level relatonshp. - 3 -

2. Ecologcal Regressons To solve the aggregaton bas, several methods have been ntroduced. A common assumpton the models make can be descrbed n below table. <Table 1> The Robnson s problem Lterate (L) Illterate (IL) Margnal The Natve(N)?? 20000 The Foregn born (F)?? 1000 Margnal 15000 6000 21000 Let s suppose that n a total populaton of 21,000 we observe only margnal populaton values for the Natve (N) and the Foregn born (F): 20,000 and 1,000. Also we know margnal values for the Lterate (L) and the Illterate (IL). Our research problem s to fnd cell frequency noted as queston marks; how many are the lterate among the natve and how many are the lterate among the foregn born? We then calculate lteracy ratos between the natve and the foregn born and examne whether the brthplace s related to the lteracy. One of the ways to solve ths problem can be suggested as follows. Let s suppose that we know the value for the left upper corner by pure luck; the number of the lterate among the natve s 15,000. Once we have ths nformaton, we can accordngly calculate the rest of cell values. The results are shown n table 2. - 4 -

<Table 2> A soluton for the Robnson s problem Lterate (L) Illterate (IL) Margnal The Natve(N) 15000 (5000) 20000 The Foregn born (F) (0) (1000) 1000 Margnal 15000 6000 21000 Snce the natve populaton who can read and wrte s 15,000, the number of the llterate among the natve s 5000 n a populaton of 20000. In addton, the number of the entre lterate s 15,000 and the number of the lterate natve s 15,000, and thus the number of the lterate foregn born s zero. For a purpose of comparson, table 2 can be rewrtten n table 3. <Table 3> Lteracy Ratos Lterate (L) Illterate (IL) Margnal The Natve(N) 0.75 0.25 0.95 The Foregn born (F) 0.00 1.00 0.05 Margnal 0.71 0.29 1.00 Of the natve, the lteracy rato s 0.75 whle t s 0.00 for the foregn born. Therefore, t leads to a concluson that the natve are more lkely to be lterate compared to the foregn born. Ths example shows that we are able to dsaggregate aggregate data f we "correctly" mpose some constrants on the parameter of our nterest. In ths case, we assumed that some nformaton on the number of the - 5 -

lterate among the natve s avalable. We can generalze table 3 n the followng table. <Table 4> General Form of Ecologcal Regresson Lterate (L) Illterate (IL) Margnal The Natve(N) β N 1-β N X The Foregn(F) β F 1-β F 1-X Margnal T 1-T X s the proporton of the natve, β N s the lteracy rato for the natve, and β F s the lteracy rato for the foregn born. T s the proporton of the lterate and "" s an aggregaton unt. Wth some constrants, the parameters of our nterest (β N and β F ) can be calculated usng aggregate values of X and T. Wth ths general form, two approaches for ecologcal regresson have been developed: method of bounds and statstcal approach. Method of bounds uses determnstc nformaton n data (Achen and Shvely 1995). Let s suppose once agan we attempt to estmate the lteracy rato among the natve and the foregn born wth aggregate nformaton. The relatonshp n table 4 can be wrtten as follows: T = β N Then, X + β F (1-X ) 1) β N = T X X 1 X β F - 6 -

Snce βs are a proporton, t should be between 0 and 1; 0 β T 1. In addton, f β F = 0, X becomes a maxmum value for β N, whle f β F T 1+ X = 1, X becomes a mnmum value for β N. Hence, lower and upper lmts for β N are: T 1+ X Max X,0, T Mn X,1. Wth the same procedure, we can obtan lower and upper lmts for β F as follows: T X Max 1 X,0, T Mn 1 X,1. Wth our example, the range of plausble values for β N s = [Max (0.7, 0), Mn (0.75. 1)]= [0.7, 0.75]. Therefore, we can sgnfcantly narrow down plausble values of parameter β N. However, n often cases, method of bounds produces too broad nformaton, especally when the dstrbuton of proportons (X and/or T) s consderably skewed. For example, the range of β F values as n our example s [Max (-5, 0), Mn (15, 1)] = [0, 1]. In ths case, method of bounds does not reduce the range of plausble values for. - 7 -

The second approach for ecologcal regressons has been developed to use logc of statstcal assocaton. If there s assocaton between varables, t wll occur across unts wth some fluctuaton. The frst model was developed by Leo A. Goodman (1959). He argues that f we can reasonably assume three thngs, we can nfer ndvdual behavor from aggregate data. Hs assumptons are constant effect of parameters, lnear functon, and normal dstrbuton of resduals. Followng hs suggeston, the equaton 1) can be rewrtten as follows: T = β N X + β F (1-X ) + e, 2) Where e s resduals. If three condtons are met, he argues, parameter βs and ther standard errors are correctly nferred. The Goodman s model, however, has several problems (Voss 2000). Frst, hs constant effect assumpton s not substantvely reasonable. For example, beng the constant lteracy rato for the natve across unts are too restrctve. If the parameters (βs) covary wth a unt, the estmates may over- or underestmates true βs due to the aggregaton bas. Second, snce the Goodman s model produces only a sngle estmate, t s hard to know ndvdual behavor wthn a unt. A seres of models have been developed to solve or relax these assumptons (Achen and Shverly 1995). For example, the - 8 -

homogeneous model utlzes, rather than estmates, nformaton from observed data. That s, wth our example, the homogeneous model observes the lteracy rato among the entre natve, and then uses ths rato as a benchmark for nferrng ndvdual relatonshps. The same procedure s appled for the lteracy rato for the entre foregn born. It s only useful when unts are hghly segregated, however. It becomes unrelable when both the natve and the foregn born are mxed n the same unt. The nformed assumpton model uses nformed knowledge nstead of observed lteracy rato. For example, we may have pror nformaton that the entre foregn born are llterate. In ths case, β F becomes zero, and thus we can use ths nformaton and then calculate β N and the rest of βs, as shown n table 4. However, n most cases, pror nformaton s unattanable. And, researchers may not receve a warnng when ths nformed knowledge s ncorrect, whch results n based estmates of parameters (Voss 2000). The fnal example for ecologcal regressons has a dfferent premse. The neghborhood model assumes that parameters of our nterest s the same wthn a unt (β N = β F ), yet vares across unts (β N β N j, where j). Therefore, the equaton 2) becomes T = + β (1-X ) + e = β + e, where β s a functon of X. For example, ths model assumes that the lteracy rato between the natve and the foregn born s the same wthn the same unt, whle the lteracy rato vares across unts. As one may notce, the assumpton - 9 -

the neghborhood model makes s too strong. Even t s a plausble assumpton, we do not have to estmate a model, because we have an answer for our research queston; whether the brthplace s related to the level of the lteracy. Wth an excepton of the neghborhood model, a survey of ecologcal regressons shows some common problems. Frst, all of the models assumed the constant effect of parameters. It seems to be too restrctve, because parameters of our nterest are hardly constant across unts. Second, the models produce only a sngle estmate. Snce we attempt to nfer the ndvdual level relatonshp, t s not lkely to be satsfed wth a sngle estmate. Fnally, f an equaton has more than two parameters to be estmated, t s hard to magne how these methods can be extended. Gary Kng (1997) provdes an nterestng method to solve these problems. Frst, he does not assume a constant effect; rather he assumes that a parameter vares wth a common underlyng dmenson. Second, because of the varyng parameter, we may have an estmate per unt. In addton, snce hs method uses addtonal nformaton from method of bounds, the estmates become more effcent. Hs method can be wrtten as follows (Kng 1997, 93-94): T = β N X + β F (1-X ) + e, where P(β N, β F ) = TN (β N, β F Β, Σ) 3) - 10 -

Probablty densty of parameters (β N, β F ) follows truncated normal dstrbuton of (β N, β F ) wth lmts β N, = [0, 1] and β F = [0, 1]. Wth the help of method of bounds, these lmts for (β N, β F ) can be narrowed down as follows: β N, = T 1+ X Max X,0, T Mn X,1 β F = T X Max 1 X,0, T Mn 1 X,1. The mean and varance matrx of (β N, β F ) are Β Β = Β N F and 2 σ N = σ σ NF Σ 2 NF σ F If hs three assumptons are met, estmaton produces an effcent and robust estmate. 2) The estmaton procedure of the Kng s soluton can be summarzed as follows: 1) The frst step calculates the bounds of parameters. 2) The second step estmates parameters from truncated bvarate normal dstrbutons wthn the bounds. If one extends a model wth more than two parameters, the estmates from the frst estmaton are used for margnal values. For 2) Three assumptons are sngle model of parameter, the absence of spatal correlaton, and no correlaton of margnal and parameter. Kng, Rosen, and Tanner (1999, 67-68) show, however, that the volaton of the thrd assumpton does not produce based estmates f the bounds of parameters are low enough. - 11 -

example, f one s nterested n the proporton of regonal votng, he or she frst estmates a turnout rate among those who were born n a certan regon n a gven dstrct. The estmated turnout rate s used as margnal values for the proporton of regonal votng. It can be dagramed below: <Fgure 1> Kng s Soluton: the frst step Jeolla Vote Not vote Margnal β J 1-β J X Other Regons β J ' 1-β J ' 1-X Margnal T 1-T Where X s the proporton of votng age populaton who were born n Jeolla, T s the proporton of voters those who turn out to vote, β J s a turnout rate among those who were born n Jeolla, β J ' s a turnout rate among those who were born n a regon other than Jeolla, and "" s a dstrct ndcator. <Fgure 2> Kng s Soluton: the second step Vote Not vote Margnal NCNP Other partes Jeolla λ J 1-λ J β J 1-β J x Other Regons λ J ' 1-λ J ' β J ' 1-β J ' 1-x P 1 - P T Where "x" s the estmated proporton of voters whose hometown s n Jeolla and who turn out to vote, P s the vote share for a canddate whose party label s the Natonal Congress for New Poltcs (NCNP), and proporton of regonal votng. s the - 12 -

The frst step s to examne turnout rate ( and ) for those who were born n Jeolla (X ) and for those who were born n areas other than Jeolla (1-X ). Once we obtan estmates for βs, these estmates are used to calculate margnal values for regonal votng estmates (x and 1-x ). The second step starts wth the calculaton of bounds of parameters (λs) and then estmates the parameters across unts. Note, however, that snce has a component to be estmated, t s not a fxed varable. Therefore, extendng tables produce more uncertan estmates due to added uncertanty orgnatng from the frst estmaton. 3) In next secton, we apply the Kng s soluton to fnd the level of regonal votng n the Korean Natonal Assembly electons of 2000. III. Applcaton: Dsaggregatng Regonal Votng The 2000 electon outcomes n Korea suggest that there are three regons whch tend to exhbt partsan regonalsm: Jeolla, Gyeongsang, and Chungcheong. Jeolla regon covers Jeollabuk-do and Jeollanam-do areas, Gyeongsang regon refers to Gyeongsangbuk-do and Gyeongsangnam-do areas, and Chungcheong regon means Chungcheongbuk-do and Chungcheongnam-do areas. Regonal domnance by a partcular party was specfed n terms of the 3) Note that the parameters (λ and (1-λ)) are weghted by the number of votng age populaton n a gven dstrct. - 13 -

brthplace of partcular party leaders. A leader of the Grand Natonal Party, Km Yong Sam was born n Gyeongsang regon a leader of the Natonal Congress for New Poltcs, Km Dae Jung n Jeolla regon and a leader of the Unted Lberal Democrats, Km Chong Phl n Chungcheong regon. Ths lnk between the brthplace of a party leader and the domnance of a partcular party s well documented n contemporary Korean poltcs (Km 1994; Lee 1998; Lee and Brunn 1996). In ths secton, usng the Gary Kng s ecologcal regresson we attempt to dsaggregate aggregate votes along the level of regonal party lnes n a dstrct. The Kng s method nfers regonal votng at the canddate level. The percentage of regonal votng at the canddate level wll be averaged out across regonal blocs and compared to estmates from survey results. The followng equatons are to be estmated:, 4), 5), 6) where P s the vote share of a canddate, λ s the proporton of regonal votng, and λ' s the proporton of non-regonal votng. J ndcates Jeolla, G Gyeongsang, and C Chungcheong. "x " s, where β s a turnout rate for those were born n a certan - 14 -

regon, and X s the proporton of voters for those who were born n a certan regon. "" ndcates a dstrct. The level of analyss s the canddate level. Estmaton s done by the program called "EzI." 4) In the 16 th Natonal Assembly electons of Korea (Aprl 13, 2000), 194 ncumbent and 449 non-ncumbent canddates ran for offce (Natonal Electon Commsson 2000). We focus our analyss on the vote share for canddates of the three major partes. 5) The percentage of those who regstered ther brthplace n a gven dstrct s collected wth the help of one of major partes. 6) The results are shown n table 5. <Table 5> Regonal Votng Estmates from Ecologcal Inference Regonal Votng Level (Average λs) Regonal Blocs GNP NCNP ULD Seoul 61.00% 75.27% 2.57% Busan 67.29% 64.66% 1.62% Daegu 62.02% 70.49% 2.26% Incheon 61.05% 73.75% 2.51% Gwangju 56.14% 79.46% 0.95% Ulsan 57.61% 66.25% 1.23% Gyeongg-do 60.43% 73.92% 2.48% Gangwon-do 60.14% 71.80% 2.61% Jeollabuk-do 57.52% 61.04% 1.76% Jeollanam-do 43.85% 63.86% 0.00% 4) "EzI" are developed by Kenneth Benot and Gary Kng (released n 2001). It s avalable from http://gkng.harvard.edu/stats.shtml, vsted May 1, 2002. 5) We focus on only these three partes because they comprsed over 96% of the sngle member dstrct seats n the 16 th Natonal Assembly Electon. 6) Because of a contrbutor s request, the source of data has not been released. Data on dstrcts n Chungcheong-do are not avalable so that the number of dstrcts n ths study are 106. - 15 -

Gyeongsangbuk-do 54.88% 71.17% 1.83% Gyeongsangnam-do 55.35% 68.99% 2.18% Average 52.21% 61.74% 1.04% Source: compled by the authors. Note: Average λ s calculated by averagng out dstrct level regonal votngs (λ ) along wth regonal blocs. GNP stands for the Grand Natonal Party, NCNP for the Natonal Congress for New Poltcs and ULD for the Unted Lberal Democrats. Table 5 shows that on average the percentage of regonal votng (61.74%) s the hghest among those who were born n Jeolla and t may be beneft to canddates of the Natonal Congress for New Poltcs. The percentage of regonal votng for those who were born n Gyeongsang ranked the second. Not surprsngly, the level of regonal votng are the lowest for those who were born n Chungcheong. It resulted n less concentraton of votes on canddates runnng under the Unted Lberal Democrats (ULD). In the 15 th Natonal Assembly Electons of 1996, the ULD won 25 of the 28 seats. But, by the electons of 2000, the ULD was only able to wn 11 of the 24 seats n Chungcheong. It s also the case when one examnes the percentage of regonal votng wthn a regonal bloc. For example, 79.46 percent of those who were born n Jeolla cast a regonal votng f they resde n dstrcts wthn Gwangju. More than 70 percent of voters also voted for canddates of the NCNP n dstrcts wthn Seoul, Daegu, Incheon, Gyeongg-do, Gangwon-do and Gyeongsangbuk-do f they were born n Jeolla. - 16 -

Those who were born n Gyeongsang tend to cast a slghtly less regonal votng, yet qute a sgnfcant level. On average, more than half of voters who were born n Gyeongsang cast a regonal votng n the 16 th Natonal Assembly electons. It s especally the case when one examnes n dstrcts wthn Seoul, Busan, Daegu, Incheon, Gyeongg-do, and Gangwon-do more than 60 percent of voters voted for canddates of the GNP f they were born n Gyeongsang. It s also the case, though to less extent, f he or she resdes n Gwangju, Ulsan, Jeollabuk-do, Gyeongsangbuk-do, and Gyeongsangnam-do. Not surprsngly, those who were born n Chungcheong cast the least extent of regonal votng. Only handful of voters who were born n Chungcheong cast regonal votng on average. However, t should be noted that data for dstrcts wthn Chungcheong area were not avalable and thus estmates may be underestmated. In sum, regonal votng estmates from the Kng s method provde supportve evdence for the argument that regonal votng s a natonwde problem (Km 1994; Lee 1998; Lee and Brunn 1996). Not only s the level of regonal votng sgnfcant n dstrcts wthn the the known regonal votng blocs (Jeolla and Gyeongsang), but also t appears to be substantal n dstrcts outsde these regonal blocs. However, there s a sgnfcant fluctuaton at the level of regonal votng across regons. For example, the percentage of regonal votng for those who were born n Jeolla vares from 61.04% n Jeollabuk-do - 17 -

to 79.46% n Gwangju, whle t vares from 43.85% n Jeollanam-do to 67.29% n Busan f voters were born n Gyeongsang. We can conclude that the level of regonal votng s not constant, but vares across regonal blocs. The results from ecologcal nference can be verfed by survey results. The procedure s the same above except that fgures are obtaned from ndvdual level data. The frst step s to dentfy voters who were born n a certan regon, and then calculate how many these voters turn out to vote for the pertnent party. The Korean Socal Scence Data Center conducted a survey of the 16 th Natonal Assembly Electons n Aprl 13, 2000 (Korean Socal Scence Data Center 2000). Multstage quota samplng technque was used to collect a random sample by regonal blocs. 1,100 ntervews were completed wth a rejecton rate of 5 percent. Fortunately, the survey ncludes a queston on the hometown of and the vote choce of a respondent. These two questons were used to construct a regonal votng; for example, f he or she was born n Jeolla area and voted for the NCNP, t s coded as a regonal votng for the NCNP. In Seoul, ffty fve respondents were born n Jeolla. Thrty four out of the ffty fve voted for the NCNP. Therefore, the percentage of regonal votng for the NCNP s 61.82 percent for Seoul. Table 6 shows the percentage of regonal votng n regonal blocs. <Table 6> Regonal Votng Estmates from Survey Results - 18 -

Regonal Votng (Percentage/Frequency) Regonal Blocs GNP NCNP ULD Seoul 46.88% 61.82% 3.03% (32) (55) (33) Incheon/Gyeongg-do 65.00% 57.14% 7.50% (20) (35) (40) Gangwon-do 0.00% 100.00% 0.00% (1) (1) (3) Daejeon/Chungcheongnam-do 0.00% 20.00% 20.34% (4) (5) (59) Chungcheongbuk-do 33.33% 0.00% 27.59% (3) (1) (29) Gwangju/Jeollanam-do 25.00% 50.68% 0.00% (4) (73) (4) Jeollabuk-do 0.00% 60.00% 0.00% (1) (40) (5) Busan/Ulsan/Gyeongsangnam-do 64.24% 31.25% 0.00% (151) (16) (8) Daegu/Gyeongsangbuk-do 53.15% 33.33% 25.00% (111) (3) (4) Average 30.09% 46.02% 9.27% Source: The Korean Socal Scence Data Center (2000). Fgures n parenthess are the number of respondents. Table 6 shows that regonal votng s the most evdent for those who were born n Jeolla, followed by those who were born n Gyeongsang and n Chungcheong. The level of regonal votng sslghtly low compared to that from the Kng s soluton. On average, 46.02 percent voted for the NCNP f they were born n Jeolla, whle t s 30.09 percent f voters were born n Gyeongsang. A sgnfcant fluctuaton appeared across regonal blocs. In partcular, f cell frequency s too small, the varaton of regonal votng s beyond the acceptable range. For example, n Gangwon-do where cell frequency s one, the percentage of regonal votng s 100-19 -

percent out of those who were born n Jeolla, whle t s zero percent n Chungcheongbuk-do where cell frequency s also one. If one may focus on the level of regonal votng where the number of respondents are suffcent enough, we can fnd smlarty n the level of regonal votng between estmates from the Kng s method and estmates from survey results. For example, accordng to survey results, the percentage of regonal votng n Seoul s 61.82 percent for the NCNP whle the comparable fgure s 75.27 percent by the ecologcal regresson. It s 60 percent n Jeollabuk-do by survey results, whle t s 61.04 percent by the ecologcal regresson. We can safely conclude that estmates from the Kng s soluton are closely matchng wth those from survey results. IV. Concluson Applyng aggregate level fndngs for the ndvdual level relatonshps generates based estmates, known as the ecologcal fallacy problem. Socal scentsts often encounters wth a dffculty to conduct a research at the ndvdual level wth aggregate data. In partcular, f a research queston s related to the past event when survey data are not avalable, t s almost mpossble to pursue a research. Further, f there s a poltcally correct answer on survey questons, t s hard to obtan unbased estmates. - 20 -

In ths paper, we ntroduced a way to nfer the ndvdual level relatonshps from aggregate data. We began wth the aggregaton bas whch leads to the ecologcal fallacy problem. A seres of efforts have been suggested to solve the aggregaton bas. The method by Gary Kng, whch combnes method of bounds and statstcal assocaton, s emphaszed. The Kng s soluton s well known for a method to produce a robust and effcent estmate, even though there s a severe aggregaton bas. The soluton appled to nfer regonal votng at the canddate level. The percentage of regonal votng was averaged out across regonal blocs. The average percentages, then, were compared to estmates from survey results. We found that estmates from the Kng s method are closely matchng wth estmates from survey results f cell frequency n survey results s large enough. We also found that the former s more relable than the latter f cell frequency n survey results s small. Ecologcal regressons offer a new venue to examne prevously mpossble questons. For example, we can examne who voted for Chung-hee Park n the 1963 Korean presdental electon. We can further queston why they voted for hm; for example, was the generaton effect related to the outcome of the 1963 Korean presdental electon? Furthermore, we can use ecologcal regressons to examne whether or not a voter casts a vote for a party canddate n a congressonal electon, whle the same voter casts a dfferent party canddate for a presdental electon (Burden and Kmball 1998). We - 21 -

should note, however, that ecologcal regressons also show some lmtaton. If tables are extended more than 2 by 2, the uncertanty of estmates gets thcker. Scholars of ecologcal regresson attempt to reduce ths uncertanty (Kng, Rosen, Tanner 1999; Rosen, Jang, Kng forthcomng). - 22 -

< REFERENCE > Achen, Chrstopher H. and W. Phllps Shvely (1995). Cross-Level Inference. Chcago: Unversty of Chcago press. Benot, Kenneth and Gary Kng (1996). "A Prevew of EI and EzI: Program for Ecologcal Inference." Socal Scence Computer Revew 14:433-438. Burden, Barry C. and Davd C. Kmball (1998). "A New Approach to the Study of Tcket Splttng." Amercan Poltcal Scence Revew 92: 533-544. Goodman, Leo (1959). "Some Alternatves to Ecologcal Correlaton." Amercan Journal of Socology 64: 610-625. Km, Jae-On and B.C. Koh (1980). "The Dynamcs of Electoral Poltcs: Socal Development, Poltcal Partcpaton, and Manpulaton of Electoral Laws." n Poltcal Partcpaton n Korea: Democracy, Moblzaton, and Stablty edted by Chong Lm Km. Santa Barbara: CLIO books. 59-84. Kng, Gary, Or Rosen, and Martn A. Tanner (1999). "Bnomal-Beta Herarchcal Models for Ecologcal Inference." Socologcal Methods & Research 28:61-90. Kng, Gary (1997). A Soluton to the Ecologcal Inference Problem: Reconstructng Indvdual Behavor from Aggregate Data. Prnceton, NJ: Prnceton Unversty Press. - 23 -

Korean Socal Scence Data Center (2000). A Survey on Voters Atttudes toward the 16th General Electon. Seoul: Korean Socal Scence Data Center. Lee, Dong Ok and Stanley D. Brunn (1996). "Poltcs and regons n Korea: an analyss of the recent presdental electon." Poltcal Geography 15: 99-119. Lee, Nam Young (1998). "Regonalsm and Votng Behavor n South Korea." Korea Observer 29: 611-633. Natonal Electon Commsson. http://www.nec.go.kr (2000. 6. 1). Palmqust, Bradley Lowell (1993). Ecologcal Inference, Aggregate Data Analyss of U. S. Electons, and the Socalst Party of Amerca. Ph. D. dssertaton at the Unversty of Calforna, Berkley. Robnson, W. S (1950). "Ecologcal Correlatons and the Behavor of Indvduals." Amercan Socologcal Revew 15: 351-357. Rosen, Or, Wenxn Jang, Gary Kng, and Martn A. Tanner (Forthcomng). "Bayesan and Frequentst Inference for Ecologcal Inference: the R X C Case." Statstca Neerlandca. Shn, Myungsoon, Youngjae, Jn, Donald A. Gross, and Khong Eom (2005). "Money Matters n Party-Centered Poltcs: Campagn Spendng n Korean Congressonal Electons." Electoral Studes 24: 85-101. Voss, D. Stephen (2000). Famlarty Doesn t Breed Contempt: The Poltcal Geography of Racal Polarzaton. Ph. D. dssertaton at Harvard Unversty. - 24 -