An Exploratory Spatial Data Analysis of Income and Education Inequality in Pakistan

Similar documents

NATIONAL TRANSMISSION & DESPATCH COMPANY LTD.

New Tools for Spatial Data Analysis in the Social Sciences

Spatial Analysis with GeoDa Spatial Autocorrelation

National Testing Service Invigilation Staff

District Education Profile

This chapter will cover key indicators on school attendance, enrolment rates and literacy.

Spatial Analysis of Five Crime Statistics in Turkey

SPORTS BOARD PUNJAB Sports Board HQ

Literacy & Non Formal Basic Education Department, Government of the Punjab

Data Mining: Algorithms and Applications Matrix Math Review

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

SPORTS BOARD PUNJAB Tennis Stadium & Swimming Pool

Spatial Data Analysis Using GeoDa. Workshop Goals

Marketing Mix Modelling and Big Data P. M Cain

EXPLORING SPATIAL PATTERNS IN YOUR DATA

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Introduction to Exploratory Data Analysis

NetSurv & Data Viewer

The primary goal of this thesis was to understand how the spatial dependence of

Geostatistics Exploratory Analysis

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Spatial Dependence in Commercial Real Estate

Simple Linear Regression Inference

Request for Proposal (RFP)

Using GIS to Identify Pedestrian- Vehicle Crash Hot Spots and Unsafe Bus Stops

Fairfield Public Schools

Data Entry Spot Check

The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.

CLUSTER ANALYSIS FOR SEGMENTATION

Research Publications by Universities/DAIs from Pakistan 2011

Elements of statistics (MATH0487-1)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Simple Predictive Analytics Curtis Seare

Multivariate Analysis of Ecological Data

Alison Hayes November 30, 2005 NRS 509. Crime Mapping OVERVIEW

117, Street 66, F-11/4, Islamabad 44000, Pakistan Tel: , , Fax:

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

Balochistan University of Engineering & Technology - Khuzdar Balochistan University of IT Engineering and Management Sciences Quetta

COURSES: 1. Short Course in Econometrics for the Practitioner (P000500) 2. Short Course in Econometric Analysis of Cointegration (P000537)

Treatment of Spatial Autocorrelation in Geocoded Crime Data

PAYMENT AND SETTLEMENT SYSTEMS

Simple linear regression

Appendix B Checklist for the Empirical Cycle

Factors affecting online sales

Data Visualization Techniques and Practices Introduction to GIS Technology

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

HIG HER EDUC ATION COMMISSION. Ranking 2014 of Pakistani Higher Education Institutions (HEIs)

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

4. Simple regression. QBUS6840 Predictive Analytics.

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Big Ideas in Mathematics

Multiple regression - Matrices

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

Quality and Research Based Ranking 2013

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Exploratory spatial data analysis using Stata

Education and Wage Differential by Race: Convergence or Divergence? *

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Performance Metrics for Graph Mining Tasks

Additional sources Compilation of sources:

Geographically Weighted Regression

Intro to Data Analysis, Economic Statistics and Econometrics

Local outlier detection in data forensics: data mining approach to flag unusual schools

Chapter 23. Inferences for Regression

Annual Activity For the year

Introduction to Regression and Data Analysis

Schools Value-added Information System Technical Manual

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Recall this chart that showed how most of our course would be organized:

Pakistan Medical Research Council

16 : Demand Forecasting

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

COMMON CORE STATE STANDARDS FOR

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Regression Analysis: A Complete Example

2. Simple Linear Regression

11. Analysis of Case-control Studies Logistic Regression

The Gravity Model: Derivation and Calibration

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Chapter 7: Simple linear regression Learning Objectives

I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Module 3: Correlation and Covariance

Response to Critiques of Mortgage Discrimination and FHA Loan Performance

Chapter 111. Texas Essential Knowledge and Skills for Mathematics. Subchapter B. Middle School

Transcription:

An Exploratory Spatial Data Analysis of Income and Education Inequality in Pakistan Sofia Ahmed Joint Doctoral Program in International Economics SIS/CIFREM October 2009 This draft is preliminary and incomplete, not for citation. Abstract Generally, econometric studies on income inequality consider regions as independent entities, ignoring the likely possibility of spatial interaction particularly within a country. This interaction may cause spatial dependency or clustering, which is referred to as spatial autocorrelation. This chapter analyzes the relationship between the spatial clustering of income and education in the districts of Pakistan by employing spatial exploratory data analysis (ESDA) techniques. Global and local measures of spatial autocorrelation were computed using the Moran s I index to obtain estimates of the existing spatial autocorrelation in income and education levels across districts. The results reveal a surprising absence of knowledge spillovers in terms of education attainment rates across districts close to large cities with high education attainment rates. On the other hand, district-wise incomes reveal a clear spatial autocorrelation pattern whereby high income districts tend to be neighbors of other high income districts. By detecting outliers and clusters, ESDA allows policy makers to focus on the geography of inequalities, hence highlighting the need to pursue spatial analysis at lower geographical units such as the district level instead of the common practice of provincial analysis in Pakistan. 1

Introduction This dissertation analyzes the spatial evolution of income inequality and its causes in Pakistan. As a first step, this chapter investigates whether spatial clustering of income and average education levels can explain the distribution of income across Pakistani districts. The technique used for this is exploratory spatial data analysis (ESDA), which describes and visualizes spatial distributions, identifies spatial outliers, detects agglomerations and local spatial autocorrelations, and highlights the types of spatial heterogeneities (Haining 1990; Bailey and Gatrell 1995; Anselin 1988; Le Gallo and Ertur 2003; Oort 2004, 107). The chapter is organized as follows. Section 1 describes the data; Section 2 gives an overview of the methodology; Section 3 explains the global and the local spatial autocorrelation detection techniques; Section 4 provides an analysis of the results after having applied the ESDA techniques on district income and education data; finally Section 5 summarizes and evaluates. 1. Data This study uses micro data from the Pakistan Social and Living Standards Measurement survey (PSLM). It is annually produced by the Federal Bureau of Statistics (FBS) of Pakistan since 2004. It is the only socio-economic micro data that is representative at the provincial and at the district level. Data collection at the district level was a plausible initiative as it was required for planning in the context of decentralization which began in 2004. Moreover, the sample size of the district level data is also substantially larger than the provincial level data contained in micro data surveys such as Household Income and Expenditure Survey (HIES) of Pakistan and the Labour Force Survey (LFS) of Pakistan. This enables researchers to draw socioeconomic information which is representative at lower administrative levels as well. Currently, this study only utilizes the PSLM survey of 2005-06, but it aims to extend the estimations over a period from 2000 to 2009 to study the temporal changes along with the spatial changes. The PSLM survey for 2005-06 provides district level welfare indicators for a sample size of about 15453 households. The data is statistically comparable with the Pakistan Census Data (1998), with some margin of sampling error. It provides data on districts in all four provinces of Pakistan namely; Punjab, Sindh, North West Frontier Province (NWFP), and Balochistan. The 2

federally administered tribal areas (FATA region) along the Afghan border in the north west and Azad Kashmir are not included in the data. The PSLM is divided into two parts. The first part contains data on socio-economic characteristics such as education, health, population welfare, immunization, pre/post natal care, family planning, water supply, and sanitation. The second part contains household income and expenditure data. The quality and reliability of the PSLM data is ensured through cross checking of field work at various stages. Regional/field offices carry out initial data editing in their regional offices before it is handed over to the Federal Bureau of Statistics in Islamabad where it undergoes various consistency checks by the data entry programs. 2. Methodology Due to the abundance in data collected at a provincial or a rural/urban disaggregation, most socio-economic analysis studies on Pakistan, are a province based analysis. Pakistani provinces however have extreme within diversity in terms of their economic structures, cultures, language, natural resources and geography. Hence regional policy making requires analyzing socio economic issues at an even smaller geographical disaggregation. For this reason, the spatial unit of analysis in this study will be the districts of Pakistan. In terms of geographical disaggregation Pakistan (excluding the Federally Administered Tribal Area (FATA) region and Azad Kashmir) has 4 levels consisting of 4 provinces, 107 districts, 377 sub-districts, and 45653 villages. A lower level unit of analysis is not being used because of two main reasons. Firstly, data on regional scales below the district level in Pakistan suffers from reliability issues. The second issue is more technical. In order to give information on 45,653 villages of Pakistan instead of 107 districts, the project would need a matrix of distance with 45,653 (45,653 + 1) = 1,042,121,031 2 free elements to be evaluated, hence the utilization of district level data. 3

2.1 Why a spatial economic analysis? A fundamental concept in geography is that proximate locations often share more similarities than locations far apart. This idea is commonly referred to as the Tobler s first law of geography and is incorporated in spatial modeling which typically aims to look for associations instead of trying to develop explanations (Haining 2003 p. 358). Classical statistical inference such as conventional regressions are inadequate for an in-depth spatial analysis since they fail to take into account spatial effects and problems of spatial data analysis such as spatial autocorrelation, identification of spatial outliers, edge effects, modifiable areal unit problem and lack of spatial independence 1. These reasons necessitate the use of spatial exploratory and explanatory methods that explicitly take spatial effects into account. 2.2 Spatial effects Spatial effects can be divided into two main kinds: spatial dependence and spatial heterogeneity. Spatial heterogeneity refers to the display of instability in the behavior of the relationships under study. This implies that parameters and functional relationships vary across space and are not homogenous throughout data sets. However, spatial dependence refers to the lack of independence between observations often present in cross sectional data sets. It can be considered as a functional relationship between what happens at one point in space and what happens in another. If the Euclidean sense of space is extended to include general space (consisting of policy space, inter-personal distance, social networks etc) it shows how spatial dependence is a phenomenon with a wide range of application in social sciences. Two factors can lead to it. First, measurement errors may exist for observations in contiguous spatial units. The second reason can be the use of inappropriate functional frameworks in the presence of different spatial processes (such as diffusion, exchange and transfer, interaction and dispersal) as a result of which what happens at one location is party determined by what happens elsewhere in the system under analysis. 1 Modifiable Areal Unit Problem: When attributes of a spatially homogenous phenomenon (e.g people) are aggregated into districts, the resulting values (e.g totals, rates and ratios) are influenced by the choice of the district boundaries just as much as by the underlying spatial patterns of the phenomenon. 4

Assuming non-stationarity or structural stability over space is a highly unrealistic assumption when the variable under study belongs to different locations across space. Along the lines of temporal autocorrelation often found in time series data, spatial autocorrelation also violates the standard assumption of independence among observations. Hence standard regression analysis that does not compensate for spatial dependency can yield possibly biased estimators and unreliable significance tests. As a remedy spatial autocorrelation statistics have been devised to detect, measure and analyze the degree of dependency among observations. 2.3 Quantifying spatial effects Spatial dependence puts forward the need to determine which spatial units in a system are related, how spatial dependence occurs between them, and what kind of influence do they exercise on each other. Formally these questions are answered by using the concepts of neighborhood expressed in terms of distance or contiguity. Boundaries of spatial units can be used to determine contiguity or adjacency which can be of several orders (e.g. first order contiguity or more). Contiguity can be defined as linear contiguity (i.e. when counties which share a border with the county of interest are immediately on its left or right), rook contiguity (i.e. counties that share a common side with the county of interest), bishop contiguity (i.e. counties share a vertex with the county of interest), double rook contiguity (i.e. two counties to the north, south, east, west of the county of interest), and queen contiguity (i.e when counties share a common side or a vertex with the county of interest) (LeSage 1999). Other common conceptualizations of spatial relationships include inverse distance, travel time, fixed distance bands, and k-nearest neighbors. The most popular way of representing a type of contiguity or adjacency is the use of the binary contiguity (Cliff and Ord 1973, 1981) expressed in a spatial weight matrix (W). In spatial econometrics W provides the composition of the spatial relationships among different points in space. The spatial weight matrix enables us to relate a variable at one point in space to the observations for that variable in other spatial units of the system. It is used as a variable while modeling spatial effects contained in the data. Generally it is based on using either distance or contiguity between spatial units. Consider below a spatial weight matrix for three units: 5

where w ij may be the inverse distance between two units i and j or it may be 0 and 1 if they share a border or a vertex. The W matrix displays the properties of a spatial system and can be used to gauge the prominence of a spatial unit within the system. The usual expectation is that values at adjacent locations will be similar. 2.4 The spatial weight matrix for Pakistan The choice of the W matrix representation and its conceptualization has to be carefully based on theoretical reasoning and the historical factors underlying the concept or phenomenon under study. For example for cluster detection and influence analysis inverse distance is the most appropriate measure, but when we are assessing the geographic distribution of a region s commuters, travel time or cost would be a better choice. This paper has employed two W matrices for Pakistan. The first one is a simple binary contiguity W matrix (BC) based on the concept of Queen Contiguity i.e. if a district i shares a border or a vertex with another district j, they are considered as neighbors, and takes the value 1 and 0 otherwise. This matrix is also zero along its diagonal implying that a district cannot be a neighbor to itself. Hence it is symmetric binary matrix with a dimension of 81x81 (81 being the total number of the districts being analyzed) 2. This matrix precisely tells us the influence of geographically adjacent neighbors on each other. A simple binary contiguity matrix is a standard starting point and its influence is often compared with other types of W matrices. The second W matrix developed for Pakistan is one based on inverse average road distance between the centroid of a district to the centroid of its nearest district/s with a large 2 The total number of districts in Pakistan is 104, but the PSLM covers 81 districts. For Balochistan, the geographical unit of analysis is division, since the data is available only on a division level. Divisions were one level greater than districts and one level smaller than provinces. As a geographical unit, they got eliminated in 2001. In the subsequent years however, Balochistan is also district representative in the PSLMs as compared to division representation only. 6

size city (ID matrix). If the nearest district is not the neighbor of a district with a large size city, then the value of is the distance from the centroid of that district to the centroid of the district which is the provincial capital city of that province in which that district is located. This matrix is a symmetric non-binary matrix, again with a dimension of 81x81. Out of the 81 districts being studied there are only 14 that come under the category of a district with a large size city as per the classification of the coding scheme for the PSLM survey. These include Islamabad as the federal capital city; Lahore, Faisalabad, Rawalpindi, Multan, Gujranwala, Sargodha, Sialkot, and Bahawalpur as districts with a large size city in Punjab; Karachi, Hyderabad and Sukkur in Sindh; Peshawar in the North West Frontier Province and Quetta in Balochistan. The reason for selecting road distance instead of train distance as is normally done in most studies on urban area analysis is that in Pakistan, the road network is much better developed than the railway network. As a result, Pakistan s transport system is primarily dependent on road transport which makes up 90 percent of national passenger traffic and 96 percent of freight movement every year (The Economic Survey of Pakistan 2007-08 p. 225). Inverse distance matrices have more explanatory power as partitions of geographic space especially when the phenomenon under study involves the exchange or transfer of information and knowledge (in our case wages and education). It establishes a decay function that weighs the effect of events in geographically proximate units more heavily than those in geographically distant units. Since a country is not a plain piece of land, Euclidean distance calculations or distance as the crow flies make little economic sense when we are trying to investigate the effect of distance from districts with a large city on regional wages. The effect of the density of country s infrastructure network is an important influence. For this reason we have used the Google Maps service of distance calculation. It not only provides the Euclidean or the straight line distance between districts using their longitude and latitude information but also the maximum and minimum road distance to reach from one district to another carefully taking into consideration the existing road network of Pakistan. The distance used in this paper is the inverse of the average of the maximum and the minimum roads distance between two the centroids of districts. 7

Finally both the matrices are row-standardized i.e. each weight is divided by its row sum. Row standardization is recommended whenever the distribution of the variables under consideration is potentially biased due to errors in sampling design or due to an imposed aggregation scheme. 3. Exploratory spatial data analysis This paper applies exploratory spatial data analysis techniques to district wise data on wages, employment and education. Before estimating the spatial econometric models, the presence of spatial dependence has to be detected. This is done by using explanatory spatial data analysis (ESDA). The technique employed in this study is Moran s I statistic. The global Moran s I demonstrates the spatial association of data collected from points in space and measures similarities and dissimilarities in observations across space in the whole system (Anselin, 1995). However in the presence of uneven spatial clustering, the Local Indicators of Spatial Association are utilized. They measure the contribution of individual spatial units to the global Moran s I statistic (Anselin, 1995). The study will also generate Moran scatter plots to demonstrate the spatial distribution of district wage and education levels across Pakistan. 3.1 Measures of spatial autocorrelation: i) Global spatial autocorrelation Spatial autocorrelation occurs when the spatial distribution of the variable of interest exhibits a systematic pattern (Cliff and Ord 1981). Positive (negative) spatial autocorrelation occurs when a geographical area tends to be surrounded by neighbors with similar (dissimilar) values of the variable of interest. As previously mentioned, this paper utilizes Moran s I Statistic to detect the global spatial autocorrelation present in the data 3. The Moran s I is the most widely used measure for detecting and explaining spatial clustering not only because of its interpretative simplicity but also because it can be decomposed into a local statistic along with providing graphical evidence of the presence of absence of spatial clustering. It is defined as 3 Other well known measures of spatial autocorrelation include the Geary s c statistic and the Getis and Ord s G statistic, see Anselin (1995a, p.22-23). 8

I = (1) where is the observation of variable in location i, is the mean of the observations across all locations, n is the total number of geographical units or locations, is one of the elements of the weights matrix and it indicates the spatial relationship between location i and location j. is a scaling factor which is equal to the sum of all the elements of the W matrix : (2) is equal to n for row standardized weights matrices (which is the preferred way to implement the Moran s I statistic), since each row then adds up to 1. The first term in equation (1) then becomes equal to 1 and the Moran s I simplifies to a ratio of spatial cross products to variance. Under the null hypothesis of no spatial autocorrelation, the theoretical mean of Moran s I is given by E (I) = -1/ (n-1) (3) The expected value is thus negative and will tend to zero as the sample size increases as it is only a function of n (the sample size). Moran s I ranges from -1 (perfect spatial dispersion) to +1 (perfect spatial correlation) while a 0 value indicates a random spatial pattern. If the Moran s I is larger than its expected value, then the distribution of y will display positive spatial autocorrelation i.e. the value of y at each location i tends to be similar to values of y at spatially contiguous locations. However, if I is smaller than its expected value, then the distribution of y will be characterized by negative spatial autocorrelation, implying that the value of y at each location i tends to be different from the value of y at spatially contiguous locations. Inference is based on z-values computed as (4) 9

i.e. the expected value of I is subtracted from I and divided by its standard deviation. The theoretical variance of Moran s I depends on the assumptions made about the data and the nature of spatial autocorrelation. This paper will present the results under the randomization assumption i.e. each value observed could have equally occurred at all locations 4. Under this assumption asymptotically follows a normal distribution, so that its significance can be evaluated using a standard normal table (Anselin 1992a). A positive (negative) and significant z- value for Moran s I accompanied by a low (high) p-value indicates positive (negative) spatial autocorrelation 5. Finally, the results of the Moran s I are dependent on the specification of the weights matrix. Interpretations change depending on whether the matrix was based on the use of physical distance or economic distance. However, a pattern of decreasing spatial autocorrelation with increasing orders of contiguity (distance decay) is commonly witnessed in most spatial autoregressive processes regardless of the matrix specification (Oort (2004) p.109). ii) Local spatial autocorrelation Since the Moran s I as a global statistic is based on simultaneous measurements from many locations, it only provides some broad spatial association measurements, ignores the location specific details, and cannot identify which local spatial clusters (or hot spots) contribute the most to the global statistic. As a remedy, local statistics commonly referred to as Local Indicators of Spatial Association (LISA) used along with graphic visualization techniques of the spatial clustering using a Moran s Scatterplot, have been developed in exploratory spatial data analysis. The Moran scatterplot is derived from the global Moran I statistic. Recall that the Moran s I formula when we use a row standardized matrix can be written as I= (5) This is similar to the formula for a coefficient of the linear regression b, with the exception of, which is the so-called spatial lag of the location i. 4 The other two assumptions include the assumption of normal distribution of the variables in question (normality assumption) or a randomization approach using a reference distribution for I that is generated empirically (permutation assumption). For details and formulas of the randomization assumption, see Sokal et al. 1998). 5 Negative spatial autocorrelation reflects lack of clustering, more than even the case of a random pattern. The checkerboard pattern is an example of perfect negative spatial autocorrelation. 10

Therefore I is formally equivalent to the regression coefficient in a regression of a location s spatial lag (Wz) on the location itself. This interpretation is used by the Moran s scatterplot, enabling us to visualize the Moran s I in a scatterplot of Wz versus z, where.moran s I is then the slope of the regression line contained in the scatterplot. A lack of fit in this scatterplot indicates local spatial associations (local pockets/non-stationarity). This scatterplot is centered on 0 and is divided in four quadrants that represent different types of spatial associations. However graphical evidence alone does not give the significance levels of the spatial clustering for which we resort to complementing the Moran scatterplot with a local statistic. Local statistics or indicators can reveal the locations that display significant deviation from spatial randomness in the presence of global spatial autocorrelation (hot spots) and the significant outliers in a diagnostic analysis for local stability. Anselin (1995b) defines a LISA as a statistic that satisfies the following two requirements: 1) The LISA for each observation gives an indication of the spatial clustering of similar values around that observation; 2) The sum of all LISA s for all observations is proportional to a global indicator of spatial association We use the local Moran s I statistic which satisfies the above requirements for our analysis. Each local Moran I for a particular location indicates the extent of spatial clustering around it and the sum of all local Moran s I s is equal to the global Moran s I. The Local Moran s I can be defined as: (6) The null hypothesis tested in this case is that there is no association between the value observed at a location i and values observed in its neighbors i.e. values of s are zero. Positive (negative) local spatial autocorrelation exists when we obtain positive (negative) values for and z-scores which indicate the clustering of similar (dissimilar) values of y around location i. 11

4. Global spatial autocorrelation District Incomes In this section we briefly summarize the results of the estimation exercises carried out for detection of spatial autocorrelation in the district wise wage rates. This is the starting point of analysis before we proceed towards a spatial econometric analysis of the determinants of varying intra district wages across Pakistani districts in the subsequent chapters 6. The variable has been obtained from the micro data set by estimating the district wise average log wage and then comparing them. As a robustness measure, we have estimated the global and local measures of spatial autocorrelation using two W matrices instead of one. Table 1 shows the result of Moran s test for average log district wage rates using the two weights matrices. In both the cases, the null hypothesis of no spatial dependence is rejected at the significance level of 1%. Table 1: Global autocorrelation results for l_wage Weight Matrix I II i Moran s I 0.688 0.495 E(I) -0.013-0.013 Sd(I) 0.084 0.109 Z 8.391 4.655 p-value 0.000 0.000 6 In this preliminary version of this chapter, incomes are considered synonymous to district monthly wages of all salaried persons interviewed in the survey, just to check the presence of spatial autocorrelation. Later a more comprehensive definition of income will be taken which will encompass wages, income in kind, transfers and pensions. 12

4.1 Local spatial autocorrelation District Incomes The Moran scatterplot (in Figures 1 and 2) provides a more disaggregated view of the nature of the global autocorrelation. It not only provides us information on the presence of clusters in the data but also the outliers contained in it. This scatterplot is divided into four quadrants, each of which represents a different type of spatial association: The upper right quadrant represents spatial clustering of a district with a high average wage rate around neighbors that also have high average wages. This quadrant is also called the High-High zone (HH) since z-score and Wz both have high values. In general these are locations that have a positive value for the local Moran s I. The upper left quadrant represents spatial clustering of a district with a low average wage rate around neighbors that have high average wages. This quadrant is also called the Low-High zone (LH) since z-score is low while Wz has high values indicating a low outlier among neighbors with high values. In general these are locations that have a negative value for the local Moran s I. The lower left quadrant represents spatial clustering of a district with a low average wage rate around neighbors that also have low average wages. This quadrant is also called the Low-Low zone (LL) since z-score and Wz both have low values. In general these are locations that have a negative value for the local Moran s I. The lower right quadrant represents spatial clustering of a high district with a high average wage rate around neighbors that have low average wages. This quadrant is also called the High-Low zone (HL) since the z-score is high while Wz has low values indicating a high outlier among neighbors with high values. In general these are locations that have a negative value for the local Moran s I. 13

Figure 1: Spatial autocorrelation of district incomes using the binary contiguity matrix 2 Moran scatterplot (Moran's I = 0.688) l_wage Wz 1 0-1 Khanew Muzaff Rawalp Attock Islama Manseh Kohist Swat Nasira Abbott Zhob Kalat Lower Sibi Quetta Chakwa Makran Charsa Mardan KarkGujrat Ghotki Upper Jhelum Sheiku Thatta Narowa Hangu Sargod D I Kh Nowshe Tank Mianwa Gujran Mandi Sawabi Haripu Shangl Chitra Bunir Bannu Lakki Batagr Malaka Khusha Hafiza Kohat Sialko Shikar BadinSangha Dadu Jaccob Nowshe Sukkur Peshaw Karach Kasur Nawab Bhakka Khair Larkan Hydera Lahore Tharpa Mirpur Faisal Jhang Okara Bahawa Sahiwa TT Sin Vehari Bahawa Lodhra Pakpat Layyah -2 Rajanp R Y Kh D G Kh Multan -3-3 -2-1 0 1 2 3 z Figure 1 uses the binary contiguity W matrix to produce a positive global Moran s I (z-score = 8.391), represented by the slope of the black line. On a local level it is confirmed by the shape and the direction of the scatterplot. There are relatively few extreme outliers or atypical locations that deviate from the global pattern of the positive spatial autocorrelation. Figure 2 (below) uses the inverse distance matrix to produce the scatterplot. Compared to the previous scatterplot, it has a lower value for the global Moran s I (z-score = 4.655) since the clusters here are not based on geographic contiguity but on geographic proximity. Hence we conclude that for the year 2005-06, there exists statistically significant local spatial autocorrelation for district wages. 14

Figure 2: Spatial autocorrelation of district incomes using the inverse distance matrix 3 Moran scatterplot (Moran's I = 0.495) l_wage Wz 2 1 0-1 Khanew Muzaff Charsa Shikar Khair Dadu Thatta Larkan Manseh Mardan Narowa Hangu D Nowshe Tank I Kark Kohist KhSwat Lower Sawabi Upper Lakki Kohat Bannu Shangl Chitra Quetta Malaka Bunir Batagr Jaccob Kasur Sheiku Gujran Ghotki GujratPeshaw BadinSangha Tharpa Mirpur Nawab Sargod Nowshe Sahiwa Okara Jhang TT Pakpat Bhakka Sin Mianwa Khusha Mandi Hafiza Sialko Layyah D G Kh Lodhra Rajanp R Y KhBahawa Vehari Bahawa Hydera Lahore Sukkur Faisal Abbott Haripu Attock Chakwa Nasira Zhob Sibi Kalat Jhelum Makran Rawalp Islama Karach -2 Multan -3-3 -2-1 0 1 2 3 z While the Moran s scatterplot provides information mainly on the clusters, we use the detailed estimates of Local Moran s I provided in Appendix A for an analysis of outlier detection. Lahore, Karachi and Peshawar, the three provincial capitals of Punjab, Sindh and NWFP, all emerge with low or negative but statistically insignificant z-scores. This indicates that wage rates in their neighboring districts are lower than theirs but there is no indication of spillovers in terms of labor remuneration and we cannot reject the null hypothesis of no spatial association between them and their neighboring districts at a 95% confidence level. The LISA s for log of average district wages and the Moran s scatterplot produce three main statistically significant clusters when we use the inverse distance matrix. All three of them belong to Punjab. While the cluster of Rawalpindi, Islamabad and Chakwal falls into the High High zone, the cluster of Vehari, Bahawalnagar, Bahawalpur, Muzaffargarh and R Y Khan falls into the Low Low zone i.e. comparatively lower wage rates in and around these districts in the province of Punjab. These clusters are an evidence of spillovers between these districts. 15

4.2 Global spatial autocorrelation District Education Attainment We have also carried out an analysis of the average district wise education attainment level which is measured as the average number of schooling years completed in a district. It is expected that neighbors of districts with high education attainment should also have high educational awareness and hence similar if not equal attainment levels. We again made use of the Moran s I global and local version along with a Moran scatterplot using the two weights matrices. Table 2: Global autocorrelation results for education attainment Weight Matrix I II i Moran s I 0.180-0.024 E(I) -0.013-0.013 Sd(I) 0.084 0.109 Z 2.291-0.103 p-value 0.022 0.918 The analysis shows that the average knowledge spillover is weak in most districts that are neighbors to districts with high education attainment levels. This finding of virtually no knowledge spillovers becomes even more significant when neighbors are defined in terms of inverse distances rather than contiguous units. Hence the data shows more outliers than clusters. Therefore contrary to the economic prediction of spillovers, having Karachi as a neighbor may translate into higher education incentives for its neighboring districts but is not actually translating into higher education levels. 16

Figure 3: Spatial autocorrelation of district education levels using the binary contiguity matrix Moran scatterplot (Moran's I = 0.180) yrsed main 2 Abbott Islama Wz 1 0-1 Rawalp Ghotki Attock Sheiku Jhelum Gujrat Narowa Mandi Thatta Hafiza Chakwa Sialko Charsa Jaccob Sawabi Kasur Gujran Bunir Dadu Lodhra Hangu Mianwa Khusha Nowshe Haripu Upper Jhang Kalat Zhob KarkLower TT Sin Sangha Okara Shikar Mardan Sargod Kohist MuzaffBhakka Pakpat Khanew Sahiwa Manseh FaisalLahore Nawab Tank Sibi Vehari Peshaw Nowshe Kohat Malaka Lakki Tharpa Badin Bahawa Khair Shangl Swat Chitra Bannu Bahawa Multan Sukkur D I Kh Nasira Layyah Larkan Quetta Batagr Mirpur Hydera Rajanp R Y Kh Makran D G Kh -2 Karach -3-2 -1 0 1 2 3 z The spatial pattern of autocorrelation is quite diffused when we use the BC matrix for analysis. The positive Moran s I value indicates that neighboring districts share similar values of average district education attainment levels but overall autocorrelation is still weak. Karachi and Thatta emerge as the most significant outliers when we analyze the local Moran s I values using the BC and the ID matrices. However, while Karachi falls into the High-Low zone, Thatta falls in the Low-High zone 7. Similarly, under both the neighborhood structures Islamabad, Rawalpindi, Abbottabad, Chakwal and Jhelum emerge as a statistically significant cluster of districts with high average education attainment levels. The global spatial autocorrelation while using the ID matrix is negative but close to 0 and statistically insignificant. This indicates that we cannot reject the null hypothesis of no spatial association and that a random pattern exists between districts for average education rates 8. 7 The results of districts with significant spatial autocorrelation have been reported in Appendix A part (c). 8 The Moran s scatterplot for average district education attainment level is provided in Appendix A part (e). 17

5. Conclusion This chapter presents the initial results after having applied ESDA techniques to district-wise income and education data. The two main preliminary findings that emerge from this crosssectional analysis are that, although the distribution of district wise income exhibits a significant tendency for income to cluster in space (i.e. the presence of autocorrelation), the distribution of education is spatially random. The chapter however remains incomplete without extending the analysis over time in order to examine and report the varying nature of spatial autocorrelation in district incomes and education. For this the immediate next step is to append PSLM data sets from 2000 till 2009. If the absence of knowledge spillovers still persists over the years, we will carry out a political economy analysis of the reasons for regional disparities in education. Moreover, the detection of significant spatial autocorrelation in income levels across districts calls for a spatial econometric analysis that considers this fact. The presence of clusters and outliers supports the use of the spatial lag model to capture the spillover of income between districts. However, missing data on district incomes or omitted variables could also necessitate the use of a spatial error model (which reflects spatial autocorrelation in measurement errors) in analyzing the effect of inequality on district income levels. The next chapter will consider these issues in detail. 18

Bibliography Anselin, Luc. (1988b). Spatial Econometrics: Methods and Models. Dordrecht, Kluwer Academic Press. Anselin, Luc. (1995a), SpaceStat. A Software Program for the Analysis of Spatial Data (version 1.80), Morgantown: Regional Research Institute, West Virginia University Anselin, Luc (1995b), Local Indicators of Spatial Association LISA, Geographical Analysis 27: p.93-115 Anselin, Luc (1996), The Moran Scatterplot as an ESDA Tool to Assess Local Instability in Spatial Association, in M.Fisher, H.J Scholten and D. Unwin (eds.), Spatial Analytical Perspectives on GIS, London: Taylor and Francis. Anselin, Luc. (2003a). "Spatial Externalities, Spatial Multipliers and Spatial Econometrics." International Regional Science Review 26: p.147-152. Arbia, Guiseppe. (2006). Spatial Econometrics. Statistical Foundations and Applications to Regional Convergence. Berlin, Heidelberg, Springer-Verlag. Haining, Robert. (2003). Spatial Data Analysis. Theory and Practice. Cambridge, Cambridge University Press Le Gallo, Julie. and Ertur, Cem, (2003). An Exploratory Spatial Data Analysis of European Regional Disparties, 1980-1995, in European Regional Growth (Advances in Spatial Sciences) by Bernard Fingleton (Ed), Springer. Van Oort, Frank. G. (2004). Urban Growth and Innovation. Spatially Bounded Externalities in the Netherlands. Aldershot, Ashgate. Wooldridge, Jeffrey M. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge, MA, MIT Press. 19

Appendix A: Measures of local spatial autocorrelation a) Local spatial autocorrelation using the binary contiguity weights matrix 9 Moran's Ii (l_wage) -------------------------------------------------------------- dist Ii E(Ii) sd(ii) z p-value* --------------------+----------------------------------------- Lodhran 1.273-0.013 0.483 2.660 0.008 Karachi -0.435-0.013 0.985-0.429 0.668 Lahore -0.227-0.013 0.697-0.307 0.759 Peshawar -0.146-0.013 0.692-0.193 0.847 Okara 1.286-0.013 0.429 3.024 0.002 Islamabad 2.384-0.013 0.692 3.463 0.001 Bahawalnagar 2.014-0.013 0.483 4.193 0.000 Rawalpindi 2.263-0.013 0.483 4.709 0.000 Sahiwal 1.828-0.013 0.389 4.726 0.000 D G Khan 2.900-0.013 0.562 5.187 0.000 Bahawalpur 2.032-0.013 0.389 5.250 0.000 Vehari 2.251-0.013 0.389 5.813 0.000 Layyah 2.648-0.013 0.429 6.196 0.000 R Y Khan 4.417-0.013 0.562 7.888 0.000 Rajanpur 4.737-0.013 0.562 8.459 0.000 Khanewal 3.616-0.013 0.333 10.900 0.000 Muzaffar grah 4.797-0.013 0.358 13.426 0.000 -------------------------------------------------------------- *2-tail test 9 The local Moran statistics are available for each one of the 81 districts and available on request. Only the statistics for the main city districts and the statistically significant ones are reported here. 20

b) Local spatial autocorrelation using the inverse distance matrix Moran's Ii (l_wage) -------------------------------------------------------------- dist Ii E(Ii) sd(ii) z p-value* --------------------+----------------------------------------- Vehari 1.992-0.013 0.985 2.035 0.042 Karachi -0.435-0.013 0.985-0.429 0.668 Lahore -0.227-0.013 0.697-0.307 0.759 Peshawar 0.306-0.013 0.265 1.202 0.229 Chakwal 2.173-0.013 0.985 2.220 0.026 Bahawalnagar 2.212-0.013 0.985 2.259 0.024 Muzaffar grah 1.795-0.012 0.772 2.342 0.019 R Y Khan 2.657-0.013 0.985 2.711 0.007 Bahawalpur 1.633-0.012 0.568 2.897 0.004 Rajanpur 2.958-0.013 0.985 3.017 0.003 Islamabad 3.015-0.013 0.690 4.389 0.000 Rawalpindi 3.140-0.013 0.684 4.611 0.000 -------------------------------------------------------------- *2-tail test 21

(c) Local spatial autocorrelation district education using the inverse distance matrix Moran's Ii (yrsed main) -------------------------------------------------------------- dist Ii E(Ii) sd(ii) z p-value* --------------------+----------------------------------------- Karachi -3.578-0.013 0.988-3.608 0.000 Thatta -2.242-0.012 0.744-2.996 0.003 Jhelum 1.922-0.012 0.713 2.714 0.007 Abbottabad 3.082-0.013 0.988 3.132 0.002 Chakwal 3.434-0.013 0.988 3.487 0.000 Sialkot 1.794-0.013 0.506 3.572 0.000 Haripur 5.128-0.013 0.988 5.201 0.000 Rawalpindi 4.844-0.013 0.686 7.078 0.000 Islamabad 5.451-0.013 0.692 7.891 0.000 -------------------------------------------------------------- (d) Local spatial autocorrelation district education using the binary contiguity matrix Moran's Ii (yrsed main) -------------------------------------------------------------- dist Ii E(Ii) sd(ii) z p-value* --------------------+----------------------------------------- Karachi -3.578-0.013 0.988-3.608 0.000 Thatta -1.292-0.013 0.563-2.271 0.023 Jhelum 1.014-0.013 0.391 2.627 0.009 Chakwal 1.144-0.013 0.431 2.684 0.007 Rajanpur 1.784-0.013 0.563 3.188 0.001 Sialkot 1.583-0.013 0.485 3.290 0.001 Abbottabad 2.016-0.013 0.563 3.600 0.000 Islamabad 4.568-0.013 0.694 6.596 0.000 Rawalpindi 3.260-0.013 0.485 6.749 0.000 -------------------------------------------------------------- *2-tail test 22

(e) Spatial autocorrelation of district wages using the ID matrix 3 Moran scatterplot (Moran's I = -0.024) yrsed main Wz 2 1 0-1 Kohist Rajanp Lodhra R Y KhKalat Zhob Nasira Sibi Bahawa Vehari Makran Attock Jhelum Narowa Kasur Gujrat Bhakka Layyah Khanew Mianwa Sheiku Thatta Jhang Hafiza Mandi D G Khusha Muzaff JaccobPakpat Dadu OkaraKhair Ghotki Shikar Sahiwa TT Larkan Sin Nawab Tharpa Badin Sangha Mirpur Nowshe Upper Tank Charsa Shangl Hangu Batagr D Lakki Sawabi Bunir I Kark Kh Lower Swat Mardan Nowshe Chitra Bannu Manseh Kohat Malaka Abbott Chakwa Gujran Sargod Lahore Peshaw Sukkur Bahawa Faisal Quetta Multan Hydera Sialko Haripu Islama Rawalp -2 Karach -3-2 -1 0 1 2 3 z 23

e) District Map of Pakistan 24