USE OF GEOSPATIAL AND WEB DATA FOR OECD STATISTICS CCSA SPECIAL SESSION ON SHOWCASING BIG DATA 1 OCTOBER 2015 Paul Schreyer Deputy-Director, Statistics Directorate, OECD
OECD APPROACH
OECD: Facilitator of discussion on new data sources for NSOs OECD s own use of new data sources From Big Data to Smart Data Not every New data source is Big Not every Big data source is New
Business value analysis: why are we working on this? More granularity or coverage of existing data (e.g. spatial disaggregation) New output (e.g., measuring trust, inequalities) Greater timeliness nowcasting Increased impact analysis supporting OECD mission, possibility to link areas Increased responsiveness capacity to address new topics quickly, respond to what-if questions
Business process analysis: Necessary capabilities Capacity to identify, evaluate and access new data sources Command of methodology Proven quality and metadata frameworks Suitable IT infrastructures Established legal and ethical frameworks Skills and training capacity
4 types of new sources and examples of use cases Web crawling, web scraping Content Analysis Mobility studies Sensor and geospatial data * Online Real estate prices (OECD GOV) * Measuring trade restrictiveness by scraping and analysing trade laws (OECD TAD) * African Economic Outlook (AEO): Civil tensions and political governance indicators (OECD DEV) * Big Data Measures of Human Well-Being Evidence from US Google Index (OECD STD) * Measure transport reliability from geolocalisation logs (ITF) * Air quality and land cover data (OECD GOV) * Enriching the metropolitan database using geo-spatial data (OECD GOV) * PIAAC log file data (OECD EDU)
EXAMPLE 1 ENVIRONMENTAL INDICATORS Using geospatial data (satellite data)
Average population exposure to air pollution (PM2.5) Key messages that the indicator should communicate Where air pollution is above recommended levels Where improvements in air quality have happened Linking air pollution to health
Source: Raster (satellite observations) Satellite observations Raster: van Donkelaar et al. (2014) Resolution: ~10 km2 Years: 1998-2012 Ground-based stations Advantages Direct measures Offer regular levels of air pollution over time More pollutants are available Disadvantages Low coverage in developing countries Uneven coverage within and across countries PM 2.5 concentration rarely monitored Site selection, measurement techniques, and reporting methods differ across regions and countries Satellite observations Global coverage Consistent method to compute air pollution in cities, regions and countries Consistent time-series data, spanning more than a decade Modelled data Satellite observations are less precise for bright surfaces (snow or desert) Current data are on a multi-year average, evaluation of short-term events often unavailable 9
Basic methodology 1. The satellite-based values of air pollution are multiplied by the population living in the area (using a 1km2 resolution grid) 2. The exposure to air pollution in a region is given by the sum of the population weighted values of PM2.5 in the 1km2 grid cells falling within the boundaries of the region 3. Finally, dividing this aggregated value by the total population in the region, we obtain the average exposure to PM2.5 concentration in a region
Levels and trends in OECD cities 68% of the urban population in OECD countries (376 million people) are exposed to pollution above the WHO s recommended levels. OECD estimates show wide variation in PM 2.5 exposure levels across cities within countries, the largest in Mexico, Italy, Japan and Korea Metropolitan minimum Country average Metropolitan maximum Country (No. of cities) -10 0 10 20 30 40 Cuernavaca Milan Kumamoto Cheongju Strasbourg Buffalo Kraków Zaragoza Essen Malmö Liverpool Mérida Palermo Naha Ulsan Toulon Portland Gdańsk Las Palmas Bremen Stockholm Glasgow Mexico (33) Italy (11) Japan (36) Korea (10) France (15) United States (70) Poland (8) Spain (8) Germany (24) Sweden (3) United Kingdom (15) Source: Brezzi and Sanchez-Serra (2014) Ostrava Brno Czech Republic (3) Santiago Concepción Chile (3) Zurich Geneva Switzerland (3) Toronto Quebec Canada (9) The Hague Utrecht Netherlands (5) Porto Lisbon Portugal (2) Thessalonica Athens Greece (2) Brussel Antwerp Belgium (4) Vienna Linz Austria (3) Budapest Hungary (1) Bratislava Slovak Republic (1) Ljubljana Slovenia (1) Copenhaguen Denmark (1) Helsinki Finland (1) Tallinn Estonia (1) Oslo Norway (1) Dublin Ireland (1) 11
Other example: raster sources used for land cover Europe USA Japan World Raster name Corine land cover National land cover dataset (NLCD) Japan National Land Service Information data MODIS 500 Map of Global Urban Extent Resolution 25 metres 30 metres 100 metres 500m Years 2000-06 2001-06 1997-2006 2008 Classif. of urban land 44 land urban classes 21 land cover classes 11 land cover classes 17 land cover classes Water
feeds into the OECD Regional Well-Being Database Links: Regional Well-Being database Regional Well-Being web tool
EXAMPLE 2 TRADE POLICY ANALYSIS Using qualitative data from government websites
Basic idea Traditionally: Policy questionnaires to countries Manual screening of government websites New: Machine-based monitoring of government web sites Automatic check for changes or addition of rules and regulations Test case: qualitative information for the OECD s trade restrictiveness information and index
How? Text comparison - Initial discovery Run a text comparison between the original document and the new updated document Detect and flag specific paragraphs changed or updated inside long documents Text comparison - Advanced discovery. Changes in rules and regulations can also happen through new pages Use big data techniques to compare in house structured information to the universe of laws and regulations in a given country. Work on text definitions similar to the original ones to help identifying potentially relevant documents.
IT Tools Web-crawling: scripts to systematically scan governmental websites where regulations can be found (federal, provincial, regional, etc.). Web-scraping: scripts to extract the relevant information in documents, possibly based on articles and paragraphs (text analysis). Document conversion: most laws and regulations are in pdf but possibly in other formats that would need to become text documents to run text analysis. Text comparison: tools and dictionaries to compare the text of updated documents with the original text, to calculate similarity coefficients with other documents, in a variety of languages with the option to also use proximity of similar words.
Web scraping / Text analysis Promising results on French legal texts (Legifrance)
Summary Significant potential Use cases and pilots provide really important reality checks Smart data and multiple source, not necessarily big data Initiatives have sprung in many parts of OECD Need to be accompanied by overall strategy being developed at OECD
Thank you!