Ingolf Boettcher Vienna 22. October 2015 Big Data in Price Statistics - Scanner Data and Web-scraping www.statistik.at Wir bewegen Informationen
What is scanner data (in price statistics)? -Sales information on article level (turn-over + quantity ) -generated by point of sales terminals in shops www.statistik.at Folie 2 26.10.2015
What is scanner data (in price statistics)? Not a sample of transactions. LIMO- NADE OTHER JUICE COLAS Non-alcoholic beverages = 100% www.statistik.at Folie 3 26.10.2015
What is scanner data (in price statistics)? a census of all transactions. LIMO- NADE OTHER COLAS JUICE Non-alcoholic beverages = 100% www.statistik.at Folie 4 26.10.2015
What is scanner data (in price statistics)? SHOP ID PERIOD PRODUCT ID /EAN PRODUCT Group PRODUCT Description 1234 2015_11 123455 Juices BrandXY Orange, mild, 500 ML, bottle. TURN OVER 220 EUR 110 1234 2015_11 984744 Juices.. 1234..... Price(unit value) p i 220 110 =2EUR = Turn over i Sales i = SALES (Quantity) www.statistik.at Folie 5 26.10.2015
What is scanner data (in price statistics)? A lot of data Estimation (rule of thumb): # of Households Austria 3.800.000 Average # of Retail Transactions per month per household Total # of Retail Transactions per month About 100 380.000.000 www.statistik.at Folie 6 26.10.2015
What is scanner data (in price statistics)? A lot of data Actual Test - Scanner Data for segment personal care provided by an Austrian drugstore chain (about 350 shops): # of sales information on article level per week per shop About 3.000 Total # of sales information per month About 4.200.000 Share of product segment personal care in CPI basket of goods 1,4% www.statistik.at Folie 7 26.10.2015
What is Web-scraped data? (in price statistics) Website content turned into a spreadsheet www.statistik.at Folie 8 26.10.2015
What is Web-scraped data? (in price statistics) SHOP URL PRODUCT ID PRODUCT Group PRODUCT Description www.. 123455 Juices BrandXY Orange, mild, 500 ML, bottle. www.. 984744 Juices www..... PRICE 2 EUR www.statistik.at Folie 9 26.10.2015
Why scanner and web-scraped data for price statistics? The Consumer Price Index ( Inflation) Representative measure of consumer prices (about 780 product groups, e.g. milk, Petrol ): Regional price collection in 20 cities + Central price collection by Statistics Austria staff = about 42.000 prices www.statistik.at Folie 10 26.10.2015
Why scanner and web-scraped data for price statistics? Broader and more segmented sortiments Past / Today Today / Future Whole Milk Skimmed Milk Organic Whole Milk Non-Organic Whole Milk Skimmed Whole Milk SkimmedLactose-Free Skimmed Lactose-Free Fat-FreeLactose-Free Fat-Free Fat-Free www.statistik.at Folie 11 26.10.2015
Why scanner and web-scraped data for price statistics? More dynamic (promotional) pricing Past / Today Today / Future List Price Promotion Price List Price Promotion Price Member Price Individual Price for Members Time specific prices (e.g. flights, hotels) Location specific prices Hardware specific prices www.statistik.at Folie 12 26.10.2015
Advantages of using scanner and webscraped data for price statistics Web-scraped data Better coverage (time, markets, products) Scanner data Better coverage (time, space, markets, products) Transaction prices Information on current quantities www.statistik.at Folie 13 26.10.2015
Advantages of using scanner data Information on current quantities allow new types of price indices: P i,laspeyres = (Price i,t1 Quantity i,t0 ) (Price i,t0 Quantity i,t0 ) International standard for Consumer Price Indices Takes into account availability of data sources Tendency to overstate inflation www.statistik.at Folie 14 26.10.2015
Advantages of using scanner data Information on current quantities allow new types of price indices: P i,paasc he = (Price i,t1 Quantity i,t1 ) (Price i,t0 Quantity i,t1 ) Not available before scanner data Tendency to understate inflation P i,fisher = P i,laspeyres P i,paasc he www.statistik.at Folie 15 26.10.2015
Challenges scanner data SHOP ID Mapping of Articles to CPI Classification PERIOD PRODUC T ID TEXT MINING PRODUCT Group PRODUCT Description 1234 2015_11 123455 Juices BrandXY Orange, CPI EA Orange Juice, 500-1000ML mild, ML, = bottle. CPI Division Group Class Sub-Class ALL FOOD& BEVERA. BEVERA GES JUICE ORANGE JUICE APPLE JUICE. TURN OVER 220 EUR 110 SALES (Quantity) 1234 (Retailer code for JUICE) AND (>=1000 AND <=2000) AND ( Orange OR BRANDX OR BRAND Y OR [FLAVOR_NAME] OR BRAND_B (.)? AND NOT ( APPLE OR Mango OR Grapefruit ( ) www.statistik.at Folie 16 26.10.2015
Challenges scanner data Common methodological framework for use of scanner data work in progress? Temporal Aggregation (days vs. week) Spatial Aggregation (shop vs. region)? Article Matching?? Article Mapping to COICOP Data Cleaning (outliers, etc.) Type of price index www.statistik.at Folie 17 26.10.2015
Challenges scanner data - Product re-launches: New ID (EAN/GTIN) for same product - Example and picture: Statistics Netherlands (Antonio Chessa) www.statistik.at Folie 18 26.10.2015
Web-scraping (in Price Statistics) Statistics Austria: click-and-point web-scraping software www.statistik.at Folie 19 26.10.2015
Challenges Web-scraping - Web-sites change frequently High workload for Web-scraper maintenance - Identification of articles with high turn-over - Identification of re-launches - Legality to crawl on websites www.statistik.at Folie 20 26.10.2015
Challenges Web-scraping web-sites change frequently Decision to use click-and-point web-scraping software (import.io) No IT-developer needed, therefore: Cheap Flexible No programming skill required www.statistik.at Folie 21 26.10.2015
Challenges Web-scraping Identification of articles with high turn-over??? www.statistik.at Folie 22 26.10.2015
Challenges Web-scraping Legality to crawl on websites Technical hurdles of websites may not be circumvented (robot.txt) The database may not be replicated as a whole elsewhere through web-scraping Web-scraping may not negatively affect a web site s performance www.statistik.at Folie 23 26.10.2015
Potential Scanner data and Web-scraping Austrian CPI basket of good www.statistik.at Folie 24 26.10.2015
Potential of Scanner data and Web-scraping Already use scanner data for HICP/CPI compilation Testing-phase using continuous realtime scanner data transmissions Testing-phase using historical scanner data Netherlands France Austria Germany Sweden Luxembourg Poland Italy Switzerland Belgium Great Britain Norway Portugal Spain Denmark (from Jan 2016) Slovenia Slovakia Hungary Testing-phase using purchased historical scanner data from market research institutes www.statistik.at Folie 25 26.10.2015
Ingolf Boettcher Vienna 22. October 2015 Big Data in Price Statistics - Scanner Data and Web-scraping www.statistik.at Wir bewegen Informationen