Keeping up with the modern consumer online data in price statistics

Similar documents
The use of online prices in the Norwegian Consumer Price Index

Alternative data collection methods -

Quality Control of Web-Scraped and Transaction Data (Scanner Data)

Theme: The path to e-commerce purchases. E-commerce in the Nordics Q2 2015

An Assessment of Prices of Natural Gas Futures Contracts As A Predictor of Realized Spot Prices at the Henry Hub

Technical Note. Consumer Confidence Survey Technical Note February Introduction and Background

HOTEL MARKET REPORT SOFIA 2015

Digital Heart of Europe: low pressure or hypertension? State of the Digital Economy in Central and Eastern Europe

Guide to PanAm Agent and Online Booking Tool Services!

media kit 2014 Advertise Global Mobile Ad Network

The Main Page of RE STATS will provide the range of data at the very top of the page.

Fundamentals Level Skills Module, Paper F5. 1 Hair Co. (a)

Best Practice Search Engine Optimisation

chapter >> Consumer and Producer Surplus Section 3: Consumer Surplus, Producer Surplus, and the Gains from Trade

Consumer Price Indices in the UK. Main Findings

Responsible Gambling Model at Veikkaus

Britepaper. How to grow your business through events 10 easy steps

Report for September 2015

Swimming upstream: navigating the world of reverse logistics

VIDEO TRANSCRIPT: Content Marketing Analyzing Your Efforts 1. Content Marketing - Analyzing Your Efforts:

CHAPTER 4. o Hotel Results 15 CHAPTER 5. o Car Results: Matrix & Options 19. o Ground and Limo Service 21. o Trip Purchasing & Booking 23

Good Call. A Guide to Driving Calls with AdWords

Stock Market Indicators: Historical Monthly & Annual Returns

USES OF CONSUMER PRICE INDICES

Workshop 3: Writing A Financial Plan. Proudly sponsored by:

Automatic data collection on the Internet (web scraping)

Statistics on E-commerce and Information and Communication Technology Activity

Community Solar Roof Guide

Service Quality Performance Report 2013

top tips to help you save on travel and expenses

A STUDY ON USAGE OF MOBILE APPS AS A IMPACTFUL TOOL OF MARKETING

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

Box 1: Main conclusions

Google Analytics Guide

ShopWindow Integration and Setup Guide

The mobile opportunity: How to capture upwards of 200% in lost traffic

Patterns of Media Usage and the Nonprofessional

Google Analytics Guide. for BUSINESS OWNERS. By David Weichel & Chris Pezzoli. Presented By

National Disability Authority Resource Allocation Feasibility Study Final Report January 2013

THE OPPORTUNITIES & CHALLENGES OF MOBILE LEARNING

Simple Inventory Management

Drop Shipping ebook. What s the Deal with Drop Shipping?

Welcome to Smart Pay As You Go

Guide to Effective Retail Merchandise Management A Step by Step Guide to Merchandising in a Retail Store

Consumer Price Developments in December 2015

1Current. Today distribution channels to the public have. situation and problems

How to Build a Successful Website

DIRECT MARKETING TIPS. How to Make Your Holiday Campaign a Success

Documentation of statistics for International Trade in Service 2016 Quarter 1

Market Research Methodology

SOCIAL ENGAGEMENT BENCHMARK REPORT THE SALESFORCE MARKETING CLOUD. Metrics from 3+ Million Twitter* Messages Sent Through Our Platform

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

The Marketer s Guide To Building Multi-Channel Campaigns

April 12, To: Verified by Visa Merchants Verified by Visa Acquirers Verified by Visa Merchant Service Providers

Multi channel merchandise planning, allocation and distribution

START HERE THE BASICS TIPS + TRICKS ADDITIONAL HELP. quick start THREE SIMPLE STEPS TO SET UP IN UNDER 5 MINUTES

Where are all the candidates at?

Page 18. Using Software To Make More Money With Surveys. Visit us on the web at:

Super-complaint: credit and debit surcharges May 2011

Basics of Dimensional Modeling

Curate Your Own Online Marketplace

Low Fare Search. Quick Reference BENEFITS O V E R V I E W

The Thinking Approach LEAN CONCEPTS , IL Holdings, LLC All rights reserved 1

President & Group CEO Håkan Ericsson s speech to the PostNord Annual General Meeting on April 23, Chairman, valued meeting participants.

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Liverpool Women s NHS Foundation Trust. Complaints Annual Report :

Measurabl, Inc. Attn: Measurabl Support 1014 W Washington St, San Diego CA,

FIVE STEPS TO MANAGE THE CUSTOMER JOURNEY FOR B2B SUCCESS. ebook

Research Design. Recap. Problem Formulation and Approach. Step 3: Specify the Research Design

No Boundaries. Just GetThere. GetThere Travel and Collaboration Management. Travel. Meet. Network.

CHAPTER 6 NETWORK DESIGN

Consumer Barometer. Country Report France

Internet Grocery Stores What does the future look like? By: Matthew Rousu

It s all about managing food. Food Recall Plan Template For Food Manufacturers

We will discuss how to manage your own ecommerce booking through your website rather than through a booking agent and how this can integrate.

Search engine optimization and CRM features:

Payments and Revenues. Do retail payments really matter to banks?

Successful Steps and Simple Ideas to Maximise your Direct Marketing Return On Investment

starting your website project

Digital Media Monitor 2012 Final report February

Fogbeam Vision Series - The Modern Intranet

Private Sector Employment Indicator, Quarter (February 2015 to April 2015)

VIETNAM B2C E-COMMERCE MARKET 2015

More mobility for the world. Base Maintenance Services. Total Base Maintenance Support TBS The smart answer to base checks NEW

BRIEFING NOTE. With-Profits Policies

Interest rate Derivatives

Total Factor Productivity of the United Kingdom Food Chain 2013 final estimate

Transcription:

DATA COLLECTION SESSION F Keeping up with the modern consumer online data in price statistics Kjersti Nyborg Hov Statistics Norway, Division of price statistics Leiv Tore Salte Rønneberg Statistics Norway, Division of price statistics

Keeping up with the modern consumer online data in price statistics Kjersti Nyborg Hov 1, Leiv Tore Salte Rønneberg 2 Abstract. Ordering and buying goods or services for private use over the internet is increasing in popularity. And as private consumers spending habits move away from physical shops toward e-commerce, the statisticians must follow. This new channel offers a lot in terms of data collection within price statistics such as the consumer price index (CPI) and purchasing power parities (PPP). First, the price information is available; the incentives for enterprises are to provide prices to the consumers, thus this information is also available for official statistics. This often holds true also for physical stores, as many of them have an online solution alongside their physical location. Secondly, it makes great promises for collecting the necessary data electronically and automatically. In this way, our ambition is two-fold; we would like to better reach the online market, but also move more of our traditional data collection to the internet wherever it is possible. By better utilizing the information available online, the goal is to greatly increase the amount of price observations in the indices, as well as reduce the burden on respondents from filling out surveys. While the information is available for all to see, it is not necessarily easy to extract in any systematic or consistent manner. In recent years though, easy to implement software solutions to scrape data from the web have become available. In this paper we will: - Highlight possibilities and difficulties of this new consumption pattern for the calculation of price statistics such as the CPI and the PPP. - Give an overview of possible ways to access the available information in a systematic way, using IT-tools such as web scrapers; and how we are utilizing these tools today. - Discuss the possibilities of more robust data collection from the internet, by moving towards more automation and the semantic web. Key words: data collection, web scraping. 1 Statistics Norway, Adviser, Oslo, Norway, Email: kjersti.nyborg.hov@ssb.no 2 Statistics Norway, Senior executive officer, Oslo, Norway, Email: leiv.ronneberg@ssb.no

1. Introduction Ordering and buying goods or services for private use over the internet is growing in popularity. For several Nordic consumers, e-commerce has become a natural way to buy goods and services. And the e-commerce is growing fast. Rapid development of information technology and secure payment systems has expanded the households internet purchases. This new purchasing channel compels price statistics to revise its traditional price collection methods. Since 2014, Statistics Norway (SN) has experimented with a new data collection method called web scraping, where an automated internet robot extracts prices and other useful product information directly from the internet. Both the experiences with, and the results of, using a web scraping tool will be presented in this paper. This paper will start with a short introduction of the Consumer Price Index (CPI) and its main objectives. Second, there will be given an overview of e-commerce in Norway, and both the coverage and treatment of online prices in the Norwegian CPI, as of today. Next, the concept of web scraping will be presented together with the results from using different experimental techniques for both goods and services. Further, special attention is drawn to the legal issues concerning automated extraction of data online. Lastly, the paper focuses on future possible opportunities regarding online data collection using web scraping techniques. 2. Background Statistics Norway, Division of price statistics, was awarded financial contribution from Eurostat for a project concerning the use and analysis of online prices for the period January 2014 to March 2016. This paper is based on the final technical report of that project. The internet is already an important purchasing channel, and it is important that a representative CPI not only cover the online purchases and their price movements, but also contains the adequate impact of online prices. When goods and services purchased online, and their price levels differ from traditional stores, it is important to reflect this in the price indices such as the CPI. This is especially true when the online market shows a different price development compared to traditional stores. Another objective of the project was to make the online data collection as efficient as possible, following general strategies in Statistics Norway. Today, most online prices are collected manually from the internet which is rather time consuming. The other option is to collect online prices via web questionnaires sent to the stores, which in turn is merely a reallocation of the labour burden from SN to the data providers. In order to achieve an efficient data collection of online prices, SN looked into different automated data extraction techniques, so-called web scraping.

3. The consumer price index In short, the Norwegian consumer price index (CPI) is a measurement of the cost of living in Norway. The aim is to measure the actual price changes of goods and services as it is experienced by the typical Norwegian household. To do this, a few ingredients are needed; first, we need to know which goods and services that households actually consume. Second, we need to know how much of their budget is allotted to these particular items. Third, we need to know the prices of these goods and services. It is neither possible, nor appropriate, to cover all goods and services acquired by private households, thus a sample is chosen. The sample is based on the item/services representativeness and availability. By representativeness, we mean an item/service whose attributes, characteristics and price development resemble other consumer goods or services in the same segment. Instead of collecting prices for all smartphones in the market, we choose a few smartphones that are meant to represent the rest of them in terms of price development and other attributes. Thus, the selected item/service represents a wider group of items/services. The sample of representative goods and services makes up what is called the basket. Once a year, the basket is revised and updated with new items and services, in addition; those items and services that are no longer considered representative are omitted. How often an item or service is replaced or omitted depends largely on the type and characteristic of the representative item or service, and also how wide or narrow our item specifications are. Electronic devices such as computers, cell phones and television sets are frequently subject to replacements as new and better technology make these items short lived. DVD-players were replaced by Blue-Rayplayers which in turn will probably be deemed unrepresentative by the increased use of streaming services. Other representative items such as trousers, t-shirts and underwear however, are long lived and seldom replaced since these items are still both representative and available. Much as for the selection of goods and services, there is also a sample of stores. The sample of stores is based on SN s Register of Establishments and Enterprises and its defined industries, and it is covering both physical and online stores. There is no need to have full coverage of stores, but again the sample of stores should represent the full coverage. The sampling plan makes sure that the CPI covers the stores with the highest turnover, thus being the most representative stores, also taken into consideration the geographical dispersion of the population. The traditional way of collecting prices, mostly of tangible goods, is by web questionnaires 3 sent out to the representative sample of stores located in different parts of the country. Statistics Norway has a long history of sending paper questionnaires to the data providers, where the shopkeepers themselves complete the questionnaires and return them. Twenty years ago, paper questionnaires were the main data source in the Norwegian CPI. Still today, SN collect prices by means of 3 Statistics Norway has a history of collecting most of the prices to the CPI through questionnaires. Web questionnaires have now taken over, and as of 2015, the data providers no longer fill out paper questionnaires.

web questionnaires; however, increasing attention to lower response burden for data providers, low costs, effectiveness and the desire to improve data quality has led to exploring the possibilities of new and more advanced data collection methods. 2 % Web questionnaires 19 % 34 % Scanner Data Household data 15 % 30 % Data collected from the internet Other Statistics in Statistics Norway Figure 1: Data collection in the Norwegian CPI. Source: Statistics Norway As of today, web questionnaires make up 34 per cent of the total basket of goods and services, measured by CPI weights, as seen in figure 1. Other alternative methods of collecting price data, such as scanner data and manually collecting online data, are gradually replacing the more traditional data collection techniques. 4. Online prices in the CPI The increasing digitalization has made the web a common marketplace. More and more consumers are purchasing goods and services through online solutions, this also being the case for Norwegian consumers. A closer look at the Norwegian e- commerce market shows that there has been, and still is, rapid growth in this sales channel. In fact, Norwegian consumers are among those who shop the most. According to E- commerce Europe and their Northern Europe B2C E-commerce Report 2015, the Norwegian purchases of goods and services were estimated to EUR 10.3 billion in 2014, up from 8.9 billion in 2013. With this total of online sales, Norway showed the strongest e-commerce in Northern Europe both in 2013 and 2014. Statistics Norway s 2015 2 nd quarter survey of ICT 4 usage in households shows that 76 per cent of the private households have used the internet for buying or ordering goods and services during the last 12 months. And, according to a 2016 report by PostNord, more than 1 4 The notion ICT covers technology related to processing, presentation and storing of information, in addition to technology for communication and exchange of information.

out of 3 Norwegian consumers purchase goods or services online at least once a month. There are several obvious factors that make the Norwegians and fellow Nordics among the most experienced online shoppers; high living-standards, strong consumer power and great availability of internet. Although traditional stores still maintain a strong position, the increasing e-commerce market compels price statistics such as the CPI, to include online prices. If one could assume that the prices and price development in both the traditional and the online store are identical, there would be no need to collect prices from both locations. Collecting prices for only one of them would be sufficient to calculate the actual price development of an item/service. However, this assumption might not hold. There are various reasons why the prices online might differ from that of a traditional store. First of all, it is plausible to believe that the prices in a traditional store are less subject to adjustments than in an online store; there are close to nonexisting menu costs online and prices can be changed instantaneously, even dynamically in response to user behaviour. Secondly, a pure online store keeps costs down by having fewer employees and lower rental costs. And thirdly, delivery costs incurred by online stores could contribute to price differences between the two channels. On the other hand one could argue that traditional stores that have an online solution alongside their physical locations are likely to have the same prices in both sales channels. In this case, consumers may use the online stores for looking up prices and stock availability across several potential businesses and their physical outlets; then shop at the physical store that stocks the item, has the lowest price, and a location close to the consumer. For some items in the basket, online collection of prices is already taking place. Airfares, package holidays, cinema tickets and many others are services that Norwegian consumers mostly purchase online, thus data collection is also online for the CPI. This type of data collection is typically a labour intensive, manual exercise; visiting webpages, looking up prices, and inputting the data into spreadsheets. Roughly 19 per cent of the total sample of goods and services measured by the CPI weight shares are collected in this fashion. These prices are, to a large degree, related to services. With the exception of food and non-alcoholic beverages, pharmaceuticals, and several others, prices for many goods are still collected from web questionnaires. In addition; due to our sampling design, and the fact that industry categories are aggregated, we have not always managed to draw and include the most important online stores in our sample for questionnaires. The result is a sample of stores that possibly underestimates the online market. Given the increasing growth of e-commerce by Norwegian consumers, and the fact that we do not know whether price developments are equal, it is vital for the Norwegian CPI to capture the online prices as well as prices from traditional stores. One solution is to solve our sampling issues and send web questionnaires to the online stores, but this seems antique and inefficient given that the information is

already publicly available. A better route to go is to utilize automatic tools to collect data from these sites, reducing both the burden on respondents, and manual labour for SN. Scanner data is another viable option that even includes actual transactions made, not just prices. 5. Data collection using web scraping Collecting data from websites can be done in several different ways, but quite often; copying and pasting the desired data is the best you can do. On the other hand, occasionally there are structures in the way information is presented online that can be utilized to automate at least parts of the process. 5.1 The structure of a web site The web sites we are interested in come in many forms, inherit different structures, and display various amounts of products on offer. Sites that provide services typically contain only a few prices, listed in an unstructured fashion; a dentist may for example list prices on 3 basic treatments simply in a line of text. However, sites that sell goods often display many products and list them in a structured way large online stores often list their products in large tables including detailed product specifications. This gives rise to two distinct axes of interest in our data collection: Few vs. many prices Unstructured vs. structured data The more structured the data is, the easier it is to extract using automatic tools. The greatest gain is then given where the site has many products and lists them in a structured fashion as most large online stores do. The CPI also covers products that are listed in a more unstructured fashion with few prices per site, such as cinema and theatre tickets. A selection of these is manually collected each month. It may be worthwhile to clarify the notion of structured data. All websites inherently have some kind of structure from the HTML behind it; a website has a tree-like structure, where the elements of the page (nodes) branch out from the main body. We will refer to product data as structured if there exists a unique node representing the product, with attributes such as product name and price branching off of this node. In the graph below to the left, the site S contains a table A that has three elements X, Y and Z. These elements represent products with prices p x, p y, p z and product names n x, n y, n z respectively. Each product has its own node, with price and product name as own nodes branching off the product node this data is well structured. It is also worth noting that once the product nodes are found (X,Y,Z), the path to the prices and product names are identical for each node.

n x X p x S A Y n y S T The price of n x is p x, the price of n y is p y and the price of n z p y is p z. n z Z p z Figure 2: Structured (left) vs. unstructured (right) data An example of unstructured data might also be useful to better grasp the difference between the two. To the right in the graph, the site S has only a text T on it, reading The price of n x is p x, the price of n y is p y and the price of n z is p z. While it is easy enough for us to recognize and infer the same information from the two sites, for a robot it is a difficult task there is no simple defined path to extract the data. One option is to extract the whole text, but it would still take some human interaction to interpret it and put it into a table suitable for calculation. These two examples are archetypes, with real data taking on forms anywhere between them. By giving simple instructions of which branch to take from each node, a path is created from the root of the tree to a terminal node. Given a path and an URL, a simple web scraper will go to the given URL, follow the path, and return the data at the end of it. A more advanced web crawler will, in addition to what the scraper does, gather links on the webpage as it goes along, follow them, and repeat the process at each page until some stopping criteria kicks in. Usually, large online stores will structure all their products identically, so that the same paths will work on all URLs connected with that particular site. In this fashion, a crawler can scrape down similarly structured information from the entire website. 5.2 Software Import.io While there are many ways to extract structured data (all you really need is something that can navigate paths on a website), we ve chosen to utilize the free-touse application Import.io 5. In the basic setup, Import.io requires minimal programming skills and works in a visual fashion by learning from user interactions. This was an important criterion in choosing which software to use. Basically, you 5 https://www.import.io/

click on the information you are interested in and the program infers the necessary paths. After a few example sites, the robot is ready to use. Now, given a new website, the robot will return the information from the inferred paths, provided a similar structure is found on the new site. This means that it is easy to extract large quantities of data from a single online store, as products are structured in a similar fashion across the store. Note, however, that training a new robot is necessary for every new online store of interest. Another advantage of Import.io is the possibility of communicating with the robots through an application programming interface (API), which we do through a simple java application. By doing our own programming, it was easier for us to control the data coming from the robots. For example, if a site is not returning data, the java program makes the robot retry the site several times until data is returned, or the URL deemed empty. Also, we could clean the data as it downloads and print it to a format suitable for further calculation. At the beginning of the project, these manually coded procedures and checks were vital, but later versions of import.io have some of them built-in. 5.3 Personal care products and consumer electronics In our implementation of Import.io, we ve focused on four major e-commerce contenders in the areas of consumer electronics and personal care products. The online stores are all of the many structured prices form, and we ve built a scraper robot for each site. From the four scrapers we have daily received approximately 4300 price observations for approximately 60 different consumer goods for over a year. These include personal care products such as shampoo, hair conditioner, mascara and lipstick; consumer electronics such as vacuum cleaners, television sets and laptops. We ve focused on scraping down items that the sites themselves list as most sold in each category. We base this decision on the assumption that the stores themselves have no incentives to fool its customers. Thus the items scraped are considered representative for the entire item in question. The automated data collection has been fairly stable, with little maintenance needed once we tweaked the java programming to include automatic retries of URLs. The figure below gives an overview of the number of prices received per day.

7000 Number of price observations per day 6000 5000 4000 3000 2000 1000 0 Mar-15 May-15 Jul-15 Sep-15 Nov-15 Jan-16 Mar-16 May-16 Figure 3: Number of observations scraped per day since March 2015 While the scraping is automatic, it has still relied on our computers being turned on; therefore, the robots have not been run on weekends or national holidays. This explains most of the gaps in the graph. The red line marks a structural shift in the number of observations. This is due to one of the sites changing from a relatively simplistic table structure to a fancier infinite scroll type structure. We have not yet been able to reprogram our robot to deal with this; consequently, the number of price observations from this site dropped from about 1000 per day to 300. Every once in a while, a site will move its products around, i.e. change the URL. When this happens, our program warns us, and we have to manually replace the old empty URL with the new one containing the data. On average, this happens a few times a month per robot a bit more during seasonal changes in inventory; summer BBQ products for example. What might also happen is that a website will change its entire structure, so that the robot will have to be retrained. For an experienced import.io user, this will not take more than a few minutes due to the point-and-click interface. In our case, this has been a rare event; on average maybe once or twice a year per robot. It might appear that we have been lucky in this respect. Experiences from other price statisticians doing similar work indicate that this could be a bigger hassle than it has been for us. It is clear, though, that the maintenance time of the system is proportional to the number of sites scraped or robots used. Besides the technological challenges, there are several methodological troubles. One major bottleneck in a system based on web scraping is classification of items, and index aggregation. These issues are beyond the scope of this paper. 5.4 Airfares Another area where automated data collection is both feasible and beneficial is in the area of airline fares. Most air travels in Norway are booked online, and our collection of data is at present a fairly manual labour. We visit the airlines web sites; look up prices and copy paste them into a spreadsheet for further analysis. Looking up

several prices, for several dates, destinations and airlines is thus very time consuming. By utilizing tools intended for travel agents, we have tried to automate this process as well, mainly motivated by efficiency. Most airlines provide fare information to travel agents and search sites, but we experienced problems consistently using web scrapers, both on the search sites as well as on the airlines web sites. To solve these issues we looked in to utilizing a travel search engine owned by Google called QPX Express. The QPX Express engine allowed us to send a request containing all price determining characteristics, such as departure and destination airport or city, date of departure, time of day, preferred cabin and airline carrier. In response we got a list of solutions to our request with detailed information about each leg of the itinerary, including departure and arrival times, flight numbers, carrier specific information, total sale price and taxes. The major advantage of this engine was that we could simply press a button to collect all the prices in a matter of minutes, saving us several hours of labour. The Google QPX engine was ran alongside the regular collection for little under a year, and we collected the same destinations and departure dates as the regular CPI collection; but we included all flights to the destination on that particular day, not just a sample of them. This was done to check if the two sources were comparable, i.e. that the price development was similar. Results are shown in the figure below with indices calculated from August 2015 to April 2016 6. 140 135 130 125 120 115 110 105 100 95 90 Official CPI recalculated Google QPX data Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 Figure 4. Monthly chained price indices of airline fares. August 2015=100. To a large degree, the two sources were comparable, and the engine looked like it could be a feasible alternative to our manual collection, shaving hours of the 6 The Official Norwegian CPI is recalculated to make the two series comparable using monthly chaining and August 2015 = 100

monthly production routine. However; after discovering we were in violation with the terms and conditions of the Google QPX service, the collection was discontinued. Future work will still explore the use of booking engines for airfare collection, but we will need to work directly with the airlines or in collaboration with other search sites. 5.5 Legal issues Extracting data automated from the internet is a new way of collecting prices for statistical purposes, and in order to take advantage of the data it is necessary to address various issues, as for instance legislation. Making frequent extractions of huge amount of data from web sites - is it really legal? Do we need permission to extract data from someone s web site? To our knowledge that depends on what kind of data we are scraping, the amount of information accessed and copied and to which extent the access adversely affects the page owner s system and the use of the data. One important feature to consider is whether or not web scraping may be against the terms of use of the web sites. We often agree to use a site according to its terms when we access and stay on a specific web site. What is allowed on one web site may be prohibited on another site. In many cases the web sites do not have any terms of use available on the sites at all. Most of the web sites we have been looking into underline that all information on their sites is protected by copyright law. Many of the sites inform that download or copying of data should not be done without an explicit consent from the site owners. The Norwegian Statistics Act however, clearly states that SN may impose an obligation to provide information necessary to produce official statistics; hence we are legally entitled to collect the data without having to alert the owners. We have however chosen to inform the different sites owners in order to establish an open dialogue and cooperation (and to avoid technical obstacles). Foreign online retailers or foreign server solutions may pose other challenges, as discussed in connection with airfare collection. In any case, it is important to follow some kind of best data extraction practices to avoid damaging the sites and furthermore the owners of the web sites and their interests. 6. Future possibilities We have focused our work on utilizing the existing structures of websites to extract data, thus being limited to web sites that have pre-existing structures. Another approach might be to persuade enterprises to appropriately structure their web sites, either through legislation or incentive control. As of October 2015 the electricity prices in the Norwegian CPI come from a price comparison site operated by the Norwegian consumer council (NCC). The site lets you compare prices of electricity across several companies to choose the one best suited for your usage and location. While we get our prices directly from the backend of this site, what s interesting is how they collect prices from over 100 electricity companies on a daily basis.

As consumer-facing enterprises, all electricity companies list their prices online. The listings are often of the unstructured type, with prices simply displayed in a line of text, or in a basic table. Instead of creating 100 separate robots for each individual site, that would have to be retrained if the information was moved, the NCC was able to get the companies to tag the relevant information. This involves adding a small line of code to the name of the product and the price information. The tags themselves are pre-specified by the NCC for the different type of information they need. Tagging does not alter the visuals of the site, but it provides structure. Below is a visual of a company s website, and the HTML behind it, for a single electricity product. Lier Spotpris Produktet er av typen Spotpris. Avtalen har et månedsgebyr på 49,00 kroner + 2,7 2,7øre/kWh i påslag. <st r ong> <span i t empr op=" pr oduct Name" >Li er Spot pr i s</ span> </ st r ong> <br > <br > Pr odukt et er av t ypen <span i t empr op=" pr oduct Type" >Spot pr i s</ span>. <br > Avt al en har et månedsgebyr på <span i t empr op=" mont hl yfee" >49, 00</ span> kr oner + <span i t empr op=" el Cer t i f i cat epr i ce" >2, 7</ span> <span i t empr op=" addonpr i ce" >2, 7</ span>ør e/ kwh i påsl ag. Figure 5. Visual appearance of website (above) and HTML code (below). The product in question is called Lier Spotpris and is tagged in the HTML with the tag productname the type of electricity plan is Spotpris and is tagged as producttype and so on. In addition, the NCC developed a robot that works a bit different from the ones we have discussed so far. It is a crawler, but instead of following pre-specified paths, the robot searches the site for tagged information, and retrieves it. This means that every company can have differently structured websites, but the same robot works on all of them. The robot is dropped on the first page of the web site, and automatically moves through the site and search for tags. Given the incentives of enterprises to display price information to the consumer, we hope to see more of these types of solutions in the future. If movie theatres, telephone service providers, kindergartens, dentists, taxi companies etc. did just a tiny bit of this, it would make price comparisons easier and thus also price statistics. Further, SN spends a decent amount of resources on gathering information from Norwegian municipalities through questionnaires. Some of the information gathered is publicly available on the Norwegian municipalities websites, like water fees,

prices of waste services, property tax etc. If these websites could be structured in a similar manner, time is saved for all involved. 7. Conclusion As described, the Internet has become an important purchasing channel, thus online prices should, with adequate impact, be included in the CPI. Traditional collection of online price such as by web questionnaires sent to the stores, or manual collection of prices by SN staff, is both labour-intensive and not necessarily an efficient way of collecting online prices. The prices of goods and services online are already available for anyone to see, which makes it possible to utilize more efficient ways of collecting prices; here, using web scraping tools. Experiences from using such web scraping tools during the last year and a half show that web scraping from structured web sites is an efficient way of collecting the data needed in the CPI. However, there are methodological issues concerning both classification and index aggregation, and also legal issues, that need to be resolved before such an approach can be implemented in the official CPI. Future work on the subject of online prices will thus be concerning solving these issues. Further, we would like to expand the online data collection to other consumption areas, and also cover the unstructured prices area. 8. References PostNord (2016). E-commerce in the Nordics 2016, Available at: http://www.postnord.com/en/media/publications/e-commerce/ (Accessed 11. March 2016) Statistics Norway (2015). ICT usage in households, 2015, 2nd quarter. Available at: https://www.ssb.no/en/teknologi-og-innovasjon/statistikker/ikthus/aar/2015-10- 01 (Accessed 11. March 2016) Ecommerce Europe. (2015) Northern Europe 2014. Infographic. Available at: http://www.ecommerce-europe.eu/facts-figures (Accessed 11. March 2016)