The Impact of Big Data on Social Research David Rhind Sharon Witherspoon 1 www.nuffieldfoundation.org
The landscape to be covered What is Big Data? Just consultants hype? Key questions for SRA Technology + other drivers of change New sources of data and their uses Big challenges Back to the future the next Census Presentation also matters Conclusions 2
What is/are Big Data? VOLUME: too large to handle by standard contemporary analytical tools i.e. subjective / relative measure the total amount of data has grown exponentially: it has been estimated that more data was harvested between 2010 and 2012 than in all of preceding human history. Source: http://www.bbc.co.uk/news/business-17682304 Certainly made by Mike Lynch; original source IBM? VELOCITY: how fast data is being produced and how fast it must be produced to meet demand. VARIETY: many different forms of data which are used structured and unstructured (the majority), held in different types of databases as text documents, emails, imagery, videos and much else PROBLEMS: hype, bias in (large) sample, focus on correlations not causality, understanding the results 3
Context and key questions for SRA Current practice mostly survey-based Divide exists between expertise in data collection and analysis skills National shortfall in quantitative analytical skills Will Big Data, etc change the ground-rules of research practice? Are established practices becoming obsolete? Or do we need to assimilate what s new into established principles of research? 4
Drivers of change Extraordinary rate of technological enhancement Austerity better vfm sought Transparency Job creation/ increase wealth Calls for better/ more up to date data/info/evidence Threats to traditional approaches e.g. EU Parliament and Data Protection - Specific and explicit consent Public sector manifestations of change: data scientists sought by government, support of Open Data Institute, ONS exploration of options, data.gov, ESRC 64m funding & ADRCs 5
Technology change Apollo 11 1969 More computing power than Apollo The iphone 4S 2012 in my pocket 3000 x storage of IBM 305 disk drive 1956 Leased for $35,000/year $150 / year 6
7
Mobile phone sensors New(ish) sources of data Proxy: satellite remote sensing 31cm resolution (how to reflect people data?) Proxy: web scraping (e.g. inflation measures) Crowd sourcing e.g. OpenStreetMap Management/ administrative data (public and private sector) Modelling starting from historic data 8
Visitors and locals in Paris 9 Source: Eric Fischer
Uses of different data types Obtaining data about things easy? see remote sensing examples People: location and movement of people technically easy via CCTVs, smartphones. ethnicity, age data approximations from names profiles from private sector data or linked governmental administrative data technically easy Best solution usually is combination of data types.. e.g. land cover and use from imagery and company records 10
Real time data collection now routine for some applications 11 Source: UK MoD under the Open Government license, Google and US Geological Survey
Different uses of imagery at different resolutions 10m resolution See roads and water features 1 to 2 metres resolution, See some cars and individual houses 30 to 60cm resolution, See all visible cars, manholes 12 Source: DigitalGlobe 2014
Extreme crowd sourcing: Pyongyang Open Street Map Also MH 370 13 Source: UK MoD under the Open Government license, Google and US Geological Survey
Admin data / management information Obvious advantages already exists, often continuously maintained, linkage of personal admin data facilitates valuable research and fraud reduction BUT You get (at best) what is created for other purposes Content or classification changes mess up time series Personal admin data sharing and privacy debate Has raw data quality been audited properly (English police recorded crime statistics)? 14
15 Ratio between CSEW incidents and crime recorded by the police
Adding value = a commercial asset Can have huge value e.g. Climate Corporation: 2006 start-up by 2 ex-google staff Linked US government weather, crop yield and soil data Provide yield forecasting and planting advice, weather and crop insurance Bought by Monsanto October 2013 for $930m 16
Big Challenges Trade-off between data integrity and currency. How good is good enough? How fast is fast enough? Want to anticipate the future as well as know the past Private sector increasingly active in data collection and exploitation e.g. Markit surveys used by Bank of England. Internationalisation of data collection/assembly growing. Public understanding: problem with use of technical language e.g. public doesn t really understand n year flood concept. PM confusion of deficit and debt. Changed role of data constructor/statistician? mentors and advocates? This all a matter for the very young? 17
18
Back to the future with surveys? 19
The 2011 Census 2011 Census survey data collection went well but total cost 480m Basically very similar to what done for decades; 16% completed on-line Results started to become available 15 months after survey but much still being published after 3.5 years Changing society more difficult to complete forms Statistics Commission, Treasury Select Committee and UKSA said no more traditional census 20
LFS Response Rates 1993 to 2008 Source: ONS US experience is similar an average of 20% reduction in 20 years 21
The 2021 Census Very strong support from public consultation for continuation of some form of Census ONS plan now accepted in principle by government Model is for an on-line Census+: aim to achieve high (e.g. 65)% of online completion of forms aim to enrich census data by adding variables derived from admin data wherever possible much research under way US Bureau of Census experimenting with use of smartphone-derived data 22
23 Source: ONS
Data presentation also matters! 24
Basic arithmetical error it should be almost 400 not almost 4000! 25 PM confusing deficit and debt
National Infrastructure Plan: Pipeline Value by sector ( m) Moral: how information is presented can seriously mislead (note log scale on Chart 2) Pipeline value by sector 250,000 200,000 150,000 100,000 50,000 Capital Value million 26 - Communications Flood Transport Water
27 Conclusions
Much Big Data hype but a revolution is under way This will change the way we assemble data and do social science to extract added value Much more work will be by multi-disciplinary teams with higher level analytic, quantitative and presentational skills in various disciplines Greater focus still needed on data quality issues Need focus on data sharing governance, ethics and safeguards and on advocacy of benefits Q-Step will help a BIT. But organisations like the SRA and its members have an important role! 28
29 Thank you