«Big data» versus «small data»: the case of gripe (flu) in Spanish

Similar documents

Statistical Challenges with Big Data in Management Science

How To Speak Spain

ANALYTICS PREDICTIVE. Tool of Providence or the End of Coincidence? He who does not expect the unexpected will not find it out.

Collaborations between Official Statistics and Academia in the Era of Big Data

Domina Google Con Video Marketing 2.0 -> Click Here

Big Health Data the challenges and connections

Outline. Press about Big Data. The Promise of Big Data. Limitations of Big Data for Solving Business Problems

Healthcare data analytics. Da-Wei Wang Institute of Information Science

The Experts Guide to Keyword Research for Social Media. A WordStream Guide

!!! The Fallacy of Big Data! Brian Fine and Con Menictas!

Is big data the new oil fuelling development?

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

Introduction to the Mathematics of Big Data. Philippe B. Laval

BIG DATA FUNDAMENTALS

Big Data Scoring. April 2014

A U T H O R S : G a n e s h S r i n i v a s a n a n d S a n d e e p W a g h Social Media Analytics

CHILD CARE 2016 SUMMER CREDENTIAL COURSES NEW YORK

Spanish 8695/S Paper 3 Speaking Test Teacher s Booklet Time allowed Instructions one not not Information exactly as they are printed not 8695/S

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Big Data a threat or a chance?

Big Data: A Closer Look

SUMMER WORK AP SPANISH LANGUAGE & CULTURE Bienvenidos a la clase de Español AP!

How To Improve Data Quality

Game Changers for Researchers: Altmetrics, Big Data, Open Access What Might They Change? Kiki Forsythe, M.L.S.

Social Media Creating an Approach That Will Bring You More Business

Now, Next and the Future: IT, Big Data and other Implications for RIM. Presented by Michael S. Smith /

Big Data and Society: The Use of Big Data in the ATHENA project

Formal Methods for Preserving Privacy for Big Data Extraction Software

Text Mining - Scope and Applications

What is Big Data? The three(or four) Vs in Big Data In 2013 the total amount of stored information is estimated to be Volume.

BIG DATA. - How big data transforms our world. Kim Escherich Executive Innovation Architect, IBM Global Business Services

What happens when Big Data and Master Data come together?

SOCIAL MEDIA ADVERTISING STRATEGIES THAT WORK

Integrated Search Engine Marketing. Merge your marketing efforts for a commanding online presence

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Opportunities and Limitations of Big Data

B2B Social Media Marketing LeadFormix Best Practices

Intro to Bioinformatics

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Statistics for BIG data

BtoB MKT Trends. El Escenario Online. Luciana Sario. Gerente de Marketing IDC Latin America 2009 IDC W W W. I D C. C O M / G M S 1

(Big) Data Analytics: From Word Counts to Population Opinions

Pharmacoepidemiology Big data, Big problems, Big solutions

Boost your business: increase online visibility with Search Engine Optimisation

Beyond Watson: The Business Implications of Big Data

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

AIA Michigan s Social Media Marketing Course

Data Refinery with Big Data Aspects

Certification In SAS Programming. Introduction to SAS Program

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

WHITE PAPER 2012 Online Marketing Trends Online Marketing Trends. Why Personas?

Getting to Know Big Data

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Health. Google Maps: Satellite View. Four Characteristics of Big Data. Personalized Medicine 2/5/2016. Health Big Data Ecosystem

Resources to Help Assess & Plan for Language Access

So Just What Is Big Data? James E. Tcheng, MD, FACC, FSCAI

Research Note What is Big Data?

Is Big Data Good for our Health? You Bet. Here s Why. By Cameron Warren and Merav Yuravlivker

SOFT DRINKS. Demo General Report. Report Monthly Oct 30, Nov 28, 2013

Demystifying Big Data Government Agencies & The Big Data Phenomenon

A Survey on Big Data Concepts and Tools

How To Understand The Benefits Of Big Data

Internet Marketing for Local Businesses Online

ANTI-FRAUD Advocacy Toolkit for Naturalization Collaboratives

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Social Media Marketing for Personal Trainers

YOUTUBE. Sharing Videos

Activities for IBM Industry Forum 2010

Big Data Analytics in Health Care

The mobile opportunity: How to capture upwards of 200% in lost traffic

FOR TEACHERS ONLY The University of the State of New York

Outline. Social Media Data. What is social media. What is social media

Samsung moves Chilean fans and grows its community.

Exemplar for Internal Achievement Standard. Spanish Level 1

SEO Search Engine Optimization. Les Bain Wizard Creek Consulting

Big data for beginners An introduction for developmental professionals and researchers

Big Data, Analytics, Intelligence: Potenziale und Nutzen

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

MEDICAL DATA MINING. Timothy Hays, PhD. Health IT Strategy Executive Dynamics Research Corporation (DRC) December 13, 2012

Rebalancing the vaccination conversation in social media

January/February Foresight Report

manual de visual foxpro 6 PDF

PAST SIMPLE. TO BE. Ejercicios

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 12

access virus ti manual Quick Start Guide This access virus ti manual is by means of independently produced user guides.

Potential and Pitfalls of Health-Related Big Data. Ana Aizcorbe. March 6, 2014

Societal Data Resources and Data Processing Infrastructure

Big Data Explained. An introduction to Big Data Science.

No Data Governance, No Actionable Insights

Contents. Introduction Chapter 1 Articles Chapter 2 Nouns Chapter 3 Adjectives Chapter 4 Prepositions and Conjunctions...

The Power of Social Media - And Significance to Adventure Travel Operators

How To Get More Data From Your Computer

Big Data and Open Data

A Study on Big-Data Approach to Data Analytics

Transcription:

«Big data» versus «small data»: the case of gripe (flu) in Spanish Antonio Moreno-Sandoval (Department of Linguistics UAM & IIC) Esteban Moro (Department of Mathematics UC3M & IIC)

The starting point Big data hubris is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Lazer, Kennedy, King and Vespignani: The Parable of Google Flu: Traps in Big Data Analysis. Science, vol. 343, 14 March 2014

The Google Flu Tool (GFT) failure à Nature reported in Feb 2013 that GFT was predicting more than double of doctor visits for influenza than the USA medical authorities. n Method: counting flu-related searches n GFT 2009: n big data were overfitting the small number of cases GFT was part flu detector, part winter detector GFT 2013, again, has been persistently overestimating flu prevalence for a much longer time

The error cause, according to Google 1. Big data overestimation 2. Algorithm dynamics, which pollute and manipulate data by expanding rumors and trending topics.

What is Big Data? n The Four V s of Big Data by IBM: 1. Volume: scale of data (2.3 trillion Gb created each day) 2. Velocity: analysis of streaming data 3. Variety: different forms of data (text, video, healthcare records, etc.) 4. Veracity: uncertainty of data

Wikipedia definition Big data is a broad term for data sets so large and complex that traditional data processing applications are inadequate. The term often refers simply to the use of predictive analytics to extract value from the (unstructured) data.

What is Big LANGUAGE Data n 30 billion pieces of content are shared on Facebook every month n 4 billion hours of video on YouTube each month n 400 million tweets are sent per day by about 200 million monthly active users. n In dozens of different languages.

Can we linguistically process big language data? n Our experiment: to replicate Google s using Twitter messages in Spanish that were: 1. geolocalized = Spain 2. included the word gripe 3. time span: from Jan 2012 to Aug 2014 à we collected our Spanish Flu Corpus on Twitter

Spanish Flu Corpus in Twitter n 2759 tweets (including RT) n 327072 words n 140 weeks n sent from locations in Spain

Procedure n Hypothesis: the increased number of messages with GRIPE is a predictor of an approaching peak of cases n Verification: to check the prediction against the reported cases by the Spanish Health System (real data sent by doctor in health centers and hospitals)

Elimination of noise in data n noise: institutional or press messages: 100.000 personas aún no han podido vacunarse contra la gripe à we removed all messages with a URL à good data : à estoy en la cama con gripe

Results

Discussion n The messages on Twitter in Spanish also magnify the real cases of flu, as Google (http://www.google.org/flutrends/es/#es.) n GFP is based on flu-related searches. They compared the query counts with the flu surveillance systems

The problem: VERACITY in data n Analysis of the Flu Corpus to discover what factors contribute negatively to the prediction. à Distinguishing Good Data from Bad Data

Bad data n Figurative and humorous language: a joke: No sé si es un constipado virulento o una virurápida. I don t know if this is a virulent cold or a virufast flu à separating real cases from figurative ones requires a good management of pragmatic aspects (intention, irony, metaphor )

More Bad Data n when the text contains gripe but the speaker doesn t suffers from it: Acabo de ver un anuncio de Gelocatil gripe y me he acordado de @... I just saw an ad of XXX flu and I have remembered @ à all those wrong cases contribute to overestimation.

Some refinements n n Finding synonyms and variants of gripe in the Twitter corpus: gripazo griposo Detecting sequences that indicate that someone has contracted this disease: con la gripe en casa menudo gripazo he pillado con tos y moqueando la gripe me mata à THERE ARE VERY FEW LEXICAL AND SYNTACTIC PATTERNS IN THE TWITTER CORPUS

collocations tener & gripe n Si tienes <gripe > lo mejor es descansar. n además tengo un < gripazo > de cuidado n Seguramente tenga <gripe > n Odio tener <gripe > en temporada de calor n n tengo un < gripazo > flipante que tengo el < gripazo > padre n Creo que tengo <gripe > post-estrés

Now, the Small Data n Analysing gripe in a medical corpus (MultiMedica) with The Sketch Engine: just 341 occurrences against 2759 gripe collocates with: virus, brote, caso estación, azote, epidemia, vacuna temporada padecer sufrir tratar cambiar superar

gripe Word Sketch

Conclusions n Our study supports Lazer et al. 2014: instead of focusing on a big data revolution, perhaps it is time we were focused on an all data revolution, where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.

Conclusions (2) n In terms of corpora, small but well selected collections of linguistic data should be combined with large repositories from internet and social networks, since sometimes small data offer information that is not inferred from big data.