Using Web and Social Media for Influenza Surveillance



Similar documents
Monitoring Influenza Trends through Mining Social Media

Text and Structural Data Mining of Influenza Mentions in Web and Social Media

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 10 March 6 th, March 12 th, 2016

Influenza Surveillance Weekly Report CDC MMWR Week 16: Apr 17 to 23, 2016

Internet Search Activity & Crowdsourcing

EHR Surveillance for Seasonal and Pandemic Influenza in Primary Care Settings

Ebola data from the Internet: An opportunity for syndromic surveillance or a news event?

What is Big Data? The three(or four) Vs in Big Data In 2013 the total amount of stored information is estimated to be Volume.

North Texas School Health Surveillance: First-Year Progress and Next Steps

Illinois Influenza Surveillance Report

SOCIAL NETWORK SIMULATION AND MINING SOCIAL MEDIA TO ADVANCE EPIDEMIOLOGY. Courtney David Corley. Dissertation Prepared for the Degree of

Influenza and Pandemic Flu Guidelines

Online Communicable Disease Reporting Handbook For Schools, Child-care Centers & Camps

Adapted from a presentation by Sharon Canclini, R.N., MS, FCN Harris College of Nursing and Health Sciences Texas Christian University

PREPARING FOR A PANDEMIC. Lessons from the Past Plans for the Present and Future

Enhancing Respiratory Infection Surveillance on the Arizona-Sonora Border BIDS Program Sentinel Surveillance Data

Pandemic Risk Assessment

excerpted from Reducing Pandemic Risk, Promoting Global Health For the full report go to

SPECIAL REPORTING ISSUE 2007

Real-time monitoring of flu epidemics through linguistic and statistical analysis of Twitter messages

PREDICT MARKET SHARE WITH USERS ONLINE ACTIVITIES DATA: AN INITIAL STUDY ON MARKET SHARE AND SEARCH INDEX OF MOBILE PHONE

Overview. Why this policy? Influenza. Vaccine or mask policies. Other approaches Conclusion. epidemiology transmission vaccine

Original Research PRACTICE-BASED RESEARCH

Key Facts about Influenza (Flu) & Flu Vaccine

MACHA Annual Meeting November 6, 2014 Normal, Illinois. Strategies to increase influenza vaccination rates among college students

WHO Regional Office for Europe update on avian influenza A (H7N9) virus

BE SURE. BE SAFE. VACCINATE.

Swine Influenza Special Edition Newsletter

INTERNET SEARCH STATISTICS AS A SOURCE OF BUSINESS INTELLIGENCE: SEARCHES ON FORECLOSURE AS AN ESTIMATE OF ACTUAL HOME FORECLOSURES

Planning for 2009 H1N1 Influenza. A Preparedness Guide for Small Business

ECDC INTERIM GUIDANCE

(Big) Data Analytics: From Word Counts to Population Opinions

Global Influenza Surveillance Network (GISN) Activities in the Eastern Mediterranean Region

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Preparing for. a Pandemic. Avian Flu:

Tracking the flu pandemic by monitoring the Social Web

UNCOVERING SOCIAL MEDIA DATA FOR PUBLIC HEALTH SURVEILLANCE

Pandemic. PlanningandPreparednesPacket

Real-Time Digital Flu Surveillance using Twitter Data

Major Public Health Threat

Guidance for School Responses to Influenza

COMMONWEALTH of VIRGINIA Department of Health

AVIAN INFLUENZA. The pandemic influenza clock is ticking. We just don t know what time it is. Laurene Mascola, MD, MPH, FAAP

AABB Interorganizational Task Force on Pandemic Influenza and the Blood Supply

Preparing for the consequences of a swine flu pandemic

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Overview of FDA s active surveillance programs and epidemiologic studies for vaccines

Illinois Long Term Care Facilities and Assisted Living Facilities

How to Prevent an Influenza Pandemic in the Workplace

Michigan Department of State Police Emergency Management Division. Volume: February 23, 2006

Towards detecting influenza epidemics by analyzing Twitter messages

Program in Public Health Course Descriptions

Three-level Response Alert System & Infection Control in the Community

ECDC SURVEILLANCE REPORT

PUBLIC HEALTH MEETS SOCIAL MEDIA: MINING HEALTH INFO FROM TWITTER

WHO checklist for influenza pandemic preparedness planning

Swine Flu and Common Infections to Prepare For. Rochester Recreation Club for the Deaf October 15, 2009

Meaningful Use - HL7 Version 2

Workforce Guidelines: H1N1 Influenza and Flu-like Illness

PREPARING YOUR ORGANIZATION FOR PANDEMIC FLU. Pandemic Influenza:

Chapter 5. INFECTION CONTROL IN THE HEALTHCARE SETTING

DESIGN AND IMPLEMENTATION OF A WEB-BASED GIS FOR PATIENTS REFERRAL TO HOSPITALS IN ZARIA METROPOLIS

Development of an Influenza Risk Assessment Tool

SARS Experiences in China:

HIV NOMOGRAM USING BIG DATA ANALYTICS

Kim Cervantes, MPH, Coordinator / Regional Epidemiology Program Barbara Carothers, LPN, Public Health Representative 1

Competency 1 Describe the role of epidemiology in public health

Medical Big Data Interpretation

Human Infl uenza Pandemic. What your organisation needs to do

Data Analytics for Healthcare: Creating understanding from big data

I. PREREQUISITE For information regarding prerequisites for this course, please refer to the Academic Course Catalog.

Preparing for and responding to influenza pandemics: Roles and responsibilities of Roche. Revised August 2014

Pandemic Influenza Response Plan. Licking County Health Department 675 Price Road Newark, OH 43055

HPAI Response HPAI Response Goals November 18, 2015

Public Health Measures

This article describes the development, testing, and

National Syndromic Surveillance Program (AKA BioSense) Tools for West Virginia syndromic surveillance

FOR INFORMATION CONTACT:

Decision Support Models for Public Health Informatics

FREQUENTLY ASKED QUESTIONS SWINE FLU

Facts you should know about pandemic flu. Pandemic Flu

Summary of infectious disease epidemiology course

CONNECTICUT TECHNICAL HIGH SCHOOL SYSTEM PANDEMIC INFLUENZA & HIGHLY COMMUNICABLE DISEASE PLAN

Ebola: Teaching Points for Nurse Educators

TUBERCULOSIS (TB) SCREENING GUIDELINES FOR RESIDENTIAL FACILITIES AND DRUG

H1N1 Flu Vaccine Available to All Virginia Beach City Public Schools Students

4

Ontario Pandemic Influenza Plan for Continuity of Electricity Operations

International Health Regulations

SA Health Hazard Leader for Human Disease

SOCIAL MEDIA: A NEW DATA SOURCE FOR PUBLIC HEALTH. Mark Dredze Johns Hopkins University Michael Paul, Alex Lamb, David Broniatowski

Thanks to Jim Goble, National City Corporation, for providing the resource material contained in this guide.

Planning for an Influenza Pandemic

PEOSH Model Tuberculosis Infection Control Program

SWINE FLU: FROM CONTAINMENT TO TREATMENT

Meaningful Use - HL7 Version 2

Operational Services

Immunization Healthcare Branch. Meningococcal Vaccination Program Questions and Answers. Prepared by

Exelon Pandemic Planning. Utility Interface With the NRC. Ralph Bassett

AV1300 STAFF INFLUENZA IMMUNIZATION AND EXCLUSION POLICY

Transcription:

Using Web and Social Media for Influenza Surveillance Courtney D Corley 1, Diane J Cook 2, Armin R Mikler 3 and Karan P Singh 4 Abstract Analysis of Google influenza-like-illness (ILI) search queries has shown a strongly correlated pattern with Centers for Disease Control (CDC) and Prevention seasonal ILI reporting data. Web and social media provide another resource to detect increases in ILI. This paper evaluates trends in blog posts that discuss influenza. Our key finding is that from 5-October 2008 to 31-January 2009 a high correlation exists between the frequency of posts, containing influenza keywords, per week and CDC influenza-like-illness surveillance data. Keywords Health informatics, disease surveillance, public health epidemiology, information retrieval, social media analytics 1 Introduction Influenza diagnosis based solely on the presentation of symptoms is limited as these symptoms may be associated with many other diseases. Serologic and antigen tests require that a patient with influenza-like-illness (ILI) be examined by a physician who can either conduct a rapid diagnosis test or take blood samples for laboratory testing. This suggests that many cases of influenza remain undiagnosed. While the presence of influenza in an individual can be confirmed through specific diagnostic tests, the influenza prevalence in the population at any given time is unknown and can only be estimated. In the past, such estimates have relied solely on the extrapolation of diagnosed cases, making it difficult to identify the various phases of seasonal influenza, or the identification of a more serious manifestation of a flu epidemic. Web and social media (WSM) provide a resource to detect increases in ILI. This paper evaluates blog posts that discuss influenza, analysis show a signif- 1 Pacific Northwest National Laboratory, Richland, WA, e-mail: court@pnl.gov 2 School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, e-mail: cook@eecs.wsu.edu 3 Department of Computer Science and Engineering, University of North Texas, Denton, TX, e-mail: mikler@unt.edu 4 Department of Biostatistics, School of Public Health, University of North Texas Health Science Center, Fort Worth, TX, e-mail: ksingh@hsc.unt.edu

2 icant correlation with the US 2008-2009 seasonal influenza epidemic. We briefly discuss a history of infectious disease outbreaks and recent approaches in online public health surveillance of influenza are discussed with regards to outbreak responses. Next, the data set used in our analysis is presented and the methodology for information extraction and trend analysis is outlined posting trends. Through discovery and verification of trends in influenza related blogs, we verify a correlation to Centers for Disease Control and Prevention (CDC) influenza-like-illness patient reporting at sentinel healthcare providers. 1.1 Background Epidemics of infectious diseases have plagued humankind since historical times. There are accounts of epidemics dating back to the times of Hippocrates (459-377 B.C.) and the ancient Greeks [1]. Fourteenth century Europe lost a quarter of its 100 million people to Black Death. The fall of the Aztec empire in 1521 was due to smallpox that eradicated half of its 3 ½ million population. The pandemic influenza of 1918 caused over 20 million excess deaths in 12 months. More recently, the severe acute respiratory syndrome (SARS) outbreak of 2003 highlighted the rapid spread of an epidemic at the global level. The outbreak, emanating from a small Guangzhou province in China, spread around the world requiring a concerted response from public health administrations around the world and the World Health Organization (WHO) to curtail the epidemic [5]. The WHO and CDC [2] actively engage in worldwide surveillance of infectious diseases, and prioritize prevention and control measures at the root cause of epidemics. The pervasiveness and ubiquity of Internet and World Wide Web resources provide individuals with access to many information sources that facilitate selfdiagnosis; one can combine specific disease symptoms to form search queries. The results of such search queries often lead to sites that may help diagnose the illness and offer medical advice. Recently, Google has addressed this issue by capturing the keywords of queries and identifying specific searches that involve search terms that indicate influenza-like-illness (ILI) [4]. Published research on influenza Internet surveillance also includes search advertisement click-through [3] using a set of Yahoo search queries containing the words flu or influenza [9] and health website access logs [6, 7]. Other information sources, such as telephone triage services, can be useful to ILI detection. The findings in Yih et al. [11], show that telephone triage service is not a reliable measure for influenza surveillance due to service coverage; however, it may be beneficial in certain situations where other surveillance measures are inadequate.

2 Data and Methodology 3 Spinn3r (www.spinn3r.com) is a Web and social media indexing service that conducts real-time indexing of all blogs, with a throughput of over 100,000 new blogs indexed per hour. Blog posts are accessed through an open source Java application-programming interface (API). Metadata available with this data set includes the following (if reported): blog title, blog url, post title, post url, date posted (accurate to seconds), description, full HTML encoded content, subject tags annotated by author, and language. Data is selected from an arbitrary time period of 20 weeks, beginning at 5- October 2008 and ending at 31-January 2009. (total 97,955,349; weblogs 69,627,831; forums 1,986,656; mainstream media 21,543,027; other 4,797,835). Weblog, micro-blog and mainstream media items containing characteristic keywords are extracted from Web and online social media published in the same time frame. Characteristic keywords include flu, influenza, H5N1, H3N1, and other keywords relevant to this task. This paper defines the blog-world to be English language, non-spam, blog posts. We also consider the following terms to be equivalent: blog post & blog item and blogger & blog site. Indexing, parsing and link extraction code was written in Python, parallelized using pympi and executed on a cluster at the Center for Computational Epidemiology and Response Analysis at the University of North Texas. This compute resource has eight nodes (2.66GHz Quad Core Xeon processors), 64 core, 256 GB memory, 30 TB of network storage [8, 10]. In our analysis, we extract English language items from the blog-world index when a lexical match exists to the terms influenza and flu anywhere in its content (misspellings and synonyms are not included). The blog items are grouped by month, week (Sunday to Saturday) and by day of week. The extracted blog items containing influenza keywords are termed flu-content posts or FC-posts. FC-post trends can be monitored using the social media mining methodology presented in this paper. This methodology facilitates identification of outbreaks or increases of influenza infection in the population. This paper's most significant finding is a strong correlation between the frequency of FC-posts per week and Centers for Disease Control and Prevention influenza-like-illness surveillance data. 3 Results We hypothesize that the frequency of blog-world flu-posts correlates with a patient reporting of influenza-like-illness during the US flu-season. To verify this statement we compare our data to Centers for Disease Control and Prevention surveillance reports from sentinel healthcare providers. The CDC website states the Outpatient Influenza-like-illness Surveillance Network (ILINet) consists of about 2,400 healthcare providers in 50 states reporting approximately 16 million patient

4 visits each year. Each provider reports data to CDC on the total number of patients seen, and the number of those patients with influenza-like-illness (ILI) by age group. For this system, ILI is defined as fever (temperature of 100F [37.8C] or greater) and a cough and/or a sore throat in the absence of a known cause other than influenza [2]. The CDC ILINet surveillance and FC-post per week data are plotted in Figure 1. CDC influenza-like-illness symptoms per visit at sentinel US healthcare providers labels the primary Y-axis. The secondary Y-axis marks the FC-post per week frequency normalized by the corresponding blog-world week post count. Correlation between the two data series measured with a Pearson correlation coefficient, r. Evidence to support our hypothesis (a correlation exists between CDC ILINet reports and Web and social media mined FC-post frequency) is the Pearson's correlation coefficient evaluated between the two data series. The Pearson correlation evaluates to unity if the two data series are exactly matching, r=1. If no correlation exists between the data series, the Pearson correlation evaluates to zero, r=0. In our analysis, the 20 ILI and FC-post data points correlate strongly with a high Pearson correlation, r=0.626, and the correlation is significant with 95% confidence. Fig. 1: CDC ILINet vs Normalized FC-post Frequency per Week. Each FC-posts per week data point is normalized by the corresponding blog-world posts per week count. Pearson correlation = 0.626, 95% confidence. 4 Future Work Once FC-posts have been extracted, one can further monitor influenza outbreaks by evaluating the perspective of blog authors. Bloggers having a direct knowledge of influenza infection are more valuable to disease surveillance than those who author objective or opinion items. Bloggers who persistently author FC-posts are less

likely to be infected with influenza and are more likely to be writing about avian influenza (bird flu). The following post excerpts demonstrate influenza-content author perspective. Self Identified: ``What began as an irritating cold became what I think might be the flu last night. I woke up in bed around three this morning with sore muscles, congested lungs/nose and chills running throughout my body.'' Secondhand: ``According to ESPN.com, Ravens quarterback Troy Smith has lost 'a considerable amount of weight' while being hospitalized with tonsillitis and flu-like symptoms. Smith and veteran Kyle Boller likely won't play in Sunday's season opener, leaving the workload to rookie Joe Flacco and Joey Harrington, who was signed Monday.'' Objective (or opinion): ``Domesticated birds may become infected with avian influenza virus through direct contact with infected waterfowl or other infected poultry, or through contact with surfaces or materials like that of water or feed that have been contaminated with the virus.'' Identifying the perspective of influenza keyword posts facilitates determining its contribution to disease surveillance, three author perspectives are identified. A FC-post is either a self-identification of having ILI symptoms, secondhand (or by proxy) of another individual having ILI or the post is an opinion or objective article containing ILI keywords. Secondhand knowledge can be writing about a friend, schoolmate, family-member or co-worker but a blogger could also post details on famous individual such as a sports player. The season opening of American football coincides with the data and many FC-posts identify athletes who are unable to play because of an ILI. Automatic classification of the influenza post author's perspective is ongoing research. 5 5 Conclusion Web and social media provide a novel disease surveillance resource. We presented a method which evaluates blog posts containing keywords influenza or flu and the results from analysis show strong co-occurrence with the US 2008-2009 flu season. This paper's key finding is that from 5-October 2008 to 31-January 2009 a high correlation exists between the frequency of posts, containing influenza keywords, per week and Centers for Disease Control and Prevention influenza-likeillness surveillance data.

6 Acknowledgments We would like to thank the National Science Foundation (NSF) for partial support under grant NSF IIS-0505819 and the Technosocial Predictive Analytics Initiative, part of the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory (PNNL). PNNL is operated by Battelle for DOE under contract DE-ACO5-76RLO 1830. The contents of this publication are the responsibility of the authors and do not necessarily represent the official views of the NSF. References 1. Bailey, N.: The Mathematical Theory of Epidemics. Griffin (1957) 2. CDC-Website: Influenza surveillance reports. Website (accessed 25 July 2009). http://www.cdc.gov/flu/weekly/fluactivity.htm 3. Eysenbach, G.: Infodemiology: tracking flu-related searches on the web for syndromic surveillance. AMIA Annual Symposium proceedings pp. 244 8 (2006) 4. Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., Brilliant, L.: Detecting influenza epidemics using search engine query data. Nature 457(7232), 1012 4 (2009). DOI 10.1038/nature07634 5. Heymann, D., Rodier, G.: Global surveillance, national surveillance, and sars. Emerging Infectious Diseases 10(2) (2004) 6. Hulth, A., Rydevik, G., Linde, A., Montgomery, J.: Web queries as a source for syndromic surveillance. PLoS ONE 4(2), e4378 (2009) 7. Johnson, H.A.,Wagner, M.M., Hogan,W.R., Chapman,W., Olszewski, R.T., Dowling, J., Barnas, Analysis of web access logs for surveillance of influenza. Studies in health technology and informatics 107(Pt 2), 1202 6 (2004) 8. Miller, P.: pympi an introduction to parallel python using mpi. Livermore National Laboratories (2002). URL https://computing.llnl.gov/code/pdf/pympi.pdf 9. Polgreen, P.M., Chen, Y., Pennock, D.M., Nelson, F.D.: Using internet searches for influenza surveillance. Clin Infect Dis 47(11), 1443 8 (2008). DOI 10.1086/593098 10. Rossum, G.V., Drake, F.: Python language reference. Network Theory Ltd (2003). URL http://www.altaway.com/resources/python/reference.pdf 11. Yih, W., Teates, K., Abrams, A., Kleinman, K., Kulldorff, M., Pinner, R., Harmon, R., Wang, S., Platt, R., Montgomery, J.: Telephone triage service data for detection of influenza-like illness. PLoS ONE 4(4), e5260 (2009)