How To Predct On The Web For Hfmd



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

An Interest-Oriented Network Evolution Mechanism for Online Communities

Forecasting the Direction and Strength of Stock Market Movement

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

An Alternative Way to Measure Private Equity Performance

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

DEFINING %COMPLETE IN MICROSOFT PROJECT

The OC Curve of Attribute Acceptance Plans

What is Candidate Sampling

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Traffic-light a stress test for life insurance provisions

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Calculating the high frequency transmission line parameters of power cables

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Assessing Student Learning Through Keyword Density Analysis of Online Class Messages

Semantic Link Analysis for Finding Answer Experts *

Web Object Indexing Using Domain Knowledge *

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture 3: Force of Interest, Real Interest Rate, Annuity

The Current Employment Statistics (CES) survey,

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

1. Measuring association using correlation and regression

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

A 'Virtual Population' Approach To Small Area Estimation

Section 5.4 Annuities, Present Value, and Amortization

A Multi-mode Image Tracking System Based on Distributed Fusion

SPECIALIZED DAY TRADING - A NEW VIEW ON AN OLD GAME

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

PEER REVIEWER RECOMMENDATION IN ONLINE SOCIAL LEARNING CONTEXT: INTEGRATING INFORMATION OF LEARNERS AND SUBMISSIONS

Overview of monitoring and evaluation

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Mining Multiple Large Data Sources

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

A Secure Password-Authenticated Key Agreement Using Smart Cards

Stochastic Protocol Modeling for Anomaly Based Network Intrusion Detection

BUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK. 0688,

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Statistical Methods to Develop Rating Models

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Tuition Fee Loan application notes

Calculation of Sampling Weights

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

IMPACT ANALYSIS OF A CELLULAR PHONE

Traffic State Estimation in the Traffic Management Center of Berlin

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

THE efficient market hypothesis (EMH) asserts that financial. Predicting Financial Markets: Comparing Survey, News, Twitter and Search Engine Data

Searching for Interacting Features for Spam Filtering

A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

Invoicing and Financial Forecasting of Time and Amount of Corresponding Cash Inflow

An Empirical Study of Search Engine Advertising Effectiveness

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Design and Development of a Security Evaluation Platform Based on International Standards

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

How To Calculate The Accountng Perod Of Nequalty

7.5. Present Value of an Annuity. Investigate

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

2016/17

The Application of Fractional Brownian Motion in Option Pricing

Single and multiple stage classifiers implementing logistic discrimination

CHAPTER 14 MORE ABOUT REGRESSION

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Multi-sensor Data Fusion for Cyber Security Situation Awareness

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Laddered Multilevel DC/AC Inverters used in Solar Panel Energy Systems

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Traditional versus Online Courses, Efforts, and Learning Performance

Financial Mathemetics

Detecting Credit Card Fraud using Periodic Features

Student Performance in Online Quizzes as a Function of Time in Undergraduate Financial Management Courses

Vehicle Detection and Tracking in Video from Moving Airborne Platform

Small pots lump sum payment instruction

1. Math 210 Finite Mathematics

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Customer Lifetime Value Modeling and Its Use for Customer Retention Planning

HARVARD John M. Olin Center for Law, Economics, and Business

A Design Method of High-availability and Low-optical-loss Optical Aggregation Network Architecture

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

Using Series to Analyze Financial Situations: Present Value

Transcription:

Proceedngs of the Twenty-Second Internatonal Jont Conference on Artfcal Intellgence Predctng Epdemc Tendency through Search Behavor Analyss Danqng Xu, Yqun Lu, Mn Zhang, Shaopng Ma, Anq Cu, Lyun Ru State Key Laboratory of Intellgent Technology and Systems Tsnghua Natonal Laboratory for Informaton Scence and Technology Department of Computer Scence and Technology, Tsnghua Unversty, Bejng 100084, Chna xudanqng06@gmal.com Abstract The possblty that nfluenza actvty can be generally detected through search log analyss has been explored n recent years. However, prevous studes have manly focused on nfluenza, and lttle attenton has been pad to other epdemcs. Wth an analyss of web user behavor data, we consder the problem of predctng the tendency of hand-foot -and-mouth dsease 1 (HFMD), whose outbreak n 2010 resulted n a great panc n Chna. In addton to search queres, we consder users nteractons wth search engnes. Gven the collected search logs, we cluster HFMD-related search queres, medcal pages and news reports nto the followng sets: epdemc-related queres (s), epdemc-related pages (ERPs) and epdemc-related news (ERNs). Furthermore, we count ther own frequences as dfferent features, and we conduct a regresson analyss wth current HFMD occurrences. The expermental results show that these features exhbt good performances on both accuracy and tmelness. 1 Introducton Seasonal epdemcs have posed a tremendous threat to publc health. The panc caused by both Influenza A (H1N1) and Severe Acute Respratory Syndrome (SARS) flu has had a terrble mpact on both economc and socal development worldwde. Wth the rapd development of the Internet, search engnes have become an mportant gateway to obtanng nformaton. Consequently, many socal events, ncludng the spread of epdemcs, can be traced from users search logs. The Web provdes abundant medcal resources for users. Approxmately 80% of consumers turn frst to the Internet when confronted wth health problems [Fox, 2006]. The dea that epdemc tendency can be generally detected through Web nformaton has been explored n recent years [Gnsberg et al., 2009; Heather et al., 2004; Hulth et al., 2009; * Supported by Natural Scence Foundaton (60736044, 60903107) and Research Fund for the Doctoral Program of Hgher Educaton of Chna (20090002120005) 1 http://en.wkpeda.org/wk/hfmd Pelat et al., 2009; Phlp et al., 2008]. These studes show that the frequency of onlne search queres s strongly correlated wth epdemc events. Ths correlaton makes t possble to detect epdemc outbreaks n areas wth a large populaton of web users. However, most studes focus on nfluenza, and ther results are not appled to other dseases. In recent years, HFMD has become a serous dsease for Chnese nfants from one to three years old. A large outbreak of HFMD n 2010 resulted n a great panc n Chna. At the peak of the epdemc, tens of thousands of chldren were nfected every week. The cycle of ths nfectous dsease s bref, and t can be observed. No sutable HFMD vaccne has been developed so far, and nfants may de wthout tmely treatment. Thus, studes of HMFD have become ndspensable for epdemc research [Dong and Sun, 2008; Wang et al., 2009]. In an attempt to provde early detecton, [Wang et al., 2009] establsh a seasonal HFMD trend model to predct future HFMD trends. Ths approach may have some vablty, but a sudden outbreak cannot be predcted n a tmely manner through ths model. In ths paper, we wll track HFMD actvty by a method of analyzng huge volumes of search logs. Generally speakng, most onlne health nformaton seekers frst take actons such as self-dagnoss and self-medcaton when they show slght symptoms [Fox, 2006]. If the symptoms get worse, they may agan search onlne for specfc symptoms or dseases and decde whether or not to see a doctor. For example, when a chld wth HFMD begns to have a slght cough, hs/her parents may not know the specfc dsease and wll merely search for ths symptom onlne. Wth more severe symptoms, the specfc dsease can be recognzed, and more detaled nformaton wll have to be obtaned from search engnes agan. Fnally, the chld s dagnosed as an HFMD case. Through ths whole process, we fnd that there s a consderable lag between searchng and reportng but that users search logs may contan related nformaton much earler. Therefore, t s possble to detect or even predct an epdemc tendency on the bass of nformaton-seekng behavors. Prevous studes have dscovered that the frequences of search queres can not only estmate current nfluenza actvty [Gnsberg et al., 2009] but also predct an ncrease of 2361

nfluenza n advance [Phlp et al., 2008]. These search queres come from patents and non-patents, and dfferent user nteractons may represent dfferent user needs. If a user clcks related medcal pages, we thnk that he/she has a hgher probablty of beng nfected than those who do not clck any medcal pages. In addton, some news webstes provde alerts of mportant nfectous events and outbreaks durng the HFMD season. These news reports may produce a postve effect on web users and make some users submt related queres to search engnes. [Fox, 2006] reveals that approxmately 7% of respondents wthout sgns of the dsease follow news stores about epdemcs every day. However, lttle attenton s pad to these stuatons. If the frequency of s alone s consdered, some stuatons may be msdentfed or not covered. In addton to the s proposed n prevous studes [Gnsberg et al., 2009; Heather et al., 2004; Hulth et al., 2009; Pelat et al., 2009; Phlp et al., 2008], ths paper ntroduces two new features: epdemc related pages (ERPs) and news (ERNs). The ERPs feature records the frequency of medcal pages beng vsted, whle the ERNs represents the popularty of epdemc-related news stores. On the bass of these features, ths paper provdes a soluton to predct the number of future HFMD cases one week n advance of when the outbreak occurs. The man contrbutons of ths paper are as follows: 1.In addton to search queres, we combne user nteracton nto analyss. Besdes s, ths paper ntroduces two new features: ERPs and ERNs, and other search stuatons wll be covered; 2.Ths s the frst tme that onlne search survellance has been developed for HFMD. Prevous works manly focus on nfluenza. Gven search logs, our results show that web survellance s also sutable for HFMD. The rest of ths paper s organzed as follows: After a descrpton of the related work n the next secton, we ntroduce the nformaton of user behavor data sets and detals of the three proposed features n Secton 3. In Secton 4, dfferent features are combned on the bass of lnear models, and the performance of each model s evaluated. We dscuss and conclude ths work n Sectons 5 and 6. 2 Related Work Wth the wdespread use of the Internet, many patents tend to obtan nformaton onlne when they meet wth any medcal symptoms. Fox [Fox, 2006] shows that approxmately 80% of Amercan adults search onlne for medcal nformaton about specfc dseases or symptoms every year. Ths survey also shows that 7% of users search for health nformaton or follow epdemc news stores on a typcal day. On average, 66% seekers begn ther onlne search from a search engne, whle 27% begn at a health-related webste. The Web can help users better understand health status wthn a short amount of tme. Ths fact makes t possble to utlze search logs to detect or predct epdemc actvty n a tmely manner. Gnsberg et al. use Google search logs n the Unted States to estmate nfluenza actvty n each state [Gnsberg et al., 2009]. They proposed a method of analyzng a large amount of Google search queres to track the trend of nfluenza-lke llness (ILI). They select nfluenza-lke llness related queres (IRQs or s) and draw a comparson between the nfluenza actvty and the total number of the nfluenza related queres submtted n some areas durng the outbreak perod. Ther results suggest a hgh correlaton between ILI-related queres and offcal released ILI occurrences. Phlp et al. also examne the relatonshp between nfluenza-related searches and actual nfluenza cases on the bass of Yahoo! search logs [Phlp et al., 2008]. They fnd that the frequency of nfluenza-lke symptom searches can not only detect nfluenza actvty but also predct an ncrease n the mortalty of nfluenza n advance. All of ther results show that the selected s strongly correlate wth the current level of nfluenza. In addton, [Whte and Horvtz, 2009] survey 515 ndvduals onlne health-related searches, and they perform a log-based study of how people search for related medcal nformaton onlne. Ther result shows that there may be some escalaton of medcal concern on user search logs, where queres about serous llness commonly follow behnd ntal queres about common symptoms. Ths result suggests that the Web has the potental to ncrease search engne users anxetes, leadng to cyberchondra. Thus, f the frequency of s alone s counted, errors may be ntroduced n the estmaton. If user nteracton s taken nto consderaton, ths error may be avoded. In addton to onlne queres, Heather et al. study the relatonshp between actual nfluenza occurrences and the number of health-related webste vsts [Heather et al., 2004]. They dscover that the frequency of nfluenza-related artcle accesses s strongly postvely correlated wth the CDC s tradtonal survellance data. 3 Dataset and Features 3.1 Dataset HFMD typcally occurs n small epdemcs n kndergartens durng the sprng and summer months. Durng ths perod, the Chnese Center for Dsease Control and Preventon (CDC) reports the total number of HFMD-nfected cases every week to montor HFMD occurrences. These cases are collected weekly from clncal laboratores and hosptals n 31 provnces. Takng data ntegrty and contnuty nto consderaton, HFMD data from February 2009 to September 2010 (two HFMD seasons) were selected as our expermental data. These data are avalable at http://www.chnacdc.cn/. In addton, wth the help of a wdely-used Chnese commercal search engne, anonymous search logs were gathered for the same perod. Approxmately 70 mllon search entres wth 12 mllon users are collected every day. In consderaton of user prvacy, only queres submtted and URLs clcked by users were extracted n our experment. 2362

3.2 Features The Internet provdes an abundance of resources, and t determnes how people search for nformaton. Typcally, users submt queres to search engnes and clck the correspondng tems accordng to ther needs. Both queres and clck nteracton are treated as the representaton of users ntents. If a user only submts a query but does not clck any result, we assume that ths knd of behavor s not as meanngful as queryng wth result beng clcked. In addton, an ncrease n search queres can be caused by mportant news reports. Prevous studes [Gnsberg et al., 2009; Heather et al., 2004; Hulth et al., 2009; Pelat et al., 2009; Phlp et al., 2008] have manly adopted the frequency of epdemc related search queres to track the current epdemc actvty. Thus, the above stuatons cannot be well dstngushed from each other. In addtonal to search queres, both clck nteractons and the nfluence caused by publc news are ntroduced. In our experment, the number of HFMD-related medcal artcles (pages) beng clcked s an mportant bass to judge the current level of HFMD actvty. At the same tme, we select the number of related web news reports to represent the HFMD popularty of publc meda. Epdemc related queres (s). A total of 66% of health seekers submt a health nqury to a search engne for onlne nformaton [Fox, 2006]. Hence, queres can be treated as an mportant resource for web mnng and can play key roles n representng users search ntent. On the bass of ths dea, Gnsberg et al. fnd that the frequences of certan nfluenza-related queres are hghly correlated wth the number of nfluenza cases [Gnsberg et al., 2009]. In addton, Phlp et al. dscover that the total counts of some queres of nfluenza-lke symptoms can predct an ncrease of nfluenza actvty [Phlp et al., 2008]. Accordng to prevous work, we select s as the frst feature, and we quantfy them by countng the total frequency durng a certan perod. In our experment, the s are the set of HFMD related queres, and they are obtaned by the method of query clusterng [Baeza-Yates and Tber, 2007; Beeferman and Berger, 2000; Chan et al., 2004; Wen et al., 2001]. Epdemc related medcal pages (ERPs). As we all know, the Internet provdes abundant nformaton, and users clck dfferent results accordng to ther needs. [Fox, 2006] shows that 27% of health nformaton seekers obtan onlne health nformaton through webstes. Medcal webstes are an mportant source of nformaton for health seekers. If the number of vstng HFMD medal webstes or pages ncreases, an outbreak of HFMD may occur durng ths perod. In ths paper, we collect HFMD-related medcal artcles from popular medcal webstes, and we gather them nto the set ERPs. The frequences of these pages beng clcked are summed to quantfy ths feature. Epdemc related web news (ERNs). In addton to medcal webstes, news webstes can provde health nformaton. [Fox, 2006] reports that the most recent health news can affect 53% of health seekers behavors, and 7% people follow health-related news every day. Thus, many searchers may be guded by publc meda. An ncrease of publc concern may brng about a sharp rse n the frequency of s and ERPs, but ths ncrease may not mean that the number of nfected patents ncreases. Smlarly wth ERPs, HFMD related news stores are extracted from certan popular news webstes, and the total count of these reports s the quantfcaton of ERNs. Gven collected search logs, queres are clustered nto dfferent categores. To some extent, queres can be regarded as a knd of descrpton for those clcked documents. These pages are also representatons of the correspondng queres. We can assume that queres connected a common page have a smlar topc, and the contents of these web pages can be an expresson of ths topc. A clck-through graph s constructed from the collected search logs. A smple clck-through graph s shown n Fgure 1. Fgure 1: An example of a clck-through graph The query-clusterng steps are lsted as follows: Step 1: Defne the ntal sets Q and P; Q stands for the set, and P represents the set of connected pages. Q= {ntal }, P=. Step 2: For any query n Q, calculate the weght between ths query and ts close web pages. If the weght s no less than the fxed threshold (here the threshold s set to 1.0), add ths page to P. Step 3: For any page n P, calculate the weght between ths page and every query connected to t. If the weght s no less than the fxed threshold, add ths query to Q. Step 4: Go to Step 2 untl the count of the Q set remans unchanged or the teraton tme reaches a gven value. Typcally, queres and web pages of hgh frequency have a broad topc and lnk dfferent topcs together. Therefore, to avod a bad effect on the weght calculaton, we flter the hgh-frequency terms at the very start of our experment. Here, we select the medcal name of the HFMD dsease as the ntal. Thus, after the above teratve process, we obtan numerous HFMD-related queres. Some search queres such as hay fever may concde wth HFMD seasons but have no relaton wth HFMD. To make our study more accurate, these queres n the s are manually fltered. Fnally, the sze of the set s 66 (the sze s 45 n [Gnsberg et al., 2009]). In addton, HFMD medcal artcles and news stores were gathered from medcal stes and news stes, respectvely. These two knds of pages consttute the ERP and ERN sets. These features are quantfed by countng ther own fre- 2363

quences of dfferent sets. Examples of s, ERPs and ERNs are llustrated n Table 1. Item Table 1: Examples of s, ERPs and ERNs Sore throat Fever Headache HFMD http://www.qqbaobao.com/s/shouzukoubng http://jbk.39.net/kesh/erke/6adc9.html http://news.hnjkw.net/hyzx/jbyw/2011/031632 815.html Type ERP ERP ERN In our experment, we adopt the correlaton coeffcent as the evaluaton ndcator, whch s usually adopted n prevous studes. The correlaton coeffcent between two vectors A and B A,B s calculated as follows: Cov(A, B) ρ = = 2 2 σ.σ A B (a - a ) (b - b ) (a - a ) * (b - b ), where a and b represent the average values of vectors A and B, respectvely. All three features and the actual HFMD occurrences are normalzed by dvdng ther own maxmum values. A comparson between the normalzed ratos of these features s llustrated n Table 2. Table 2: Coeffcent-analyss over dfferent tme lags Lag(weeks) s ERPs ERNs 0 1 2 3 4 5 0.725 0.763 0.667 0.658 0.491 0.251 0.707 0.765 0.667 0.650 0.473 0.243 0.415 0.509 0.561 0.606 0.595 0.583 In Table 2, the average correlaton between actual HFMD occurrences and the total frequency of s s 0.725. Ths correlaton value shows that the search frequency of HFMD-related queres can generally reflect current HFMD actvty. However, ths value s smaller than the result (0.91) n [Gnsberg et al., 2009], whch s mostly ascrbed to the followng two aspects: 1.The methods of the set s selecton are dfferent. We automatcally select our s on the bass of clck-through data, whle they use lnear regresson to select the top 45 queres; 2. Our study target s HFMD, whle they focus on nfluenza. These two dseases may dffer from each other. In addton to s, we also study the relatonshp between the number of HFMD cases and ERPs and ERNs. As descrbed n [Heather et al., 2004], the total number of people vstng HFMD-related medcal pages s strongly correlated wth HFMD s morbdty, and ths correlaton value reaches to 0.707. (1) The total frequences of s and ERPs have a smlar tendency over tme, and ther own notable peaks precede the peak of actual HFMD actvty. Dfferent from s and ERPs, the number of reported news stores s not completely postvely correlated wth occurrences. The peak of ERNs s much earler than the peak of HFMD actvty. The and ERP features obtan a best ft wth a one-week precedng lag, whle the coeffcent between the ERNs and actual HFMD occurrences reaches the maxmum value wth a three-week lag. These lag values wll provde an mportant bass for the followng predctve models. From the medcal pont of vew, the ncubaton perod of HFMD s commonly a week, that s, the transtonal perod from showng early slght symptoms to gettng sck s a week. Ths fact may provde a reasonable explanaton for the lags of s and ERPs. If a user shows slght symptoms, that user may submt one or more s, wll vst medcal treatment pages (ERPs) and wll be confrmed as a case after a week. For s, ERPs and ERNs, we thnk ther frequences are approxmately lnear wth the actual occurrences. Wth the tme lags taken nto consderaton, a smple log-odd lnear model (adopted n [Gnsberg et al., 2009]) s establshed as follows: log t( occ ) *log t( s ) *log t( ERPs ) t 1 t 1 2 t 1 *log t( s ) 3 t 3, where 1, 2, 3 are multplcatve coeffcents, s the error term, and logt(x) s the natural log of x/(1-x). 4 Experments and Analyss Let us consder the two followng crcumstances when an epdemc breaks: 1.A user shows slght symptoms and submts s for medcal treatments; 2.Another user wthout symptoms also submts s n response to the recent epdemc news. s cannot dstngush these two stuatons from each other. The number of ERPs wll ncrease n the frst stuaton, whle the second stuaton wll merely lead to ncreasng ERNs. We thnk these problems wll be better solved by ntroducng both ERPs and ERNs. To study ther roles n HFMD predcton, these features are ntroduced one at a tme n our experment. A smple multvarable logstcal lnear regresson [Chrstensen, 1997] s selected. We collect search logs for two HFMD seasons, and we establsh dfferent lnear models for the data of 2009 (the tranng set). The fnal model s valdated on untraned data of 2010. To make full use of the exstng data, we adopt the followng learnng algorthm: on the bass of the tranng set, we learn an ntal model for the frst teraton of test set. Durng each teraton on the test set, prevous weeks are also added nto tranng set and relearn a model. Dfferent models are llustrated n Table 3. In Equaton 2, we frst ft the relatonshp between the (2) 2364

Table 3: Results of dfferent models on the tranng and test sets Dfferent models Tranng set(2009) Test set (2010) 1: logt(occ ) = 0.9297 * logt(s ) - 0.9090 0.763 0.735 t t-1 2: logt(occ ) = 0.9536 * logt(s ) - 0.1198 0.746 0.729 t t-1 3: logt occ = 0.5137 * logt ERNs - 0.3408 0.612 0.595 t t-3 4: logt(occ ) = 0.3157 * logt(s ) + 0.6582 * logt(erps ) - 0.3624 t t-1 t-1 0.784 0.813 5: logt(occ ) = 0.8478* logt(s ) + 0.1616 * logt(erns ) - 0.7152 t t-1 t-3 0.802 0.824 6: logt(occ ) = 0.2042 * logt(s ) + 0.6830 * logt(erps ) - 0.0739 * logt(erns ) - 0.1326 t t-1 t-1 t-3 0.836 0.891 actual HFMD occurrences and s, ERPs and ERNs usng the unary lnear regresson method. The correlaton value of Model 1 s the largest for all sngle features, and the ERNs feature has the worst coeffcent (0.595). As a sngle feature, the ERPs or s feature can not only detect epdemc occurrences but also predct future tendency. However, the ERNs feature s not a good sngle feature. Hence, we need to combne these features usng bnary lnear regresson to modfy these sngle models. At the peak of HFMD outbreaks, many healthy people are lkely to follow HFMD related reports to search for HFMD-related queres. These users are msdentfed as patents n Model 1. In addton, some patents wth the dsease begn ther search at a webste, and these users are also not dentfed as patents n the sngle model. Accordng to these stuatons, we add other selected features, ERPs and ERNs, nto the sngle models. ERPs represents t-1 the clck frequences of epdemc-related medcal pages durng the tme t-1, and s the number of news tems durng ths perod. The correlaton values between emprcal occurrences and these ftted results are calculated. The comparson between the baselne and the bnary feature models are llustrated n Table 3. For Model 4, the correlaton between the actual occurrences and the model combnng both the s and ERPs features reaches 0.813, and ts performance s better than the models usng only the s or ERPs feature. Ths fact ndcates that the ERPs feature may have a good assstant functon of predctng future mprovements over the method of only usng the s feature. Next, we place emphass on why the ERPs feature promotes predctve performance. Accordng to [Fox, 2006], approxmately 27% people seek health-related nformaton from a health-related webste. Some healthy people submt HFMD related queres, but do not clck any HFMD-related medcal artcles. They may have other needs. The ERPs feature only records the frequency of medcal artcles beng clcked. If addng the ERP features nto the predcted model, ths stuaton can be removed and the predcton wll become closer to actual HFMD occurrences, whch s why the ERPs are ntroduced. Next, we wll use the ternary lnear regresson method to contnue modfyng the models. To some extent, hgh s and ERPs do not mean large numbers of patents; they may also be caused by hgh ERNs. The frequences of the s come from both patents and non-patents. When an HFMD epdemc breaks out, epdemc-related news stores also ncrease, and many healthy people may follow these stores. In Model 6, the coeffcent of ERNs s negatve, meanng that t can partly separate news concern queres from latent patents. Model 6 s the best n all models, and we select t as our fnal predctve model. Fgure 2 shows our predctve result at four tme ponts through the 2010 HFMD season. A notable ncrease s predcted durng Aprl, and the peak s reached on May 24. Next, the occurrences begn to declne durng the followng perod. All of these results are later valdated by offcal CDC data. Fgure 2: Predctve effect of our fnal model (black: predcted values, gray: actual values) To montor the predctve effect n a tmely manner, we develop a smple HFMD montorng system based on our predctve model. Admttedly, our system stll has some drawbacks. Ths system s not desgned to be a replacement for tradtonal montorng networks. We hope our system wll be useful for future epdemc researches and wll enable publc health offcals to be well-prepare for epdemc emergences. In the future, we wll contnue to mprove our method and system. 2365

5 Dscusson Some healthy people follow epdemc news by submttng epdemc related queres (s) whle some patents search for nformaton from medcal webstes. Consequently, s cannot draw a complete pcture of epdemc patents. Therefore, we propose two novel features to separate the healthy users from the patents. Dfferent from prevous studes, ths paper makes an attempt to predct the future numbers of epdemc cases. From the comparson between dfferent models, we can see that the ternary-feature model has the best predctve effect, ndcatng that these new features play mportant roles n the predcton of an epdemc. However, our method has some lmtatons: 1.Our predctve model can only estmate the tendency of future epdemc cases (rse or drop), and concrete occurrence rates are not accurately calculated; 2.User relablty s not taken nto consderaton n our experment. Hence, some nosy data may have a negatve effect on the fnal result. Ths paper develops web resource survellance for HFMD for the frst tme, and t obtans a better effect on both accuracy and tmelness. Ths method of web mnng s not lmted to nfluenza and HFMD, and t can also be used to predct other nfectous dseases. We expect that our features and methods may provde helpful nformaton to health offcals for seasonal epdemcs. 6 Concluson Prevous studes show that nfluenza actvty can be traced from web query logs. On the bass of these earler works, ths paper develops a smlar approach to HFMD predcton. In ths paper, two novel features, ERPs and ERNs, are ntroduced to separate the healthy from the latent patents. Furthermore, we conduct a log-based study, and we obtan HFMD-related queres, pages and news. To valdate the effect of these features, we conduct a systematc comparson between these features, and the expermental result ndcates that both the ERPs and ERNs play key roles n predctng epdemc actvtes. Fnally, we develop an onlne predcton system, and the montored results show that our model s farly effectve at predctng future epdemc tendency. We hope that our predctve method wll be helpful for future epdemc research and wll help provde people wth earler alerts to gve them enough tme to prepare for an mmnent epdemc outbreak by takng certan measures, such as obtanng vaccnes. References [Baeza-Yates and Tber, 2007] Rcardo Baeza-Yates, and Alessandro Tber. Extractng semantc relatons from query logs. In proceedng of the 13th nternatonal conference on Knowledge dscovery and data mnng, pages 76-85, San Jose, Calforna, August 12-15, 2007. [Beeferman and Berger, 2000] Doug Beeferman, and Adam Berger. Agglomeratve clusterng of a search engne query log. In proceedng of sxth nternatonal conference on Knowledge dscovery and data mnng, pages 407-416, Boston, MA, USA, August 20-23, 2000. [Chan et al., 2004] Wng Shun Chan, Wa Tng Leung, and Dk Lun Lee. Clusterng search engne query log contanng nosy clckthroughs. In proceedng of 2004 Internatonal Symposum on Applcatons and the Internet, pages 305-308, Tokyo, Japan, January 26-30, 2004. [Chrstensen, 1997] Ronald Chrstensen. Log-Lnear Models and Logstc Regresson, Second Edton, Sprng-Velag, 1997. [Dong and Sun, 2008] Zhaohua Dong, and Png Sun. Clncal analyss of 325 chldren wth hand-foot-mouth dsease n 2005 and 2007. Journal of Clncal Pedatrcs, pages. 470-472, 2008. [Fox, 2006] Susannah Fox. Onlne Health Search 2006. Pew Internet and Amercan Lfe Project, 2006. [Gnsberg et al., 2009] Jeremy Gnsberg, Matthew H. Mohebb, Rajan S. Patel, Lynnette Brammer, Mark S.Smolnsk and Larry Brllant. Detectng nfluenza epdemc usng search engne query data. Nature, 457: 1012-1014, 2009. [Heather et al., 2004] Heather A. Johnson, Mchael M. Wagner, Wllam R. Hogan, Wendy Chapman, Robert T Olszewsk, John Dowlng and Gary Barnas. Analyss of web access logs for survellance of nfluenza. Stu Health Technol Inform, 107:1202-1207, 2004. [Hulth et al., 2009] Anette Hulth, Gustaf Rydevk, and Annka Lnde. Web queres as a source for syndromc survellance. plos ONE 4(2): e4378, 2009. [Pelat et al., 2009] Camlle Palat, Clement Turbeln, Avner Bar-Hen, Antone Flahault, and Alan-Jacques Vallernon. More dseases tracked by usng Google trends. Emergng nfectous dseases, 15(8): 1327-1328, 2009. [Phlp et al., 2008] Phlp M. Polgreen, Ylng Chen, Davd M. Pennock, Forrest D. Nelson, and Robert A. Wensten. Usng nternet searches for nfluenza survellance. Clncal Infectous Dseases 47(11): 1443-1448, 2008. [Wang et al., 2009] Rupng Wang, Yal Chun, and Ylng Wu. Predctng hand-foot-mouth dsease condton n Songjang Dstrct of Shangha Cty based on seasonal trend model. Chna Preventve Medcne, 10(11): 1025-1028, 2009. [Wen et al.,2001] J-Rong Wen, Jan-Yun Ne, and- Hong-Jang Zhang. Clusterng user queres of a search engne. In proceedng of the 10th nternatonal World Wde Web conference, pages 162-168, Hongkong, May 1-5, 2001. [Whte and Horvtz, 2009] Ryen Whte, and Erc Horvtz. Cyberchondra: studes of the escalaton of medcal concerns n web search. ACM Transactons on Informaton Systems, 23(4): 770-812, November, 2009. 2366