Monitoring Frequency of Change By Li Qin



Similar documents
Implementation of Statistic Process Control in a Painting Sector of a Automotive Manufacturer

An important observation in supply chain management, known as the bullwhip effect,

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

The risk of using the Q heterogeneity estimator for software engineering experiments

Index Numbers OPTIONAL - II Mathematics for Commerce, Economics and Business INDEX NUMBERS

Compensating Fund Managers for Risk-Adjusted Performance

POISSON PROCESSES. Chapter Introduction Arrival processes

Two-resource stochastic capacity planning employing a Bayesian methodology

Managing specific risk in property portfolios

Web Application Scalability: A Model-Based Approach

Risk in Revenue Management and Dynamic Pricing

Safety evaluation of digital post-release environment sensor data interface for distributed fuzing systems

Large-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation

Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7

A Multivariate Statistical Analysis of Stock Trends. Abstract

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods

On the predictive content of the PPI on CPI inflation: the case of Mexico

On the (in)effectiveness of Probabilistic Marking for IP Traceback under DDoS Attacks

Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION

Multiperiod Portfolio Optimization with General Transaction Costs

A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm

Normally Distributed Data. A mean with a normal value Test of Hypothesis Sign Test Paired observations within a single patient group

Load Balancing Mechanism in Agent-based Grid

Penalty Interest Rates, Universal Default, and the Common Pool Problem of Credit Card Debt

Machine Learning with Operational Costs

Comparing Dissimilarity Measures for Symbolic Data Analysis

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems

Softmax Model as Generalization upon Logistic Discrimination Suffers from Overfitting

Methods for Estimating Kidney Disease Stage Transition Probabilities Using Electronic Medical Records

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

ECONOMIC OPTIMISATION AS A BASIS FOR THE CHOICE OF FLOOD PROTECTION STRATEGIES IN THE NETHERLANDS

An Introduction to Risk Parity Hossein Kazemi

QoS-aware bandwidth provisioning for IP network links

IEEM 101: Inventory control

Efficient Training of Kalman Algorithm for MIMO Channel Tracking

An actuarial approach to pricing Mortgage Insurance considering simultaneously mortgage default and prepayment

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON

TRANSMISSION Control Protocol (TCP) has been widely. On Parameter Tuning of Data Transfer Protocol GridFTP for Wide-Area Networks

Rummage Web Server Tuning Evaluation through Benchmark

Effect Sizes Based on Means

Buffer Capacity Allocation: A method to QoS support on MPLS networks**

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

Large firms and heterogeneity: the structure of trade and industry under oligopoly

Forensic Science International

Characterizing and Modeling Network Traffic Variability

The impact of metadata implementation on webpage visibility in search engine results (Part II) q

THE RELATIONSHIP BETWEEN EMPLOYEE PERFORMANCE AND THEIR EFFICIENCY EVALUATION SYSTEM IN THE YOTH AND SPORT OFFICES IN NORTH WEST OF IRAN

Pressure Drop in Air Piping Systems Series of Technical White Papers from Ohio Medical Corporation

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 29, NO. 4, APRIL Load-Balancing Spectrum Decision for Cognitive Radio Networks

Alpha Channel Estimation in High Resolution Images and Image Sequences

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks

Binomial Random Variables. Binomial Distribution. Examples of Binomial Random Variables. Binomial Random Variables

On-the-Job Search, Work Effort and Hyperbolic Discounting

Local Connectivity Tests to Identify Wormholes in Wireless Networks

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM)

THE WELFARE IMPLICATIONS OF COSTLY MONITORING IN THE CREDIT MARKET: A NOTE

STATISTICAL CHARACTERIZATION OF THE RAILROAD SATELLITE CHANNEL AT KU-BAND

Design of A Knowledge Based Trouble Call System with Colored Petri Net Models

Ambiguity, Risk and Earthquake Insurance Premiums: An Empirical Analysis. Toshio FUJIMI, Hirokazu TATANO

ANALYSING THE OVERHEAD IN MOBILE AD-HOC NETWORK WITH A HIERARCHICAL ROUTING STRUCTURE

An Associative Memory Readout in ESN for Neural Action Potential Detection

Expert Systems with Applications

An effective multi-objective approach to prioritisation of sewer pipe inspection

Drinking water systems are vulnerable to

Title: Stochastic models of resource allocation for services

Failure Behavior Analysis for Reliable Distributed Embedded Systems

Red vs. Blue - Aneue of TCP congestion Control Model

1 Gambler s Ruin Problem

Evaluating a Web-Based Information System for Managing Master of Science Summer Projects

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

Simulink Implementation of a CDMA Smart Antenna System

TOWARDS REAL-TIME METADATA FOR SENSOR-BASED NETWORKS AND GEOGRAPHIC DATABASES

The predictability of security returns with simple technical trading rules

2D Modeling of the consolidation of soft soils. Introduction

CFRI 3,4. Zhengwei Wang PBC School of Finance, Tsinghua University, Beijing, China and SEBA, Beijing Normal University, Beijing, China

Probabilistic models for mechanical properties of prestressing strands

Concurrent Program Synthesis Based on Supervisory Control

Storage Basics Architecting the Storage Supplemental Handout

On Multicast Capacity and Delay in Cognitive Radio Mobile Ad-hoc Networks

INFERRING APP DEMAND FROM PUBLICLY AVAILABLE DATA 1

Learning Human Behavior from Analyzing Activities in Virtual Environments

4. Discrete Probability Distributions

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Principles of Hydrology. Hydrograph components include rising limb, recession limb, peak, direct runoff, and baseflow.

A Novel Architecture Style: Diffused Cloud for Virtual Computing Lab

Modeling and Simulation of an Incremental Encoder Used in Electrical Drives

COST CALCULATION IN COMPLEX TRANSPORT SYSTEMS

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS

Forensic Science International

Automatic Search for Correlated Alarms

NOISE ANALYSIS OF NIKON D40 DIGITAL STILL CAMERA

Joint Production and Financing Decisions: Modeling and Analysis

Optimal Routing and Scheduling in Transportation: Using Genetic Algorithm to Solve Difficult Optimization Problems

X How to Schedule a Cascade in an Arbitrary Graph

GAS TURBINE PERFORMANCE WHAT MAKES THE MAP?

How To Understand The Difference Between A Bet And A Bet On A Draw Or Draw On A Market

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm

An Analysis Model of Botnet Tracking based on Ant Colony Optimization Algorithm

Transcription:

Monitoring Frequency of Change By Li Qin Abstract Control charts are widely used in rocess monitoring roblems. This aer gives a brief review of control charts for monitoring a roortion and some initial ideas of using them to monitor the change frequency of a web age, whose estimator can be exressed as the function of a roortion. Exeriment and/or simulation should be done to comare the erformance of these control charts.. Introduction Control charts like Shewhart -chart, various CUSUM chart and SPRT chart have been widely used in rocess monitoring roblems such as quality control in manufacturing. All the items can be insected or samles can be taken from the rocess when % insection is economically or ractically imossible. Items are classified as defective or nondefective (nonconforming or conforming) based on the test result. Alications usually focus on the change (usually increase) of the roortion of defective items. In estimating the change frequency of a web age, the crawler may visit it eriodically and find whether it has changed or not by comuting the checksum for the web age at each access. A web age has been shown to change by a Poisson rocess. The frequency ratio is defined to be the ratio of the change frequenc y to the access frequency. We can estimate the frequency ratio first and estimate the change frequency indirectly from the frequency ratio by multilying it with the access frequency. The frequency ratio can be estimated by log(x/n) [CGMa], where n is the total number of accesses and X is the number of times the age does not change during the checking eriod. The estimated change frequency can be alied to imrove the freshness of data warehouse, web caching olicy and data mining [CGMb]. One of the challenges to these alications is the change frequency itself may change and a roer solution to monitor the shift of the change frequency has to be roosed. So, this aer is about the ossibility of using existing control

charts to monitor the change frequency. Similarly, we can monitor the change frequency indirectly by monitoring the frequency ratio. This aer is organized as follows: section is a review on how to estimate the change frequency; Section 3 is to introduce various control cha rts to be considered; Section 4 is about some initial ideas of using control charts to monitor the change frequency and some questions to be considered; Section 5 concludes this aer.. Estimating the Change Frequency In most cases, when we try to estimate the change frequency of web ages, we don t have the comlete change history of web ages, i.e. we don t know when exactly each web age changes and how many times it has changed between consecutive accesses. So, our discussion below is based on an incomlete change history of web ages. In order to estimate the change frequency, [CGMa] traced the daily change history of 7, web ages from 7 sites for four months. The exeriment shows that web ages change by following a Poisson rocess. The frequency ratio r is defined to be the ratio of the change frequency? to the access frequency f, so r =? / f. An intuitive estimator for the frequency ratio is X/T, where X is the number of detected changes and T is the monitoring eriod. This estimator has been roved to be biased and not consistent since the bias does not decrease as the samle size increases. Due to the drawbacks of this intuitive estimator, [CGMa] roosed an imroved estimator exressed as - log(x/n), where n is the total number of accesses and X is the number of accesses in which the web age does not change. For examle, if we access a web age once a day for days and the web age does not change in 7 accesses, the frequency ratio r = - log(7/) =.36. This result is slightly larger than the intuitive estimator 3/ =.3 since some changes may have been missed between accesses. Similar erformance analysis shows that this estimator is better than X/n in bias, more efficient and consistent.

The change frequency, λ,can change in ractice. We don t have any exeriment so far to show how the change frequency itself can change. If the change frequency changes very quickly, it will be difficult and imractical to estimate λ and really use it in alications. So, we assum e that the change frequency will remain relatively stable for at least some eriod of time. What we can do is to test it eriodically and see whether it has changed and whether the change is beyond a redefined threshold. Since the frequency ratio r can be estimated as log(x/n), the roortion to be monitored,, will be X/n. We are interested in finding out both the increase and decrease of. In order not to miss too many changes, the crawler should access the web age as frequently as ossible. Usually, the crawler can not access web ages more than once a day and we are not interested in the web ages which change more than once a day, so the access frequency can be chosen as one access er day. 3. Control Charts [W997] gives a review and bibliograhy on control charts based on attribute data. Here, we focus on the Shewhart -chart, Bernoulli CUSUM chart, Binomial CUSUM chart and SPRT chart [RSa, RSb, RS999, and RS998]. For our alication, since the crawler can not visit all the web ages once a day, continuous % insection is not ossible. Therefore, our insection will be based on samles taken from the rocess. 3. The Shewhart -chart When samles of n items are taken from the rocess, the Shewhart -chart is to lot the fraction of defective items in the samles. So, if T is the total number of defective items in a samle of size n, then T/n is lotted on the -chart. T has a binomial distribution assuming is constant and items are indeendent. If the crawler visits a web age once a day for n days and X is number of accesses in which the web age doesn t change, the roortion of X/n can be lotted on the 3

Shewhart -chart. Here, X has a binomial distribution with arameter n and, where is the robability that the web age doesn t change between two consecutive accesses and = e -r, where r is the frequency ratio. So, the result of ith access, X i, takes the value of with robability and of with robability -. 3.. Bernoulli CUSUM Chart The Bernoulli CUSUM chart is based on the individual observations X, X,. In order to detect an increase in, the Bernoulli CUSUM control statistic is B i = max (, B i- ) + (X i r), i=,, r is the reference value. This CUSUM chart will signal there has been an increase in if B k = h, where h > is the control limit. For detecting a decrease in, the corresonding CUSUM control statistic is B i = min (, B i- ) + (X i r), i=,, r is the reference value. It will signal there has been a decrease in if B k = h, where h < is the control limit. In order to get the value of r, we have to secify an out-of-control value which we want to be detected quickly. Constants r and r are defined to be r = log Then, the reference value r = r/ r r ( ) = log ( ) Usually, is adjusted slightly so that r takes the value of the recirocal of an integer m, i.e. r = /m. The control limit h is obtained by making the false alarm rate (i.e. the average number of observations/samles to signal when =) satisfy some redefined value. 3.3 Binomial CUSUM Chart Binomial CUSUM chart is to lot a cumulative sum of defective items in a samle of n consecutive items, T, T,, where each T k has a binomial distribution. For detecting an increase in, the binomial CUSUM control statistic is Sk = max (, Sk-) + (Tk nr), k=,..., where nr is the reference value. The Binomial CUSUM chart will signal there has been an increase in if S k =h, where h> is the 4

control limit. For detecting a decrease in, the binomial CUSUM control statistic is Sk = min (, Sk-) + (Tk nr), k=,..., where nr is the reference value. The Binomial CUSUM chart will signal there has been a decrease in if Sk= h,where h< is the control limit. Similar to the Bernoulli CUSUM chart, the control limit h for the binomial CUSUM chart is also obtained by making the false alarm rate satisfy some redefined value. 3.4 SPRT Chart In most alications, both the -chart and the CUSUM chart take a fixed samle size of n items using a fixed samling interval between samles. SPRT is to use a varied samle size which is determined dynamically. It is a sequential test of null hyothesis H: = against H: =. For each item, Xi = if the ith item is defective (in our alication, if the web age does not change) and X i = otherwise. The statistic used by SPRT is S j = r T j r j, where j T j = Xi. Here, r and r are defined as in 3.. i= The SPRT requires sec ifying two constants a and b, b<a. The following rules are used for samling and making decisions to accet or reject H : If b<s j <a, then continue samling; If S j = a, then sto samling and reject H ; If Sj= b, then sto samling and accet H. The constants a and b are usually chosen to satisfy some redefined error robabilities (robabilities for tye I and II errors). 3.5 Comarison of Control Charts The erformance of the above control charts can be measured by ANSS (average number of samles to signal), ANOS (average number of observations to signal) and ATS (average time to signal). Since the samle size is not fixed for SPRT chart and ATS is deendent on the length of non-insecting eriod, we suggest using ANOS to comare the erformance of control charts. 5

A corrected diffusion (CD) theory aroximation to the ANOS has been develoed for CUSUM chart and SPRT chart. For each tye of control chart, ANOS can be obtained for a range of in-control value and out-of-control value. When the Bernoulli CUSUM chart is used, the CD aroximation to the ANOS when = is ANOS( e ) h r r h r r We can find the required value of h to give a desired value for in-control ANOS (average false alarm rate). Then, by using h = h + ε ( ) q where e () can be aroximated by.376 (log()) 4 -.8(log()) 7, if.= <.5; 3, if <<.;.4 -.84(log()) -.39(log()) 3, if otherwise. We can find the control limit h. Also, the CD aroximation to the ANOS when = is ANOS( ) e h r h r r r When the SPRT chart is used, let a be the robability for a tye I error and ß be the robability for a tye II error, using h β ln r α β g ln α r and h = h + ( - )/3, we can find the values for h and g. Since g =b/r, h = a/r, we can get a and b. The ANOS ( ) and ANOS ( ) can be obtained using,, r, r, g and h. 6

The Shewhart -chart has the advantage of simlicity, and it also has some disadvantages: if the control limit is set to be three standard deviations from the target value, the false alarm rate will be much different from that for a normal distribution. The Shewhart -chart is not effective for detecting small changes in. The erformance of the -chart for detecting small shifts can be imroved by using a larger samle size, but it will not be very effective in detecting large shifts. The Bernoulli CUSUM chart detects shifts in much faster than the -chart. The binomial CUSUM chart is a little slower than the Bernoulli CUSUM chart for small shifts in and considerably slower for very large shifts, since a binomial CUSUM would have to wait until the end of a samle to signal. The SPRT chart has much better erformance than the -chart or the CUSUM chart. 4. Monitoring the Change Frequency Based on the erformance analysis of the control charts and the characteristics of our alications, Bernoulli CUSUM chart or SPRT chart would be more aroriate for our urose since they both are good for detecting small shifts. Also, we need to consider the following questions: a. determining how we check the change frequency, either eriodically or randomly. If eriodically, then determine how often we should check; b. determining the samle size since the samle size could have an imortant effect on the insection result; c. determining the out-of-control value ; d. determining the false alarm rate for CUSUM chart so that the control limit h can be determined or the error robabilities for SPRT chart so that the two constants a and b can be determined; For examle, we have the knowledge that the current change frequency is.693 er day, which means the web age changes.693 times a day. Based on.693 = - log(x/n), this change frequency corresonds to X/n =.5. We want to detect the shift when the change frequency becomes.3567, which corresonds to X/n =.7. In this case, our in-control value =.5 and out-of -control value =.7. These values are 7

relatively large comared with those used in quality control. If we use the Bernoulli CUSUM chart, we can get the reference value r using and. Then, given a desired value for ANOS ( ), we can find the value for the control limit h. Next, we can get ANOS ( ). If the SPRT chart is used, for some desired values of a andß, we can find the values for a and b, and further find the aroximation of ANOS (). In order to comare the erformance of these control charts, we can try to adjust the values so that they give similar values for ANOS ( ) and comare ANOS ( ) for different values of and. 5. Conclusion & Future Work This aer gives a brief review of control charts and some initial ideas of alying these control charts to monitor the change frequency of a web age. However, in order to show the aroriateness of the control charts, exeriments or simulation should be done and secific data to measure the erformance of these control charts should be obtained and comared. Reference [RSa] M Reynolds, Jr. and Z. Stoumbos, Monitoring a Proortion Using CUSUM and SPRT Control Charts. Frontiers in Statistical Quality Control 6,. 56-76() [RSb] M Reynolds, Jr. and Z. Stoumbos, A General Aroach to Modeling CUSUM charts for a Proortion, IIE Transactions() 3,. 55-535 [RS999] M Reynolds, Jr. and Z. Stoumbos, A CUSUM Chart for Monitoring a Proortion When Insecting Continuously, Journal of Quality Technology, Vol. 3, No., Jan 999 [RS998] M Reynolds, Jr. and Z. Stoumbos, The SPRT Chart for Monitoring a Proortion, IIE Transactions (998) 3,. 545-56 [W997] W. Woodall, Control Charts Based on Attribute Data: Bibliograhy and Review, Journal of Quality Technology, Vol. 9, No., Aril 997 [CGMa] Junghoo Cho and Hector Garcia -Molina, Estimating Frequency of Change 8

[CGMb] Junghoo Cho and Hector Garcia -Molina, The Evolution of the Web and Imlications for an Incremental crawler, VLDB, Exerience/Alication track,. 9