Monitoring Frequency of Change By Li Qin Abstract Control charts are widely used in rocess monitoring roblems. This aer gives a brief review of control charts for monitoring a roortion and some initial ideas of using them to monitor the change frequency of a web age, whose estimator can be exressed as the function of a roortion. Exeriment and/or simulation should be done to comare the erformance of these control charts.. Introduction Control charts like Shewhart -chart, various CUSUM chart and SPRT chart have been widely used in rocess monitoring roblems such as quality control in manufacturing. All the items can be insected or samles can be taken from the rocess when % insection is economically or ractically imossible. Items are classified as defective or nondefective (nonconforming or conforming) based on the test result. Alications usually focus on the change (usually increase) of the roortion of defective items. In estimating the change frequency of a web age, the crawler may visit it eriodically and find whether it has changed or not by comuting the checksum for the web age at each access. A web age has been shown to change by a Poisson rocess. The frequency ratio is defined to be the ratio of the change frequenc y to the access frequency. We can estimate the frequency ratio first and estimate the change frequency indirectly from the frequency ratio by multilying it with the access frequency. The frequency ratio can be estimated by log(x/n) [CGMa], where n is the total number of accesses and X is the number of times the age does not change during the checking eriod. The estimated change frequency can be alied to imrove the freshness of data warehouse, web caching olicy and data mining [CGMb]. One of the challenges to these alications is the change frequency itself may change and a roer solution to monitor the shift of the change frequency has to be roosed. So, this aer is about the ossibility of using existing control
charts to monitor the change frequency. Similarly, we can monitor the change frequency indirectly by monitoring the frequency ratio. This aer is organized as follows: section is a review on how to estimate the change frequency; Section 3 is to introduce various control cha rts to be considered; Section 4 is about some initial ideas of using control charts to monitor the change frequency and some questions to be considered; Section 5 concludes this aer.. Estimating the Change Frequency In most cases, when we try to estimate the change frequency of web ages, we don t have the comlete change history of web ages, i.e. we don t know when exactly each web age changes and how many times it has changed between consecutive accesses. So, our discussion below is based on an incomlete change history of web ages. In order to estimate the change frequency, [CGMa] traced the daily change history of 7, web ages from 7 sites for four months. The exeriment shows that web ages change by following a Poisson rocess. The frequency ratio r is defined to be the ratio of the change frequency? to the access frequency f, so r =? / f. An intuitive estimator for the frequency ratio is X/T, where X is the number of detected changes and T is the monitoring eriod. This estimator has been roved to be biased and not consistent since the bias does not decrease as the samle size increases. Due to the drawbacks of this intuitive estimator, [CGMa] roosed an imroved estimator exressed as - log(x/n), where n is the total number of accesses and X is the number of accesses in which the web age does not change. For examle, if we access a web age once a day for days and the web age does not change in 7 accesses, the frequency ratio r = - log(7/) =.36. This result is slightly larger than the intuitive estimator 3/ =.3 since some changes may have been missed between accesses. Similar erformance analysis shows that this estimator is better than X/n in bias, more efficient and consistent.
The change frequency, λ,can change in ractice. We don t have any exeriment so far to show how the change frequency itself can change. If the change frequency changes very quickly, it will be difficult and imractical to estimate λ and really use it in alications. So, we assum e that the change frequency will remain relatively stable for at least some eriod of time. What we can do is to test it eriodically and see whether it has changed and whether the change is beyond a redefined threshold. Since the frequency ratio r can be estimated as log(x/n), the roortion to be monitored,, will be X/n. We are interested in finding out both the increase and decrease of. In order not to miss too many changes, the crawler should access the web age as frequently as ossible. Usually, the crawler can not access web ages more than once a day and we are not interested in the web ages which change more than once a day, so the access frequency can be chosen as one access er day. 3. Control Charts [W997] gives a review and bibliograhy on control charts based on attribute data. Here, we focus on the Shewhart -chart, Bernoulli CUSUM chart, Binomial CUSUM chart and SPRT chart [RSa, RSb, RS999, and RS998]. For our alication, since the crawler can not visit all the web ages once a day, continuous % insection is not ossible. Therefore, our insection will be based on samles taken from the rocess. 3. The Shewhart -chart When samles of n items are taken from the rocess, the Shewhart -chart is to lot the fraction of defective items in the samles. So, if T is the total number of defective items in a samle of size n, then T/n is lotted on the -chart. T has a binomial distribution assuming is constant and items are indeendent. If the crawler visits a web age once a day for n days and X is number of accesses in which the web age doesn t change, the roortion of X/n can be lotted on the 3
Shewhart -chart. Here, X has a binomial distribution with arameter n and, where is the robability that the web age doesn t change between two consecutive accesses and = e -r, where r is the frequency ratio. So, the result of ith access, X i, takes the value of with robability and of with robability -. 3.. Bernoulli CUSUM Chart The Bernoulli CUSUM chart is based on the individual observations X, X,. In order to detect an increase in, the Bernoulli CUSUM control statistic is B i = max (, B i- ) + (X i r), i=,, r is the reference value. This CUSUM chart will signal there has been an increase in if B k = h, where h > is the control limit. For detecting a decrease in, the corresonding CUSUM control statistic is B i = min (, B i- ) + (X i r), i=,, r is the reference value. It will signal there has been a decrease in if B k = h, where h < is the control limit. In order to get the value of r, we have to secify an out-of-control value which we want to be detected quickly. Constants r and r are defined to be r = log Then, the reference value r = r/ r r ( ) = log ( ) Usually, is adjusted slightly so that r takes the value of the recirocal of an integer m, i.e. r = /m. The control limit h is obtained by making the false alarm rate (i.e. the average number of observations/samles to signal when =) satisfy some redefined value. 3.3 Binomial CUSUM Chart Binomial CUSUM chart is to lot a cumulative sum of defective items in a samle of n consecutive items, T, T,, where each T k has a binomial distribution. For detecting an increase in, the binomial CUSUM control statistic is Sk = max (, Sk-) + (Tk nr), k=,..., where nr is the reference value. The Binomial CUSUM chart will signal there has been an increase in if S k =h, where h> is the 4
control limit. For detecting a decrease in, the binomial CUSUM control statistic is Sk = min (, Sk-) + (Tk nr), k=,..., where nr is the reference value. The Binomial CUSUM chart will signal there has been a decrease in if Sk= h,where h< is the control limit. Similar to the Bernoulli CUSUM chart, the control limit h for the binomial CUSUM chart is also obtained by making the false alarm rate satisfy some redefined value. 3.4 SPRT Chart In most alications, both the -chart and the CUSUM chart take a fixed samle size of n items using a fixed samling interval between samles. SPRT is to use a varied samle size which is determined dynamically. It is a sequential test of null hyothesis H: = against H: =. For each item, Xi = if the ith item is defective (in our alication, if the web age does not change) and X i = otherwise. The statistic used by SPRT is S j = r T j r j, where j T j = Xi. Here, r and r are defined as in 3.. i= The SPRT requires sec ifying two constants a and b, b<a. The following rules are used for samling and making decisions to accet or reject H : If b<s j <a, then continue samling; If S j = a, then sto samling and reject H ; If Sj= b, then sto samling and accet H. The constants a and b are usually chosen to satisfy some redefined error robabilities (robabilities for tye I and II errors). 3.5 Comarison of Control Charts The erformance of the above control charts can be measured by ANSS (average number of samles to signal), ANOS (average number of observations to signal) and ATS (average time to signal). Since the samle size is not fixed for SPRT chart and ATS is deendent on the length of non-insecting eriod, we suggest using ANOS to comare the erformance of control charts. 5
A corrected diffusion (CD) theory aroximation to the ANOS has been develoed for CUSUM chart and SPRT chart. For each tye of control chart, ANOS can be obtained for a range of in-control value and out-of-control value. When the Bernoulli CUSUM chart is used, the CD aroximation to the ANOS when = is ANOS( e ) h r r h r r We can find the required value of h to give a desired value for in-control ANOS (average false alarm rate). Then, by using h = h + ε ( ) q where e () can be aroximated by.376 (log()) 4 -.8(log()) 7, if.= <.5; 3, if <<.;.4 -.84(log()) -.39(log()) 3, if otherwise. We can find the control limit h. Also, the CD aroximation to the ANOS when = is ANOS( ) e h r h r r r When the SPRT chart is used, let a be the robability for a tye I error and ß be the robability for a tye II error, using h β ln r α β g ln α r and h = h + ( - )/3, we can find the values for h and g. Since g =b/r, h = a/r, we can get a and b. The ANOS ( ) and ANOS ( ) can be obtained using,, r, r, g and h. 6
The Shewhart -chart has the advantage of simlicity, and it also has some disadvantages: if the control limit is set to be three standard deviations from the target value, the false alarm rate will be much different from that for a normal distribution. The Shewhart -chart is not effective for detecting small changes in. The erformance of the -chart for detecting small shifts can be imroved by using a larger samle size, but it will not be very effective in detecting large shifts. The Bernoulli CUSUM chart detects shifts in much faster than the -chart. The binomial CUSUM chart is a little slower than the Bernoulli CUSUM chart for small shifts in and considerably slower for very large shifts, since a binomial CUSUM would have to wait until the end of a samle to signal. The SPRT chart has much better erformance than the -chart or the CUSUM chart. 4. Monitoring the Change Frequency Based on the erformance analysis of the control charts and the characteristics of our alications, Bernoulli CUSUM chart or SPRT chart would be more aroriate for our urose since they both are good for detecting small shifts. Also, we need to consider the following questions: a. determining how we check the change frequency, either eriodically or randomly. If eriodically, then determine how often we should check; b. determining the samle size since the samle size could have an imortant effect on the insection result; c. determining the out-of-control value ; d. determining the false alarm rate for CUSUM chart so that the control limit h can be determined or the error robabilities for SPRT chart so that the two constants a and b can be determined; For examle, we have the knowledge that the current change frequency is.693 er day, which means the web age changes.693 times a day. Based on.693 = - log(x/n), this change frequency corresonds to X/n =.5. We want to detect the shift when the change frequency becomes.3567, which corresonds to X/n =.7. In this case, our in-control value =.5 and out-of -control value =.7. These values are 7
relatively large comared with those used in quality control. If we use the Bernoulli CUSUM chart, we can get the reference value r using and. Then, given a desired value for ANOS ( ), we can find the value for the control limit h. Next, we can get ANOS ( ). If the SPRT chart is used, for some desired values of a andß, we can find the values for a and b, and further find the aroximation of ANOS (). In order to comare the erformance of these control charts, we can try to adjust the values so that they give similar values for ANOS ( ) and comare ANOS ( ) for different values of and. 5. Conclusion & Future Work This aer gives a brief review of control charts and some initial ideas of alying these control charts to monitor the change frequency of a web age. However, in order to show the aroriateness of the control charts, exeriments or simulation should be done and secific data to measure the erformance of these control charts should be obtained and comared. Reference [RSa] M Reynolds, Jr. and Z. Stoumbos, Monitoring a Proortion Using CUSUM and SPRT Control Charts. Frontiers in Statistical Quality Control 6,. 56-76() [RSb] M Reynolds, Jr. and Z. Stoumbos, A General Aroach to Modeling CUSUM charts for a Proortion, IIE Transactions() 3,. 55-535 [RS999] M Reynolds, Jr. and Z. Stoumbos, A CUSUM Chart for Monitoring a Proortion When Insecting Continuously, Journal of Quality Technology, Vol. 3, No., Jan 999 [RS998] M Reynolds, Jr. and Z. Stoumbos, The SPRT Chart for Monitoring a Proortion, IIE Transactions (998) 3,. 545-56 [W997] W. Woodall, Control Charts Based on Attribute Data: Bibliograhy and Review, Journal of Quality Technology, Vol. 9, No., Aril 997 [CGMa] Junghoo Cho and Hector Garcia -Molina, Estimating Frequency of Change 8
[CGMb] Junghoo Cho and Hector Garcia -Molina, The Evolution of the Web and Imlications for an Incremental crawler, VLDB, Exerience/Alication track,. 9