File Size Distribution Model in Enterprise File Server toward Efficient Operational Management



Similar documents
The assignment of chunk size according to the target data characteristics in deduplication backup system

Simple Linear Regression Inference

11. Analysis of Case-control Studies Logistic Regression

QUANTITATIVE METHODS. for Decision Makers. Mik Wisniewski. Fifth Edition. FT Prentice Hall

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Characterizing Task Usage Shapes in Google s Compute Clusters

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Analysis of Fire Statistics of China: Fire Frequency and Fatalities in Fires

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

Multi-level Metadata Management Scheme for Cloud Storage System

II. DISTRIBUTIONS distribution normal distribution. standard scores

SAS Software to Fit the Generalized Linear Model

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Data Analysis Tools. Tools for Summarizing Data

Java Modules for Time Series Analysis

Least Squares Estimation

NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

Geostatistics Exploratory Analysis

Characterizing Task Usage Shapes in Google s Compute Clusters

Analysis of VDI Workload Characteristics

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Hardware Configuration Guide

LOGIT AND PROBIT ANALYSIS

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Data Backup and Archiving with Enterprise Storage Systems

Cumulus: filesystem backup to the Cloud

A Data De-duplication Access Framework for Solid State Drives

Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Predict the Popularity of YouTube Videos Using Early View Data

Description. Textbook. Grading. Objective

Simple Linear Regression

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

Interpretation of Somers D under four simple models

The Case for Massive Arrays of Idle Disks (MAID)

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Master s Thesis. A Study on Active Queue Management Mechanisms for. Internet Routers: Design, Performance Analysis, and.

A comparison between different volatility models. Daniel Amsköld

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Descriptive Statistics

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

The CUSUM algorithm a small review. Pierre Granjon

Normality Testing in Excel

Economic Statistics (ECON2006), Statistics and Research Design in Psychology (PSYC2010), Survey Design and Analysis (SOCI2007)

I ~ 14J... <r ku...6l &J&!J--=-O--

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

MATH 10: Elementary Statistics and Probability Chapter 7: The Central Limit Theorem

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Energy aware RAID Configuration for Large Storage Systems

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

The State of Cloud Storage

On Benchmarking Popular File Systems

Empirical study of Software Quality Evaluation in Agile Methodology Using Traditional Metrics

Regularized Logistic Regression for Mind Reading with Parallel Validation

3.4 The Normal Distribution

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Manhattan Center for Science and Math High School Mathematics Department Curriculum

DATA INTERPRETATION AND STATISTICS

Factors affecting online sales

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Penalized Logistic Regression and Classification of Microarray Data

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Quantitative Methods for Finance

Exploratory Data Analysis

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Directions for using SPSS

Using Excel for inferential statistics

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Part 2: Analysis of Relationship Between Two Variables

I/A Series Information Suite AIM*SPC Statistical Process Control

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Performance Modeling and Analysis of a Database Server with Write-Heavy Workload

AP Physics 1 and 2 Lab Investigations

2. Simple Linear Regression

Data Deduplication Scheme for Cloud Storage

Threshold Autoregressive Models in Finance: A Comparative Approach

Two-Level Metadata Management for Data Deduplication System

An Efficient Combination of Dispatch Rules for Job-shop Scheduling Problem

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Data Preparation and Statistical Displays

Transcription:

Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA File Size Distribution Model in Enterprise File Server toward Efficient Operational Management Toshiko Matsumoto, Takashi Onoyama, and Norihisa Komoda Abstract Toward efficient operational management of enterprise file server, we propose an estimation method for relationship between file number and cumulative in descending order of based on a model for distribution. We develop the model by weighted summation of multiple log normal distribution based on AIC. data from technical and non-technical divisions of a company show that our model fits well with observed distribution, and that the estimated relationship can be utilized for cost-effective operational management of file server. Index Terms enterprise file server,, operational management, tiered storage R I. INTRODUCTION ecently enterprise file server has received more and more attention, because of growing trend of unstructured files. In addition, file server is expected to handle archive need for e-discovery, and real-time backup for Business Continuity Management. From a viewpoint of cost effectiveness, several solutions have been provided: tiered storage technology, de-duplication, and deleting unnecessary files [1], [2]. Tiered storage technology contributes not only to cost-effectiveness of high performance storage system but also to reducing cost for real-time backup in Business Continuity Management by separating active and inactive files. De-duplication reduces the size of data volume in a fileserver and thus is able to shorten backup time. Deleting unnecessary files can improve cost-performance of file servers and leads to more desirable usage of files. Toward theoretical evaluation of algorithms implemented in these solutions, several statistical characteristics of files are reported [3] [1]. They measured frequencies of file access, extension distribution, and fraction of duplicated contents. Some of them also suggested similarities between distribution and Pareto or log normal distribution [4], [8]. However, quantitative estimation requires a survey where 1) system files are discriminated from user files, 2) files are stored in a shared file server for collaboration, Manuscript received June 6 th, 212; revised July 2 nd, 212. T. Matsumoto is with Hitachi Solutions, Ltd., 4-12-7, Higashishinagawa, Shinagawa-ku, Tokyo, 14-2, Japan (phone: +81-3-578-2111; fax:+81-3-578-249; e-mail: toshiko.matsumoto.jz@hitachi-solutions.com). T. Onoyama is with Hitachi Solutions, Ltd., 4-12-7, Higashishinagawa, Shinagawa-ku, Tokyo, 14-2, Japan (e-mail: takashi.onoyama.js@hitachi-solutions.com). N. Komoda is a professor at Osaka University, 2-1, Yamadaoka, Suita, Osaka 565-871, Japan (e-mail: komoda@ist.osaka-u.ac.jp). ISSN: 278-958 (Print); ISSN: 278-966 (Online) 3) data are from industrial file servers, and 4) models are evaluated by statistically testing goodness of fit. In this study, we propose a model for distribution, which is one of the most fundamental statistical characteristics of files. The model is calculated as a weighted summation of multiple log normal distributions. Number of log normal distributions is decided based of Akaike's Information Criterion (AIC) to prevent over fitting [11]. We also describe efficiency estimation for introducing process of tiered storage technology and for reduction of usage volume based on our model. Finally, the model and the estimation are evaluated with actual data. II. ABOUT DATA We use data of shared file servers from 24 divisions of a company: 11 are technical divisions such as research and development, and the other 13 are non-technical divisions such as sales, marketing, and management. Some divisions have tens of thousands of files and others have more than one-million files. Sum of s are between several gigabytes to several hundreds of gigabytes. Sizes of files in a shared file server of Research and Development division of a company are plotted in Fig 1. X-axis and y-axis represents number and size of a file, respectively. This graph is equivalent to transposed cumulative probability distribution, and demonstrates that there are very few large files. All the other divisions show the same tendency. Fig 2 shows the same data of Fig 1 and data of [8] in logarithmic scale. Since Pareto distribution is proposed for which is larger than some threshold, linear regression line of logarithm value of upon logarithm value of file number is calculated with Finney s correction method [12] from plot with of 1KB or larger. Fig 3 shows histogram of the observed data and of two theoretical distributions proposed in previous studies [4], [8]. Number of classes of the histogram 7.418GB 4MB 2MB MB 1, 2, 3, 4, File number Fig 1. Example plot of and descending order. WCECS 212

Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA 1GB 1MB 1KB 1 1, 1,, File number Fig 2. Example plot of in descending order for observed data and. 8, Frequency 6, 4, Log normal distribution 2, 1KB 1MB 1GB Fig 3. Example plot of histogram for observed data and models in previous studies. is decided based on Sturges formula [13]. Log normal distribution is calculated by average and standard deviation of logarithm value of. Observed plot seems to have linearity in Fig 2 and shows roughly bell-shaped form Fig 3. However, frequency of observed data differs from Pareto distribution by more than 2, at 1 KB and from log normal distribution by almost 1, at peak in the histogram. Because of these apparent discrepancies, observed histogram shows statistically significant difference both from Pareto distribution and from log normal distribution with type I error rate of.1. Even if threshold is set to 1MB, Pareto distribution still shows statistically significant difference from observed data. In all other divisions, observed data differ significantly both from and from log normal distribution. Therefore, neither nor log normal distribution fits to observed data, even though they have roughly similar histogram form. III. FILE SIZE MODELING BY WEIGHTED SUMMATION OF MULTIPLE LOG NORMAL DISTRIBUTIONS A. AIC-based Modeling We propose modeling by weighted summation of multiple log normal distributions. Our model is based on observations where distribution depends on content type, such as movie files, database files, and executable files are larger than that of other files [3], [5], and based on an idea that more variety should be observed in general-purpose industrial file servers. Weighted summation of c log normal distributions are calculated to fit to observed data. Similarity is evaluated with χ 2 value from contingency table. χ 2 value is calculated for various c to minimize AIC. AIC is defined as expression (1), where L is maximum likelihood and k is number of degrees of freedom [11]. AIC = 2 ln L + 2k (1) Likelihood ratio λ is L/L, where L and L are maximum likelihood values under null hypothesis and in parameter space. AIC can be minimized by minimizing (χ 2 + 3c), because of following three facts. First, L can be treated as a constant value when observed data is given. Second, λ eventually closes to χ 2 distribution as sample size grows. Third, each log normal distribution adds three degrees of freedom: average, variance, and weight. We can prevent over fitting during model calculation, since AIC value is increased when too many parameters are adopted. B. Comparison of Observed Data and Proposed Model of File Size According to the previous section, theoretical distribution of logarithmic scaled is calculated as expression (2) where N(µ, σ 2 ) is normal distribution with average µ and variance σ 2. 28196.8 N (3.28,.6) + 42195.9 N (5.1,.1) + 13121.3 N (5.95,.14 ) + 12342.1 N (2.14,.28 ) + 862.5 N (4.23,.81 ) (2) + 7144.9 N (6.5,.23) + 364.6 N (.1,.4 ) + 2156.7 N (.28,.49 ) + 844.5 N (1.38,.14 ) + 252.4 N (7.88,.25 ) + 141.5 N (8.5,.9 ) + 75.5 N (9.5, 9.) ISSN: 278-958 (Print); ISSN: 278-966 (Online) WCECS 212

Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA 8, Frequency 6, 4, Proposed model 2, 1KB 1MB 1GB Fig 4. Example plot of histogram for observed data and proposed model. Fig 4 shows histogram for observed data and the proposed model. Since theoretical distribution fits very well to the observed data at all classes, no statistically significant difference was observed. Expression (2) shows decrease of weights in an exponentially fashion. The decrease indicates that first term has a dominant effect to the distribution and supports that observed distribution shows roughly bell-shaped form. First term includes source code files and plain text files, that occupy large part of the files in the file server. Second term includes HTML files and PDF files. HTML files have larger size than XML files do in average, because HTML files include specification documents of a programming language and files that are converted by a word processing software. Observed distributions of in all the other 23 divisions fit very well to their corresponding models and demonstrate no statistically significant difference. IV. EFFICIENCY OF OPERATIONAL MANAGEMENT OF FILE SERVER BASED ON MODEL OF FILE SIZE DISTRIBUTION A. Cases of Operational Management utilizing Model of File Size Distribution distribution model can be directly utilized in the following two kinds of estimation for efficient operational management of file servers. First case is estimating effect of file moving policy in introducing process of tiered storage technology. Processing time of moving file depends both on size summation and on number of files. From these dependences, we can expect that assignment of a high priority to a large file achieves in an efficient moving policy: large volume is moved to lower cost storage in short time and that benefit of tiered storage technology can be realized efficiently with the policy. Since last access time and show no correlation (correlation coefficient <.1 for all divisions in our data), distribution model in the section III A is expected to be effective to estimate relationship between number and size summation of files moved to lower cost storage. Second case is estimating effect of deleting unnecessary files by users. When millions of files are stored in a file server, it is obviously unrealistic to manually check deletability of all files one by one. Reduction of unnecessary files can be tractable only when large volume is deleted with manual confirmation of a small fraction of files. Deletability confirmation in descending order of is effective for efficient volume reduction when the files are deletable. When some files are confirmed to be undeletable, this confirmation policy is still effective for efficient estimation of upper limit ISSN: 278-958 (Print); ISSN: 278-966 (Online) of eventual reducible volume. B. Estimating Efficiency Based on Cumulative File Size When top n% of large files occupy p% of the total volume, relationship between n% and p% is equivalent to volume efficiency of introducing process of tiered storage technology where inactive files are moved to lower cost storage. The relationship also represents upper limit of reduction volume per confirmation number of file deletability. We can estimate the relationship by integrating theoretical distributions described in section III A. V. EXPERIMENTAL RESULT AND EVALUATION A. Relationship between number of files and cumulative From expression (2) in section III, Fig 5 shows relationship between fraction of file number n% and fraction of cumulative p% in descending order of. Value of p% rapidly reaches almost 1%, whereas p% should be equal to n% in randomized order. Rapid increase of p% results from tiny fraction of large files as shown in Fig 1. Since small value of n% can give large p% value, policy with descending order is expected to achieve efficient operational management in the both two cases of section IV A. B. Accuracy Evaluation of Estimating Relationship between Number and Cumulative Size of Files Fraction of cumulative p% is calculated for fraction of file number n% = 1%, 5%, and 1% in 24 divisions. Fig 6, 7, and 8 show comparison results of observed data and estimated value by distribution models of, log normal distribution, and our model of expression (2) in section III. The plots show strong correlation between observed and estimated values of our model in Fig 8 (correlation coefficient.9). In contrast, Fraction of cumulative 1..998.996.994.992...2.4.6.8 1. Fraction of file number Fig 5. Fraction of file number and of cumulative in descending order based on our model. WCECS 212

Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA no apparent correlation is observed for the case of Pareto distribution or log normal distribution (correlation coefficient.6). These results demonstrate that our model can estimate relationship between n% and p% more accurately than models of previous studies can, and is suitable for quantitative evaluation. C. Shared File Servers in Technical/non-technical Divisions We compare technical and non-technical divisions regarding fraction of cumulative p% for fraction of file number n% = 1%, number of log normal distributions c calculated according to section III A, and overview statistics of file server as shown in Table 1. T-test with Bonferroni correction [14] reveals only p% shows statistically significant difference between technical and non-technical divisions. Values of p% in technical and non-technical divisions are shown in Fig 9. These results mean that overview statistics of file server are not good at estimating p%, although p% depends on type of division. They also indicate that our model can estimate p% better. VI. CONCLUSION In this study, we propose a distribution model in enterprise file server and describe that the model is effective for efficient operational management of file servers. Data of shared file servers from various technical and non-technical divisions demonstrate that distribution can be modeled as a weighted summation of multiple log normal distributions. The data also demonstrate that integrating the theoretical distribution gives accurate estimate for efficiency of policy with descending order in introducing process of tiered storage technology and in deleting unnecessary files. Our estimation can be applied to each file server by two conversions: 1) converting cumulative into cost effect on the basis of price and volume capacity of a file server product, and 2) converting file number into introducing process time on the basis of data transfer rate and metadata writing speed in the case of tiered storage technology, and into labor cost on the basis of work efficiency of manual confirmation in the case of deleting unnecessary files. Therefore our model contributes cost-effective file server from viewpoint of operational management. We believe that our model also can be utilized in many other simulation-based evaluations, such as access performance and file fragmentation [1], [15]. Furthermore, file access frequencies, fraction of duplicated content, or more other statistical characteristics can be modeled and be utilized for further efficiency of enterprise file servers. No. TABLE I OVERVIEW STATISTICS OF FILE SERVER Overview statistics of file server 1 File number 2 File number (logarithmic value) 3 Sum of s 4 Sum of s (logarithmic value) 5 Average 6 Average (logarithmic value) 7 Maximum 8 Maximum (logarithmic value) Maximum (logarithmic value) / file number 9 (logarithmic value) Fig 6. Fraction of cumulative occupied by top large files in observed data and in. 1..8.6.4.2.2.4.6.8 1. Log normal distribution : top 1% Fig 7. Fraction of cumulative occupied by top large files in observed data and in log normal distribution. 1..8.6.4.2 1..8.6.4.2.2.4.6.8 1..2.4.6.8 1. Our model : top 1% : top 1% Fig 8. Fraction of cumulative occupied by top large files in observed data and in our model. Fraction of cumulative 1..8.6.4.2 Technical Non-technical divisions divisions Fig 9. Fraction of cumulative occupied by top 1% of large files in technical and non-technical divisions. ISSN: 278-958 (Print); ISSN: 278-966 (Online) WCECS 212

Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA REFERENCES [1] E. Anderson, J. Hall, J. Hartline, M. Hobbs, A. R. Karlin, J. Saia, R. Swaminathan, and J. Wilkes, An experimental study of data migration algorithms, in Proc. of the 5 th International Workshop on Algorithm Engineering, pp.145 158, 21. [2] J. Malhotra, P. Sarode, and A. Kamble, A review of various techniques and approaches of data deduplication, International Journal of Engineering Practices, vol. 1, no. 1, pp.1 8, 212. [3] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch, A five-year study of file-system metadata, ACM Transactions on Storage, vol. 3, no.3, pp. 31 45, 27. [4] A. B. Downey, The structural cause of distributions, in Proc. of the 21 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 21. [5] K. M. Evans and G. H. Kuenning, A study of irregularities in file-size distributions, in Proc. of International Symposium on Performance Evaluation of Computer and Telecommunication Systems, 22. [6] D. T. Meyer and W. J. Bolosky, A study of practical deduplication, in Proc. of 9 th USENIX Conference on File and Storage Technologies, 211. [7] M. Satyanarayanan, A study of s and functional lifetimes, in Proc. of the eighth ACM symposium on Operating systems principles, pp.96 18, 1981. [8] P. Barford and M. Crovella, Generating representative web workloads for network and server performance evaluation, in Proc. of the 1998 ACM SIGMETRICS joint international conference on measurement and modeling of computer systems, 1998. [9] T. Gibson and E. L. Miller, An improved long-term file-usage prediction algorithm, in Proc. of Annual International Conference on Computer Measurement and Performance, 1999. [1] SPEC SFS 28 benchmark. http://http://www.spec.org/sfs28/ [11] H. Akaike, Information theory and an extension of the maximum likelihood principle, in Proc. of the 2 nd International Symposium on Information Theory, pp.267 281, 1973. [12] D. Finney, On the distribution of avariate whose logarithm is normally distributed, Journal of Royal Statistical Society, Supplement VII, pp.155 161, 1941. [13] H. A. Sturges, The choice of a class interval, Journal of American Statistical Association, vol.21, no.153, pp.65 66, 1926. [14] G. R. Miller, Simultaneous statistical inference 2nd edition, New York: Springer-Verlag, 1981. [15] T. Nakamura and N. Komoda, Size adjusting pre-allocation methods to improve fragmentation and performance on simultaneous file creation by synchronous write, ISPJ journal, vol. 5, no. 11, pp.269 2698, 29. ISSN: 278-958 (Print); ISSN: 278-966 (Online) WCECS 212