File Size Distribution Model in Enterprise File Server toward Efficient Operational Management
|
|
|
- Basil Walton
- 10 years ago
- Views:
Transcription
1 Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA File Size Distribution Model in Enterprise File Server toward Efficient Operational Management Toshiko Matsumoto, Takashi Onoyama, and Norihisa Komoda Abstract Toward efficient operational management of enterprise file server, we propose an estimation method for relationship between file number and cumulative in descending order of based on a model for distribution. We develop the model by weighted summation of multiple log normal distribution based on AIC. data from technical and non-technical divisions of a company show that our model fits well with observed distribution, and that the estimated relationship can be utilized for cost-effective operational management of file server. Index Terms enterprise file server,, operational management, tiered storage R I. INTRODUCTION ecently enterprise file server has received more and more attention, because of growing trend of unstructured files. In addition, file server is expected to handle archive need for e-discovery, and real-time backup for Business Continuity Management. From a viewpoint of cost effectiveness, several solutions have been provided: tiered storage technology, de-duplication, and deleting unnecessary files [1], [2]. Tiered storage technology contributes not only to cost-effectiveness of high performance storage system but also to reducing cost for real-time backup in Business Continuity Management by separating active and inactive files. De-duplication reduces the size of data volume in a fileserver and thus is able to shorten backup time. Deleting unnecessary files can improve cost-performance of file servers and leads to more desirable usage of files. Toward theoretical evaluation of algorithms implemented in these solutions, several statistical characteristics of files are reported [3] [1]. They measured frequencies of file access, extension distribution, and fraction of duplicated contents. Some of them also suggested similarities between distribution and Pareto or log normal distribution [4], [8]. However, quantitative estimation requires a survey where 1) system files are discriminated from user files, 2) files are stored in a shared file server for collaboration, Manuscript received June 6 th, 212; revised July 2 nd, 212. T. Matsumoto is with Hitachi Solutions, Ltd., , Higashishinagawa, Shinagawa-ku, Tokyo, 14-2, Japan (phone: ; fax: ; [email protected]). T. Onoyama is with Hitachi Solutions, Ltd., , Higashishinagawa, Shinagawa-ku, Tokyo, 14-2, Japan ( [email protected]). N. Komoda is a professor at Osaka University, 2-1, Yamadaoka, Suita, Osaka , Japan ( [email protected]). ISSN: (Print); ISSN: (Online) 3) data are from industrial file servers, and 4) models are evaluated by statistically testing goodness of fit. In this study, we propose a model for distribution, which is one of the most fundamental statistical characteristics of files. The model is calculated as a weighted summation of multiple log normal distributions. Number of log normal distributions is decided based of Akaike's Information Criterion (AIC) to prevent over fitting [11]. We also describe efficiency estimation for introducing process of tiered storage technology and for reduction of usage volume based on our model. Finally, the model and the estimation are evaluated with actual data. II. ABOUT DATA We use data of shared file servers from 24 divisions of a company: 11 are technical divisions such as research and development, and the other 13 are non-technical divisions such as sales, marketing, and management. Some divisions have tens of thousands of files and others have more than one-million files. Sum of s are between several gigabytes to several hundreds of gigabytes. Sizes of files in a shared file server of Research and Development division of a company are plotted in Fig 1. X-axis and y-axis represents number and size of a file, respectively. This graph is equivalent to transposed cumulative probability distribution, and demonstrates that there are very few large files. All the other divisions show the same tendency. Fig 2 shows the same data of Fig 1 and data of [8] in logarithmic scale. Since Pareto distribution is proposed for which is larger than some threshold, linear regression line of logarithm value of upon logarithm value of file number is calculated with Finney s correction method [12] from plot with of 1KB or larger. Fig 3 shows histogram of the observed data and of two theoretical distributions proposed in previous studies [4], [8]. Number of classes of the histogram 7.418GB 4MB 2MB MB 1, 2, 3, 4, File number Fig 1. Example plot of and descending order. WCECS 212
2 Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA 1GB 1MB 1KB 1 1, 1,, File number Fig 2. Example plot of in descending order for observed data and. 8, Frequency 6, 4, Log normal distribution 2, 1KB 1MB 1GB Fig 3. Example plot of histogram for observed data and models in previous studies. is decided based on Sturges formula [13]. Log normal distribution is calculated by average and standard deviation of logarithm value of. Observed plot seems to have linearity in Fig 2 and shows roughly bell-shaped form Fig 3. However, frequency of observed data differs from Pareto distribution by more than 2, at 1 KB and from log normal distribution by almost 1, at peak in the histogram. Because of these apparent discrepancies, observed histogram shows statistically significant difference both from Pareto distribution and from log normal distribution with type I error rate of.1. Even if threshold is set to 1MB, Pareto distribution still shows statistically significant difference from observed data. In all other divisions, observed data differ significantly both from and from log normal distribution. Therefore, neither nor log normal distribution fits to observed data, even though they have roughly similar histogram form. III. FILE SIZE MODELING BY WEIGHTED SUMMATION OF MULTIPLE LOG NORMAL DISTRIBUTIONS A. AIC-based Modeling We propose modeling by weighted summation of multiple log normal distributions. Our model is based on observations where distribution depends on content type, such as movie files, database files, and executable files are larger than that of other files [3], [5], and based on an idea that more variety should be observed in general-purpose industrial file servers. Weighted summation of c log normal distributions are calculated to fit to observed data. Similarity is evaluated with χ 2 value from contingency table. χ 2 value is calculated for various c to minimize AIC. AIC is defined as expression (1), where L is maximum likelihood and k is number of degrees of freedom [11]. AIC = 2 ln L + 2k (1) Likelihood ratio λ is L/L, where L and L are maximum likelihood values under null hypothesis and in parameter space. AIC can be minimized by minimizing (χ 2 + 3c), because of following three facts. First, L can be treated as a constant value when observed data is given. Second, λ eventually closes to χ 2 distribution as sample size grows. Third, each log normal distribution adds three degrees of freedom: average, variance, and weight. We can prevent over fitting during model calculation, since AIC value is increased when too many parameters are adopted. B. Comparison of Observed Data and Proposed Model of File Size According to the previous section, theoretical distribution of logarithmic scaled is calculated as expression (2) where N(µ, σ 2 ) is normal distribution with average µ and variance σ N (3.28,.6) N (5.1,.1) N (5.95,.14 ) N (2.14,.28 ) N (4.23,.81 ) (2) N (6.5,.23) N (.1,.4 ) N (.28,.49 ) N (1.38,.14 ) N (7.88,.25 ) N (8.5,.9 ) N (9.5, 9.) ISSN: (Print); ISSN: (Online) WCECS 212
3 Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA 8, Frequency 6, 4, Proposed model 2, 1KB 1MB 1GB Fig 4. Example plot of histogram for observed data and proposed model. Fig 4 shows histogram for observed data and the proposed model. Since theoretical distribution fits very well to the observed data at all classes, no statistically significant difference was observed. Expression (2) shows decrease of weights in an exponentially fashion. The decrease indicates that first term has a dominant effect to the distribution and supports that observed distribution shows roughly bell-shaped form. First term includes source code files and plain text files, that occupy large part of the files in the file server. Second term includes HTML files and PDF files. HTML files have larger size than XML files do in average, because HTML files include specification documents of a programming language and files that are converted by a word processing software. Observed distributions of in all the other 23 divisions fit very well to their corresponding models and demonstrate no statistically significant difference. IV. EFFICIENCY OF OPERATIONAL MANAGEMENT OF FILE SERVER BASED ON MODEL OF FILE SIZE DISTRIBUTION A. Cases of Operational Management utilizing Model of File Size Distribution distribution model can be directly utilized in the following two kinds of estimation for efficient operational management of file servers. First case is estimating effect of file moving policy in introducing process of tiered storage technology. Processing time of moving file depends both on size summation and on number of files. From these dependences, we can expect that assignment of a high priority to a large file achieves in an efficient moving policy: large volume is moved to lower cost storage in short time and that benefit of tiered storage technology can be realized efficiently with the policy. Since last access time and show no correlation (correlation coefficient <.1 for all divisions in our data), distribution model in the section III A is expected to be effective to estimate relationship between number and size summation of files moved to lower cost storage. Second case is estimating effect of deleting unnecessary files by users. When millions of files are stored in a file server, it is obviously unrealistic to manually check deletability of all files one by one. Reduction of unnecessary files can be tractable only when large volume is deleted with manual confirmation of a small fraction of files. Deletability confirmation in descending order of is effective for efficient volume reduction when the files are deletable. When some files are confirmed to be undeletable, this confirmation policy is still effective for efficient estimation of upper limit ISSN: (Print); ISSN: (Online) of eventual reducible volume. B. Estimating Efficiency Based on Cumulative File Size When top n% of large files occupy p% of the total volume, relationship between n% and p% is equivalent to volume efficiency of introducing process of tiered storage technology where inactive files are moved to lower cost storage. The relationship also represents upper limit of reduction volume per confirmation number of file deletability. We can estimate the relationship by integrating theoretical distributions described in section III A. V. EXPERIMENTAL RESULT AND EVALUATION A. Relationship between number of files and cumulative From expression (2) in section III, Fig 5 shows relationship between fraction of file number n% and fraction of cumulative p% in descending order of. Value of p% rapidly reaches almost 1%, whereas p% should be equal to n% in randomized order. Rapid increase of p% results from tiny fraction of large files as shown in Fig 1. Since small value of n% can give large p% value, policy with descending order is expected to achieve efficient operational management in the both two cases of section IV A. B. Accuracy Evaluation of Estimating Relationship between Number and Cumulative Size of Files Fraction of cumulative p% is calculated for fraction of file number n% = 1%, 5%, and 1% in 24 divisions. Fig 6, 7, and 8 show comparison results of observed data and estimated value by distribution models of, log normal distribution, and our model of expression (2) in section III. The plots show strong correlation between observed and estimated values of our model in Fig 8 (correlation coefficient.9). In contrast, Fraction of cumulative Fraction of file number Fig 5. Fraction of file number and of cumulative in descending order based on our model. WCECS 212
4 Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA no apparent correlation is observed for the case of Pareto distribution or log normal distribution (correlation coefficient.6). These results demonstrate that our model can estimate relationship between n% and p% more accurately than models of previous studies can, and is suitable for quantitative evaluation. C. Shared File Servers in Technical/non-technical Divisions We compare technical and non-technical divisions regarding fraction of cumulative p% for fraction of file number n% = 1%, number of log normal distributions c calculated according to section III A, and overview statistics of file server as shown in Table 1. T-test with Bonferroni correction [14] reveals only p% shows statistically significant difference between technical and non-technical divisions. Values of p% in technical and non-technical divisions are shown in Fig 9. These results mean that overview statistics of file server are not good at estimating p%, although p% depends on type of division. They also indicate that our model can estimate p% better. VI. CONCLUSION In this study, we propose a distribution model in enterprise file server and describe that the model is effective for efficient operational management of file servers. Data of shared file servers from various technical and non-technical divisions demonstrate that distribution can be modeled as a weighted summation of multiple log normal distributions. The data also demonstrate that integrating the theoretical distribution gives accurate estimate for efficiency of policy with descending order in introducing process of tiered storage technology and in deleting unnecessary files. Our estimation can be applied to each file server by two conversions: 1) converting cumulative into cost effect on the basis of price and volume capacity of a file server product, and 2) converting file number into introducing process time on the basis of data transfer rate and metadata writing speed in the case of tiered storage technology, and into labor cost on the basis of work efficiency of manual confirmation in the case of deleting unnecessary files. Therefore our model contributes cost-effective file server from viewpoint of operational management. We believe that our model also can be utilized in many other simulation-based evaluations, such as access performance and file fragmentation [1], [15]. Furthermore, file access frequencies, fraction of duplicated content, or more other statistical characteristics can be modeled and be utilized for further efficiency of enterprise file servers. No. TABLE I OVERVIEW STATISTICS OF FILE SERVER Overview statistics of file server 1 File number 2 File number (logarithmic value) 3 Sum of s 4 Sum of s (logarithmic value) 5 Average 6 Average (logarithmic value) 7 Maximum 8 Maximum (logarithmic value) Maximum (logarithmic value) / file number 9 (logarithmic value) Fig 6. Fraction of cumulative occupied by top large files in observed data and in Log normal distribution : top 1% Fig 7. Fraction of cumulative occupied by top large files in observed data and in log normal distribution Our model : top 1% : top 1% Fig 8. Fraction of cumulative occupied by top large files in observed data and in our model. Fraction of cumulative Technical Non-technical divisions divisions Fig 9. Fraction of cumulative occupied by top 1% of large files in technical and non-technical divisions. ISSN: (Print); ISSN: (Online) WCECS 212
5 Proceedings of the World Congress on Engineering and Computer Science 212 Vol II WCECS 212, October 24-26, 212, San Francisco, USA REFERENCES [1] E. Anderson, J. Hall, J. Hartline, M. Hobbs, A. R. Karlin, J. Saia, R. Swaminathan, and J. Wilkes, An experimental study of data migration algorithms, in Proc. of the 5 th International Workshop on Algorithm Engineering, pp , 21. [2] J. Malhotra, P. Sarode, and A. Kamble, A review of various techniques and approaches of data deduplication, International Journal of Engineering Practices, vol. 1, no. 1, pp.1 8, 212. [3] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch, A five-year study of file-system metadata, ACM Transactions on Storage, vol. 3, no.3, pp , 27. [4] A. B. Downey, The structural cause of distributions, in Proc. of the 21 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 21. [5] K. M. Evans and G. H. Kuenning, A study of irregularities in file-size distributions, in Proc. of International Symposium on Performance Evaluation of Computer and Telecommunication Systems, 22. [6] D. T. Meyer and W. J. Bolosky, A study of practical deduplication, in Proc. of 9 th USENIX Conference on File and Storage Technologies, 211. [7] M. Satyanarayanan, A study of s and functional lifetimes, in Proc. of the eighth ACM symposium on Operating systems principles, pp.96 18, [8] P. Barford and M. Crovella, Generating representative web workloads for network and server performance evaluation, in Proc. of the 1998 ACM SIGMETRICS joint international conference on measurement and modeling of computer systems, [9] T. Gibson and E. L. Miller, An improved long-term file-usage prediction algorithm, in Proc. of Annual International Conference on Computer Measurement and Performance, [1] SPEC SFS 28 benchmark. [11] H. Akaike, Information theory and an extension of the maximum likelihood principle, in Proc. of the 2 nd International Symposium on Information Theory, pp , [12] D. Finney, On the distribution of avariate whose logarithm is normally distributed, Journal of Royal Statistical Society, Supplement VII, pp , [13] H. A. Sturges, The choice of a class interval, Journal of American Statistical Association, vol.21, no.153, pp.65 66, [14] G. R. Miller, Simultaneous statistical inference 2nd edition, New York: Springer-Verlag, [15] T. Nakamura and N. Komoda, Size adjusting pre-allocation methods to improve fragmentation and performance on simultaneous file creation by synchronous write, ISPJ journal, vol. 5, no. 11, pp , 29. ISSN: (Print); ISSN: (Online) WCECS 212
The assignment of chunk size according to the target data characteristics in deduplication backup system
The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
QUANTITATIVE METHODS. for Decision Makers. Mik Wisniewski. Fifth Edition. FT Prentice Hall
Fifth Edition QUANTITATIVE METHODS for Decision Makers Mik Wisniewski Senior Research Fellow, Department of Management Science, University of Strathclyde Business School FT Prentice Hall FINANCIAL TIMES
Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)
1 Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship) I. Authors should report effect sizes in the manuscript and tables when reporting
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
Analysis of Fire Statistics of China: Fire Frequency and Fatalities in Fires
Analysis of Fire Statistics of China: Fire Frequency and Fatalities in Fires FULIANG WANG, SHOUXIANG LU, and CHANGHAI LI State Key Laboratory of Fire Science University of Science and Technology of China
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE 1 M.PRADEEP RAJA, 2 R.C SANTHOSH KUMAR, 3 P.KIRUTHIGA, 4 V. LOGESHWARI 1,2,3 Student,
Multi-level Metadata Management Scheme for Cloud Storage System
, pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
12.5: CHI-SQUARE GOODNESS OF FIT TESTS
125: Chi-Square Goodness of Fit Tests CD12-1 125: CHI-SQUARE GOODNESS OF FIT TESTS In this section, the χ 2 distribution is used for testing the goodness of fit of a set of data to a specific probability
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
Java Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York
NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York DEPARTMENT: Mathematics COURSE: MAT 1272/ MA 272 TITLE: DESCRIPTION: TEXT: Statistics An introduction to statistical methods and statistical
Geostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras [email protected]
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo [email protected] Joseph L. Hellerstein Google Inc. [email protected] Raouf Boutaba University of Waterloo [email protected]
Analysis of VDI Workload Characteristics
Analysis of VDI Workload Characteristics Jeongsook Park, Youngchul Kim and Youngkyun Kim Electronics and Telecommunications Research Institute 218 Gajeong-ro, Yuseong-gu, Daejeon, 305-700, Republic of
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
MBA 611 STATISTICS AND QUANTITATIVE METHODS
MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain
Hardware Configuration Guide
Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...
LOGIT AND PROBIT ANALYSIS
LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 [email protected] In dummy regression variable models, it is assumed implicitly that the dependent variable Y
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression
Data Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia [email protected],
Cumulus: filesystem backup to the Cloud
Michael Vrable, Stefan Savage, a n d G e o f f r e y M. V o e l k e r Cumulus: filesystem backup to the Cloud Michael Vrable is pursuing a Ph.D. in computer science at the University of California, San
A Data De-duplication Access Framework for Solid State Drives
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science
Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function
The Empirical Econometrics and Quantitative Economics Letters ISSN 2286 7147 EEQEL all rights reserved Volume 1, Number 4 (December 2012), pp. 89 106. Comparison of sales forecasting models for an innovative
MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!
MATH BOOK OF PROBLEMS SERIES New from Pearson Custom Publishing! The Math Book of Problems Series is a database of math problems for the following courses: Pre-algebra Algebra Pre-calculus Calculus Statistics
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Description. Textbook. Grading. Objective
EC151.02 Statistics for Business and Economics (MWF 8:00-8:50) Instructor: Chiu Yu Ko Office: 462D, 21 Campenalla Way Phone: 2-6093 Email: [email protected] Office Hours: by appointment Description This course
Simple Linear Regression
STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze
Street Address: 1111 Franklin Street Oakland, CA 94607. Mailing Address: 1111 Franklin Street Oakland, CA 94607
Contacts University of California Curriculum Integration (UCCI) Institute Sarah Fidelibus, UCCI Program Manager Street Address: 1111 Franklin Street Oakland, CA 94607 1. Program Information Mailing Address:
Interpretation of Somers D under four simple models
Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms
The Case for Massive Arrays of Idle Disks (MAID)
The Case for Massive Arrays of Idle Disks (MAID) Dennis Colarelli, Dirk Grunwald and Michael Neufeld Dept. of Computer Science Univ. of Colorado, Boulder January 7, 2002 Abstract The declining costs of
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments
Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives
Master s Thesis. A Study on Active Queue Management Mechanisms for. Internet Routers: Design, Performance Analysis, and.
Master s Thesis Title A Study on Active Queue Management Mechanisms for Internet Routers: Design, Performance Analysis, and Parameter Tuning Supervisor Prof. Masayuki Murata Author Tomoya Eguchi February
A comparison between different volatility models. Daniel Amsköld
A comparison between different volatility models Daniel Amsköld 211 6 14 I II Abstract The main purpose of this master thesis is to evaluate and compare different volatility models. The evaluation is based
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
The CUSUM algorithm a small review. Pierre Granjon
The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................
Normality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected]
Economic Statistics (ECON2006), Statistics and Research Design in Psychology (PSYC2010), Survey Design and Analysis (SOCI2007)
COURSE DESCRIPTION Title Code Level Semester Credits 3 Prerequisites Post requisites Introduction to Statistics ECON1005 (EC160) I I None Economic Statistics (ECON2006), Statistics and Research Design
I ~ 14J... <r ku...6l &J&!J--=-O--
City College of San Francisco Technology-Mediated Course Proposal Course Outline Addendum I. GENERAL DESCRIPTION A. Date B. Department C. Course Identifier D. Course Title E. Addendum Preparer F. Chairperson
COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.
277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies
MATH 10: Elementary Statistics and Probability Chapter 7: The Central Limit Theorem
MATH 10: Elementary Statistics and Probability Chapter 7: The Central Limit Theorem Tony Pourmohamad Department of Mathematics De Anza College Spring 2015 Objectives By the end of this set of slides, you
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Energy aware RAID Configuration for Large Storage Systems
Energy aware RAID Configuration for Large Storage Systems Norifumi Nishikawa [email protected] Miyuki Nakano [email protected] Masaru Kitsuregawa [email protected] Abstract
STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance
Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis
The State of Cloud Storage
203 Industry Report A Benchmark Comparison of Performance, Availability and Scalability Executive Summary In the last year, Cloud Storage Providers (CSPs) delivered over an exabyte of data under contract.
On Benchmarking Popular File Systems
On Benchmarking Popular File Systems Matti Vanninen James Z. Wang Department of Computer Science Clemson University, Clemson, SC 2963 Emails: {mvannin, jzwang}@cs.clemson.edu Abstract In recent years,
Empirical study of Software Quality Evaluation in Agile Methodology Using Traditional Metrics
Empirical study of Software Quality Evaluation in Agile Methodology Using Traditional Metrics Kumi Jinzenji NTT Software Innovation Canter NTT Corporation Tokyo, Japan [email protected] Takashi
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
3.4 The Normal Distribution
3.4 The Normal Distribution All of the probability distributions we have found so far have been for finite random variables. (We could use rectangles in a histogram.) A probability distribution for a continuous
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream
Manhattan Center for Science and Math High School Mathematics Department Curriculum
Content/Discipline Algebra 1 Semester 2: Marking Period 1 - Unit 8 Polynomials and Factoring Topic and Essential Question How do perform operations on polynomial functions How to factor different types
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
Factors affecting online sales
Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4
Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:
Chapter 7 Notes - Inference for Single Samples You know already for a large sample, you can invoke the CLT so: X N(µ, ). Also for a large sample, you can replace an unknown σ by s. You know how to do a
Penalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Quantitative Methods for Finance
Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain
Exploratory Data Analysis
Exploratory Data Analysis Johannes Schauer [email protected] Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE
ACTA UNIVERSITATIS AGRICULTURAE ET SILVICULTURAE MENDELIANAE BRUNENSIS Volume 62 41 Number 2, 2014 http://dx.doi.org/10.11118/actaun201462020383 GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE Silvie Kafková
Directions for using SPSS
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
Using Excel for inferential statistics
FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
I/A Series Information Suite AIM*SPC Statistical Process Control
I/A Series Information Suite AIM*SPC Statistical Process Control PSS 21S-6C3 B3 QUALITY PRODUCTIVITY SQC SPC TQC y y y y y y y y yy y y y yy s y yy s sss s ss s s ssss ss sssss $ QIP JIT INTRODUCTION AIM*SPC
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Data Deduplication Scheme for Cloud Storage
26 Data Deduplication Scheme for Cloud Storage 1 Iuon-Chang Lin and 2 Po-Ching Chien Abstract Nowadays, the utilization of storage capacity becomes an important issue in cloud storage. In this paper, we
Threshold Autoregressive Models in Finance: A Comparative Approach
University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Informatics 2011 Threshold Autoregressive Models in Finance: A Comparative
Two-Level Metadata Management for Data Deduplication System
Two-Level Metadata Management for Data Deduplication System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3.,Young Woong Ko 1 1 Dept. of Computer Engineering, Hallym University Chuncheon, Korea { kongjs,
An Efficient Combination of Dispatch Rules for Job-shop Scheduling Problem
An Efficient Combination of Dispatch Rules for Job-shop Scheduling Problem Tatsunobu Kawai, Yasutaka Fujimoto Department of Electrical and Computer Engineering, Yokohama National University, Yokohama 240-8501
Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
Data Preparation and Statistical Displays
Reservoir Modeling with GSLIB Data Preparation and Statistical Displays Data Cleaning / Quality Control Statistics as Parameters for Random Function Models Univariate Statistics Histograms and Probability
