REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

Size: px
Start display at page:

Download "REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION"

Transcription

1 REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety of economic and social activities signifies a challenge and an opportunity for official statistics. But it remains still to discover how to extract significant value for the production of statistical figures from the diversity of data available. This paper proposes to start exploring some big data following a straightforward path to achieve results. Next section provides a preliminary insight on the issues that are at stake, and the following presents some ideas to start a road map from Eurostat. Some defining features of Big Data The most conspicuous feature of Big Data as compared to traditional statistical sources is that they do not come from a previous design with the aim of obtaining specific statistics, but become available as traces of human activity. This attribute makes it difficult to use traditional statistical methods and tools such as probabilistic sampling, statistical classifications and so on, turning into useless and not applicable the Generic Statistical Business Process Model. The outcomes of a recent survey conducted among executives of a wide range of industries around the world can be clarifying: although a great part of the respondents agreed on that data had become an important factor for their business, many companies were struggling with basic aspects of data management, still attempting to exploit it effectively [1]. This confirms the fact that extracting useful information from this kind of data is a non-obvious and rather difficult task that should be carefully planned. Although the appearance of some statistical figures based on Big Data may suggest the information is found directly on the data prepared to be published, there should be considered the huge amounts of data that have been previously analysed and processed to achieve these results. Nevertheless, the attractive of the potential reduction of respondents burden and costs, and the general framework of improving the productivity of the ESS, introduce increasing pressure to the use of Big Data as sources of statistical information. The experience on the use of another source that shares with Big Data this feature of not being designed for statistical purposes as it is administrative data may illuminate the road map, comparing the features in common and the ones that make a difference. The main aspects of the administrative data we are interested to consider are the following: 1

2 a) Methods to obtain statistical information from administrative data usually depend on the specific data, being difficult to establish general rules or use generic production process models. b) Administrative data are not structured as statistical data are, that is, they do not use statistical classifications and definitions, but still they show a certain structure related with the objective of its creation. This means that some tasks of translating, linking or harmonizing the structures (units, definitions, classifications...) should always be done. c) Sampling procedures are not used to obtain the reporting units but frequently there is an idea of their representativeness on the population of interest (sometimes all the population units are included). d) The volume of administrative data is not usually a problem and they may be treated with the statistical procedures used with other typical sources. e) The way they are increasingly being used to produce statistical figures can be classified as (i) totally replacing statistical sources, (ii) partially replacing statistical sources, completing the information by means of record linkage, matching or other procedures, and (iii) providing completely new statistical figures that may be a complement to the available statistical information from other perspectives. The two first ways may result in theory on significant reductions of costs and respondents burden, but they frequently imply new tasks of translating, linking or harmonizing which are not necessary when completely new statistical figures are produced. An example of this last case could be the figures of registered unemployment. As for Big Data, and concerning the same corresponding features: a) Due to the heterogeneity of Big Data available, methods to produce statistical information should be developed ad hoc for each case, exactly the same to the case of administrative data. b) Some Big Data have a certain structure related with the source of information and some are just unstructured text strings. Good metadata are not usually available and it seems that in most of the cases the tasks to harmonize or translate to statistical structures would be enormous. c) Apart from not using sampling procedures, Big Data come frequently from private companies and its representativeness and coverage over the populations of interest for official statistical is difficult to assess. d) The name of Big Data refers precisely to the huge volume. This dimension has an impact on the storage and processing, falling frequently out of the scope of the traditional statistical tools. e) The way Big Data could be used to produce statistical figures deals with a crucial issue. The idea is that it seems not easy to find Big Data able to totally or partially replace statistical sources in the short term because of the reasons explained in previous points and follow the path in this direction may be too expensive in time and resources. Thus, a sound approach would be to start searching for sources that could provide completely new and independent statistical figures not adapted to traditional statistical structures but offering new perspectives. For example, instead of finding sources to substitute the HBS, try to build indicators of its trends over time. When improvements on this area are achieved, the new set of statistics available will provide a valuable basis for re-designing the products and the process of production of official statistics. There may be opportunities to tackle the specific problems of Big Data by using the suitable tools: 1. An apparently critical problem is the volume of the Big Data available: there is a necessity to move away from exclusive dependence on the statistical methods that cannot handle this volume of information and adopt a more 2

3 diverse set of tools. This can be simply addressed through the use of algorithms specially developed for this goal such as data mining methods. These algorithms have the computational efficiency required and are scalable, that is, have the ability to handle a growing amount of work in a capable manner, or to be enlarged to accommodate that growth [2]. The state of the art provides a great variety of data mining tools for different objectives: classification, clustering, regression, association, feature extraction A first stage of exploration using data mining procedures should be usually carried out to learn about the unknown data structure and the possible outcomes, combining later this with traditional statistical procedures. The type of Big Data and its form determines the type of data mining tool to be used. Thus the statistical production process from Big Data should have as a first step the performance of an exploratory analysis. A combination of data mining and traditional statistical procedures may follow to produce the best results. 2. Another important concern is the representativeness and validity of the statistics produced. The use of probabilistic sampling in traditional statistics provides a theoretical framework that ensures confidence on the figures produced, being the accuracy based on sampling errors. Most of Big Data available cannot be adapted to this framework and other procedures should be devised. This seems to be an important weakness of Big Data use and efforts should be focused on it. Meanwhile, experiences of successful uses of Big Data could be investigated to follow a similar approach. Two well-known examples are here briefly considered. The first refers to the estimates of the incidence of flu in different countries and regions around the world from the searches on Google for flu-related topics [3]. It has been found that these estimates are very closely matched to traditional flu activity indicators. Similarly, a recent article in BBC News [4] reported that Google searches for finance-related terms may predict moves in markets, and that an investment strategy based on these search volume data between 2004 and 2011 would have made a profit of 326%. These examples have two important features in common (apart from being Google products) that may help with the problem of representativeness, coverage and validity. The first thing is that both of them estimate changes or movements across time, and not absolute figures. A well-established statistical principle is that it is more reliable to estimate changes (over time or space) than absolute figures, because some bias and errors can be cancelled up when computing the change: maybe the first attempts to use Big Data should be addressed to produce estimates of changes or evolutions. The other relevant feature shared by both examples is the criterion to evaluate the results. What is estimated are proxy variables that perform well in following the movements of a phenomenon of interest. That is, the performance is assessed in terms of its similarity to other figures available measuring the same or analogous thing. In the same way, the performance of Big Data could be evaluated on a first instance from the similarity or agreement to other available measures and not from a sampling errors criterion. This makes sense from a data mining perspective, where the equivalent to fitting a model is tuning an algorithm so that it fits with the real world. 3

4 When many different statistical figures are produced from different and independent Big Data sources following these principles, the coherence and agreement among them may be an argument to support the validity and representativeness of the whole system. 3. Although Big Data may be not structured as statistical data are, they may have the same type of structure/non-structure across countries. This would have the advantage of making unnecessary the process of harmonization between countries what is of special interest for transnational statistics. 4. There are other concerns about Big Data that seem to be similar to the case of statistical sources, such as the appearance of diverse types of problems or errors: noise, incompleteness, missing data, reporting errors, outliers Data editing (cleaning, checking, imputing ) are time and resources consuming activities in traditional statistical processing and similar methods to deal with them could be used. It is likely that some errors (reporting, incompleteness ) have fewer occurrences in Big Data through non-human intervention on its origin, although machine or system failures may as well happen, producing other errors. A new type of problem that do not occur with statistical sources but may emerge in Big Data is imprecision (for instance, vague or categorical measures as high, medium, low...): it may be attacked using other data mining tools such as fuzzy and rough sets. Some data mining procedures are interesting because they are robust in the sense of being tolerant towards erroneous data or departures from data assumptions. In any case, all these methods should be developed in an ad hoc basis. A final remark is that the opportunity with the use of Big Data is based on the reduction of burden to respondents and that sometimes may be quickly obtained. Hence prior to engage into a complex process to make it a reliable source for statistics, a careful analysis of the potential gains should be made. Or, something similar, reduction of costs, burden and timeliness provided by the use of Big Data may balance a possible decreasing of accuracy or quality in general. A possible road map to exploit Big Data This section just sketches out a few actions that Eurostat may promote as a first step to exploit Big Data sources. These actions are to: 1. Identify possible Big Data sources. These may be private or public, internet or non-internet, being the interest especially on those sources having international scope and appropriate to produce indicators of trends or changes in different economic and social activities. The access to these data and possible problems (confidentiality, ownership ) should be studied as well. 2. Gather information of practices in European countries on the use of Big Data for producing statistics, classifying the methods and tools and the outcomes produced. This would provide information on alternative approaches. 4

5 3. Launch pilot research projects to produce statistical figures from identified Big Data sources. An example of a possible research exercise using a non-internet Big Data source is the production of an indicator of the evolution of household budgets from the transactions records of a department store. It may be obtained using association rules as a first step and computing later weighted indices. These can be checked comparing to the outcomes of alternative sources as the annual HBS. References [1] Big Data: Lessons from the leaders, The Economist Intelligence Unit Limited, [2] André B. Bondi, Characteristics of scalability and their impact on performance, Proceedings of the 2nd international workshop on Software and performance, Ottawa, Ontario, Canada, 2000, ISBN X. [3] [4] 5

Appendix B Data Quality Dimensions

Appendix B Data Quality Dimensions Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational

More information

OECD SHORT-TERM ECONOMIC STATISTICS EXPERT GROUP (STESEG)

OECD SHORT-TERM ECONOMIC STATISTICS EXPERT GROUP (STESEG) OECD SHORT-TERM ECONOMIC STATISTICS EXPERT GROUP (STESEG) 10-11 September 2009 OECD Conference Centre, Paris Session II: Short-Term Economic Statistics and the Current Crisis A national statistics office

More information

ICT Perspectives on Big Data: Well Sorted Materials

ICT Perspectives on Big Data: Well Sorted Materials ICT Perspectives on Big Data: Well Sorted Materials 3 March 2015 Contents Introduction 1 Dendrogram 2 Tree Map 3 Heat Map 4 Raw Group Data 5 For an online, interactive version of the visualisations in

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS

THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS List of best practice for the conduct of business and consumer surveys 21 March 2014 Economic and Financial Affairs This document is written

More information

Data quality and metadata

Data quality and metadata Chapter IX. Data quality and metadata This draft is based on the text adopted by the UN Statistical Commission for purposes of international recommendations for industrial and distributive trade statistics.

More information

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT Scientific Bulletin Economic Sciences, Vol. 9 (15) - Information technology - DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT Associate Professor, Ph.D. Emil BURTESCU University of Pitesti,

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

ANNUAL QUALITY REPORT

ANNUAL QUALITY REPORT ANNUAL QUALITY REPORT FOR THE SURVEY ANNUAL STATISTICAL SURVEY ON THE QUANTITY OF WASTE AT WASTE LANDFILL SITES (KO-U) FOR 2013 Prepared by: Mojca Žitnik, Marko Polh, Department for Environment and Energy

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

5 Discussion and Implications

5 Discussion and Implications 5 Discussion and Implications 5.1 Summary of the findings and theoretical implications The main goal of this thesis is to provide insights into how online customers needs structured in the customer purchase

More information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

More information

Concept and Project Objectives

Concept and Project Objectives 3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

More information

Cloud computing based big data ecosystem and requirements

Cloud computing based big data ecosystem and requirements Cloud computing based big data ecosystem and requirements Yongshun Cai ( 蔡 永 顺 ) Associate Rapporteur of ITU T SG13 Q17 China Telecom Dong Wang ( 王 东 ) Rapporteur of ITU T SG13 Q18 ZTE Corporation Agenda

More information

The big data revolution

The big data revolution The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

More information

Will big data transform official statistics?

Will big data transform official statistics? Will big data transform official statistics? Denisa Florescu, Martin Karlberg, Fernando Reis, Pilar Rey Del Castillo, Michail Skaliotis and Albrecht Wirthmann 1 Abstract Official Statistics, confronted

More information

Intrusion Detection System using Log Files and Reinforcement Learning

Intrusion Detection System using Log Files and Reinforcement Learning Intrusion Detection System using Log Files and Reinforcement Learning Bhagyashree Deokar, Ambarish Hazarnis Department of Computer Engineering K. J. Somaiya College of Engineering, Mumbai, India ABSTRACT

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

THE INTELLIGENT BUSINESS INTELLIGENCE SOLUTIONS

THE INTELLIGENT BUSINESS INTELLIGENCE SOLUTIONS THE INTELLIGENT BUSINESS INTELLIGENCE SOLUTIONS ADRIAN COJOCARIU, CRISTINA OFELIA STANCIU TIBISCUS UNIVERSITY OF TIMIŞOARA, FACULTY OF ECONOMIC SCIENCE, DALIEI STR, 1/A, TIMIŞOARA, 300558, ROMANIA ofelia.stanciu@gmail.com,

More information

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH 205 A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH ABSTRACT MR. HEMANT KUMAR*; DR. SARMISTHA SARMA** *Assistant Professor, Department of Information Technology (IT), Institute of Innovation in Technology

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept Statistics 215b 11/20/03 D.R. Brillinger Data mining A field in search of a definition a vague concept D. Hand, H. Mannila and P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge. Some definitions/descriptions

More information

CHAPTER SIX DATA. Business Intelligence. 2011 The McGraw-Hill Companies, All Rights Reserved

CHAPTER SIX DATA. Business Intelligence. 2011 The McGraw-Hill Companies, All Rights Reserved CHAPTER SIX DATA Business Intelligence 2011 The McGraw-Hill Companies, All Rights Reserved 2 CHAPTER OVERVIEW SECTION 6.1 Data, Information, Databases The Business Benefits of High-Quality Information

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Types of Studies. Systematic Reviews and Meta-Analyses

Types of Studies. Systematic Reviews and Meta-Analyses Types of Studies Systematic Reviews and Meta-Analyses Important medical questions are typically studied more than once, often by different research teams in different locations. A systematic review is

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support Rok Rupnik, Matjaž Kukar, Marko Bajec, Marjan Krisper University of Ljubljana, Faculty of Computer and Information

More information

Identifying IT Markets and Market Size

Identifying IT Markets and Market Size Identifying IT Markets and Market Size by Number of Servers Prepared by: Applied Computer Research, Inc. 1-800-234-2227 www.itmarketintelligence.com Copyright 2011, all rights reserved. Identifying IT

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

Information Visualization WS 2013/14 11 Visual Analytics

Information Visualization WS 2013/14 11 Visual Analytics 1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and

More information

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013 ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

More information

Data mining and official statistics

Data mining and official statistics Quinta Conferenza Nazionale di Statistica Data mining and official statistics Gilbert Saporta président de la Société française de statistique 5@ S Roma 15, 16, 17 novembre 2000 Palazzo dei Congressi Piazzale

More information

Business Information Systems. IT Enabled Services And Emerging Technologies. Chapter 4: Facilitated e-learning Part 1 of 2 CA M S Mehta, FCA

Business Information Systems. IT Enabled Services And Emerging Technologies. Chapter 4: Facilitated e-learning Part 1 of 2 CA M S Mehta, FCA Business Information Systems IT Enabled Services And Emerging Technologies Chapter 4: Facilitated e-learning Part 1 of 2 CA M S Mehta, FCA 1 Business Information Systems Task Statements 1.6 Consider the

More information

Data Discovery, Analytics, and the Enterprise Data Hub

Data Discovery, Analytics, and the Enterprise Data Hub Data Discovery, Analytics, and the Enterprise Data Hub Version: 101 Table of Contents Summary 3 Used Data and Limitations of Legacy Analytic Architecture 3 The Meaning of Data Discovery & Analytics 4 Machine

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992 Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992 File ID Filename Version uvapub:122992 1: Introduction unknown SOURCE (OR

More information

1. Understanding Big Data

1. Understanding Big Data Big Data and its Real Impact on Your Security & Privacy Framework: A Pragmatic Overview Erik Luysterborg Partner, Deloitte EMEA Data Protection & Privacy leader Prague, SCCE, March 22 nd 2016 1. 2016 Deloitte

More information

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration

More information

2. Issues using administrative data for statistical purposes

2. Issues using administrative data for statistical purposes United Nations Statistical Institute for Asia and the Pacific Seventh Management Seminar for the Heads of National Statistical Offices in Asia and the Pacific, 13-15 October, Shanghai, China New Zealand

More information

Automatic Document Categorization A Hummingbird White Paper

Automatic Document Categorization A Hummingbird White Paper Automatic Document Categorization A Hummingbird White Paper Automatic Document Categorization While every attempt has been made to ensure the accuracy and completeness of the information in this document,

More information

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) Leo Pipino University of Massachusetts Lowell Leo_Pipino@UML.edu David Kopcso Babson College Kopcso@Babson.edu Abstract: A series of simulations

More information

Current Situations and Issues of Occupational Classification Commonly. Used by Private and Public Sectors. Summary

Current Situations and Issues of Occupational Classification Commonly. Used by Private and Public Sectors. Summary Current Situations and Issues of Occupational Classification Commonly Used by Private and Public Sectors Summary Author Hiroshi Nishizawa Senior researcher, The Japan Institute for Labour Policy and Training

More information

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/

More information

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1 Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1 Introduction Electronic Commerce 2 is accelerating dramatically changes in the business process. Electronic

More information

Assessment Policy. 1 Introduction. 2 Background

Assessment Policy. 1 Introduction. 2 Background Assessment Policy 1 Introduction This document has been written by the National Foundation for Educational Research (NFER) to provide policy makers, researchers, teacher educators and practitioners with

More information

Big Data-Challenges and Opportunities

Big Data-Challenges and Opportunities Big Data-Challenges and Opportunities White paper - August 2014 User Acceptance Tests Test Case Execution Quality Definition Test Design Test Plan Test Case Development Table of Contents Introduction 1

More information

Reflections on Probability vs Nonprobability Sampling

Reflections on Probability vs Nonprobability Sampling Official Statistics in Honour of Daniel Thorburn, pp. 29 35 Reflections on Probability vs Nonprobability Sampling Jan Wretman 1 A few fundamental things are briefly discussed. First: What is called probability

More information

Key Requirements for a Job Scheduling and Workload Automation Solution

Key Requirements for a Job Scheduling and Workload Automation Solution Key Requirements for a Job Scheduling and Workload Automation Solution Traditional batch job scheduling isn t enough. Short Guide Overcoming Today s Job Scheduling Challenges While traditional batch job

More information

Software Firm Applies Structure to Content Management System for Greatest Value

Software Firm Applies Structure to Content Management System for Greatest Value Partner Solution Case Study Software Firm Applies Structure to Content Management System for Greatest Value Overview Country or Region: United States Industry: Professional services Software engineering

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Chapter 5 Foundations of Business Intelligence: Databases and Information Management 5.1 Copyright 2011 Pearson Education, Inc. Student Learning Objectives How does a relational database organize data,

More information

Delivering Smart Answers!

Delivering Smart Answers! Companion for SharePoint Topic Analyst Companion for SharePoint All Your Information Enterprise-ready Enrich SharePoint, your central place for document and workflow management, not only with an improved

More information

Introduction to Quality Assessment

Introduction to Quality Assessment Introduction to Quality Assessment EU Twinning Project JO/13/ENP/ST/23 23-27 November 2014 Component 3: Quality and metadata Activity 3.9: Quality Audit I Mrs Giovanna Brancato, Senior Researcher, Head

More information

relevant to the management dilemma or management question.

relevant to the management dilemma or management question. CHAPTER 5: Clarifying the Research Question through Secondary Data and Exploration (Handout) A SEARCH STRATEGY FOR EXPLORATION Exploration is particularly useful when researchers lack a clear idea of the

More information

warehouse landscape for HINC

warehouse landscape for HINC Transforming the data warehouse landscape for the financial industry HINC by Graz A data warehouse pre-configured for the financial industry significantly reduces the costs and risks associated with reporting

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

The Benefits of Using Data Mining Approach in Business Intelligence for Healthcare Organizations

The Benefits of Using Data Mining Approach in Business Intelligence for Healthcare Organizations The Benefits of Using Data Mining Approach in Business Intelligence for Healthcare Organizations Hisham S, Katoua Management Information Systems Dept. Faculty of Economics & Administration King Abdulaziz

More information

A Survey of Classification Techniques in the Area of Big Data.

A Survey of Classification Techniques in the Area of Big Data. A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department

More information

Chapter 6 Experiment Process

Chapter 6 Experiment Process Chapter 6 Process ation is not simple; we have to prepare, conduct and analyze experiments properly. One of the main advantages of an experiment is the control of, for example, subjects, objects and instrumentation.

More information

Fight fire with fire when protecting sensitive data

Fight fire with fire when protecting sensitive data Fight fire with fire when protecting sensitive data White paper by Yaniv Avidan published: January 2016 In an era when both routine and non-routine tasks are automated such as having a diagnostic capsule

More information

Alternative data collection methods -

Alternative data collection methods - Alternative data collection methods - focus on online data Presentation prepared by Ragnhild Nygaard, Statistics Norway for the UNECE/ILO Meeting on CPIs, Geneva, 2.-4. May 2016 Contents Data sources and

More information

Integrated email archiving: streamlining compliance and discovery through content and business process management

Integrated email archiving: streamlining compliance and discovery through content and business process management Make better decisions, faster March 2008 Integrated email archiving: streamlining compliance and discovery through content and business process management 2 Table of Contents Executive summary.........

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

DESKTOP BASED RECOMMENDATION SYSTEM FOR CAMPUS RECRUITMENT USING MAHOUT

DESKTOP BASED RECOMMENDATION SYSTEM FOR CAMPUS RECRUITMENT USING MAHOUT Journal homepage: www.mjret.in ISSN:2348-6953 DESKTOP BASED RECOMMENDATION SYSTEM FOR CAMPUS RECRUITMENT USING MAHOUT 1 Ronak V Patil, 2 Sneha R Gadekar, 3 Prashant P Chavan, 4 Vikas G Aher Department

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Data Mining and Analytics in Realizeit

Data Mining and Analytics in Realizeit Data Mining and Analytics in Realizeit November 4, 2013 Dr. Colm P. Howlin Data mining is the process of discovering patterns in large data sets. It draws on a wide range of disciplines, including statistics,

More information

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep Neil Raden Hired Brains Research, LLC Traditionally, the job of gathering and integrating data for analytics fell on data warehouses.

More information

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining System, Functionalities and Applications: A Radical Review Data Mining System, Functionalities and Applications: A Radical Review Dr. Poonam Chaudhary System Programmer, Kurukshetra University, Kurukshetra Abstract: Data Mining is the process of locating potentially

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

More information

Chapter 8: Quantitative Sampling

Chapter 8: Quantitative Sampling Chapter 8: Quantitative Sampling I. Introduction to Sampling a. The primary goal of sampling is to get a representative sample, or a small collection of units or cases from a much larger collection or

More information

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting WHITE PAPER Five Steps to Better Application Monitoring and Troubleshooting There is no doubt that application monitoring and troubleshooting will evolve with the shift to modern applications. The only

More information

Planning and Writing Essays

Planning and Writing Essays Planning and Writing Essays Many of your coursework assignments will take the form of an essay. This leaflet will give you an overview of the basic stages of planning and writing an academic essay but

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Improving quality through regular reviews:

Improving quality through regular reviews: Implementing Regular Quality Reviews at the Office for National Statistics Ria Sanderson, Catherine Bremner Quality Centre 1, Office for National Statistics, UK Abstract There is a requirement under the

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Chapter ML:XI. XI. Cluster Analysis

Chapter ML:XI. XI. Cluster Analysis Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster

More information

Statistical Challenges with Big Data in Management Science

Statistical Challenges with Big Data in Management Science Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision

More information

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS In today's scenario data warehouse plays a crucial role in order to perform important operations. Different indexing techniques has been used and analyzed using

More information

USES OF CONSUMER PRICE INDICES

USES OF CONSUMER PRICE INDICES USES OF CONSUMER PRICE INDICES 2 2.1 The consumer price index (CPI) is treated as a key indicator of economic performance in most countries. The purpose of this chapter is to explain why CPIs are compiled

More information

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam ECLT 5810 E-Commerce Data Mining Techniques - Introduction Prof. Wai Lam Data Opportunities Business infrastructure have improved the ability to collect data Virtually every aspect of business is now open

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Quality Control of Web-Scraped and Transaction Data (Scanner Data)

Quality Control of Web-Scraped and Transaction Data (Scanner Data) Quality Control of Web-Scraped and Transaction Data (Scanner Data) Ingolf Boettcher 1 1 Statistics Austria, Vienna, Austria; ingolf.boettcher@statistik.gv.at Abstract New data sources such as web-scraped

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

ETPL Extract, Transform, Predict and Load

ETPL Extract, Transform, Predict and Load ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements

More information

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, University of Indonesia Objectives

More information

The University of Adelaide Business School

The University of Adelaide Business School The University of Adelaide Business School MBA Projects Introduction There are TWO types of project which may be undertaken by an individual student OR a team of up to 5 students. This outline presents

More information

WHERE TO START? A PRELIMINARY DATA QUALITY CHECKLIST FOR EMERGENCY MEDICAL SERVICES DATA (Practice - Oriented Paper)

WHERE TO START? A PRELIMINARY DATA QUALITY CHECKLIST FOR EMERGENCY MEDICAL SERVICES DATA (Practice - Oriented Paper) WHERE TO START? A PRELIMINARY DATA QUALITY CHECKLIST FOR EMERGENCY MEDICAL SERVICES DATA (Practice - Oriented Paper) Jennifer Long Prehospital and Transport Medicine Research Program Sunnybrook and Women

More information

The skill content of occupations across low and middle income countries: evidence from harmonized data

The skill content of occupations across low and middle income countries: evidence from harmonized data The skill content of occupations across low and middle income countries: evidence from harmonized data Emanuele Dicarlo, Salvatore Lo Bello, Sebastian Monroy, Ana Maria Oviedo, Maria Laura Sanchez Puerta

More information

Enhancing Sales and Operations Planning with Forecasting Analytics and Business Intelligence WHITE PAPER

Enhancing Sales and Operations Planning with Forecasting Analytics and Business Intelligence WHITE PAPER Enhancing Sales and Operations Planning with Forecasting Analytics and Business Intelligence WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Analytics.... 1 Forecast Cycle Efficiencies...

More information

A Process-focused Approach to Improving Business Performance

A Process-focused Approach to Improving Business Performance A Process-focused Approach to Improving Business Performance Richard B Davis, BSc(Eng), CEng, MIEE, AKC Process Improvement Consultant, AXA AXA Centre, PO Box 1810, Bristol BS99 5SN Telephone: 0117 989

More information

Statistics on E-commerce and Information and Communication Technology Activity

Statistics on E-commerce and Information and Communication Technology Activity Assessment of compliance with the Code of Practice for Official Statistics Statistics on E-commerce and Information and Communication Technology Activity (produced by the Office for National Statistics)

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY

SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY Shirley Radack, Editor Computer Security Division Information Technology Laboratory National Institute

More information