Statistics for BIG data



Similar documents
Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

The Data Mining Process

COMP9321 Web Application Engineering

An Overview of Knowledge Discovery Database and Data mining Techniques

Introduction to Data Mining

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

MS1b Statistical Data Mining

Data Mining Solutions for the Business Environment

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Statistical Challenges with Big Data in Management Science

Information Management course

Sunnie Chung. Cleveland State University

not possible or was possible at a high cost for collecting the data.

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Collaborations between Official Statistics and Academia in the Era of Big Data

Big Data and Marketing

Big Data Strategies Creating Customer Value In Utilities

Data Mining + Business Intelligence. Integration, Design and Implementation

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Data Isn't Everything

Healthcare Measurement Analysis Using Data mining Techniques

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Principles of Data Mining by Hand&Mannila&Smyth

Big Data and Analytics: Challenges and Opportunities

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Azure Machine Learning, SQL Data Mining and R

A Review of Data Mining Techniques

Information Visualization WS 2013/14 11 Visual Analytics

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining and Neural Networks in Stata

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Proposal for New Program: BS in Data Science: Computational Analytics

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Analytics for Business Intelligence and Decision Support

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Proposal for the Theme on Big Data. Analytics. Qiang Yang, HKUST Jiannong Cao, PolyU Qi-man Shao, CUHK. May 2015

Machine Learning for Data Science (CS4786) Lecture 1

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

An Introduction to Data Mining

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

This Symposium brought to you by

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

Perspectives on Data Mining

Data, Measurements, Features

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

The Scientific Data Mining Process

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

BIG DATA What it is and how to use?

An Introduction to Advanced Analytics and Data Mining

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Master of Science in Health Information Technology Degree Curriculum

Machine Learning Introduction

Chapter 6 - Enhancing Business Intelligence Using Information Systems

Data Warehousing and Data Mining in Business Applications

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

Big Data Hope or Hype?

2015 Workshops for Professors

Analytics on Big Data

Advanced analytics at your hands

A New Era Of Analytic

Predictive Analytics Certificate Program

Data UNC. Vinayak Deshpande

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data a threat or a chance?

Data Mining Applications in Higher Education

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

Big Data Big Knowledge?

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Learning outcomes. Knowledge and understanding. Competence and skills

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Statistical Models in Data Mining

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Chapter ML:XI. XI. Cluster Analysis

ANALYTICS CENTER LEARNING PROGRAM

Chapter 12 Discovering New Knowledge Data Mining

Clustering Methods in Data Mining with its Applications in High Education

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Nagarjuna College Of

MEDICAL DATA MINING. Timothy Hays, PhD. Health IT Strategy Executive Dynamics Research Corporation (DRC) December 13, 2012

TIETS34 Seminar: Data Mining on Biometric identification

CHAPTER 1 INTRODUCTION

The University of Jordan

AMIS 7640 Data Mining for Business Intelligence

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Bijan Raahemi, Ph.D., P.Eng, SMIEEE Associate Professor Telfer School of Management and School of Electrical Engineering and Computer Science

Intrusion Detection via Machine Learning for SCADA System Protection

Transcription:

Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before and After What is BIGdata? Who is working on BIGdata? What kinds of Statistics are needed? 1

What gets measured, get managed. If you cannot measure it, You cannot improve it! Wikipedia Big Data A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. 目 前 無 法 顯 示 此 圖 像 目 前 無 法 顯 示 此 圖 像 It looks at large data sets and turns raw data into actionable intelligence and insights to make better, more valuable marketing decisions. 2

Why Now? Computer power The price of digital storage Data warehouses are being widely implemented Astronomy, Genetic, etc Amazon, Google, Facebook, Yahoo, Youtube. Censors/RFID DeVeaux Big Data 2.5 EB bytes of data is created every day. 2,500,000,000,000,000,000 bytes More than 30 million sensors are being used. More than 4 billion people were using mobile phones in 2010. 90% of the total data was created in last two years. 10 Who is working on BIGdata? Computer Science Information Technology Applied Mathematics Computational Math/Science Operation Research Industrial Engineering Informatics Science/Data Scientist Electronic Engineering Political Science Anyone who needs funding $$$ BIG Data This is a golden age for Statistics, But not necessary for statisticians. 3

Genetic Study Astronomy Geology 目 前 無 法 顯 示 此 圖 像 G.I.S & Geo-Statistics 4

Characteristics of Big Data Emphasis on Automatic Computer users are domain experts, not computer professionals Too much data Too much technology Not enough useful information When is Automatic Analysis Useful? DeVeaux 17 BIGdata: The word Statistics was nowhere to be seen! 5

Keep in mind BIG Data: What should be the fourth V? Value? Big data is not necessarily complete, accurate, or true! Value is in the eye of beholder, not the person crunching the numbers! Bigger does not always implies better! If there is initial bias Challenges BIGdata typically may not have much Value at all: DRIP effect Data Rich Information Poor 87.5% global data hasn t been really developed and used. Large databases only used in data entry and query statistics without deep analysis on big data. Structured databases are used to store data, resulting in distortion and missing of large amounts on unstructured data. 6

Big Data: 3+1=4 V s Randomness Opportunity for Statisticians Volume Data at rest Velocity Data in Motion Variety Data in many forms Veracity Data in Doubt Big Data: The 4V s Big data in real world Skills Politics Techniques Cognition Privacy 7

What kind of Stat is needed: General Prediction vs Estimation (Inference) Only Statistician talk about inference Common Belief Big data is better than Small data New Methodologies are so powerful and work well The death of p-value and the death of science Fundamental Theory All theories with iid one population assumption CLT, LLN, etc What Statistics: Wish-List Seek out high-impact problems Provide structure for poorly defined problems Develop new theories for new (reality) BIG data world. Big Data: Basic concerns What to collect? Bias? How to collect? How to store? How to use? What to use? Analysis and potential risks? What Stat: Technical Details Statistics of (many descriptive) statistics how could we summarize thousands of correlations? How about thousands of p-values? ANOVA s? Regression models? Histograms? Classification and Clustering Low-dimension behavior Feature Extraction (for extremes) and pattern recognition (for norm) New type/structure of data How to build up a regression (say) model when both input and output variables are network? 8

A good Example Some Remarks Data have complete meaning in themselves, no theory is required. cf. Data has no meaning in themselves (BH 2, 1978) Using a Scientific approach, as oppose to an Algorithm approach, to Big data including data collection, data analysis, mode selection, and feature interpretation. Susan Hockfield (former MIT President) Around the dawn of the 20th century, physicists discovered the basic building blocks of the universe; a parts list, if you will. Engineers said we can build something from this list, and produced the electronics revolution, and subsequently the computer revolution. More recently, biologists have discovered and mapped the basic parts list of life the human genome. Engineers have said we can build something from this list, and are producing a revolution in personalized medicine. Who is Building Something Meaningful from the Statistical Science Parts List of Tools? What s Statistics All About? Classical View 9

Physics Mathematics ( 物 理 () 数 学 ) 科 学 探 索 Biology Chemistry (( 生 化 物 学 )) Deduction ( 演 绎 ) Induction ( 归 纳 ) Statistics( 统 计 分 析 ) 比 较 偏 重 于 观 察 与 试 验 的 累 积 ( 也 就 是 归 纳 类 型 的 研 究 ) 万 事 万 物 有 其 共 同 性, 这 使 得 统 计 变 成 可 能 ; 万 事 万 物 有 其 不 同 性, 这 使 得 统 计 变 成 必 须 Things are the same, this makes Statistics possible. Things are different, this makes Statistics necessary. 鏡 子 與 窗 子 成 功 者 失 敗 者 A Data Explosion is Coming! Are You Ready? RFID Study Group 10

BIGdata: Statistics is essential! Speak Loud and Remind all your friends and enemies! Computer Science Root Database SVM: Support Vector Machine ANN: Artificial Neural Network MBA: Market Basket Analysis (Association Rule) Genetic Algorithms OLAP: On-Line Analytic Processing Link Analysis High Dimensional Plots KDD Process Machine Learning Text Mining Statistical Root (Linear & Nonlinear) Regression Logistic Regression Classification Clustering Density Estimation, Bumps and Ridges Time Series: Trend & Spectral Estimation Dimension Reduction Techniques High Dimensional Plots Classical Multivariate Methods Statistical & Computer Science Mixture Data Visualization Dimension Reduction Multi-dimensional Scaling Local Linear Embedding Principal Component & Principal Manifolds Local Tangent Space Alignment Independent Component Analysis K-means & kd-tree Support Vector Machine 11

Which Professions??? Old Business Model Statistics All kind of errors Type-I error Type-II error Pure error Lack of fit Loss function Failure Rate Hazard Model Penalty False Discovery Rate Risk Computer Science Faithfulness Greedy Search Smart Algorithm Intelligent Procedure Golden Standard Knowledge Discovery Profit and Loss Social Investment Shareholder Value New Business Model Community Development Product Investment Environment Sustainability Send $500 to Dennis Lin STILL QUESTION? University Distinguished Professor 317 Thomas Building Department of Statistics Penn State University +1 814 865-0377 (phone) +1 814 863-7114 (fax) DennisLin@psu.edu (Customer Satisfaction or your money back!) 12