Introducing Big Data. Abstract. with Small Changes. Agenda. Big Data in the News. Bits and Bytes



Similar documents
Introduction to the Mathematics of Big Data. Philippe B. Laval

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Data Storage. 1s and 0s

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Here is a diagram of a simple computer system: (this diagram will be the one needed for exams) CPU. cache

SECURITY MEETS BIG DATA. Achieve Effectiveness And Efficiency. Copyright 2012 EMC Corporation. All rights reserved.

THE EMERGING GI ENVIRONMENT

Big Systems, Big Data

Big Data a threat or a chance?

Implications of Big Data for Statistics Instruction 17 Nov 2013

Big Data: Study in Structured and Unstructured Data

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Sunnie Chung. Cleveland State University


HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Education in Transition. Learning in the 21 st Century. Paige Johnson Education Strategist Intel

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

A Survey on Big Data Concepts and Tools

Big Data Architectures and Technologies

THE NEXT PHASE IN THE EVOLUTION OF BIG DATA ANALYTICS

Addressing the Challenges of Teaching Big Data in Technical Education. Christopher B. Davison. Ball State University.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

REVIEW SHEETS INTRODUCTORY PHYSICAL SCIENCE MATH 52

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Current Landscape of Business Analytics and Data Science 8/12/2015

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

Predictive Analytics for Demand Forecasting and Planning Managers A Big Data Challenge Hans Levenbach, Delphus, Inc.

A Professional Big Data Master s Program to train Computational Specialists

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Data-Intensive Science and Scientific Data Infrastructure

Big Data Readiness. A QuantUniversity Whitepaper. 5 things to know before embarking on your first Big Data project

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Collaborations between Official Statistics and Academia in the Era of Big Data

Data Centric Computing Revisited

Séminaire du LaDHUL. Periklis Andritsos, «Big Data Challenges, Opportunities and Avenues of Research, or How did I grow up talking to data»

Big Data Analytics. David Dietrich, EMC Education Services. April 4, 2013

Big Data: Image & Video Analytics

Cleveland State University NAL/PAD/PDD/UST 504 Section 51 Levin College of Urban Affairs Fall, 2009 W 6 to 9:50 pm UR 108

Information Management course

Doing Multidisciplinary Research in Data Science

Business Intelligence & Product Analytics

Journal of Environmental Science, Computer Science and Engineering & Technology

Big Data Streams. Analytics Challenges, Analysis, and Applications. Adel M. Alimi

( Data Scientists in the Wild )

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Kai Wähner. The Next-Generation BPM for a Big Data World: Intelligent Business Process Management Suites (ibpms)

ว ชาโปรแกรมแกรมประย กต ด านการจ ดการ สาน กงานอ ตโนม ต

Introduction to Predictive Analytics. Dr. Ronen Meiri

EXECUTIVE REPORT. Big Data and the 3 V s: Volume, Variety and Velocity

Statistics for BIG data

The Top Challenges in Big Data and Analytics

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Massive Scale Analytics for a Smarter Planet

Better Decision Making

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

Visual Data Discovery & Streaming Data It s About Time! Pat Wall Product Marketing

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

Statistics, Big Data and Data Science!?

Big Data Introduction, Importance and Current Perspective of Challenges

Visualizing Big Data. Activity 1: Volume, Variety, Velocity

Introducing Big Data Concepts in an Introductory Technology Course

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

SPC BENCHMARK 1C FULL DISCLOSURE REPORT SEAGATE TECHNOLOGY LLC IBM 600GB 10K 6GBPS SAS 2.5" SFF SLIM-HS HDD SPC-1C V1.5

Getting Ready for Big Data

BIG DATA IN BUSINESS ENVIRONMENT

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Von Social Media zum Social Business Ein Megatrend für die Geschäftswelt


What does the future hold for predictive analytics?

WELCOME TO THE WORLD OF BIG DATA. NEW WORLD PROBLEMS, NEW WORLD SOLUTIONS

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

A Presenta*on from Big Data 22 February 2013

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

REVIEW PAPER ON BIG DATA USING HADOOP

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

CONFIGURING TCP/IP ADDRESSING AND SECURITY

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Big Analytics: A Next Generation Roadmap

CS222: Systems Programming

File Organization and Management

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

A Course in Data Discovery and Predictive Analytics 16 Nov 2013

How To Teach Data Science

We are building the next generation of Big Data and Analytics solutions!

Keywords Cloud Computing, CRC, RC4, RSA, Windows Microsoft Azure

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

BUSINESS INTELLIGENCE COMPETENCY CENTER

Applying Process Mining to IT Big Data

Research Note What is Big Data?

Indexed Terms: Big Data, benefits, characteristics, definition, problems, unstructured data

Introduction to Data Mining

The Next Wave of Data Management. Is Big Data The New Normal?

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

jbeam A CEA-based enterprisewide reporting solution, even for big datasets

How Big Data is Different

Big Data in the context of Preservation and Value Adding

Don t worry! There is no right or wrong answer.be honest so that I can figure out the best way to help you next year!

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Transcription:

Introducing Big Data in Stat 101 with Small Changes 17 Nov 2013 Introducing Big Data in Stat 101 with Small Changes John D. McKenzie, Jr. Babson College Babson Park, MA 02457 0310 mckenzie@babson.edu DSI Baltimore, MD 2013 November 18 Abstract Today s technology produces massive amounts of data from a variety of sources such as social networking activities, financial transactions, genetic sequences, and astronomical transmissions. Very few introductory applied statistics courses consider such Big Data, for which many standard descriptive and inferential methods fail. This presentation will consider a number of ways that students can be easily exposed to the three V s of 'Big Data' (Volume, Velocity, and Variety) in such courses. 1 2 Agenda 2012 Mathematics Awareness Month http://www.maa.org/mathematics awareness month 2012 Big Data and its Three + V s Standard Introductory Applied Course Big Data Sets Volume Velocity Varieties 3 4 Big Data in the News OSTP s Big Data Initiative (US$200,000,000) (nsf.gov search on Big Data) McKinsey Global Institute Report (a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know how to use the analysis of big data to make effective decisions) Big Data Special Issue of Significance Magazine (August 2012) NSA Disclosures, Prefixes for multiples of bits (b) or bytes (B) Value Bits and Bytes Decimal Metric 1000 k kilo 1000 2 M mega 1000 3 G giga 1000 4 T tera 1000 5 P peta 1000 6 E exa 1000 7 Z zetta 1000 8 Y yotta Binary Value JEDEC IEC 1024 K kilo Ki kibi 10242 M mega Mi mebi 10243 G giga Gi gibi 10244 Ti tebi 10245 Pi pebi 10246 Ei exbi 10247 Zi zebi 10248 Yi yo 5 6 2013 McKenzie DSI MSMESB Slides.pdf 1

Introducing Big Data in Stat 101 with Small Changes 17 Nov 2013 The Three V s of Big Data Volume Velocity Variety Introductory Applied Course Terminology and Sampling Methods Descriptive Statistics (graphs and numeric measures) Basic Probability Fundamental Inference Advanced Topics META Group (now Gartner) analyst, Doug Laney Only one course (De Veaux) 7 8 Massive Data Sets Practice Significance Visualization Volume Big Data Sets http://www.kdnuggets.com/datasets/ Over 60 Data Repositories and growing Data Mining Competitions KDD Cup Results Summary 9 10 Practical Significance p value >.05 from one sample z test and versus p value =.000 from one sample z test with same sample mean and standard deviation but a 1000 times the sample size Doane and Steward (2009), Applied Statistics in Business & Economics pp. 364, 371, 374, 404, and 594 reinforcement Practical Significance 2 Chi Square Test of Independence 100 60 90 70 with p value of.255 to a p value of.000 for 1000 600 900 700 11 12 2013 McKenzie DSI MSMESB Slides.pdf 2

Introducing Big Data in Stat 101 with Small Changes 17 Nov 2013 Data Visualization Data Visualization A visualization created by IBM of Wikipedia edits. At multiple terabytes in size, the text and images of Wikipedia are a classic example of big data Twitter Mentions 13 14 Velocity Variety (structure) Time Series Data Process Data Two Sample Data Missing Data Messy Data Text Data Date and Time Data 15 16 Variety: Two Sample Data Text Data: Word Cloud 17 18 2013 McKenzie DSI MSMESB Slides.pdf 3

Introducing Big Data in Stat 101 with Small Changes 17 Nov 2013 Text Data: Word Cloud DSI Constitution and By Laws 19 20 Text Data: N Gram Big Data, Business Analytics, Predictive Analytics,, Data Science 21 22 Variety (sources) http://www.amstat.org/publications/jse/jse_data_archive.htm: JSE Data Archive http://www.causeweb.org/cwis/spt--browseresources.php?parentid=5: CAUSE Data Sets http://stat-computing.org/dataexpo : ASA Statistical Computing and Statistical Graphics Bi- Annual Data Exposition http://www.kdnuggets.com/datasets : Datasets for Data Mining http://www.data.gov : U.S. Government Data http://data.worldbank.org/ : The World Bank Data http://bitly.com/bundles/hmason/1 : Research-Quality Data Sets http://aws.amazon.com/big-data : Big Data on Amazon Web Services http://www.bigdata-startups/public-data: 14 Sources of Public Data Sets http://es.slideshare.net/cengagelearning/mark-frydenberg-drinking-from-the-fire-hose : Big Data: What It Is and How You Can Use It slide show https://developers.google.com/fusiontables/ : Google experimental application that lets you store, share, query, and visualize data tables https://developers.google.com/bigquery/ : Google site to interactively analyze massive datasets http://citizen-statistician.org/ : Learning to Swim in the Data Deluge Blog http://www.williams.edu/feature-stories/visualizing-the-liberal-arts/ : Williams College Majors and Employment Future Introductory Course Math Common Core State Standards will result in Remedial Sections? Today s Course with More Topics? Today s Second Core? Big Data Analytics Course? or? 23 24 2013 McKenzie DSI MSMESB Slides.pdf 4

Introducing Big Data in Stat 101 with Small Changes 17 Nov 2013 Two Current Examples of Analytics Sharpe, De Veaux, and Velleman (2012), Business Statistics, Second Edition, Chapter 25, Introduction to Data Mining (Paralyzed Veterans of America) Berenson, Levine, and Krehbiel (2012), Basic Business Statistics, Twelfth Edition, Online Topic: Analytics and Data Mining 2015? 25 2013 McKenzie DSI MSMESB Slides.pdf 5

Introducing Big Data in Stat 101 with Small Changes John D. McKenzie, Jr. Babson College Babson Park, MA 02457 0310 mckenzie@babson.edu DSI Baltimore, MD 2013 November 18 1

Abstract Today s technology produces massive amounts of data from a variety of sources such as social networking activities, financial transactions, genetic sequences, and astronomical transmissions. Very few introductory applied statistics courses consider such Big Data, for which many standard descriptive and inferential methods fail. This presentation will consider a number of ways that students can be easily exposed to the three V s of 'Big Data' (Volume, Velocity, and Variety) in such courses. 2

Agenda Big Data and its Three + V s Standard Introductory Applied Course Big Data Sets Volume Velocity Varieties 3

2012 Mathematics Awareness Month http://www.maa.org/mathematics awareness month 2012 4

Big Data in the News OSTP s Big Data Initiative (US$200,000,000) (nsf.gov search on Big Data) McKinsey Global Institute Report (a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know how to use the analysis of big data to make effective decisions) Big Data Special Issue of Significance Magazine (August 2012) NSA Disclosures, 5

Bits and Bytes Prefixes for multiples of bits (b) or bytes (B) Decimal Value Metric 1000 k kilo 1000 2 M mega 1000 3 G giga 1000 4 T tera 1000 5 P peta 1000 6 E exa 1000 7 Z zetta 1000 8 Y yotta Binary Value JEDEC IEC 1024 K kilo Ki kibi 1024 2 M mega Mi mebi 1024 3 G giga Gi gibi 1024 4 Ti tebi 1024 5 Pi pebi 1024 6 Ei exbi 1024 7 Zi zebi 1024 8 Yi yo 6

The Three V s of Big Data Volume Velocity Variety META Group (now Gartner) analyst, Doug Laney 7

Introductory Applied Course Terminology and Sampling Methods Descriptive Statistics (graphs and numeric measures) Basic Probability Fundamental Inference Advanced Topics Only one course (De Veaux) 8

Volume Massive Data Sets Practice Significance Visualization 9

Big Data Sets http://www.kdnuggets.com/datasets/ Over 60 Data Repositories and growing Data Mining Competitions KDD Cup Results Summary 10

Practical Significance p value >.05 from one sample z test and versus p value =.000 from one sample z test with same sample mean and standard deviation but a 1000 times the sample size Doane and Steward (2009), Applied Statistics in Business & Economics pp. 364, 371, 374, 404, and 594 reinforcement 11

Practical Significance 2 Chi Square Test of Independence 100 60 90 70 with p value of.255 to a p value of.000 for 1000 600 900 700 12

Data Visualization A visualization created by IBM of Wikipedia edits. At multiple terabytes in size, the text and images of Wikipedia are a classic example of big data 13

Data Visualization Twitter Mentions 14

Velocity Time Series Data Process Data 15

Variety (structure) Two Sample Data Missing Data Messy Data Text Data Date and Time Data 16

Variety: Two Sample Data 17

Text Data: Word Cloud 18

Text Data: Word Cloud 19

DSI Constitution and By Laws 20

Text Data: N Gram 21

Big Data, Business Analytics, Predictive Analytics,, Data Science 22

Variety (sources) http://www.amstat.org/publications/jse/jse_data_archive.htm: JSE Data Archive http://www.causeweb.org/cwis/spt--browseresources.php?parentid=5: CAUSE Data Sets http://stat-computing.org/dataexpo : ASA Statistical Computing and Statistical Graphics Bi- Annual Data Exposition http://www.kdnuggets.com/datasets : Datasets for Data Mining http://www.data.gov : U.S. Government Data http://data.worldbank.org/ : The World Bank Data http://bitly.com/bundles/hmason/1 : Research-Quality Data Sets http://aws.amazon.com/big-data : Big Data on Amazon Web Services http://www.bigdata-startups/public-data: 14 Sources of Public Data Sets http://es.slideshare.net/cengagelearning/mark-frydenberg-drinking-from-the-fire-hose : Big Data: What It Is and How You Can Use It slide show https://developers.google.com/fusiontables/ : Google experimental application that lets you store, share, query, and visualize data tables https://developers.google.com/bigquery/ : Google site to interactively analyze massive datasets http://citizen-statistician.org/ : Learning to Swim in the Data Deluge Blog http://www.williams.edu/feature-stories/visualizing-the-liberal-arts/ : Williams College Majors and Employment 23

Future Introductory Course Math Common Core State Standards will result in Remedial Sections? Today s Course with More Topics? Today s Second Core? Big Data Analytics Course? or? 24

Two Current Examples of Analytics Sharpe, De Veaux, and Velleman (2012), Business Statistics, Second Edition, Chapter 25, Introduction to Data Mining (Paralyzed Veterans of America) Berenson, Levine, and Krehbiel (2012), Basic Business Statistics, Twelfth Edition, Online Topic: Analytics and Data Mining 2015? 25