Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University



Similar documents
Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Transforming the Telecoms Business using Big Data and Analytics

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Data in Transportation Engineering

DIGITAL UNIVERSE UNIVERSE

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Information Management course

Collaborations between Official Statistics and Academia in the Era of Big Data

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

Data Mining: Introduction

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Analytics. The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, Laboratory

So Just What Is Big Data? James E. Tcheng, MD, FACC, FSCAI

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Turning Big Data into Big Decisions Delivering on the High Demand for Data

Big Data: Image & Video Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Introduction to Data Mining

A New Era Of Analytic

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data Systems CS 5965/6965 FALL 2014

Data Mining Introduction

Statistical Challenges with Big Data in Management Science

Big data and its transformational effects

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Addressing government challenges with big data analytics

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Demystifying Big Data Government Agencies & The Big Data Phenomenon

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Majed Al-Ghandour, PhD, PE, CPM Division of Planning and Programming NCDOT 2016 NCAMPO Conference- Greensboro, NC May 12, 2016

Are You Ready for Big Data?

Introduction to Predictive Analytics. Dr. Ronen Meiri

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

From Big Data to Smart Data Thomas Hahn

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Applications for Business Intelligence, Predictive Analytics and Big Data

Big Data Driven Knowledge Discovery for Autonomic Future Internet

Exploiting the power of Big Data

MEDICAL DATA MINING. Timothy Hays, PhD. Health IT Strategy Executive Dynamics Research Corporation (DRC) December 13, 2012

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Big Data Analytics Process & Building Blocks

BIG DATA: BIG BOOST TO BIG TECH

Data Warehousing and Data Mining

Exploiting Data at Rest and Data in Motion with a Big Data Platform

The Big Data Paradigm Shift. Insight Through Automation

Data Centric Computing Revisited

Industry Impact of Big Data in the Cloud: An IBM Perspective

AV-24 Advanced Analytics for Predictive Maintenance

Context-Aware Online Traffic Prediction

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

BIG DATA TECHNOLOGY. Hadoop Ecosystem

WHAT IS BIG DATA? David Bechtold

BIG DATA What it is and how to use?

The Scientific Data Mining Process

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Big Data a threat or a chance?

Hexaware E-book on Predictive Analytics

Introduction to the Mathematics of Big Data. Philippe B. Laval

Deploying Big Data to the Cloud: Roadmap for Success

Data Refinery with Big Data Aspects

Statistics for BIG data

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Data Mining and Soft Computing. Francisco Herrera

Are You Ready for Big Data?

BIG DATA. - How big data transforms our world. Kim Escherich Executive Innovation Architect, IBM Global Business Services

JAPAN UNIVERSE. RICH DATA & the Increasing Value of the INTERNET OF THINGS. The DIGITAL UNIVERSE of OPPORTUNITIES GET STARTED COUNTRY BRIEF

Data Cleansing for Remote Battery System Monitoring

Contemporary Techniques for Data Mining Social Media

COMP9321 Web Application Engineering

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

Customer Classification And Prediction Based On Data Mining Technique

How To Understand The Power Of The Internet Of Things

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Foundations of Artificial Intelligence. Introduction to Data Mining

Big Data Analytics. Lucas Rego Drumond

Introduction. A. Bellaachia Page: 1

Transcription:

Mining Big Data Pang-Ning Tan Associate Professor Dept of Computer Science & Engineering Michigan State University Website: http://www.cse.msu.edu/~ptan

Google Trends Big Data Smart Cities

Big Data and Smart Cities

Outline Smart Cities Big Data and Its Challenges Mining Big Data

Smart Cities Cities are growing steadily, and the process of urbanization is a common trend in the world. Although cities are getting bigger, they are not necessarily getting better smart cities, founded on the use of information and communication technologies, aim at tackling many local problems, from local economy and transportation to quality of life and e-governance. [Martínez-Ballesté et al. IEEE Communications 2013]

Examples of Smart Cities Smart Buildings E-Governance Transportation Healthcare DATA $$$$ Education Energy Water Waste management Public safety What are the key resources needed to realize this?

Types of Data from Smart Cities Sensor time series Surveillance video streams GPS trajectories from mobile devices Smart card Social media Structured data

Why Mine/Analyze the Data? The data contains useful information that can be harnessed for various purposes: Monitoring/surveillance Event detection Adaptation Decision making Planning Forecasting Etc..

Outline Smart Cities Big Data and Its Challenges Mining Big Data

Big Data: How Much Data is Out There? Source: http://www.emc.com/leadership/digital-universe/index.htm

How much is a Zettabyte? 1 ZettaByte= 1000 ExaBytes= 10 6 PetaBytes = 10 9 TeraBytes= 10 12 GigaBytes A DVD stores about 5 GB data and its case is ~1cm thick 1 ZettaByte ~ 10 21 / 5 10 9 = 200 billion DVDs to store them Distance from Earth to moon = 384,000 km = 3.84 10 10 cm ** If you stack all the DVDs that contain 1 ZB of data, it is about 3 times the distance to the moon and back

Challenges of Big Data Volume: large amount of data that is continuously growing Velocity: rapid streams of data collected Variety: structured and unstructured data obtained from (potentially) multiple data sources Veracity: messiness or trustworthiness of the data Value: usefulness of the data; needs a careful cost/benefit analysis before embarking on big data project

Outline Smart Cities Big Data and Its Challenges Mining Big Data

What is Data Mining? A collection of computer algorithms and techniques to automatically extract useful information from large data repositories Big Data Analytics Pipeline

Garbage In, Garbage Out Quality of output information depends on quality of input data

Data Preprocessing Helps to alleviate many of the data quality issues Noise Outliers Missing values (incomplete data) Duplicate data Data with irrelevant attributes Data with redundant attributes Data of varying format, scales, etc

Types of Data Analysis Simple, descriptive statistics Mean/Median/Mode Standard deviation/mean absolute deviation Quartiles, percentiles, top-k Example: Heavy-hitter problem Find the hot topics (e.g., trending hashtags) used over the past 24 hours

TrendMap

Finding Hot Topics (Unbounded storage) Data Stream 2013 discount holiday 2013 MSU Associative array, f Memory 2013 12 discount 1 holiday 1 MSU 1 Naïve algorithm; Assume storage space is unbounded

Finding Hot Topics (Limited Storage) Data Stream 2013 discount holiday Associative array, f Memory 2013 1 discount 1? holiday 1 Which one to replace? Any theoretical guarantees that solution will always be in the array?

Misra-Gries Algorithm Data Stream 2013 discount holiday 2013 MSU Associative array, f Memory 2013 1 0 discount MSU 1 0 holiday 1 Algorithm guarantees that all hot items that appear at least m/k+1 times will be in the buffer (where m is length of data stream and k is number of buffers)

Summary Even simple analysis becomes harder to compute when you have big data Need for fast and scalable algorithms that can produce good, approximate solutions

10 Advanced Data Mining Analysis Data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes Ranking/ Recommendation

Predictive Modeling: Classification To infer the value of a nominal attribute based on the values of other observed attributes Examples: Autonomous driving Traffic sign recognition Open lane detection Smart Home/Building: Appliance identification based on electricity utilization

Predictive Modeling: Regression To infer the value of a continuous attribute based on the values of other observed attributes Examples: mhealth Monitoring heart rate and body temperature using wearable devices Intelligent Transportation System Traffic volume prediction Smart Building Electricity/Water demand prediction

Framework for Predictive Modeling Labeled examples Unlabeled examples congestion No congestion Test Set Training Set Train Model Model

Cluster Analysis Find groups of observations such that the observations in the same group are more similar to each other than to those in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized

Applications of Cluster Analysis Crime hotspot detection GPS trajectory segmentation

Association Analysis Extract patterns of frequently co-occurring events Time Sensor ID State 3/1/2015 07:48:05 BR1 OFF 3/1/2015 07:48:07 LR1 ON 3/1/2015 07:48:10 LR6 ON 3/1/2015 07:48:20 BT1 ON 3/1/2015 07:48:40 LR6 OFF 3/1/2015 07:49:30 BT3 ON Weekday, 7-8am, BR2 = OFF, BR1 = OFF, LR6 = ON LR1=ON Weekday, 10-11pm, BR1 = ON, BR2 = ON, LR6 = OFF LR1 = OFF

Applications of Association Analysis Traffic Accident Analysis Smart Health Adverse drug interactions

Anomaly Detection Detect significant deviations from normal observations

Applications of Anomaly Detection Smart Transportation Congestion detection Sensor fault detection Smart Home/Building Water theft detection Pipe burst detection

Ranking (Recommendation) Given a query q, recommend items in specific rank order based on their relevance to q Examples: Location-aware services Smart home assistant

Other Challenges: Privacy http://techland.time.com/2012/02/17/how-target-knew-a-high-school-girl-was-pregnant-before-her-parents/

Other Challenges: Security http://arstechnica.com/security/2012/12/how-an-internet-connected-samsung-tv-can-spill-your-deepest-secrets/

Summary Mining big data is both a challenge and an opportunity

CSE Courses on Data Mining CSE 491/891: Computational Techniques for Large-Scale Data Analysis CSE 881: Data Mining

References Pang-Ning Tan, Knowledge Discovery from Sensor Data, Feature Article in Sensors Magazine, March 1 2006 Pang-Ning Tan, Michael Steinbach, and VipinKumar, Introduction to Data Mining, Addison Wesley, 2006