Data Mining: Opportunities and Challenges



Similar documents
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Information Management course

Data Mining: Concepts and Techniques

Introduction. A. Bellaachia Page: 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Sanjeev Kumar. contribute

Big Data Analytics in Mobile Environments

Introduction to Data Mining

DATA MINING AND KNOWLEDGE DISCOVERY FROM RESEARCH PROBLEMS

DATA PREPARATION FOR DATA MINING

Classification and Prediction

Top Top 10 Algorithms in Data Mining

CAS CS 565, Data Mining

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Top 10 Algorithms in Data Mining

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data Mining & Data Stream Mining Open Source Tools

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Introduction to Data Mining

Application of Data Mining Techniques in Intrusion Detection

Project Participants

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Data Mining: Concepts and Techniques

Random Forest Based Imbalanced Data Cleaning and Classification

Prediction of DDoS Attack Scheme

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Collaborations between Official Statistics and Academia in the Era of Big Data

Introduction to Data Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Clustering Data Streams

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Data Mining System, Functionalities and Applications: A Radical Review

An Introduction to Data Mining

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY. Elena Zheleva Evimaria Terzi Lise Getoor. Privacy in Social Networks

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Machine Learning Introduction

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Intrusion Detection via Machine Learning for SCADA System Protection

Introduction to Data Mining. Lijun Zhang

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Data Mining: An Overview. David Madigan

Statistics for BIG data

Knowledge Engineering with Big Data

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Interactive Exploration of Decision Tree Results

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Data Mining Introduction

Information Security in Big Data using Encryption and Decryption

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

MATTEO RIONDATO Curriculum vitae

IC05 Introduction on Networks &Visualization Nov

IEEE JAVA Project 2012

A Game Theoretical Framework for Adversarial Learning

Data Mining for Knowledge Management. Classification

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Solutions for the Business Environment

SPMF: a Java Open-Source Pattern Mining Library

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

INTRUSION PREVENTION AND EXPERT SYSTEMS

What is Data Mining?

MA2823: Foundations of Machine Learning

Data Warehousing and Data Mining

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Dynamic Data in terms of Data Mining Streams

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Knowledge Discovery Process and Data Mining - Final remarks

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Conclusions and Future Directions

Introduction to Learning & Decision Trees

Transcription:

Data Mining: Opportunities and Challenges Xindong Wu University of Vermont, USA; Hefei University of Technology, China ( 合 肥 工 业 大 学 计 算 机 应 用 长 江 学 者 讲 座 教 授 ) 1

Deduction Induction: My Research Background 1988 1990 Expert Systems Expert Systems 1995 2004 2

Outline 1. Data Mining Opportunities Major Conferences and Journals in Data Mining Main Topics in Data Mining Some Research Directions in Data Mining 2. 10 Challenging Problems in Data Mining Research 3

What Is Data Mining? The discovery of knowledge (in the form of rules, trees, frequent patterns etc.) from large volumes of data A hot field: 15 data mining conferences in 2003, including KDD, ICDM, SDM, IDA, PKDD and PAKDD excluding IJCAI, COMPSTAT, SIGMOD and other more general conferences that also publish data mining papers. 4

Main Activities in Data Mining: Conferences The birth of data mining/kdd: 1989 IJCAI Workshop on Knowledge Discovery in Databases 1991-1994 Workshops on Knowledge Discovery in Databases 1995 date: International Conferences on Knowledge Discovery in Databases and Data Mining (KDD) 2001 date: IEEE ICDM and SIAM-DM (SDM) Several regional conferences, incl. PAKDD (since 1997) & PKDD (since 1997). 5

Data Mining: Major Journals Data Mining and Knowledge Discovery (DMKD, since 1997) Knowledge and Information Systems (KAIS, since 1999) IEEE Transactions on Knowledge and Data Engineering (TKDE) Many others, incl. TPAMI, ML, IDA, 6

ACM KDD vs. IEEE ICDM 700 KDD and ICDM Paper Submissions 630 # of Submissions 600 500 400 300 200 133 215 162 501 451 384 415 365 369 250 264 284 308 298 237 ACM SIGKDD IEEE ICDM 100 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 ACM SIGKDD 133 215 162 250 264 284 237 308 298 384 415 IEEE ICDM 365 369 501 451 630 Year 7

Main Topics in Data Mining Association analysis (frequent patterns) Classification (trees, Bayesian methods, etc) Clustering and outlier analysis Sequential and spatial patterns, and time-series analysis Text and Web mining Data visualization and visual data mining. 8

Some Research Directions Web mining (incl. Web structures, usage analysis, authoritative pages, and document classification) Intelligent data analysis in Bioinformatics Mining with data streams (in continuous, real-time, dynamic data environments) Integrated, intelligent data mining environments and tools (incl. induction, deduction, and heuristic computation). 9

Outline 1. Data Mining Opportunities Major Conferences and Journals in Data Mining Main Topics in Data Mining Some Research Directions in Data Mining 2. 10 Challenging Problems in Data Mining Research 10

10 Challenging Problems in Data Mining Research Joint Efforts with Qiang Yang (Hong Kong Univ. of Sci. & Tech.) With Contributions from ICDM & KDD Organizers Xindong Wu, (University of Vermont, USA; Hefei University of Technology, China) 11

Why Most Challenging Problems? What are the 10 most challenging problems in data mining, today? Different people have different views, a function of time as well What do the experts think? Experts we consulted: Previous organizers of IEEE ICDM and ACM KDD We asked them to list their 10 problems (requests sent out in Oct 05, and replies Obtained in Nov 05) Replies Edited into an article: hopefully be useful for young researchers Not in any particular importance order 12

1. Developing a Unifying Theory of Data Mining The current state of the art of data-mining research is too ``ad-hoc techniques are designed for individual problems no unifying theory Needs unifying research Exploration vs explanation Long standing theoretical issues How to avoid spurious correlations? Deep research. Knowledge discovery on hidden causes? Similar to discovery of Newton s Law? An Example (from Tutorial Slides by Andrew Moore): VC dimension. If you've got a learning algorithm in one hand and a dataset in the other hand, to what extent can you decide whether the learning algorithm is in danger of overfitting or underfitting? formal analysis into the fascinating question of how overfitting can happen, estimating how well an algorithm will perform on future data that is solely based on its training set error, a property (VC dimension) of the learning algorithm. VC-dimension thus gives an alternative to crossvalidation, called Structural Risk Minimization (SRM), for choosing classifiers. CV,SRM, AIC and BIC. 13

2. Scaling Up for High Dimensional Data and High Speed Streams Scaling up is needed ultra-high dimensional classification problems (millions or billions of features, e.g., bio data) Ultra-high speed data streams Streams. continuous, online process e.g. how to monitor network packets for intruders? concept drift and environment drift? RFID network and sensor network data Excerpt from Jian Pei s Tutorial http://www.cs.sfu.ca/~jpei/ 14

3. Sequential and Time Series Data How to efficiently and accurately cluster, classify and predict the trends? Time series data used for predictions are contaminated by noise. How to do accurate short-term and long-term predictions? Signal processing techniques introduce lags in the filtered data, which reduces accuracy Key in source selection, domain knowledge in rules, and optimization methods Real time series data obtained from wireless sensors in Hong Kong UST CS department hallway 15

4. Mining Complex Knowledge from Complex Data Mining graphs Data that are not i.i.d. (independent and identically distributed) many objects are not independent of each other, and are not of a single type. mine the rich structure of relations among objects, E.g.: interlinked Web pages, social networks, metabolic networks in the cell Integration of data mining and knowledge inference The biggest gap: unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user More research on interestingness of knowledge. Citation (Paper 2) Conference Name Title Author (Paper1) 16

5. Data Mining in a Network Setting Community and Social Networks Linked data between emails, Web pages, blogs, citations, sequences and people Static and dynamic structural behavior Mining in and for Computer Networks. detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of Service) attack Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop packet Picture from Matthew Pirretti s slides, Penn State An Example of packet streams (data courtesy of NCSA, UIUC) 17

6. Distributed Data Mining and Mining Multi-agent Data Need to correlate the data seen at the various probes (such as in a sensor network) Adversary data mining: deliberately manipulate the data to sabotage them (e.g., make them produce false negatives) Game theory may be needed for help. Games Player 1:miner Action: H H T H T Player 2 (-1,1) (-1,1) (1,-1) (1,-1) Outcome T 18

7. Data Mining for Biological and Environmental Problems New problems raise new questions Large scale problems especially so Biological data mining, such as HIV vaccine design DNA, chemical properties, 3D structures, and functional properties need to be fused Environmental data mining Mining for solving the energy crisis. 19

8. Data-mining-Process Related Problems How to automate mining process? the composition of data mining operations Data cleaning, with logging capabilities Sampling Feature Sel Visualization and mining automation. Need a methodology: help users avoid many data mining mistakes What is a canonical set of data mining operations? Mining 20

9. Security, Privacy and Data Integrity How to ensure the users privacy while their data are being mined? How to do data mining for protection of security and privacy? Knowledge integrity assessment. Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security Development of measures to evaluate the knowledge integrity of a collection of Data Knowledge and patterns http://www.cdt.org/privacy/ Headlines (Nov 21 2005) Senate Panel Approves Data Security Bill - The Senate Judiciary Committee on Thursday passed legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised. While several other committees in both the House and Senate have their own versions of data security legislation, S. 1789 breaks new ground by including provisions permitting consumers to access their personal files 21

10. Dealing with Non-static, Unbalanced and Cost-sensitive Data The UCI datasets are small and not highly unbalanced Real world data are large (10^5 features) but only < 1% of the useful classes (+ ve) There is much information on costs and benefits, but no overall model of profit and loss Data may evolve with a bias introduced by sampling. temperature 39 o c pressure blood test essay??? cardiogram? Each test incurs a cost Data extremely unbalanced Data change with time 22

10 Challenging Problems: Summary 1. Developing a Unifying Theory of Data Mining 2. Scaling Up for High Dimensional Data/High Speed Streams 3. Mining Sequence Data and Time Series Data 4. Mining Complex Knowledge from Complex Data 5. Data Mining in a Network Setting 6. Distributed Data Mining and Mining Multi-agent Data 7. Data Mining for Biological and Environmental Problems 8. Data-Mining-Process Related Problems 9. Security, Privacy and Data Integrity 10. Dealing with Non-static, Unbalanced and Cost-sensitive Data 23

Contributors Pedro Domingos, Charles Elkan, Johannes Gehrke, Jiawei Han, David Heckerman, Daniel Keim, Jiming Liu, David Madigan, Gregory Piatetsky-Shapiro, Vijay V. Raghavan, Rajeev Rastogi, Salvatore J. Stolfo, Alexander Tuzhilin, and Benjamin W. Wah 24