Sanjeev Kumar. contribute



Similar documents

An Overview of Knowledge Discovery Database and Data mining Techniques

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Data Mining Research: Opportunities and Challenges

Information Visualization WS 2013/14 11 Visual Analytics

Introduction to Data Mining

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Doctor of Philosophy in Computer Science

Information Management course

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining Analytics for Business Intelligence and Decision Support

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Principles of Data Mining by Hand&Mannila&Smyth

Introduction. A. Bellaachia Page: 1

Data Mining System, Functionalities and Applications: A Radical Review

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

TDS - Socio-Environmental Data Science

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

The Scientific Data Mining Process

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

ANALYTICS CENTER LEARNING PROGRAM

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

A Review of Data Mining Techniques

Bayesian networks - Time-series models - Apache Spark & Scala

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining for Fun and Profit

Introduction to Data Mining

not possible or was possible at a high cost for collecting the data.

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Learning is a very general term denoting the way in which agents:

ABSTRACT The World MINING R. Vasudevan. Trichy. Page 9. usage mining. basic. processing. Web usage mining. Web. useful information

Data Isn't Everything

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Master of Science in Computer Science

Computer Information Systems

D A T A M I N I N G C L A S S I F I C A T I O N

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

DATA MINING TECHNIQUES AND APPLICATIONS

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Learning outcomes. Knowledge and understanding. Competence and skills

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Data Discovery, Analytics, and the Enterprise Data Hub

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Big Data Text Mining and Visualization. Anton Heijs

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining: Overview. What is Data Mining?

Adobe Insight, powered by Omniture

Data Mining and Pattern Recognition for Large-Scale Scientific Data

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Visualization methods for patent data

MEng, BSc Applied Computer Science

Statistics for BIG data

Topics in basic DBMS course

A New Approach for Evaluation of Data Mining Techniques

Data warehouse and Business Intelligence Collateral

Unsupervised Data Mining (Clustering)

SPATIAL DATA CLASSIFICATION AND DATA MINING

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Page 1 of 5. (Modules, Subjects) SENG DSYS PSYS KMS ADB INS IAT

Big Data: Rethinking Text Visualization

In-Database Analytics

Assessing Data Mining: The State of the Practice

Data Mining Solutions for the Business Environment

Process Modelling from Insurance Event Log

IC05 Introduction on Networks &Visualization Nov

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Improving Decision Making and Managing Knowledge

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Supervised Learning (Big Data Analytics)

ICT Perspectives on Big Data: Well Sorted Materials

Foundations of Business Intelligence: Databases and Information Management

Clustering & Visualization

Comparison of Data Mining Techniques used for Financial Data Analysis

Big Data Executive Survey

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Steven C.H. Hoi School of Information Systems Singapore Management University

Statistical Models in Data Mining

Transcription:

RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a new, fundamental research area with important applications to science, engineering, medicine, business, and education. Dataa mining attempts to formulate analyze and implement basic induction processes that facilitate the extraction of meaningful information and knowledge from unstructured data. Data mining extracts patterns, changes, associations and anomalies from large data sets. Work in data mining ranges from theoretical work on the principles of learning and mathematical representations of data to building advanced engineering systems that perform informationn filtering on the web, find genes in DNA sequences, help understandd trends and anomalies in economics and education, and detect network intrusion. Data mining is also a promising computational paradigm that enhances traditional approaches to discovery and increases the opportunities for breakthroughs in the understanding of complex physical and biologicall systems. Researcherss from many intellectual communities have much to contribute to this field. These include the communities of machine learning, statistics, databases, visualization and graphics, optimization, computational mathematics, and the theory of algorithms. 2. The Process of Data Mining The data mining process is often characterized as a multi-stage iterative process involving dataa selection, data cleaning, and application of data mining algorithms, evaluation, and so forth. Heree we adopt a somewhat different process-oriented view and break it down into different steps: 2.1. Exploring and Pre-processing: the initial steps of exploring, visualizing, and querying the data, to gain insight into the data in an interactive manner. Preprocessing steps such as variable selection, data focusing, and data validation can also be included in these initial steps.

2.2. Modeling The steps involved in (a) selecting the model representations that we seek to fit to the data (e.g., a tree, a linear function, a probability density model, etc.), (b) selecting the score functions that score different models with respect to the data, and (c) specifying the computational methods and algorithms to optimize the score function (e.g., greedy local search). These \components" combined together specify the data mining algorithm to be used. The components may be \precompiled" into a specific algorithm (e.g., CART or C4.5 decision tree implementations) or may be integrated in a \customized" manner for a specific application (much more common in the sciences). 2.3. Mining The step (often repeated) is of actually running a particular data mining algorithm on a particular data set. 2.4. Evaluating The step (often ignored) of critically evaluating the quality of the output of the data mining algorithm from step 3, both the predictions of the model and the interpretation of the fitted model itself. 2.5. Deploying The step (rarely achieved) of putting a model from a data mining algorithm into routine predictive use, e.g., using the model continuously in real-time for scoring customers visiting an ecommerce Web site. A challenging (and under-appreciated) technical issue in this context is, how and when models should be updated for such continuous data stream applications. 3. Two Extremes of Data Mining Given the vast numbers of different users of data analysis tools, across different application disciplines, it is clearly dangerous to make broad generalizations about data analysis and data mining! Nonetheless, we will now consider two \prototype" users of data mining tools in the general context of the 5-step data mining process above. These two prototypes are in some respects at opposite ends of a hypothetical spectrum of approaches to data mining. 4. The Business Data Miner The first prototype data miner is \The Business-Person," or \BP" for short. BP typically deals with large numbers of customers, for example in retail consumer or financial services environments. BP's main goal is to use data mining techniques for prediction of customer behavior to gain competitive advantage, e.g., using decision trees to try to identify promising customers for a marketing campaign. BP's approach to data mining is often strictly constrained by a number of factors: cost factors (the cost of data mining development and deployment should not be less than anticipated gains in revenue), integration factors (the deployed predictive model may need be integrated into an existing transaction-oriented environment), and time constraints (any potential competitive advantage may depend critically on being able to develop and deploy a predictive model quickly). If one asks the BP what their most pressing problems are, they will almost certainly not say that they need a slightly more accurate decision tree algorithm, or a slightly faster association rule algorithm. Instead their most pressing problem is that of managing the overall process: how should features be defined for time-dependent data? Which models are 212

most appropriate? How much data is needed for training? Will a random sample of 100k customers from a database of 10 million be \good enough"? What kind of a system can be put in place to update the models as new data arrives? and so on. 5. The Science Data Miner The other prototype we consider is the \Science Person" (SP for short) which is in some sense at the other end of a hypothetical spectrum of data miners. The SP might be an atmospheric scientist, investigating global climate change patterns via analysis of spatio-temporal gridded observations of temperature, pressure, wind-speed, taken on a global grid on a daily basis for the past 30 years. Or the SP might be working in computational biology, exploring a large gene expression data set and its relation to cancer. The SP's data mining approach is quite different to that of the BP. SP typically spends significant time in Steps 1 and 2: exploring, visualizing, defining alternative models, and so forth. For example, in atmospheric science, the data can be \shaped" in different ways via numerous pre-processing techniques such as principal components analysis, time-series smoothing, and so forth. A single model may take 6 months or a year to develop and only be evaluated once on a particular data set. SP's primary goal is to explore model space in such a manner as to better understand the phenomena generating the data: generating better predictions is merely a stepping stone on the path to identifying better models. 6. Recent Research Achievements The opportunities today in data mining rest solidly on a variety of research achievements, which were interdisciplinary in nature, resting on discoveries made by researchers from different disciplines working together collaboratively. Neural Networks: Neural networks are systems inspired by the human brain. A basic example is provided by a back propagation network which consists of input nodes, output nodes, and intermediate nodes called hidden nodes. Initially, the nodes are connected with random weights. During the training, a gradient descent algorithm is used to adjust the weights so that the output nodes correctly classify data presented to the input nodes. The algorithm was invented independently by several groups of researchers. Tree-based Classifiers: A tree is a convenient way to break large data sets into smaller ones. By presenting a learning set to the root and asking questions at each interior node, the data at the leaves can often be analyzed very simply. For example, a classifier to predict the likelihood that a credit card transaction is fraudulent may use an interior node to divide a training data set into two sets, depending upon whether or not five or fewer transactions were processed during the previous hour. After a series of such questions, each leaf can be labeled fraud/no-fraud by using a simple majority vote. Tree based classifiers were independently invented in information theory, statistics, pattern recognition and machine learning. Graphical Models and Hierarchical Probabilistic Representations: A directed graph is a good means of organizing information about qualitative knowledge about conditional independence and causality gleamed from domain experts. Graphical models generalize Markov models and hidden Markov models, which have proved themselves to be a powerful modeling tool. Graphical models were independently invented by computational probabilists and artificial intelligence researchers studying uncertainty. Ensemble Learning: Rather than use data mining to build a single predictive model, it is often better to build a collection or ensemble of models and to combine them, say with a simple, efficient voting strategy. This simple idea has now been applied in a wide variety of 213

contexts and applications. In some circumstances, this technique is known to reduce variance of the predictions and therefore to decrease the overall error of the model. Linear Algebra: Scaling data mining algorithms often depends critically upon scaling underlying computations in linear algebra. Recent work in parallel algorithms for solving linear system and algorithms for solving sparse linear systems in high dimensions are important for a variety of data mining applications, ranging from text mining to detecting network intrusions. Large Scale Optimization: Some data mining algorithms can be expressed as large-scale, often non-convex, optimization problems. Recent work has provided parallel and distributed methods for large-scale continuous and discrete optimization problems, including heuristic search methods for problems too large to be solved exactly. High Performance Computing and Communication: Data mining requires statistically intensive operations on large data sets. These types of computations would not be practical without the emergence of powerful SMP workstations and high performance clusters of workstations supporting protocols for high performance computing such as MPI and MPIO. Distributed data mining can require moving large amounts of data between geographically separated sites, something which is now possible with the emergence of wide area high performance networks. Databases, Data Warehouses, and Digital Libraries: The most time consuming part of the data mining process is preparing data for data mining. This step can be stream-lined in part if the data is already in a database, data warehouse, or digital library, although mining data across different databases, for example, is still a challenge. Some algorithms, such as association algorithms, are closely connected to databases, while some of the primitive operations being built into tomorrow's data warehouses should prove useful for some data mining applications. Visualization of Massive Data Sets: Massive data sets, often generated by complex simulation programs, required graphical visualization methods for best comprehension. Recent advances in multi-scale visualization allow the rendering to be done far more quickly and in parallel, making these visualization tasks practical. 7. Research Challenges The amount of digital data has been exploding during the past decade, while the number of scientists, engineers, and analysts available to analyze the data has been static. To bridge this gap requires the solution of fundamentally new research problems, which can be grouped into the following broad areas: Developing a unifying theory of data mining Scaling up for high dimensional data and high speed data streams Mining sequence data and time series data Mining complex knowledge from complex data Data mining in a network setting Distributed data mining and mining multi-agent data Data mining for biological and environmental problems Data Mining process-related problems Security, privacy and data integrity Dealing with non-static, unbalanced and cost-sensitive data 214

Scaling data mining algorithms: Most data mining algorithms, assume that the data fits into memory. Although success on large data sets is often claimed, usually this is the result of sampling large data sets until they fit into memory. A fundamental challenge is to scale data mining algorithms as: 1. the number of records or observations increases; 2. the number of attributes per observation increases; 3. the number of predictive models or rule sets used to analyze a collection of observations increases; and, 4. as the demand for interactivity and real-time response increases. Not only must distributed, parallel, and out-of-memory versions of current data mining algorithms be developed, but genuinely new algorithms are required. For example, association algorithms today can analyze out-of-memory data with one or two passes, while requiring only some auxiliary data be kept in memory. Extending data mining algorithms to new data types: Most data mining algorithms work with vector-valued data. It is an important challenge to extend data mining algorithms to work with other data types, including 1) time series and process data, 2) unstructured data, such as text, 3) semi-structured data, such as HTML and XML documents, 4) multi-media and collaborative data, 5) hierarchical and multi-scale data, and 6) and collection-valued data. Developing distributed data mining algorithms: Today most data mining algorithms require bringing all together data to be mined in a single, centralized data warehouse. A fundamental challenge is to develop distributed versions of data mining algorithms so that data mining can be done while leaving some of the data in place. In addition, appropriate protocols, languages, and network services are required for mining distributed data to handle the meta-data and mappings required for mining distributed data. As wireless and pervasive computing environments become more common, algorithms and systems for mining the data produced by these types of systems must also be developed. Ease of Use: Data mining today is at best a semi-automated process and perhaps destined to always remain so. On the other hand, a fundamental challenge is to develop data mining systems which are easier to use, even by casual users. Relevant techniques include improving user interface, supporting casual browsing and visualization of massive and distributed data sets, developing techniques and systems to manage the meta-data required for data mining, and developing appropriate languages and protocols for providing casual access to data. In addition, the development of data mining and knowledge discovery environments which address the process of collecting, processing, mining, and visualizing data, as well as the collaborative and reporting aspects necessary when working with data and information derived from it, is another important fundamental challenge. Privacy and Security: Data mining can be a powerful means of extracting useful information from data. As more and more digital data becomes available, the potential for misuse of data mining grows. A fundamental challenge is to develop privacy and security models and protocols 215

appropriate for data mining and to ensure that next generation data mining systems are designed from the ground up to employ these models and protocols. 8. Conclusions Data mining and knowledge discovery are emerging as a new discipline with important applications to science, engineering, health care, education, and business. Data mining rests firmly on 1) research advances obtained during the past two decades in a variety of areas and 2) more recent technological advances in computing, networking and sensors. Data mining is driven by the explosion of digital data and the scarcity of scientists, engineers, and domain experts available to analyze it. Data mining is beginning to contribute research advances of its own, by providing scalable extensions and advances to work in associations, ensemble learning, graphical models, techniques for on-line discovery, and algorithms for the exploration of massive and distributed data sets. Advances in data mining requires a) supporting single investigators working in data mining and the underlying research domains supporting data mining; b) supporting inter-disciplinary and multi-disciplinary research groups working on important basic and applied data mining problems; and c) supporting the appropriate test-beds for mining large, massive and distributed data sets. Appropriate privacy and security models for data mining must be developed and implemented. 216