Data. Data Mining in Cognitive Science. How to look for patterns. From data to information 9/6/2010

Similar documents
Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Introduction to Data Mining

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Data Mining: Introduction

Foundations of Artificial Intelligence. Introduction to Data Mining

Data mining for prediction

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Introduction to Data Mining

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Information Management course

Introduction. A. Bellaachia Page: 1

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Information Visualization WS 2013/14 11 Visual Analytics

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Big Data. Introducción. Santiago González

Data Mining Analytics for Business Intelligence and Decision Support

not possible or was possible at a high cost for collecting the data.

Data Mining: Concepts and Techniques

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Database Marketing, Business Intelligence and Knowledge Discovery

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Data Mining Solutions for the Business Environment

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

DATA MINING - 1DL105, 1Dl111

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Data Mining. Yeow Wei Choong Anne Laurent

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Isn't Everything

Comparison of K-means and Backpropagation Data Mining Algorithms

Statistics for BIG data

Data Mining and Machine Learning in Bioinformatics

SPATIAL DATA CLASSIFICATION AND DATA MINING

Quick Introduction of Data Mining Techniques

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Database management System, Data warehousing and Data Mining

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Machine Learning and Data Mining. Fundamentals, robotics, recognition

This software agent helps industry professionals review compliance case investigations, find resolutions, and improve decision making.

Data Mining System, Functionalities and Applications: A Radical Review

Sanjeev Kumar. contribute

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mining Introduction

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Introduction to Data Mining

Dynamic Data in terms of Data Mining Streams

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

The Data Mining Process

A Review of Data Mining Techniques

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Data Mining Applications in Higher Education

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Chapter ML:XI. XI. Cluster Analysis

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Data Mining. Introduction to Modern Information Retrieval from Databases and the Web. Administrivia

Foundations of Business Intelligence: Databases and Information Management

Chapter 20: Data Analysis

CHAPTER 1 INTRODUCTION

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Concept and Applications of Data Mining. Week 1

MBA Data Mining & Knowledge Discovery

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

An Empirical Study of Application of Data Mining Techniques in Library System

Data Mining Techniques

Role of Social Networking in Marketing using Data Mining

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Introduction to Data Mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Master of Science in Marketing Analytics (MSMA)

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

I N T E L L I G E N T S O L U T I O N S, I N C. DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD

Social Media Mining. Data Mining Essentials

Knowledge Discovery Process and Data Mining - Final remarks

Transforming the Telecoms Business using Big Data and Analytics

A New Approach for Evaluation of Data Mining Techniques

Categorical Data Visualization and Clustering Using Subjective Factors

Quality Assessment in Spatial Clustering of Data Mining

Assessing Data Mining: The State of the Practice

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Transcription:

Data Mining in Cognitive Science Chen Yu Indiana University Acknowledgement: Some of slides are adapted from the slides created by Drs. Balac, Tan, Steinbach, Kumar, Clifton, Tuchinda. Data We are overwhelmed with data. The amount of data in the world, in our lives, seems to be go on and on increasing. Ubiquitous electronics record our decisions, our choices in the supermarket, our financial habits, and our comings and goings. The WWW overwhelms us with information, but meanwhile every choice we make is recorded. The amount of data stored in the world s databases doubles every 20 months. From data to information Most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns that underlie the data. How to look for patterns Politicians seek patterns in voter opinions Lovers seek patterns in their partner's responses. Entrepreneurs identify opportunities, patterns that can be turned into a profitable business, and exploit them. Hunters seek patterns in animal migration behavior. A scientist s job is to make sense of data, to discover the patterns that govern how the physical world works and encapsulate them in theories that can be used for predicting what will happen in new situations. 1

Data and Knowledge Discovery There is a growing gap between the generation of data and our understanding of it. Potentially useful information is hidden in the data that is rarely made explicit so that we can take advantage of. Very little data will ever be looked at by a human. Data Mining Definition Data mining is defined as the process of discovering patterns in data. The process must be automatic (or semiautomatic). The patterns discovered must be meaningful in that they lead to some advantage. Data Mining Examples Grocery store NBA: Advanced Scout http://www.springerlink.com/content/r32q73 7858160r97/ 8160 Personalization & Customer Profiling Campaign Management Necessity Is the Mother of Invention Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories We are drowning in data, but starving for knowledge! Solution: data mining Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 2

Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description i Methods Find human interpretable patterns that describe the data. Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for the class attribute as a function of the values of other attributes. Goal: previously unseen records should ldbe assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Classification Given training examples of inputs and corresponding outputs (x1,y1), (x2,y2),, (xn,yn), produce the correct outputs for new inputs. 3

Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem specific Measures. Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intercluster distances are minimized are maximized Clustering can be ambiguous Partitional clustering 4

Hierarchical clustering Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Association Rule Discovery: Definition Definition of association rule Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 5

Results Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Studied in statistics. Examples: Predicting sales amounts of new product tbased on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. Sequential Pattern Discovery: Definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. How to deal with noisy data items. Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection (A B) (C) (D E) Typical network traffic at University level may reach over 100 million connections per day 6

Related Fields Related Fields Machine Learning Visualization Machine Learning Visualization Data Mining and Knowledge Discovery Data mining in cognitive science Statistics Databases Statistics Human cognition and learning 25 26 27 Knowledge Discovery Definition Knowledge Discovery in Data is the non trivial process of identifying valid novel potentially useful and ultimately t l understandable d patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996 Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation 7

Knowledge Discovery Process Integration Steps of a KDD Process RawData DATA Ware house Target Data Transformed Data Interpretation & Evaluation Patterns and Rules adapted from: U. Fayyad, et al. (1995), From Knowledge Discovery to Data Mining: An Overview, Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Knowledge Knowledge Und derstanding Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 50% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Data cleaning and preprocessing You didn t collect it yourself. Data are not in the right format (e.g., with the sampling rate). People make mistakes (typos) People are busy ( this is good enough ) Data Mining and Scientific Discovery Thousand years ago: science was purely empirical: describing natural phenomena. Last few hundred years: A theoretical branch, using mathematical equations/models, e.g. Newton s laws of motion, Maxwell s equations. Last few decades: a computational branch, the theoretical models grew too complicated to solve analytically, and people had to start simulating. 8

Fourth paradigm: data intensive science Much of the vast volume of scientific data captured by new instruments on a 24/7 basis, along with simulated data from computer models. The data is likely to reside forever in a substantially publically accessible, curated state for the purposes of continued analysis. This analysis will result in the development of many new theories. Behavioral studies What we think people are doing What models tell us what people are doing What conclusions we can infer from empirical datacollectedfrom well designedexperiments experiments What fine grained data suggest, what patterns can be extracted from such data, and how those patterns relate to principles/theories of human cognition and learning. Advances in sensing techniques Data is Information Rich Cognitive processes/mechanisms are revealed by real time behaviors. E.g. eye movements. 9

Data reduction From data capture to data curation to data analysis The published results are just the tip of the data iceberg. People collect a lot of data and then reduce this down to some number of column inches in Science or Nature. Hunting the fox There are many different ways to reduce data and extract patterns. there is a lot of data that is collected but not curated/coded or analyzed/published. Rewarding: There are many opportunties to extract new patterns and infer new principles even from the same dataset. Frustrating: There are a few ways to do it properly and probably many incorrect ways that lead to nothing. There are also risks/pitfalls of making mistakes. Example Eye gaze data based on Areas of Interest (AOIs) e.g. aabbcdddbcccddabbbaddbb Different ways to look at this data: Proportion of looking time on each region. Average duration, overall or each region # of times looking at each AOI # of switches: transition ii matrix Histogram of looking durations, overall or each region Frequent sequential patterns Patterns within a certain timing event, e.g. the first 5 seconds Recalculating by filtering small/big events (e.g. short fixations). Within or across participants Grouping participants or instances across participants Relative values vs. absolute values Extensions from the example What if we want to analyze raw coordinate data but not Areas of Interest (AOIs) from categorical values to continuous values. What if the data has several dimensions at a single moment. What if we deal with a heterogeneous dataset with other measures. 10

Key issues At what stage to reduce/transform what aspects of the data, e.g. noisy items, from continuous to categorical, when to aggregate the data over time and to what degree. After doing that, being aware of limitation and simplification. Always go back to raw data (through information visualization). Always try to make sense of data (through human involvement). Human in the loop computation Two levels: visualization /human perception Theoretic driven Information Visualization Visualization uses human perception to recognize patterns in large data sets Advantages relative to data mining Perceive unconsidered patterns Recognize non linear relationships E.g. windirstat Disadvantages relative to data mining Hard to recognize small patterns Difficult to quantify results Data Mining and Visualization Approaches Visualization to display results of data mining Help analyst to better understand the results of the data mining tool Visualization to aid the data mining process Interactivecontrol control over the data exploration process Interactive steering of analytic approaches Interactive data mining issues Relationships between the analyst, the data mining tool and the visualization tool Analyst Data Mining Tool Visualized result CS590D 44 11

Intelligent data analysis Pattern discovery Pattern intepretation: theoretical driven Interactive and incremental Intelligible data analysis This course Information processing and visualization Data mining and machine learning algorithms, focusing on temporal data mining Data mining engineering: e.g. data cleaning and data preprocessing Basic math Three Flavors Computer skills using matlab to write some simple functions compiling and running some existing software systems. Human cognition Theories Assignment Type I Commenting on papers Three parts/slides: what are most compelling ideas/techniques in a paper. pp what are new challenges based on the present work. how to use (and extend) the ideas/techniques in the paper in your data/research, what research topics would be ideal for the techniques in the paper, and what aspects of the data the present techniques cannot handle. 12

Assignment Type 2 Presenting new papers Introducing the overall idea of a paper. Leading the discussion. Demonstratingthe the software with some dataset. Assignment Type 3 Installing and running some existing data mining/visualization software applications Feeding your data into such system to generate results. Assignment Type 4 Writing some Matlab functions to data mine your data and explore new patterns from the data. Final Project Using what you ve learned in the class to datamine a dataset. Wii h h b Writing a research report that can be extended to conference/journal submissions. 13

Grading Weekly Assignments: 50% Final Project: 30% Discussion and Participation: 20% http://www.indiana.edu/~dll/p657_dm.html 14