An Introduction to Data Mining



Similar documents
An Overview of Knowledge Discovery Database and Data mining Techniques

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to WEKA. As presented by PACE

Learning outcomes. Knowledge and understanding. Competence and skills

Machine learning for algo trading

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Introduction to Data Mining

MS1b Statistical Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Big Data Analytics and Optimization

Data Mining Solutions for the Business Environment

Introduction Predictive Analytics Tools: Weka

Social Media Mining. Data Mining Essentials

Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Machine Learning Introduction

Chapter 12 Discovering New Knowledge Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Supervised Learning (Big Data Analytics)

Principles of Data Mining by Hand&Mannila&Smyth

Data Mining. Dr. Saed Sayad. University of Toronto

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Predictive Dynamix Inc

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Machine Learning and Data Mining. Fundamentals, robotics, recognition

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Using Data Mining for Mobile Communication Clustering and Characterization

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

not possible or was possible at a high cost for collecting the data.

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Role of Neural network in data mining

Data Mining + Business Intelligence. Integration, Design and Implementation

Final Project Report

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Algorithms Part 1. Dejan Sarka

Advanced analytics at your hands

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

The Data Mining Process

Sunnie Chung. Cleveland State University

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

IBM SPSS Modeler Premium

IBM SPSS Modeler Professional

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Analytics on Big Data

Introduction. A. Bellaachia Page: 1

Predictive Modeling and Big Data

DATA PREPARATION FOR DATA MINING

How To Predict Web Site Visits

Make Better Decisions Through Predictive Intelligence

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

1. Classification problems

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

DATA MINING ALPHA MINER

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Learning is a very general term denoting the way in which agents:

Information Management course

DATA MINING IN FINANCE

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Linear Classification. Volker Tresp Summer 2015

Machine Learning for Data Science (CS4786) Lecture 1

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Pentaho Data Mining Last Modified on January 22, 2007

Data Mining Part 5. Prediction

Statistical Models in Data Mining

The University of Jordan

HT2015: SC4 Statistical Data Mining and Machine Learning

Perspectives on Data Mining

Statistical Machine Learning

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Maximizing Return and Minimizing Cost with the Decision Management Systems

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Bayesian networks - Time-series models - Apache Spark & Scala

An In-Depth Look at In-Memory Predictive Analytics for Developers

Comparison of K-means and Backpropagation Data Mining Algorithms

The Prophecy-Prototype of Prediction modeling tool

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

IT services for analyses of various data samples

DATA MINING TECHNIQUES AND APPLICATIONS

Meta-learning. Synonyms. Definition. Characteristics

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Improving spam mail filtering using classification algorithms with discretization Filter

How To Perform An Ensemble Analysis

Transcription:

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014

Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail Regression Classification Clustering Summarization 3 Future Reading Recent Evolution Reference

What is What is is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistic and database system. The actual task is the automatic or semi-automatic analysis of big data to extract unknown interesting patterns, such as groups of data records(cluster analysis), unusual records(anomaly detection), dependencies(association rule mining). 1 Early methods of identifying patterns in data include Bayes theorem (1700s) and regression analysis (1800s). 2 Aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees (1960s), and support vector machines (1990s).

Notable Application of Notable Application of Games since the early 1960s, with the availability of oracles for certain combinatorial games, extraction of human-usable strategies from these oracles. Business analysis of historical business activities, to reveal hidden patterns,trends and unknown strategic business information.such as CRM, Customer Behavior, return on investment. Science and Engineering bioinformatics, genetics, medicine, education and electrical power engineering. Others Medical, Sensor, Music, Visual data Mining.

Conference, Software and Applications Conference and Research KDD Conference ACM SIGKDD Conference on Knowledge Discovery and. MLDM Conference Machine Learning and in Pattern Recognition. WSDM Conference ACM Conference on Web Search and.

Conference, Software and Applications Software and Applications Free open-source data mining software and applications NLTK (Natural Language Toolkit) A suite of libraries and programs for symbolic and statistical NLP for the Python. R A programming language and software environment for statistical computing, data mining and graphics. Weka A suite of machine learning software applications written in the Java. Commercial data-mining software and applications IBM SPSS Modeler data mining software provided by IBM. Microsoft Analysis Services data mining software Microsoft. Oracle data mining software by Oracle. Google Predict API Google s cloud-based machine learning tools.

Major Process in Major Process in Pre-processing Before data mining algorithms can be used, a target data set must be assembled.as data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. Data mining Data mining involves six common classes of tasks:anomaly detection, Association rule learning, Clustering, Classification, Regression, Summarization. Results validation verify that the patterns produced by the data mining algorithms occur in the wider data set. such as overfitting and underfitting.

Regression Regression Analysis estimating the relationships among variables, it estimates the conditional expectation of the dependent variable given the independent variables. common used for: 1 prediction and forecasting. 2 explore the forms of these relationships among the independent and dependent variables. common techniques: 1 linear regression or ordinary least squares. 2 non-linear regression, such as polynomial regression.

Regression Regression Analysis illustration of linear regression: Minimizing Errors an error is the distance between the actual and model data. Figure : simple linear regression Figure : error function

Regression Other Type of Regression what if your data is not a straight line? Logarithmic Regression Y = a + b (ln(x)) Quadratic Regression Y = a x 2 + b x + c Power Regression Y = a x b Exponential Regression Y = a b x

Classification Classification classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Bayesian procedure and discrete procedure. Binary classification and Multiclass classification.

Classification Common Algorithms Classification Algorithms Linear classifiers Logistic Regression Naive Bayes Classifier Perceptron Support Vector Machines Kernel estimation k-nearest neighbor Boosting (meta-algorithm) Decision trees ID3 algorithm C4.5 algorithm Neural networks Bayesian networks Hidden Markov models

Clustering Clustering Cluster Analysis clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Cluster analysis itself is not one specific algorithm, but the general task to be solved.

Clustering Clustering Algorithm Clustering Algorithm Connectivity models for example hierarchical clustering, hierarchical latent dirichlet process, hierarchical dirichlet process. Centroid models for example k-means algorithm represents each cluster by a single mean vector. Distribution models for example topic modeling based LDA, hlda rcrp etc. Graph-based models commonly also known as Bayesian Network, or Probabilistic Graphic Model.

Summarization Summarization the meaning of summarization is general, and mainly contains two aspects: visualization and report generation. Common Visualization Tools Matlab R language Octave Dot language and graphize tools package.

Summarization Summarization Here we lay attention on automatic summarization techniques. Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document Generally, there are two approaches to automatic summarization: extraction and abstraction. And the most used is extraction based techniques, which scores every sentences in documents, and ranking them according to scors rank: Graph based Topic Modeling based linguistics rule based.

Recent Evolution Recent Research Direction hierarchical topic modeling, such as HDP, rcrp, hlda. multi-layer neural network based, such as Deep Learning.

Reference Reference John Smith (2012) History of Association of Compuationall Linguistics 12(3), 45 678.

Reference The Beginning