Data Mining An introduction

Size: px
Start display at page:

Download "Data Mining An introduction"

Transcription

1 Data Mining An introduction Devert Alexandre School of Software Engineering of USTC 13 February 2012 Slide 1/1

2 Table of Contents Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 2/1

3 Purpose Data mining Looking for data inside data Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 3/1

4 Purpose But what s the point of looking for data in data? Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 4/1

5 Purpose Data mining Looking for small meaningful data inside a lot of raw data Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 5/1

6 Table of Contents Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 6/1

7 Dataset A dataset is a lump of data, usually without much structure Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 7/1

8 Old Faithful Old Faithful is a geyser located in Wyoming, in Yellowstone National Park, in the United States Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 8/1

9 Old Faithful A geyser can teach us a lot about what s going on underground Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 9/1

10 Old Faithful Geologists are observing the geyser activity 1 eruption duration 2 time since previous eruption 3 geyser height duration interval height Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 10/1

11 Old Faithful A quick look shows us Old Faithful is not random interval duration Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 11/1

12 Old Faithful A quick look shows us Old Faithful is not random height interval Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 11/1

13 Old Faithful A quick look shows us Old Faithful is not random height duration Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 11/1

14 Old Faithful Geologists would like to know 1 how to sum-up all those data? 2 can we learn something new? 3 can we predict the eruptions? 4 can we detect anomalies? Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 12/1

15 Planets discovery Since 1992, astronomers found direct evidence of planets around others stars As of 4 February 2012, 758 known extrasolar planets around 707 stars. Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 13/1

16 Planets discovery One way to find planets works is the transit method It works very well! Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 14/1

17 Planets discovery The 7th of March 2009, the Kepler space observatory have been launched and put in Sun s orbit. Kepler performs the transit method on stars Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 15/1

18 Planets discovery Kepler s data are not easy to analyse Stars luminosity is variable No such a thing as a perfect sensor Useful signal level close to noise level Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 16/1

19 Planets discovery Typical extract of Kepler s data e4 Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 17/1

20 Planets discovery Kepler returns a lot of data : 100 Gb/months Years of work to look all the data High rate of false detections Confirming a planet candidate is expensive Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 18/1

21 On-demand streaming media One very popular usage for Internet is to watch movies Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 19/1

22 On-demand streaming media You have dozen of thousands of movies. You have millions of users. How to recommend to each user movies they will like? Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 20/1

23 On-demand streaming media So important, that company NetFlix offered $1 millions to solve that problem Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 21/1

24 On-demand streaming media Too much data to search by hand 78 millions of past recommendation to analyse What are the different kind of users What factors change users preferences Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 22/1

25 More applications More examples? Financial planning & prediction Molecules discovery for new drugs Large networks monitoring Factory monitoring Market studies... Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 23/1

26 Table of Contents Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 24/1

27 Problems categories Several types of problems have been identified Clustering Classification Regression Dimension reduction Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 25/1

28 Clustering Putting elements of a dataset in a group of related elements Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 26/1

29 Clustering Clustering try to Find number of different groups of similar data Which data belongs to which group There are many clustering algorithms Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 27/1

30 Applications Finding group is very popular data-mining application Finding group of customers Automatic suggestion Data fusion Picture segmentation Data compression Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 28/1

31 Data fusion Tags are very helpful for searching information But many tags for same things, or similar things Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 29/1

32 Data fusion Clustering the tags makes the tagging system ever more useful Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 30/1

33 Picture segmentation Take the colors of a picture, cluster them. Each pixel belong to a cluster. Cheap & effective processing step for object recognition! Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 31/1

34 Picture segmentation 1 Take many pictures of faces 2 Take their colors 3 Computes clusters of colors 4 Keep clusters containing skin-like color Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 32/1

35 Picture segmentation Instead of using just using the pixels color local orientation (Gabor filters) Fourier coefficients... Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 33/1

36 Picture segmentation Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 34/1

37 Data compression Find group of related colors to reduce number of colors 24 bits/pixels 4 bits/pixels 4 bits/pixels dithered Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 35/1

38 Data compression Find group of related colors to reduce number of colors 24 bits/pixels 4 bits/pixels 4 bits/pixels dithered Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 36/1

39 Classification You have separated your dataset into 2 or more groups. A classifier will tell to which group belongs new, incoming data instances The classifier is built from existing examples data And we call all that classification Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 37/1

40 Classification Note the (huge) differences with clustering We don t search for groups in data The groups are already defined There is a learning step building the classifier There are data not from the dataset it s what we classify Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 38/1

41 Example the example dataset all the dots the groups or class blue & yellow classifier the red line classification blue side or yellow side of the line Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 39/1

42 Classification Different algorithms give different informations say to which group data belong to say probability to belong to a given group Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 40/1

43 Classification Different algorithms works differently classifier built example by example iterative or online algorithm classifier Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 41/1

44 Applications Obviously, automatic recognition Finding objects in pictures Speech recognition Optical characters recognition Biometric identification Document classification... Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 42/1

45 Models Let s say you have a model to produce data. A model can be a simulation of the system you get data from equations of something you observe a relation between some variables of your data Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 43/1

46 Models Model to predict luminosity of a star size, mass, energy of the star number of planets distance, speed, mass, size of a planet Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 44/1

47 Regression Regression try to accomplish the following goal Tuning a model, such as the model give the best explanations of some dataset you get. Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 45/1

48 Example Model for the data y = ax + b dots regression data red line tuned model Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 46/1

49 Example Model for the data y = ax 2 + bx + c dots regression data red line tuned model Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 47/1

50 Regression Different algorithms give different informations find a tuning of the model to match the data say to which amount a tuning of the model matches the data Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 48/1

51 Regression Regression does not give explanations of data With enough parameters, any model can generate any data A wrong model might be able to generate the data you used for the regression Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 49/1

52 Applications Prediction Automatic recognition Data compression Anomaly detection Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 50/1

53 Prediction If you trust your model Tune your model with past data Generate future data with the tuned model Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 51/1

54 Data compression If your model is smaller than the dataset you used for the regression is accurate enough You have a lossy compression scheme! Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 52/1

55 Data compression 100 dots, but 2 coefficients might be good enough to sum-up the data Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 53/1

56 Anomaly detection & automatic recognition With some families of model Tune your model to match normal data Some models can tell how likely to be generated by the model some data are Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 54/1

57 Dimension reduction Raw data are often in unfamiliar spaces with weird geometries Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 55/1

58 Dimension reduction Take a space of handwritten letters shapes : allographs A space with 1000 dimensions, mapped into 3 dimensions Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 56/1

59 Dimension reduction How to define the distance between 2 shapes? How to build a meaningful map of the shapes? Where would be a new shape on the map? Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 57/1

60 Dimension reduction Take the Netflix movie database How to define the distance between 2 movies? How to build a meaningful map of the movies? Where would be a new movie on the map? Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 58/1

61 Dimension reduction Dimension reduction try to accomplish the following goal Mapping a dataset in a low-dimension space, such as related data are close, less related data are far Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 59/1

62 Applications Dimension reduction is often used as a pre-processing step Data visualization Automatic recognition Data compression Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 60/1

63 Data visualization Dimension reduction helps to simplify while preserving meaning of data Many algorithms would fail on the folded dataset, but work well on the unfolded version. Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 61/1

64 Automatic recognition Use dimension reduction of complex data, then clustering to find groups Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 62/1

65 Data compression Dimension reduction techniques to build codebooks You can approximate each 550 members of the Turkish parliament by combining those faces! Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 63/1

66 Table of Contents Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 64/1

67 Real-world data-mining Real-world data-mining never fits perfectly in a problem category Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 65/1

68 Real-world data-mining Real-world data-mining are a blend of tweaked versions of standard algorithms Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 66/1

69 Real-world data-mining My goal Giving you the basics to understand real-world data-mining techniques Devert Alexandre (School of Software Engineering of USTC) Data Mining Slide 67/1

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

More information

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;

More information

Instagram Post Data Analysis

Instagram Post Data Analysis Instagram Post Data Analysis Yanling He Xin Yang Xiaoyi Zhang Abstract Because of the spread of the Internet, social platforms become big data pools. From there we can learn about the trends, culture and

More information

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n Principles of Data Mining Pham Tho Hoan hoanpt@hnue.edu.vn References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Study Guide: Solar System

Study Guide: Solar System Study Guide: Solar System 1. How many planets are there in the solar system? 2. What is the correct order of all the planets in the solar system? 3. Where can a comet be located in the solar system? 4.

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

TIETS34 Seminar: Data Mining on Biometric identification

TIETS34 Seminar: Data Mining on Biometric identification TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland Youming.Zhang@uta.fi Course Description Content

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Introduction to U-verse Easy Remote

Introduction to U-verse Easy Remote Introduction to U-verse Easy Remote The U-verse Easy Remote iphone application allows you to connect your iphone to your TV so you can use it as a remote control. It s ideal for customers who want an easy-to-use

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Isn't Everything

Data Isn't Everything June 17, 2015 Innovate Forward Data Isn't Everything The Challenges of Big Data, Advanced Analytics, and Advance Computation Devices for Transportation Agencies. Using Data to Support Mission, Administration,

More information

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL International Journal Of Advanced Technology In Engineering And Science Www.Ijates.Com Volume No 03, Special Issue No. 01, February 2015 ISSN (Online): 2348 7550 ASSOCIATION RULE MINING ON WEB LOGS FOR

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

USTC Course for students entering Clemson F2013 Equivalent Clemson Course Counts for Clemson MS Core Area. CPSC 822 Case Study in Operating Systems

USTC Course for students entering Clemson F2013 Equivalent Clemson Course Counts for Clemson MS Core Area. CPSC 822 Case Study in Operating Systems USTC Course for students entering Clemson F2013 Equivalent Clemson Course Counts for Clemson MS Core Area 398 / SE05117 Advanced Cover software lifecycle: waterfall model, V model, spiral model, RUP and

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecture Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml3e CHAPTER 1: INTRODUCTION Big Data 3 Widespread

More information

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections Maximilian Hung, Bohyun B. Kim, Xiling Zhang August 17, 2013 Abstract While current systems already provide

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

MiSeq: Imaging and Base Calling

MiSeq: Imaging and Base Calling MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please

More information

Applications of Deep Learning to the GEOINT mission. June 2015

Applications of Deep Learning to the GEOINT mission. June 2015 Applications of Deep Learning to the GEOINT mission June 2015 Overview Motivation Deep Learning Recap GEOINT applications: Imagery exploitation OSINT exploitation Geospatial and activity based analytics

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Maschinelles Lernen mit MATLAB

Maschinelles Lernen mit MATLAB Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

More information

Image Compression through DCT and Huffman Coding Technique

Image Compression through DCT and Huffman Coding Technique International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul

More information

Numerical Algorithms Group

Numerical Algorithms Group Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful

More information

CBERS Program Update Jacie 2011. Frederico dos Santos Liporace AMS Kepler liporace@amskepler.com

CBERS Program Update Jacie 2011. Frederico dos Santos Liporace AMS Kepler liporace@amskepler.com CBERS Program Update Jacie 2011 Frederico dos Santos Liporace AMS Kepler liporace@amskepler.com Overview CBERS 3 and 4 characteristics Differences from previous CBERS satellites (CBERS 1/2/2B) Geometric

More information

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining Mining Process CRISP - DM Cross-Industry Standard Process for Mining (CRISP-DM) European Community funded effort to develop framework for data mining tasks Goals: Cross-Industry Standard Process for Mining

More information

MassArt Studio Foundation: Visual Language Digital Media Cookbook, Fall 2013

MassArt Studio Foundation: Visual Language Digital Media Cookbook, Fall 2013 INPUT OUTPUT 08 / IMAGE QUALITY & VIEWING In this section we will cover common image file formats you are likely to come across and examine image quality in terms of resolution and bit depth. We will cover

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

CHI DATABASE VISUALIZATION

CHI DATABASE VISUALIZATION CHI DATABASE VISUALIZATION Niko Vegt Introduction The CHI conference is leading within the field of interaction design. Thousands of papers are published for this conference in an orderly structure. These

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

K-Means Clustering Tutorial

K-Means Clustering Tutorial K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Introduction. Chapter 1

Introduction. Chapter 1 1 Chapter 1 Introduction Robotics and automation have undergone an outstanding development in the manufacturing industry over the last decades owing to the increasing demand for higher levels of productivity

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised

More information

A Demonstration of a Robust Context Classification System (CCS) and its Context ToolChain (CTC)

A Demonstration of a Robust Context Classification System (CCS) and its Context ToolChain (CTC) A Demonstration of a Robust Context Classification System () and its Context ToolChain (CTC) Martin Berchtold, Henning Günther and Michael Beigl Institut für Betriebssysteme und Rechnerverbund Abstract.

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Machine Learning. CS494/594, Fall 2007 11:10 AM 12:25 PM Claxton 205. Slides adapted (and extended) from: ETHEM ALPAYDIN The MIT Press, 2004

Machine Learning. CS494/594, Fall 2007 11:10 AM 12:25 PM Claxton 205. Slides adapted (and extended) from: ETHEM ALPAYDIN The MIT Press, 2004 CS494/594, Fall 2007 11:10 AM 12:25 PM Claxton 205 Machine Learning Slides adapted (and extended) from: ETHEM ALPAYDIN The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml What

More information

A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA

A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA N. Zarrinpanjeh a, F. Dadrassjavan b, H. Fattahi c * a Islamic Azad University of Qazvin - nzarrin@qiau.ac.ir

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Obtaining Value from Big Data

Obtaining Value from Big Data Obtaining Value from Big Data Course Notes in Transparency Format technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Data deluge, is it enough?

More information

Detection of Transiting Planet Candidates in Kepler Mission Data

Detection of Transiting Planet Candidates in Kepler Mission Data Detection of Transiting Planet Candidates in Kepler Mission Data Peter Tenenbaum For the Kepler Transiting Planet Search Team 2012-June-06 SAO STScI! The Kepler Mission A space-based photometer searching

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

TDA and Machine Learning: Better Together

TDA and Machine Learning: Better Together TDA and Machine Learning: Better Together TDA AND MACHINE LEARNING: BETTER TOGETHER 2 TABLE OF CONTENTS The New Data Analytics Dilemma... 3 Introducing Topology and Topological Data Analysis... 3 The Promise

More information

De Rotation of Images in Planetary Astrophotography Basic Concepts

De Rotation of Images in Planetary Astrophotography Basic Concepts De Rotation of Images in Planetary Astrophotography Basic Concepts Fernando Rodriguez,SFAAA Contrary to popular belief, high quality planetary imaging is no easy task although planetary imaging is by far

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: ogino@okinawa-ct.ac.jp

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

GAZETRACKERrM: SOFTWARE DESIGNED TO FACILITATE EYE MOVEMENT ANALYSIS

GAZETRACKERrM: SOFTWARE DESIGNED TO FACILITATE EYE MOVEMENT ANALYSIS GAZETRACKERrM: SOFTWARE DESIGNED TO FACILITATE EYE MOVEMENT ANALYSIS Chris kankford Dept. of Systems Engineering Olsson Hall, University of Virginia Charlottesville, VA 22903 804-296-3846 cpl2b@virginia.edu

More information

Crowdsourcing mobile networks from experiment

Crowdsourcing mobile networks from experiment Crowdsourcing mobile networks from the experiment Katia Jaffrès-Runser University of Toulouse, INPT-ENSEEIHT, IRIT lab, IRT Team Ecole des sciences avancées de Luchon Networks and Data Mining, Session

More information

Drugs store sales forecast using Machine Learning

Drugs store sales forecast using Machine Learning Drugs store sales forecast using Machine Learning Hongyu Xiong (hxiong2), Xi Wu (wuxi), Jingying Yue (jingying) 1 Introduction Nowadays medical-related sales prediction is of great interest; with reliable

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research Astrophysics with Terabyte Datasets Alex Szalay, JHU and Jim Gray, Microsoft Research Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Security visualisation

Security visualisation Security visualisation This thesis provides a guideline of how to generate a visual representation of a given dataset and use visualisation in the evaluation of known security vulnerabilities by Marco

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Machine Learning Capacity and Performance Analysis and R

Machine Learning Capacity and Performance Analysis and R Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10

More information

Best Practices. Create a Better VoC Report. Three Data Visualization Techniques to Get Your Reports Noticed

Best Practices. Create a Better VoC Report. Three Data Visualization Techniques to Get Your Reports Noticed Best Practices Create a Better VoC Report Three Data Visualization Techniques to Get Your Reports Noticed Create a Better VoC Report Three Data Visualization Techniques to Get Your Report Noticed V oice

More information

Establishing the Uniqueness of the Human Voice for Security Applications

Establishing the Uniqueness of the Human Voice for Security Applications Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.

More information

www.quickontheuptake.co.uk 07949 071066 info@quickontheuptake.co.uk

www.quickontheuptake.co.uk 07949 071066 info@quickontheuptake.co.uk What is your subject / title? Generate as much information on your subject as possible below Put a star next to areas of your speech where extra research is needed. What is your core message? Mark on your

More information

How long would we have to wait until the next eruption of Old Faithful?

How long would we have to wait until the next eruption of Old Faithful? MDM 4U - Student Page How Faithful is Old Faithful? An Activity in Statistical Thinking Imagine that you have just arrived at Yellowstone National Park, the home of geyser basins, thermal mud pots, hot

More information

Visualisatie BMT. Introduction, visualization, visualization pipeline. Arjan Kok Huub van de Wetering (h.v.d.wetering@tue.nl)

Visualisatie BMT. Introduction, visualization, visualization pipeline. Arjan Kok Huub van de Wetering (h.v.d.wetering@tue.nl) Visualisatie BMT Introduction, visualization, visualization pipeline Arjan Kok Huub van de Wetering (h.v.d.wetering@tue.nl) 1 Lecture overview Goal Summary Study material What is visualization Examples

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

More information

The Visualization Pipeline

The Visualization Pipeline The Visualization Pipeline Conceptual perspective Implementation considerations Algorithms used in the visualization Structure of the visualization applications Contents The focus is on presenting the

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

Data Mining: Introduction

Data Mining: Introduction Data Mining: Introduction Introducing the course How the course is organized How students are evaluated Deadlines Data Mining [Chapt. 1 of course book] What is it about? The KDD process Relations to other

More information

Monday Morning Data Mining

Monday Morning Data Mining Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

More information

SKINAKAS OBSERVATORY. Astronomy Projects for University Students PROJECT THE HERTZSPRUNG RUSSELL DIAGRAM

SKINAKAS OBSERVATORY. Astronomy Projects for University Students PROJECT THE HERTZSPRUNG RUSSELL DIAGRAM PROJECT 4 THE HERTZSPRUNG RUSSELL DIGRM Objective: The aim is to measure accurately the B and V magnitudes of several stars in the cluster, and plot them on a Colour Magnitude Diagram. The students will

More information

Connecting library content using data mining and text analytics on structured and unstructured data

Connecting library content using data mining and text analytics on structured and unstructured data Submitted on: May 5, 2013 Connecting library content using data mining and text analytics on structured and unstructured data Chee Kiam Lim Technology and Innovation, National Library Board, Singapore.

More information

Prediction of Cancer Count through Artificial Neural Networks Using Incidence and Mortality Cancer Statistics Dataset for Cancer Control Organizations

Prediction of Cancer Count through Artificial Neural Networks Using Incidence and Mortality Cancer Statistics Dataset for Cancer Control Organizations Using Incidence and Mortality Cancer Statistics Dataset for Cancer Control Organizations Shivam Sidhu 1,, Upendra Kumar Meena 2, Narina Thakur 3 1,2 Department of CSE, Student, Bharati Vidyapeeth s College

More information