What is Data? Also called samples, examples, instances, data points, objects, tuples.

Similar documents
Lecture 6 - Data Mining Processes

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Data Mining 5. Cluster Analysis

Intro to GIS Winter Data Visualization Part I

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Foundations of Artificial Intelligence. Introduction to Data Mining

2+2 Just type and press enter and the answer comes up ans = 4

Concepts of Variables. Levels of Measurement. The Four Levels of Measurement. Nominal Scale. Greg C Elvers, Ph.D.

Data Mining Apriori Algorithm

Statistics. Measurement. Scales of Measurement 7/18/2012

Lecture 2: Types of Variables

Arithmetic and Algebra of Matrices

Optimization Modeling for Mining Engineers

2-1 Position, Displacement, and Distance

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data

Microsoft Azure Machine learning Algorithms

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Data Storage 3.1. Foundations of Computer Science Cengage Learning

IBM SPSS Direct Marketing 23

Math 215 HW #6 Solutions

Quantitative vs. Categorical Data: A Difference Worth Knowing Stephen Few April 2005

IBM SPSS Direct Marketing 22

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

Elementary Statistics

Big Data Text Mining and Visualization. Anton Heijs

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Exploratory Data Analysis

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Mining: Introduction

Introduction to Data Mining

Business Statistics: Intorduction

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Chapter 1: The Nature of Probability and Statistics

SOST 201 September 18-20, Measurement of Variables 2

Physics Notes Class 11 CHAPTER 3 MOTION IN A STRAIGHT LINE

MAT 242 Test 2 SOLUTIONS, FORM T

In order to describe motion you need to describe the following properties.

Visualization Software

Largest Fixed-Aspect, Axis-Aligned Rectangle

Data Exploration Data Visualization

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

1 Determinants and the Solvability of Linear Systems

Data Mining: An Overview. David Madigan

Statistical research is always concerned with a group of research objects, called population or universe (populaatio/perusjoukko).

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Data Mining: Data Preprocessing. I211: Information infrastructure II

Framing Business Problems as Data Mining Problems

Chapter 4 One Dimensional Kinematics

Measurement and Measurement Scales

Data, Measurements, Features

6. Cholesky factorization

Descriptive Statistics and Measurement Scales

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Name: Section Registered In:

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

CHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY. 3.1 Basic Concepts of Digital Imaging

Statistics Review PSY379

Introduction to Data Mining. Lijun Zhang

APPLICATIONS AND MODELING WITH QUADRATIC EQUATIONS

System Identification for Acoustic Comms.:

Algebra 1 Course Information

Schedule of Accreditation issued by United Kingdom Accreditation Service 2 Pine Trees, Chertsey Lane, Staines-upon-Thames, TW18 3HR, UK

Math Content by Strand 1

Logistic Regression (1/24/13)

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Social Media Mining. Data Mining Essentials

6 Scalar, Stochastic, Discrete Dynamic Systems

Math Quizzes Winter 2009

with functions, expressions and equations which follow in units 3 and 4.

IBM SPSS Direct Marketing 19

Answer Key for California State Standards: Algebra I

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

A Quick Guide to Constructing an SPSS Code Book Prepared by Amina Jabbar, Centre for Research on Inner City Health

Ratios and Proportional Relationships Set 1: Ratio, Proportion, and Scale... 1 Set 2: Proportional Relationships... 8

Implementing GIS in Optical Fiber. Communication


Digital Versus Analog Lesson 2 of 2

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

How To Understand The Scientific Theory Of Evolution

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Understanding Raster Data

Principles of Data Mining

GE Medical Systems Training in Partnership. Module 8: IQ: Acquisition Time

South Carolina College- and Career-Ready (SCCCR) Pre-Calculus

Example SECTION X-AXIS - the horizontal number line. Y-AXIS - the vertical number line ORIGIN - the point where the x-axis and y-axis cross

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Representing Geography

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

Vectors. Objectives. Assessment. Assessment. Equations. Physics terms 5/15/14. State the definition and give examples of vector and scalar variables.

Core Components Data Type Catalogue Version October 2011

Measurement. How are variables measured?

NOTES ON LINEAR TRANSFORMATIONS

Module 2: Introduction to Quantitative Data Analysis

Foundation of Quantitative Data Analysis

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Solving Systems of Linear Equations Using Matrices

Jim Lambers MAT 169 Fall Semester Lecture 25 Notes

Transcription:

What is Data? Data sets are made up of data objects. A data object represents an entity. Also called samples, examples, instances, data points, objects, tuples. Data objects are described by attributes. An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Data Mining 1

10 A Data Object Attributes Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Cheat Database rows -> data objects; columns -> attributes. Objects 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Data Mining 2

Attributes Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different; ID has no limit but age has a maximum and minimum value Data Mining 3

Attribute Types There are different types of attributes Nominal: categories, states, or names of things Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states (0 and 1) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings Data Mining 4

Attribute Types: Numeric Attribute Types Quantity (integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g., temperature in C or F, calendar dates No true zero-point Ratio Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10 K is twice as high as 5 K ). e.g., temperature in Kelvin, length, counts, monetary quantities Data Mining 5

Discrete vs. Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables Data Mining 6

Types of data sets Record Relational records Data matrix, e.g., numerical matrix, crosstabs Document data: text documents: term-frequency vector Transaction data Graph and network World Wide Web Social or information networks Molecular Structures Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data Spatial, image and multimedia: Spatial data: maps Image data: Video data: Data Mining 7

Important Characteristics of Structured Data Dimensionality Curse of dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale Distribution Centrality and dispersion Data Mining 8

10 Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Data Mining 9

Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection of x Load Projection of y load Distance Load Thickness 10.23 5.27 15.22 2.7 1.2 12.65 6.25 16.22 2.2 1.1 Data Mining 10

Document Data Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document Data Mining 11

Transaction Data A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Data Mining 12

Graph Data Examples: Generic graph and HTML Links 5 2 2 1 <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers 5 Data Mining 13

Chemical Data Benzene Molecule: C 6 H 6 Data Mining 14

Sequences of transactions Items/Events Ordered Data An element of the sequence Data Mining 15

Ordered Data Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG Data Mining 16