Data Foundations Data Attributes and Features Data Pre-processing Data Storage Data Analysis 1 Data Attributes Describing data content and characteristics Representing data dimensions Set of all attributes: attribute vector or attribute array. 2 1
Attribute Types Nominal Data Sortable Data 3 Numerical Attributes Discrete vs. Continuous 4 2
Statistical Features of Data Mean: Median: Standard Deviation: 5 Similarity Matrix Distance for Categorical Data: p: total # of types; m: # of same types 6 3
Distance Measures (Numerical) Euclidean Distance: L2 Manhattan Distance: L1 Minkowski Distance: Lp Cosine Distance: s( X, Y ) X X Y Y 7 Data Uncertainty Source of uncertainty Attribute error Missing attributes Data integration error Resolution conversion Application uncertainty 8 4
Data Preprocessing Application database Data Products ETL: Extract, Transform, & Load Data Warehouse Commercial Intelligence Analysis 9 Data Preprocessing 1. Data Cleaning 2. Data Integration Data Quality Accuracy Completeness Consistency Timeliness Believability Interpretability 10 5
Data Visualization Quality Data-Ink Ratio: 11 Data Error Types Missing data Replace with constants Replace with average value of attributes Regression Manual filling Noisy data Regression Outlier analysis 12 6
Visual Data Cleaning Using visualization techniques for data cleaning 13 Data Integration Combining data from multiple sources Structural conflicts Schema differences Data Conflicts Repeated data Providing a uniform visualization 14 7
Data Integration Example Form 1: Form 2: Integrated Form: 15 Data Storage File Systems Databases and DBMS Data Warehouse 16 8
CSV file (comma-separated values) 17 Structured Files XML files: extensible Markup Language <note> <to>tove</to> <from>jani</from> <heading>reminder</heading> <body>don't forget me this weekend!</body> </note> XML Extensions: IVOA VOtable (Space programs), KML (Web maps), etc. Special formats: e.g. HDF (Hierarchical Data Format) for scientific data 18 9
Databases A database is an organized collection of data. The data is typically organized to model aspects of reality in a way that supports processes requiring information. For example, modelling the availability of rooms in hotels in a way that supports finding a hotel with vacancies. 19 Relational Databases DBMS: Database Management System Data definition and query languages (SQL) 20 10
Visualization of a database Challenge: Interactivity! Example: National Science Foundation Database: 21 Data Warehouse A data warehouse is a system used for reporting and data analysis. It is a central repository of integrated data from one or more disparate sources. 22 11
Database vs Data Warehouse Database Data Warehouse Purpose Data operation Information in the data Application Business Analysis Users Employee, DB Administrator Analyst, manager, executive Functionalities Daily operations decision making support Data Current Historical, time-variant Access Read, write, mean, etc Read Focus Input, query information output Size 1 GB ~ < 1 TB TBs 23 Data Analysis Statistical Analysis Exploratory Analysis Data Mining 24 12
Statistical Analysis Statistical description: properties, parameters, distribution, correlations. Statistical prediction: using probability methods and sampling theory to predict statistical properties (distribution, correlation, parameters, forecasting, etc.) 25 Data Exploration Using Visualization Raw data drawing Statistical drawing Multi-view 26 13
Data Trajectory 27 Data Comparison 28 14
Trend and Patterns 29 Relations 30 15
Line Chart 31 Line Chart: Sunspots 32 16
Bar Chart 33 Sorted Bar Chart 34 17
Bar Chart Labeling 35 Bar Chart: Scale 36 18
Bar Chart : Variant 37 Stacked Bar Chart 38 19
Stacked Bar Chart 39 Stacked Chart 40 20
Pie Chart 41 Pie Chart 42 21
Pie Charts of Van Gogh Paintings 43 Histogram 44 22
Contour Map 45 Scatter Plot Matrix 46 23
Reference Lines in Scatter Plot 47 Heatmap 48 24
Box Plot 49 Box Plot Variations 50 25
Multi-View 51 Data Mining Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive repositories or data streams J. Han and M Kamber, Data Mining: Concepts and Techniques Data Mining Model Validation Knowledge 52 26
Data Mining Tasks Description: Data Algorithm Features Prediction 1 Model Training Data Trained Model 2 New Data Trained Model Features 53 Data Mining Tasks Descriptive Tasks: Concept Description Association Mining Clustering Outlier Analysis Predictive Tasks: Classification Evolution Analysis 54 27
Data Mining Methods Statistical Method: Regression, Parameter Estimation Machine Learning: Decision tree, Neural Networks Statistical Learning: Probability model, Basian networks Algorithmic Method: K- mean, Graph operations 55 Data Mining New Applications Text Mining summarizes, navigates, and clusters documents contained in a database. Web Mining integrates data and text mining within a Web site; enhances the Web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer 56 28
Data Visualization Pipeline 57 Looping Model 58 29
Visual Data Mining and Visual Analytics Users are involved in the data mining process through visualization and user interactions. Certain tasks are difficult to be automated Validation of clustering results Checking data focal points and noise Expert knowledge input, etc. 59 Visual Analytics Paradigm Knowledge Visualization Data Processing & Mining User Interaction Revision of Mining Engine & Decision Making 60 30
Visual Analytics By Daniel Keim 61 Visual Data Mining Examples: Visualizing data correlations 62 31
Visual Data Mining Example: Biomarker detection N 1 Z f ( D1, P i ) N i 1 1 7 Z f ( D1, P ) 1 7 i Z f ( D2, P i ) 7 i 1 7 i 1 1 Z f ( D1, Pi ), i 1, 2, 7 3 i 1 Z f( Pi, Dj ), i 1, 2, 7 3 i 1 7 Z f ( P i, D j ), 7 i 1 1 Z f ( Pi, Dj ), i 1, 2, 3, 5, 7 5 i 63 Visual Data Mining Example: Facial feature detection for medical diagnosis 64 32
Visual Data Mining Example: Concept detection in text data 65 Visual Data Mining Example: Visualizing decision trees 66 33
67 34