Introduction to IBM Watson Analytics Data Loading and Data Quality



Similar documents
IBM SPSS Direct Marketing 20

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 22

FrontStream CRM Import Guide Page 2

Qlik REST Connector Installation and User Guide

User Guide. Analytics Desktop Document Number:

Creating an Excel Spreadsheet for Mail Merge. Excel Spreadsheet Mail Merge. 1 of 9 Design & Print Offline: Mail Merge

Radius Maps and Notification Mailing Lists

Hatco Lead Management System:

Ad Hoc Reporting: Data Export

Microsoft Access Rollup Procedure for Microsoft Office Click on Blank Database and name it something appropriate.

IBM SPSS Direct Marketing 19

How To Use Optimum Control EDI Import. EDI Invoice Import. EDI Supplier Setup General Set up

SAS Visual Analytics 7.1 for SAS Cloud. Quick-Start Guide

MicroStrategy Desktop

MONEY TRANSFER. Import & Approval User Guide

MS Excel. Handout: Level 2. elearning Department. Copyright 2016 CMS e-learning Department. All Rights Reserved. Page 1 of 11

Welcome to the topic on Master Data and Documents.

EXCEL IMPORT user guide

How to transfer your Recipient Address Book from FedEx Ship Manager at fedex.ca to FedEx Ship Manager Software

New Online Banking Guide for FIRST time Login

Operative Media. Operative.One. Release Bulk Data Loaders User's Guide. Document Version 1.0

WIDA Assessment Management System (WIDA AMS) User Guide, Part 2

Registered Pesticide Product Search Online Tutorial. Dealer Agricultural Pesticide Sales Reporting Application Tutorial

PASW Direct Marketing 18

Creating Mailing Lables in IBM Cognos 8 Report Studio

Strategic Asset Tracking System User Guide

AT&T Business Messaging Account Management

WhatCounts Newsletter System Manual

SAP BusinessObjects Financial Consolidation Web User Guide

Creating a Participants Mailing and/or Contact List:

Uploading Ad Cost, Clicks and Impressions to Google Analytics

Indiana County Assessor Association Excel Excellence

IBM Emptoris Contract Management. Release Notes. Version GI

Basics Series-4004 Database Manager and Import Version 9.0

Website Administration Security Guide

Converting an Excel Spreadsheet Into an Access Database

User Manual - Sales Lead Tracking Software

IBM Unica Leads Version 8 Release 6 May 25, User Guide

Excel for Data Cleaning and Management

SPSS: Getting Started. For Windows

PeopleSoft Query Training

How to Import Data into Microsoft Access

To export data formatted for Avery labels -

BID2WIN Workshop. Advanced Report Writing

Research Electronic Data Capture Prepared by Angela Juan

First Time Users: Setting Up Your Account ADP Online Payroll Instructions

Lesson 07: MS ACCESS - Handout. Introduction to database (30 mins)

EXCEL Using Excel for Data Query & Management. Information Technology. MS Office Excel 2007 Users Guide. IT Training & Development

Work with the MiniBase App

ACCESS Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818)

UNIVERSITY OF CALGARY Information Technologies WEBFORMS DRUPAL 7 WEB CONTENT MANAGEMENT

Coupling Microsoft Excel with NI Requirements Gateway

UDW+ Quick Start Guide to Functionality 2013 Version 1.1

Elisabetta Zodeiko 2/25/2012

EmpCenter Employee Training for Harvey Mudd College. TR-01: Time and Attendance Employee Functions

E-FILE. Universal Service Administrative Company (USAC) Last Updated: September 2015

Instructions for applying data validation(s) to data fields in Microsoft Excel

IRF Business Objects. Using Excel as a Data Provider in an IRF BO Report. September, 2009

SAS Visual Analytics 7.2 for SAS Cloud: Quick-Start Guide

jexcel plugin user manual v0.2

Transaction Inquiries

Administrator s Guide

WebSphere Business Monitor V6.2 KPI history and prediction lab

Business Intelligence Tutorial: Introduction to the Data Warehouse Center

Training Needs Analysis

Maple T.A. Beginner's Guide for Instructors

How To Write A File System On A Microsoft Office (Windows) (Windows 2.3) (For Windows 2) (Minorode) (Orchestra) (Powerpoint) (Xls) (

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

BEx Analyzer (Business Explorer Analyzer)

Chapter 2 The Data Table. Chapter Table of Contents

Getting Started with Automizy

DDN CUSTOMER SUPPORT COMMUNITY QUICK START GUIDE

Site Maintenance. Table of Contents

Advanced Excel 10/20/2011 1

Using. - Training Documentation -

Somalia Online Projects System

Importing and Exporting With SPSS for Windows 17 TUT 117

A guide to bulk deposit submissions

Taleo Enterprise. Taleo Reporting Getting Started with Business Objects XI3.1 - User Guide

Enterprise Analytics. (Also known as Pyramid Analytics or BI Office) Supported Operating Systems and Internet Browsers

Result Entry by Spreadsheet User Guide

Council of Ontario Universities. COFO Online Reporting System. User Manual

Impreso: 22/02/2011 Página 1 de 19. Origen: FSC Versión: 2011-v3 EBILLING FAQ

Oracle Sales Cloud Using Customer Data Management. Release 10

Polynomial Neural Network Discovery Client User Guide

Budget Process using PeopleSoft Financial 9.1

Creating Codes with Spreadsheet Upload

Web forms in Hot Banana reside on their own pages and can contain any number of other content and containers like any other page on your Website.

You are building a learning programme on Dokeos LMS and now has come the time to import users in the system. We will address this need at 3 levels

Textkernel Search! User Guide. Version , Textkernel BV, all rights reserved

Data Mining Commonly Used SQL Statements

ithenticate User Manual

emarketing Manual- Creating a New

Creating an Excel Database for a Mail Merge on a PC. Excel Spreadsheet Mail Merge. 0 of 8 Mail merge (PC)

FrontPage 2003: Forms

SPSS Workbook 1 Data Entry : Questionnaire Data

SAS. Cloud. Account Administrator s Guide. SAS Documentation

IBM SPSS Data Preparation 22

Transcription:

Introduction to IBM Watson Analytics Data Loading and Data Quality December 16, 2014 Document version 2.0

This document applies to IBM Watson Analytics. Licensed Materials - Property of IBM Copyright IBM Corporation 2014. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 2

Contents IBM Watson Analytics needs your data to start helping you drive insights!... 4 Data loading and file characteristics... 4 Loading data files... 4 Data file sizes and types... 4 Data file structure... 4 Microsoft Excel file restrictions... 5 CSV file restrictions... 6 Data quality... 6 Data quality improvements... 7 Review the Data Quality Report... 8 Add to the breadth and depth of the data... 9 Use your domain knowledge to determine if the results are making sense... 9 Viewing and changing the properties of a field in Predict... 9 Changing the role of a field in Predict... 10 Changing the measurement level of a field in Predict... 10 Browsers currently supported in Watson Analytics... 11 3

IBM Watson Analytics needs your data to start helping you drive insights! When you load data into IBM Watson Analytics the service automatically starts analyzing and interrogating it for interestingness and quality. It then determines what you might want to analyze. This document will help you to understand the ins and outs of using your data with Watson Analytics. Data loading and file characteristics You can load comma-separated values (.csv) and Microsoft Excel spreadsheet (.xls,.xlsx) files into IBM Watson Analytics. Currently, Watson Analytics supports structured data files. Additionally, your data files must meet size and structure requirements. Loading data files You must load a data set to IBM Watson Analytics before you can analyze your data. 1. Log in to IBM Watson Analytics. 2. On the Welcome page, click Add. 3. In the Add your data area, add your data set. You can add.csv and Microsoft Excel spreadsheet files. Important: If your data is filtered in a Microsoft Excel spreadsheet, the data is only hidden in the spreadsheet and the full original data set is imported into Watson Analytics. You can use the filtering options available in the Explore and Assemble capabilities to filter your dataset. Even if you filter the data in an exploration or view, the full data set is still available if you create a new exploration or view. After your file loads, it appears on the Welcome page as a data set. Choose a data set to create a prediction or exploration based on it. Data file sizes and types Each individual file that you upload must be smaller than 50 MB with a maximum of 100,000 rows and 50 columns. If you upload a file that exceeds these limits, you will receive an error message and the file is not loaded. The overall capacity for all data sets and other assets in your account is 500 MB. Supported file types are: Microsoft Excel 97 2003 spreadsheet files (.xls). Microsoft Excel 2007 and later spreadsheet files (.xlsx). Comma-separated values files (.csv). Data file structure Data files must meet the following structural requirements: 4

Headers: Because IBM Watson Analytics relies on natural language and matches elements from the question you ask to elements in the data, files with descriptive column headers are preferred. Watson Analytics assumes that the first row of your file contains headers. List Files: List files work best. List files are tabular data, with columns and rows. In Watson Analytics, we refer to columns as fields and to rows as records. The first row is a header. Watson Analytics does not currently work with nested headings or row headings. The following example of a list file works well in Watson Analytics: The following example of a nested file does not work in Watson Analytics because it contains row headings and nested headings. Additionally, data files must meet the following characteristics: You cannot have empty columns inserted before the data. You must have a header for every column. The number of columns in the header row is assumed by Watson Analytics to be the number of columns of data. For example, if the first six columns have headers but there are eight columns of data, the last two columns of data are ignored. You can have empty rows above the data. Empty rows preceding the data are ignored. You cannot have textual rows above the header row. For example, if you have a title or description of what the data is about above the header row, the file is not read appropriately. You cannot have textual rows following the data. For example, a row following the data that says This information came from is considered to be part of the data. Microsoft Excel file restrictions Specific conditions apply to Microsoft Excel files:.xlsx files saved in OpenOffice are currently not supported Password-protected Microsoft Excel files are not supported 5

Only the first sheet in a Microsoft Excel file is imported, and remaining sheets are ignored CSV file restrictions Specific conditions apply to comma-separated values (.csv) files: The file extension must be.csv. Delimiter symbols must separate the fields. Comma, tab, semi-colon, and pipe ( ) are supported. Quote characters escape literal values. Single quotes and double quotes are supported. Record separators separate rows. Newline (\n), carriage return (\r) and carriage return followed by newline (\r\n) are supported. Data quality When a data set is loaded, Watson Analytics creates a data quality report, which includes an overall average data quality score. The data quality score indicates how ready the data is for analysis and does not necessarily indicate whether Watson Analytics will provide good predictive or explorative results. In other words, a low data quality score just indicates that your data is not suitable for analysis but Watson Analytics might still provide useful insights and answers about your data. The most problematic fields that cause the average data quality score to be low are usually excluded from analysis. Additionally, some data preparation steps are taken when Watson Analytics creates a prediction. Watson Analytics will compute a data quality score based on the original data, before any cleansing or transformation has occurred. The score is an average of the data quality score for every field in the data set, as determined by missing values, constant values, imbalance, influential categories, outliers, and skewness. Skewness is a measure of the asymmetry of a distribution. Symmetry describes how values are distributed on either side of the central value. There are some things you can do to your data that can help improve the data quality score before you load the data and the score is calculated. Before loading your data set, clean your data as much as possible, in the following ways: Eliminate blank rows. Exclude summary rows and columns. 6

Avoid column headings and row headings in the same cell. Avoid look up tables. Avoid subtotals and aggregations. You can see the score associated with each data set in the list of assets on the Welcome page. In the following example, 68 is the score assigned to the IBM Sales Sample data set and represents the data s readiness for analysis. A score of 68 indicates a data set of medium quality. The score is an average of the data quality score for every field in the data set, as determined by missing values, constant values, imbalance, influential categories, outliers and skewness. The lower the score, the higher the number of outliers or missing values and other issues associated with some of the fields in the data set. It is worth mentioning again that a poor data score is only indicative of how suitable the data is for analysis and not indicative of the quality of answers you will get for your queries. You can access the Data Quality Report in the menu on the Main Insight page in the Predict capability. Data quality improvements In general, there are three things that you can do to improve the quality of your data: 7

Review the Data Quality Report. Add to the breadth and depth of the data. Use your domain knowledge to determine if the results are making sense. Review the Data Quality Report In the Predict capability, review the Data Quality Report. It highlights areas where the source data needs to be cleaned. You can access the Data Quality Report from the menu in Predict: For example, while looking at the Analysis Details of your prediction, you may see that some input fields are omitted. Use the Data Quality Report to determine why they were removed and perhaps, more importantly, determine if you should be including them. Watson Analytics might exclude a field from use for various reasons. Use your domain knowledge to determine whether an excluded field should be included. Too many categories in the field: If a field contains 50 or more categories, Watson Analytics will ignore it and does not include it in the subsequent analyses even if you set the field role to Input. Constant or near-constant fields: If a field contains a single value over 95% of valid values, Watson Analytics will set its field role to None. However, if you set the field role to Input or Target, Watson Analytics will use it in subsequent analyses. For example, let s say that you have a Churn field which is extremely unbalanced in that only 4% of people would be included in the data. In this case, Watson Analytics excludes Churn from analysis. However, you know it is an important target field, so you set its role to Target. Missing values: Watson Analytics ignores a field when the number of missing values is greater than 25%. However, it will use the field if the user sets it as Input or Target. Currently, Watson Analytics does not impute missing values for such a field, so records with missing values for the field are excluded in subsequent analyses. Alternatively, you can change the default threshold from 25% to another value in the dropdown box in the Data Quality Report. Watson Analytics would impute missing 8

values for these fields with missing values that represent less than the threshold value and use the imputed values in the subsequent analyses. For example, let s say that you have an Age field with 30% of the values missing. By default, Watson Analytics excludes it because more than 25% of the values are missing. However, you know that the Age field is an interesting input field that might explain the new program preference in a viewer survey. So, you might decide to include it to see how it will affect the predictive results. Add to the breadth and depth of the data Adding more rows and columns to the data will often improve the quality of the data. The more data that IBM Watson Analytics has available to choose from, the more accurate its predictive and explorative results will be. Make sure that you follow the appropriate data structures and cleansing procedures before you add the new data in. Use your domain knowledge to determine if the results are making sense You will always need to bring your domain knowledge with you to the analysis part of your prediction or exploration. IBM Watson Analytics provides you with recommended analytical starting points and predictive models based on the data you provide it. However, you must determine what to do with the analysis and recommendations in order to create an appropriate response. For example, let s say you are an HR professional trying to analyze employee attrition. In this case, Watson Analytics may initially determine that whether an employee had an exit interview is a near-perfect predictor of whether that they have left the company. However, with your domain knowledge, you know that exit interviews are not a useful predictor of future attrition. In this situation, you could choose to change the role of the Exit Interview input field from Input to None and exclude it completely from the analysis. Similarly, while Watson Analytics does its best to determine what questions you want to answer with your data, there is no substitute for your own expertise. For example, if you are examining payments received from customer accounts, Watson Analytics may initially determine that you want to be able to predict the amount on the invoice. However, in fact you want to predict whether a customer will pay the invoice by the due date. You can change the Targets identified by Watson Analytics in order to influence how it interprets the data. Viewing and changing the properties of a field in Predict In the Predict capability, you can view and change the properties of a field in the Field Properties area to specify its role in a prediction. Additionally, you can specify the measurement level for a field. You can also view the interestingness of a field. You can access the Field Properties area from the menu in Predict: 9

Changing the role of a field in Predict You can change the role of one or more input fields in the Predict capability by selecting a new role for the data. A field can have one of several roles: Input: Most fields are input fields. Input fields are fields whose values might influence another field. For example, if you were conducting a study to analyze the effect of salary on overall happiness, salary is an input field. Target: Although input fields are the most common, target fields are the most important. Target fields are the fields whose outcome you are interested in predicting. Target fields are influenced by input fields. You cannot have more than five targets in a workbook. Record ID: Record ID fields are not used in the analysis. These fields are used for labeling but do not provide any analytical substance. None: Fields with a role of None are those fields that are not used in a prediction. These fields might have too much missing data or might be fields that you choose to exclude as they include standard data, such as counts that are identical through the length of the column. A field might have a role of None because it was excluded automatically by Watson Analytics. Alternatively, you might decide to exclude a field because of your domain knowledge that it is not important for the analysis that you want to perform. Changing the measurement level of a field in Predict You can change the measurement level of a field in the Predict capability to improve the accuracy of your prediction. 10

A field can have one of several measurement levels: Nominal: A nominal field is a field with a limited number of distinct values that have no inherent order or ranking. Examples of nominal fields include department, region, postal code, and religious affiliation. Ordinal: An ordinal field is a field with a limited number of distinct values that have an inherent order or ranking. Examples of ordinal fields include attitude scores that represent the degree of satisfaction or confidence and preference rating scores. Like continuous fields, ordinal fields can be measured numerically. However, unlike continuous fields, distance comparisons between values are not appropriate. Continuous: A continuous field is measured numerically so that distance comparisons between values are appropriate. Examples of continuous fields include age in years and income in thousands of dollars. Browsers currently supported in Watson Analytics The following table lists browsers, browsers operating systems, and browser versions that are currently supported by IBM Watson Analytics. Browser Browser OS Versions Chrome * Windows 37 and later Mozilla Firefox Windows 31and later, 31 ESR Internet Explorer Windows 11 Safari MacOS 6 and 7 * Chrome is the recommended browser for use with IBM Watson Analytics. 11