TC 2014 Hands on Presentation When Universes Collide: Table Joins in Tableau Guided PDF

Similar documents
Analyzing Excel Data Using Pivot Tables

Gantt Chart for Excel

I. Create the base view with the data you want to measure

Microsoft Excel 2010 Pivot Tables

Advanced Excel Charts : Tables : Pivots : Macros

How To Create A Powerpoint Intelligence Report In A Pivot Table In A Powerpoints.Com

MicroStrategy Desktop

About PivotTable reports

Introduction to Querying & Reporting with SQL Server

An Introduction to Excel Pivot Tables

Sample Table. Columns. Column 1 Column 2 Column 3 Row 1 Cell 1 Cell 2 Cell 3 Row 2 Cell 4 Cell 5 Cell 6 Row 3 Cell 7 Cell 8 Cell 9.

Q&As: Microsoft Excel 2013: Chapter 2

Toad for Data Analysts, Tips n Tricks

Excel 2003 PivotTables Summarizing, Analyzing, and Presenting Your Data

Building Better Dashboards PART 1: BASIC DASHBOARDS

In This Issue: Excel Sorting with Text and Numbers

Scientific Graphing in Excel 2010

ITS Training Class Charts and PivotTables Using Excel 2007

Excel Using Pivot Tables

Excel Pivot Tables. Blue Pecan Computer Training Ltd - Onsite Training Provider :: :: info@bluepecan.co.

Converting Dimensions to Measures & Changing Data Types

Use Find & Replace Commands under Home tab to search and replace data.

Microsoft Excel: Pivot Tables

Creating and Using Databases with Microsoft Access

Merging Labels, Letters, and Envelopes Word 2013

TABLEAU COURSE CONTENT. Presented By 3S Business Corporation Inc Call us at : Mail us at : info@3sbc.com

Task Force on Technology / EXCEL

Microsoft Access 3: Understanding and Creating Queries

Excel Using Pivot Tables

Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies

ECDL. European Computer Driving Licence. Spreadsheet Software BCS ITQ Level 2. Syllabus Version 5.0

Microsoft Excel 2010 Tutorial

EXCEL FINANCIAL USES

Microsoft Excel 2010 Part 3: Advanced Excel

Creating a Poster Presentation using PowerPoint

Excel -- Creating Charts

Access Creating Databases - Fundamentals

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

Data exploration with Microsoft Excel: analysing more than one variable

Spotfire v6 New Features. TIBCO Spotfire Delta Training Jumpstart

Microsoft Access 2010 handout

New Orleans 2007 Workshop Tips For Using Microsoft Excel to Analyze EMSC Data and Generate Reports Pivot Tables and Other Goodies

Getting Started Guide

Microsoft Excel 2010 Charts and Graphs

Abstract. For notes detailing the changes in each release, see the MySQL for Excel Release Notes. For legal information, see the Legal Notices.

Step One. Step Two. Step Three USING EXPORTED DATA IN MICROSOFT ACCESS (LAST REVISED: 12/10/2013)

MICROSOFT EXCEL 2010 ANALYZE DATA

Microsoft Excel 2007 Mini Skills Overview of Tables

ACADEMIC TECHNOLOGY SUPPORT

Tableau Your Data! Wiley. with Tableau Software. the InterWorks Bl Team. Fast and Easy Visual Analysis. Daniel G. Murray and

Excel Project Creating a Stock Portfolio Simulation

Microsoft Excel 2007 Consolidate Data & Analyze with Pivot Table Windows XP

DATA VISUALIZATION WITH TABLEAU PUBLIC. (Data for this tutorial at

BID2WIN Workshop. Advanced Report Writing

Excel 2007 A Beginners Guide

Excel Database Management Microsoft Excel 2003

Tableau's data visualization software is provided through the Tableau for Teaching program.

WHAT S NEW IN OBIEE

User Services. Intermediate Microsoft Access. Use the new Microsoft Access. Getting Help. Instructors OBJECTIVES. July 2009

Pastel Evolution BIC. Getting Started Guide

INTERMEDIATE Excel 2013

Excel 2003 A Beginners Guide

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

Lab 11: Budgeting with Excel

Excel for Data Cleaning and Management

Getting Started Guide SAGE ACCPAC INTELLIGENCE

Introduction to Microsoft Excel 2007/2010

Critical Path Method (CPM)

MS Excel. Handout: Level 2. elearning Department. Copyright 2016 CMS e-learning Department. All Rights Reserved. Page 1 of 11

Working with Excel in Origin

Formatting Report Output to MS Excel

Microsoft Project From the WBS to a Complete Schedule Emanuele Della Valle, Lecturer: Dario Cerizza

Drawing a histogram using Excel

Excel 2013 Sort: Custom Sorts, Sort Levels, Changing Level & Sorting by Colored Cells

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Pivot Tables & Pivot Charts

Click to create a query in Design View. and click the Query Design button in the Queries group to create a new table in Design View.

FirstClass FAQ's An item is missing from my FirstClass desktop

GUIDE FOR SORTING RX HISTORY REPORTS IN MICROSOFT EXCEL

Easy Calculations in Excel

To create a histogram, you must organize the data in two columns on the worksheet. These columns must contain the following data:

LabVIEW Day 6: Saving Files and Making Sub vis

Tableau Data Visualization Cookbook

Using SPSS, Chapter 2: Descriptive Statistics

DbSchema Tutorial with Introduction in SQL Databases

"Excel with Excel 2013: Pivoting with Pivot Tables" by Venu Gopalakrishna Remani. October 28, 2014

User Services. Microsoft Access 2003 II. Use the new Microsoft

Introduction to Pivot Tables in Excel 2003

Microsoft Office. Mail Merge in Microsoft Word

Using Excel s PivotTable to Analyze Learning Assessment Data

CHAPTER 6: ANALYZE MICROSOFT DYNAMICS NAV 5.0 DATA IN MICROSOFT EXCEL

To launch the Microsoft Excel program, locate the Microsoft Excel icon, and double click.

Preview DESIGNING DATABASES WITH VISIO PROFESSIONAL: A TUTORIAL

Excel Math Project for 8th Grade Identifying Patterns

A Beginning Guide to the Excel 2007 Pivot Table

Data Tool Platform SQL Development Tools

REP200 Using Query Manager to Create Ad Hoc Queries

Excel 2007 Basic knowledge

Access II 2007 Workshop

Basic Excel Handbook

Transcription:

TC 2014 Hands on Presentation When Universes Collide: Table Joins in Tableau Guided PDF Introduction - Anatomy of a Table Concepts and vocabulary used in this class Table - A set of values that is organized using a model of vertical columns (identified by unique header names) and horizontal rows. Field - Any unique column that can contain either qualitative data (dimensions such as dates, descriptions, or names) or quantitative data (measures such as sum, count, average). Row A line of cells containing data corresponding to each of the fields in the table. Data may be aggregated at any level, which is important to keep in mind when considering joining tables. Database A data structure that stores organized information. Most contain multiple tables. Typical examples of database schemas are the Star Schema and the Snowflake Schema, both of which contain a single central Fact Table connected to one or more Dimension Tables. Fact Table Contains the measures, metrics or values. Dimension Table Contains the descriptive attributes that allow for things like labeling, aggregating, grouping, or filtering of the measurements. 1

What is a Join? Joins combine related data from different tables When working with a relational database, such as Microsoft SQL Server or a flat file such as Excel, our data are often stored in different tables within our database. We can combine data from multiple tables in Tableau using Joins. For example, imagine that we have two tables. The first contains Patient Names and Patient IDs. The second table contains each patient s prescription information as well as Patient IDs. We can use the Patient ID field to link the two tables into one table. This allows us to see the patient names alongside their respective prescription information. Join Requirements To create a join it is necessary to have a common field between the tables (Primary and Foreign Key), in this case we have Patient ID. To create joins in Tableau, our tables must all be present in one single database. When using Excel the workbook represents the database and each sheet within the workbook is a different table. Types of Joins How tables are combined The types of joins that you are able to make within Tableau depend on the capabilities of your database. For example, Excel files are only capable of making Left Joins, Right Joins, and Inner Joins, while Microsoft SQL data can also have an Outer Join. Joins types are best understood via a Venn diagram as seen below. The shaded areas represent the data that is kept from each table for each join type. For example, in a Left Join all of the data is kept from the Primary table A and only the data that matches table A is joined from Secondary table B. If there are some Foreign Keys in table B that are not present as Primary Keys in table A then that information will not appear in the output table. 2

The images below show examples of Left Outer, Right Outer, and Inner Joins between our Patient Data Tables. 3

Other Joins There are two other techniques for combining data in Tableau Unions The two other techniques for combining data in Tableau are Unions and Data Blending. Unions append data to a dataset, like stacking one table on top of another. If we were to create a Union of tables 1 and 2 below, we would get the Unions results table containing all of the information from both original tables. 4

Unions Table 1: Bicycle Rides in June Date Trip Name Distance (miles) Duration (hh:mm:ss) 6/07/2014 Lake Union Loop 6.00 00:27:08 6/14/2014 Salmon Bay Loop 11.04 01:06:59 6/21/2014 Lake Washington Loop 54.00 07:11:36 Unions Table 2: Bicycle Rides in July Date Trip Name Distance (miles) Duration (hh:mm:ss) 7/05/2014 To and from Magnuson 8.63 00:47:02 Park 7/12/2014 Loop Lake Washington 54.00 07:05:47 Loop 7/26/2014 Over I-90 and back 11.30 01:13:06 Unions Results Table: My Entire Bicycle Ride History Date Trip Name Distance (miles) Duration (hh:mm:ss) 6/07/2014 Lake Union Loop 6.00 00:27:08 6/14/2014 Salmon Bay Loop 11.04 01:06:59 6/21/2014 Lake Washington Loop 54.00 07:11:36 7/05/2014 To and from Magnuson 8.63 00:47:02 Park 7/12/2014 Loop Lake Washington 54.00 07:05:47 Loop 7/26/2014 Over I-90 and back 11.30 01:13:06 For Unions to be possible, both tables must have the same number of columns with the same column names, and the effect is that the results table is lengthened rather than widened because rows of data are added. A Union can be achieved by creating an extract of a table and using the add data from a file option. They can also be done using the custom SQL dialog and writing a Union query between two tables in the same database. Data Blending Data Blending is similar to a Join in that it combines information from two tables based on a primary and Foreign Key. The main difference is that this combination is done in the view at a designated aggregation. The two tables are queried separately and the data from each are linked and aggregated in the view. A Join is done at the row level and results in the formation of a new single table. A data blend leaves both tables separate and results in a view of combined information. 5

Cardinality The relationships between tables There are three types of cardinality: One-to-One, One-to-Many, or Many-to-Many. The cardinality of a Join can have a major effect on the resulting table. If not dealt with properly, certain types of cardinality can produce incorrect metrics such as duplicated fields which may skew analysis. Cardinality s Effects Duplication of Records Joining tables that have One-to-Many or Many-to-Many relationships can produce resulting tables with duplicate values. 6

Database Schemas How tables are organized in a database Primary and Foreign Keys These identifying fields are used to establish and enforce a link between tables. Primary Keys are fields that uniquely identify each record (row) in a table. A Foreign Key uniquely identifies a row in another table. Star Schema Database model in which a Fact Table is connected directly to all of the Dimension Tables via Primary/Foreign Keys. Wikipedia: http://en.wikipedia.org/wiki/star_schema Snowflake Schema Database model in which a Fact Table is connected to Dimension Tables and those Dimension Tables are connected to other Dimension tables via Primary/Foreign Keys. Wikipedia: http://en.wikipedia.org/wiki/snowflake_schema 7

Guided PDF Scenarios For the remainder of the session, we will work through four example scenarios containing exercises in creating Table Joins. Bonus scenarios are also available for those who complete scenarios 1-4 and would like additional challenge questions. NOTE: For all of the following exercises we will be using the legacy connector for Excel files. See below for instructions on how to do so. 1. Open Tableau Desktop and select Connect to data. 8

2. Select Microsoft Excel. 3. Click once on the One-to-Many file to highlight it. Then choose the drop down arrow right of the open option. Choose the second of the two options Open with Legacy Connection a. We will use the Legacy Connection because the new connection for Excel does not support all of the Join types that we will be using today. 9

Scenario 01: One-to-Many Our database stores Customer Sales and Shipping Location in separate tables. We would like to view a map showing where customers are located and a bar chart of sales by country. Notes: This is a One-to-One relationship. All of the shipping data for China was lost in a database failure. However, the Sales Table was unaffected so all of the Customer Sales data for China are still available. Considerations before you start: What is the common field across tables? Which table should I use as my primary? How will the Join type affect my results? (Think about the missing data in China.) 10

Scenario 01 Step by Step: 1. Use the Excel Legacy Connector to open the One-to-Many.xlsx file. All of the tables will now show up in the connection screen. 2. First, drag in the Customer sales Fact Table and then drag in the customer location Dimension Table. 3. Tableau will automatically recognize that the linking field between the two tables is Customer Name. The Join type will default to an Inner Join. 4. From the scenario notes we know that there are no shipping data for China in the customer location table. This is important to consider because if we leave an Inner Join we will lose the Customer Sales data for anyone who bought from China. This is because an Inner Join only combines the data that is present in both tables. Choose a Left Join to preserve the Customer Sales data for China. 5. Notice that when we choose a Left Join, null values appear in some of the rows in the data view. These represent the rows of data that contain Customer Sales data but no locational data. Because there was no linking field from the location table for customers in China, a null is put as a place holder for the those rows. 11

6. Go to the work sheet and build a map of all of the cities and a bar chart of sales by country. Learn how to build a Map: http://onlinehelp.tableausoftware.com/current/pro/online/enus/help.htm#maps_build.html Learn how to build a bar chart: http://onlinehelp.tableausoftware.com/current/pro/online/enus/help.htm#buildexamples_bar.html 7. China will not contain any data points on the map because it is not present in the location data. Also notice that upon sorting the countries in descending order of Sales, the second bar is Null. This represents all of the customers from China in our data set. 8. For practice, return to the connection screen by double clicking the data source. Change the Join type to Inner and go back to the worksheet. 12

Notice the Null is now removed from the bar chart. We have now removed all of the data associated with China from the Customer Sales Table with an Inner Join. Challenge01: Connect to the challenge01 workbook using the Legacy Connection and Join the sales and locational data together so that no data is lost in the Join. Then count the distinct number of customers that we have data for. There should be 62 customers with at least locational data available. Scenario 02: Many-to-Many Similar to the last scenario, we have Customer Sales data and location data stored separately. However, when customers purchase an item from us they are given the option to write down more than one home location. Because of this issue, there are sometimes multiple rows of location data for one customer if they have more than one home. We would like to map out where all of our customers have homes as well as view a chart of sales by customer. 13

Notes: This is a Many-to-Many relationship. There are no location data for customer 6. Considerations before you start: How can we account for duplicate rows that result from the Many-to-Many relationship? How will the Join type affect the lack of data for customer 6? Scenario 02 Step by Step: 1. Use the Excel Legacy Connector to open the Many-to-Many.xlsx file. 2. Drag in the Customer Sales Fact Table followed by the customer location Dimension Table. 3. Once again, Tableau will automatically recognize the linking field that is Customer ID and default to an Inner Join. 14

4. There are no locational data present for customer 6, so there is a similar situation to scenario 01. If we create an Inner Join we will lose all of the sales data for customer 6, but if we create a Left Join we can preserve the data and add Null values for the locational data of customer 6. Choose a Left Join. 15

5. Go to the worksheet and build a map of customer locations. This map should show all of the locations where customers have a home (first and second homes). This view should also display one unknown for customer 6 because we do not have their locational data. 6. Now create a bar chart of Sales by Customer. Use the Customer Name on rows with the customer id (from the Customer Sales Table) next to it. This will allow you to visualize the null associated with customer 6. Some of these values are incorrect because of the Many-to-Many relationship between our joined tables. Customers 1 and 2 (John and Nick) are showing sales values that are much larger than what they actually purchased. This is a result of these customers putting down multiple locations for their home in the location data. 16

If we drag out State and City to the view we can visualize where the Many-to-Many relationship is occurring. John has two homes and Nick has three homes. The sales values are now duplicated across each row of data for John and Nick. This means that the sales data were duplicated for each home that they own. 7. There are a variety of ways to get the graph to display the correct values. The easiest is to use a field which is unique for each customer that will filter out duplicated rows of data. In this case we can use the First Home? field to filter out all of the second homes. Right-click First Home? in the dimensions and choose show quick filter. 17

8. Uncheck the No option to remove all second homes. Notice that there is now a single row of data for each customer. This now displays the correct values. Challenge02: Sometimes there is not always a field that can easily be filtered to get correct results when dealing with a Many-to-Many relationship. In these cases we must find another way to get the proper results. Use the Many-to-Many Excel workbook for this challenge. You can also work off of the bar chart view created in scenario 02. For the bar chart of customer names and sales values create a calculated field that will result in the correct sales values without using any filters. The trick is to find a field that is creating the duplication of records. If the sales are being tripled than there is some field that is repeating three distinct times to make this happen. To get the correct sales we will need to divide the sales by the distinct count of this field. ***This method will not work for customer 6. Why not? 18

Scenario 03: multiple-tables We have a simple snowflake schema with one Fact Table and three Dimension Tables. The goal of this exercise is to combine all three tables and visualize the percentage of total sales that each manager is responsible for as well as view sales over time. Notes: There is a Many-to-Many relationship. One of the Foreign Keys has slightly different spelling than the Primary Key. In this case Tableau will not automatically recognize the common field. Considerations before you start: Which table contains the Many-to-Many relationship and why? How will the Many-to-Many relationship affect the end results? Find the Foreign Key that has different spelling before you start so that you know what to expect. 19

Scenario 03 Step by Step: 1. Use the Excel Legacy Connector to open the multiple-tables.xlsx file. 2. It may help to sketch the linking fields that combine each table. Remember that the Fact Table is in the center and the Dimension Tables surround it. This means that each Dimension Table should have a field that matches a field in the Fact Table. However, this is a snow-flake schema, so some tables may have linking fields to other Dimension Tables. 3. Drag in the Customer Sales Fact Table followed by customer location and products tables. Tableau will automatically find the linking fields. 4. Drag in the managers table and Tableau will immediately bring up the Join type and Join Clause dialog box. This is because it cannot find the linking field between managers table and the tables already present. We want to match the region field from the customer location table to the manager table. From the data source list find the field Regions under the customer location table and 20

from the managers table find the field Region. Because region has different spelling in each table Tableau cannot guess the linking field. 21

5. Change all of the Join types to Left to avoid losing any Customer Sales data just as we did in previous exercises. 6. Go to the worksheet and create a bar chart of sales by manager. 22

7. Right-click the Sales pill and create a quick table calculation percentage of total. Why are there 5 managers for 4 regions and why does Pat have 50% of the total sales? 23

8. Drag out Region to the view after Manager and notice that each manager is assigned to one region except for Pat who is assigned to all of the regions. This is because Pat is a VP and is responsible for all regions. Drag Region before Manager in the Columns Shelf to see that each sales bar is duplicated because of this Many-to-Many relationship. This will double all of our sales metrics throughout the analysis. Because of this we need to filter out Pat before we look at other metrics. Right-click on the manager s field and create a quick filter. Uncheck Pat s name. 24

Now we can create a graph of sales over time using the same filtering trick to get the proper results. 9. Create a graph of sales over time and apply the same manager filter to the worksheet unchecking Pat s name. Challenge 03: The best way to avoid a Many-to-Many relationship is to fix the problem straight form the source. This means restructuring the tables in a way that does not create a Many-to-Many relationship. In this particular scenario the mangers table is the source of the problem. Rather than using filtering or calculation tricks it would make more sense to change the way that the managers table is storing the data to fix the problem from the start. How can we restructure the managers table to avoid the Many-to-Many relationship? 25

Scenario 04: Self Join We want to better understand the purchase behavior of our customers. Specifically, we want to know what products individual customers purchase together. These types of analyses allow us to create effective packaged deals and make more efficient product placement in stores. In other words, let s find out what customers purchased paper and purchased binders, or paper and pens and art supplies. Create a heat map that shows the each combination of product sub-categories and how many customers purchased that combination. This is called a market basket analysis. Notes: We will need to create a Self-Join to accomplish this. It is important to think carefully about the type of Join (Inner/Outer) and the Joining clause. 26

Considerations before you start: We want to join each customer name to itself to get the level of detail desired. We are comparing each product sub-category to all the other product sub-categories. How do we write a Joining Clause that joins each product to all the other products but itself? Scenario 04 Step by Step: 1. Use the Excel Legacy Connector to open the Self-Join.xlsx file. 2. The reason for creating the Self Join in this scenario is to examine what sub-categories of products are purchased together. To do so we must compare each product sub-category for a given customer to every other product sub-category that they purchased. In order to compare product sub-category to itself, we need to Join the Customer Orders table to itself. Drag customer orders into the view. Drag customer orders again out to the view. 3. Tableau will give an exclamation mark because you are joining the table to itself. 27

4. First, we must set the level of detail for our Self Join. We wish to examine the relationships between product sub-categories for each customer. This means that the analysis will be done at the level of the Customer Name field. Set the customer name from the left table equal to the customer name in the right table. If we wanted to see what products customers purchased together in individual orders we would set the order ids equal to one another instead. 5. Now we must consider how to Join the product sub-category of each customer s purchase to all of the other sub-categories. This will give us a new record for each combination of sub-categories. Create a second Join Clause with product sub-category on the left and right tables. Set these keys not equal to each other using the < > symbol. 28

6. Go to the worksheet and notice that all of the dimensions are replicated across the two tables. 7. Drag out product sub-category from the first orders table to the rows and product subcategory from the second table to the columns. The order or shelf that you put each of these on is insignificant. The view now shows a matrix with each cell representing a combination of product sub-categories. 29

8. Dragging Number of Records into the color will now turn the matrix into a heat map with darker colors representing the most common combinations of product sub-categories. 9. Why are there blank cells running diagonal across the matrix? Challenge 04: Every combination of product sub-categories are repeated twice in the matrix we created from the market basket analysis. The diagonal splits blank spaces split this duplication in half. How can we remove the top or bottom half of this matrix so as to remove the duplications? This can be accomplished using joins or calculated fields. Challenge 05: The techniques used in market basket analysis are also applicable to other problems. If we have a data set with each row containing a trip id and the start and stop day of that trip it would be very difficult to calculate which trips are overlapping. However it is possible to use a Self-Join to find out which trips are overlapping. Use the challenge05 Excel doc to get a list of all the trips that are overlapping. For a further challenge create a Gantt chart that shows each trip with a bar expanding the time that the trip overlaps. Then color each trips bar if it overlaps with another trip. **This is a very difficult problem to solve do not expect to get this immediately. Challenge 06: We would like to create the visualization seen here: http://kb.tableausoftware.com/articles/knowledgebase/using-path-shelf-pattern-analysis The above kb article shows how to use path data on maps with preformatted data. However, this is often not the way this type of data is stored so it requires some reshaping to get the results. Using the custom SQL in the connection dialog reshape the data so that we can create a map with distribution paths in Tableau. This challenge will require some knowledge in SQL. Challenge 07: Parameterize challenge 06 so that we can choose the state at the center of the paths. 30