Python for Data Analysis and Visualiza4on. Fang (Cherry) Liu, Ph.D fang.liu@oit.gatech.edu PACE Gatech July 2013



Similar documents
Ibis: Scaling Python Analy=cs on Hadoop and Impala

Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015

Programming in Slicer4

Introduction to Python

Tutorial 2 Online and offline Ship Visualization tool Table of Contents

Unlocking the True Value of Hadoop with Open Data Science

Exercise 0. Although Python(x,y) comes already with a great variety of scientic Python packages, we might have to install additional dependencies:

1. Base Programming. GIORGIO RUSSOLILLO - Cours de prépara+on à la cer+fica+on SAS «Base Programming»

CONTENTS. Introduc on 2. Undergraduate Program 4. BSC in Informa on Systems 4. Graduate Program 7. MSC in Informa on Science 7

Analytic Modeling in Python

Visualization with Excel Tools and Microsoft Azure

CRASH COURSE PYTHON. Het begint met een idee

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

MAXIMIZING THE SUCCESS OF YOUR E-PROCUREMENT TECHNOLOGY INVESTMENT. How to Drive Adop.on, Efficiency, and ROI for the Long Term

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

DNS Big Data

Computational Mathematics with Python

Understanding Cloud Compu2ng Services. Rain in business success with amazing solu2ons in Cloud technology

Availability of the Program A free version is available of each (see individual programs for links).

NEXT Analytics Business Intelligence User Guide

Computational Mathematics with Python

Module 9 Ad Hoc Queries

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Introduction to Python

EXST SAS Lab Lab #4: Data input and dataset modifications

Today's Topics. COMP 388/441: Human-Computer Interaction. simple 2D plotting. 1D techniques. Ancient plotting techniques. Data Visualization:

Phone Systems Buyer s Guide

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Computational Mathematics with Python

REDCap Longitudinal Studies & Surveys

SPSS INSTRUCTION CHAPTER 1

SAS Analyst for Windows Tutorial

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Client Marketing: Sets

BornAgain: a versa.le framework for modelling and fi6ng GISAS

Query 4. Lesson Objectives 4. Review 5. Smart Query 5. Create a Smart Query 6. Create a Smart Query Definition from an Ad-hoc Query 9

Big Data Paradigms in Python

Software Automated Testing

Suite. How to Use GrandMaster Suite. Exporting with ODBC

SAP Predictive Analysis Installation

Introduction Course in SPSS - Evening 1

Tutorial 3. Maintaining and Querying a Database

Microsoft Office 2010

How to Download Census Data from American Factfinder and Display it in ArcMap

MicroStrategy Analytics Express User Guide

CONTENTS MANUFACTURERS GUIDE FOR PUBLIC USERS

CowCalf5. for Dummies. Quick Reference. D ate: 3/26 /

Python programming guide for Earth Scientists. Maarten J. Waterloo and Vincent E.A. Post

Music Mood Classification

Installing Intel Parallel Studio XE Composer Edition for Fortran Windows 2016

RIFIS Ad Hoc Reports

We are pleased to offer the following program to Woodstock Area Educators:

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

HOW TO CREATE APPS FOR TRAINING. A step- by- step guide to crea2ng a great training app for your company

Analysis Programs DPDAK and DAWN

6.170 Tutorial 3 - Ruby Basics

Why are we teaching you VisIt?

SPSS Step-by-Step Tutorial: Part 1

Mail Merge Using Thunderbird. Bob Booth February 2009 AP-Tbird2

Microsoft Excel 2010 Part 3: Advanced Excel

Chapter 24: Creating Reports and Extracting Data

Step 2: Save the file as an Excel file for future editing, adding more data, changing data, to preserve any formulas you were using, etc.

Using SPSS, Chapter 2: Descriptive Statistics

Assignment 5: Visualization

Programming in Python. Basic information. Teaching. Administration Organisation Contents of the Course. Jarkko Toivonen. Overview of Python

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Data Analysis for Yield Improvement using TIBCO s Spotfire Data Analysis Software

OS/Run'me and Execu'on Time Produc'vity

COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

PHI Audit Us er Guide

Big Data Frameworks: Scala and Spark Tutorial

Detailed Claim Information via DCA Access Online User s Guide

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

SOAL-SOAL MICROSOFT EXCEL 1. The box on the chart that contains the name of each individual record is called the. A. cell B. title C. axis D.

An Introduction to Using Python with Microsoft Azure

Analyze IQ Spectra Manager Version 1.2

How To Use Optimum Control EDI Import. EDI Invoice Import. EDI Supplier Setup General Set up

By: Peter K. Mulwa MSc (UoN), PGDE (KU), BSc (KU)

Protec'ng Communica'on Networks, Devices, and their Users: Technology and Psychology

Advanced Functions and Modules

Rarefaction Method DRAFT 1/5/2016 Our data base combines taxonomic counts from 23 agencies. The number of organisms identified and counted per sample

SPSS: Getting Started. For Windows

Scientific Programming in Python

EARTHSOFT DNREC. EQuIS Data Processor Tutorial

Transcription:

Python for Data Analysis and Visualiza4on Fang (Cherry) Liu, Ph.D PACE Gatech July 2013

Outline System requirements and IPython Why use python for data analysis and visula4on Data set US baby names 1880-2012 Data Loading Data Processing using Lists Data Aggreg4on and Group PloTng and visualiza4on 2

System Setup Op4on 1 : (Preferred) Download and Install the Enthought Canopy product: hwps://www.enthought.com/products/canopy/academic/ Enthought Canopy is free for Academic Users. This will install a full Python distribu4on onto your computer. Op4on 2: Download and Install Python(x,y) (This is for Windows only) hwps://code.google.com/p/pythonxy/wiki/downloads This will install a full Python distribu4on onto your computer. Note1: Op4ons 1 and 2 are mutually exclusive. Please do not install both Canopy and Python(x,y) on your computer. Note2: Downloading and installing either Canopy or Python(x,y) will take a long 4me. Note3: During this course, Canopy will be used to type and execute all commands (op4on 1). Op4on 3: Use the Python installed on PACE clusters. (You need a PACE account for this to work) If you choose this op4on, let me know and I'll send instruc4ons that will help ensure that your environment is setup properly for the tutorials. Op4on 4: Use the Python already installed on your laptop. As long as Numpy, SciPy, Matplotlib, IPython, and Pandas are installed on your laptop, you will be able follow both courses (Scien4fic Compu4ng and Data Analysis and Visualiza4on). 3

IPython An Interac4ve Compu4ng and Development Environment It provides an execute- explore workflow instead of typical edit- compile- run workflow of many other programming languages It provides very 4ght integra4on with the opera4ng system s shell and file system It also includes: A rich GUI console with inline plotng A web- based interac4ve notebook format A lightweight, fast parallel compu4ng engine 4

Why use Python for Data Analysis The Python language is easy to fall in love with Python is dis4nguished by its large and ac4ve scien4fic compu4ng community Adop4on of Python for scien4fic compu4ng in both industry applica4ons and academic research has increased significantly since the early 2000s Python s improved library support (pandas) made it a strong tool for data manipula4on tasks 5

Example: US Baby Names 1880-2012 The United States Social Security Administra4on (SSA) has mad available data on the frequency of baby names from 1880 through 2012, this data set is ofen used in illustra4ng data manipula4on in R, Python, etc. The data can be obtained at: hwp://www.ssa.gov/oact/babynames/limits.html Things can be done with this data set Visualize the propor4on of babies given a par4cular name Determine the naming trend Determine the most popular names in each year 6

Check the Data In IPython, MacOS or Linux: use the UNIX head to look at the first 10 lines of the one of the files. Windows: download the files, and click to open the files This is nicely comma- separated form. 7

Load Data Using csv module from the standard library, CSV means Comma Separated Values, and any delimiter can be chosen. The variable table contains records list in which each record has three fields : name, sex, count 8

Grouping the data based on sex To find the total births by sex, the groupby func4on is used: It returns an iterator for each group based on the key value which is extracted from x[1] (sex) Then traverses the group and get the total counts Be sure to do from itertools import groupby first 9

Anonymous (lamda) Func4ons Anonymous or lambda func4ons are simple func4ons consis4ng of a single statement, the result is the return value. Lamda func4ons are convenient in data analysis since there are many cases where data transforma4on func4ons will take func4ons as arguments. 10

Aggregate the data at the year and sex Since the data set is split into files by year, one need to traverse all the files to get the total number of births per year per sex level 11

The result list (Lef) first 10 records in pieces list (Right) last 10 records in pieces list 12

Matplotlib review Before we start plotng the result, let s review the plot first 13

Prepare the data for plot Currently, the result is a list of list, each internal list include three values, [year, female births, male births], to plot the births according to year and sex, the plot needs to have year as x- axis, and births as y- axis, while two lines will be showing to represent female and male birth. 14

Plot the total births by sex and year Plot 15

Reorganize the data Concatenate the all files together to prepare the further analysis. 16

Extract a subset of the data Find the top 1000 names for each sex/year combina4on, further narrow down the data set to facilitate further analysis, the sor4ng is ignored here since the input files are already in descending order 17

Compare the subset data with original data The subset data has much less records than the original data set, but represents the majority informa4on 18

Analyzing Naming Trends With the full data set and Top 1,000 data set in hand, we can start analyzing various naming trends of interest. SpliTng the Top 1,000 names into the boy and girl por4ons: 19

Analyzing Naming Trends (Cont.) Plot for a handful of names in a subplot, John, Harry, Marry, to compare their trends over the years, first prepare data set for each chosen name. 20

Analyzing Naming Trends (Cont.) Plot three curves ver4cally, with x- axis as years, y- axis as births, the result shows that those names have grown out of favor with American popula4on 21

Measuring the increase in naming diversity To explain why there is a decrease in the previous plots, we can measure the propor4on of births represented by the top 1000 most popular names by year and sex Step 1: find total of birth per year for each sex 22

Measuring the increase in naming diversity (Cont.) Step 2: compute the propor4on of top 1000 births to the total births per year per sex For boys: 23

Measuring the increase in naming For girls: diversity (Cont.) 24

Measuring the increase in naming diversity (Cont.) Plot the result shows that fewer parents are choosing the popular names for their children over the years 25

Measuring the increase in naming diversity (Cont.) Another interest metric is the number of dis4nct popular names, taken in order of popularity from highest to lowest in the top 50% of births. Step 1: Add the fourth column to girls1000 and boys1000 list, to represent the birth propor4on to the total birth of the given year, then sort the list in descending order on propor4on, sort the list again in ascending order on years. The result list will have each years records in a chunk with propor4on number in decreasing order. 26

Measuring the increase in naming For girls: diversity (Cont.) 27

Measuring the increase in naming For boys: diversity (Cont.) 28

Measuring the increase in naming diversity (Cont.) Step 2: Adding the propor4on for each year from highest un4l the total propor4on reaches 50%, recording the number of individual names 29

Measuring the increase in naming For girls: diversity (Cont.) 30

Measuring the increase in naming For boys: diversity (Cont.) 31

Measuring the increase in naming diversity (Cont.) Step 3: Plot the result, as you can see, girl names has always been more diverse than boy names, and the dis4nguished names become more over 4me. 32

Python Library for Data Analysis Pandas wriwen by Wes McKinney hwp://pandas.pydata.org/ provides rich data structures and func4ons working with structured data It is one of the cri4cal ingredients enabling Python to be a powerful and produc4ve data analysis environment. The primary object in pandas is called DataFrame a two- dimensional tabular, column- oriented data structure with both row and column labels Pandas combines the features of NumPy, spreadsheets and rela4onal databases 33

Useful Links Python Scien4fic Lecture Notes hwp://scipy- lectures.github.io/ Matplotlib hwp://matplotlib.org/ Documenta4on hwp://docs.python.org 34