Similar documents
CRASH COURSE PYTHON. Het begint met een idee

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Part VI. Scientific Computing in Python

Scientific Programming in Python

Programming Languages & Tools

Exercise 0. Although Python(x,y) comes already with a great variety of scientic Python packages, we might have to install additional dependencies:

Introduction Installation Comparison. Department of Computer Science, Yazd University. SageMath. A.Rahiminasab. October9, / 17

Python for Data Analysis and Visualiza4on. Fang (Cherry) Liu, Ph.D PACE Gatech July 2013

CUDAMat: a CUDA-based matrix class for Python

Unlocking the True Value of Hadoop with Open Data Science

Mastering pandas. Mastering pandas. Mastering pandas. Femi Anthony. Master the features and capabilities of pandas, a data analysis toolkit for Python

Performance Monitoring using Pecos Release 0.1

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Lab 2: Visualization with d3.js

Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015

Analytics on Big Data

Usability of Visualization Libraries for Web Browsers for Use in Scientific Analysis

SmartArrays and Java Frequently Asked Questions

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

An Introduction to APGL

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

NetworkX: Network Analysis with Python

Tentative NumPy Tutorial

CIS 192: Lecture 13 Scientific Computing and Unit Testing

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Today's Topics. COMP 388/441: Human-Computer Interaction. simple 2D plotting. 1D techniques. Ancient plotting techniques. Data Visualization:

The Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT): A Vision for Large-Scale Climate Data

Car Insurance. Prvák, Tomi, Havri

Introduction to the data.table package in R

What s New in MATLAB and Simulink

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Importing Data into R

Chapter 8 Detailed Modeling

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Data Visualization. Christopher Simpkins

Analytics with Excel and ARQUERY for Oracle OLAP

Session 15 OF, Unpacking the Actuary's Technical Toolkit. Moderator: Albert Jeffrey Moore, ASA, MAAA

SAS ADD-IN FOR MICROSOFT OFFICE

R YOU READY FOR PYTHON? Sunday 19th April, 2015

Data Management: Best Practices. Michelle Craft Research IT Coordinator

$99.95 per user. SQL Server 2008/R2 Reporting Services CourseId: 162 Skill level: Run Time: 37+ hours (195 videos)

IBM Cognos Analysis for Microsoft Excel

Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014

6.170 Tutorial 3 - Ruby Basics

Architectures for Big Data Analytics A database perspective

R2MLwiN Using the multilevel modelling software package MLwiN from R

Using Computer Programs for Quantitative Data Analysis

Intellect Platform - Parent-Child relationship Basic Expense Management System - A103

How To Cluster

Linear Algebra Review. Vectors

1 Topic. 2 Scilab. 2.1 What is Scilab?

Working with Spreadsheets

Participant Guide RP301: Ad Hoc Business Intelligence Reporting

Lesson 9. Reports. 1. Create a Visual Report. Create a visual report. Customize a visual report. Create a visual report template.

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

Data exploration with Microsoft Excel: analysing more than one variable

Pivot Charting in SharePoint with Nevron Chart for SharePoint

Python. Python. 1 Python. M.Ulvrova, L.Pouilloux (ENS LYON) Informatique L3 Automne / 25

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Modeling with Python

SAS R IML (Introduction at the Master s Level)

NetworkX: Network Analysis with Python

Machine Learning Final Project Spam Filtering

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Processing Data with rsmap3d Software Services Group Advanced Photon Source Argonne National Laboratory

<no narration for this slide>

Microsoft Excel 2010 Pivot Tables

The NumPy array: a structure for efficient numerical computation

Big Data Paradigms in Python

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

How to Use a Data Spreadsheet: Excel

The Scientific Data Mining Process

What s new in TIBCO Spotfire 6.5

Integrating SAS with JMP to Build an Interactive Application

An introduction to Python Programming for Research

Understanding Data: A Comparison of Information Visualization Tools and Techniques

ANSA and μeta as a CAE Software Development Platform

Intelligent Query and Reporting against DB2. Jens Dahl Mikkelsen SAS Institute A/S

Visualizing Data from Government Census and Surveys: Plans for the Future

CHAPTER 6: ANALYZE MICROSOFT DYNAMICS NAV 5.0 DATA IN MICROSOFT EXCEL

Resource Dashboard. Portfolio and Project Management. A PLM Consulting Solution. Public

Python for Scientific Computing.

Use Cases and Design Best Practices

CREATING EXCEL PIVOT TABLES AND PIVOT CHARTS FOR LIBRARY QUESTIONNAIRE RESULTS

MicroStrategy Analytics Express User Guide

Getting Started with R and RStudio 1

IBM SPSS Data Preparation 22

JMulTi/JStatCom - A Data Analysis Toolkit for End-users and Developers

Figure 1: Choose your Excel output format.

Advanced Visualizations Tools for CERN Institutional Data

Lecture 25: Database Notes

How To Write A Data Analysis Project

Assignment 2: Option Pricing and the Black-Scholes formula The University of British Columbia Science One CS Instructor: Michael Gelbart

Create Cool Lumira Visualization Extensions with SAP Web IDE Dong Pan SAP PM and RIG Analytics Henry Kam Senior Product Manager, Developer Ecosystem

Fast Query Manuale Sistemista. Fast Query System users guide

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

The difference between. BI and CPM. A white paper prepared by Prophix Software

Transcription:

Data structures for statistical computing in Python Wes McKinney SciPy 2010 McKinney () Statistical Data Structures in Python SciPy 2010 1 / 31

Environments for statistics and data analysis The usual suspects: R / S+, MATLAB, Stata, SAS, etc. Python being used increasingly in statistical or related applications scikits.statsmodels: linear models and other econometric estimators PyMC: Bayesian MCMC estimation scikits.learn: machine learning algorithms Many interfaces to mostly non-python libraries (pycluster, SHOGUN, Orange, etc.) And others (look at the SciPy conference schedule!) How can we attract more statistical users to Python? McKinney () Statistical Data Structures in Python SciPy 2010 2 / 31

What matters to statistical users? Standard suite of linear algebra, matrix operations (NumPy, SciPy) Availability of statistical models and functions More than there used to be, but nothing compared to R / CRAN rpy2 is coming along, but it doesn t seem to be an end-user project Data visualization and graphics tools (matplotlib,...) Interactive research environment (IPython) McKinney () Statistical Data Structures in Python SciPy 2010 3 / 31

What matters to statistical users? (cont d) Easy installation and sources of community support Well-written and navigable documentation Robust input / output tools Flexible data structures and data manipulation tools McKinney () Statistical Data Structures in Python SciPy 2010 4 / 31

What matters to statistical users? (cont d) Easy installation and sources of community support Well-written and navigable documentation Robust input / output tools Flexible data structures and data manipulation tools McKinney () Statistical Data Structures in Python SciPy 2010 5 / 31

Statistical data sets Statistical data sets commonly arrive in tabular format, i.e. as a two-dimensional list of observations and names for the fields of each observation. array([( GOOG, 2009-12-28, 622.87, 1697900.0), ( GOOG, 2009-12-29, 619.40, 1424800.0), ( GOOG, 2009-12-30, 622.73, 1465600.0), ( GOOG, 2009-12-31, 619.98, 1219800.0), ( AAPL, 2009-12-28, 211.61, 23003100.0), ( AAPL, 2009-12-29, 209.10, 15868400.0), ( AAPL, 2009-12-30, 211.64, 14696800.0), ( AAPL, 2009-12-31, 210.73, 12571000.0)], dtype=[( item, S4 ), ( date, S10 ), ( price, <f8 ), ( volume, <f8 )]) McKinney () Statistical Data Structures in Python SciPy 2010 6 / 31

Structured arrays Structured arrays are great for many applications, but not always great for general data analysis Pros Fast, memory-efficient, good for loading and saving big data Nested dtypes help manage hierarchical data McKinney () Statistical Data Structures in Python SciPy 2010 7 / 31

Structured arrays Structured arrays are great for many applications, but not always great for general data analysis Pros Fast, memory-efficient, good for loading and saving big data Nested dtypes help manage hierarchical data Cons Can t be immediately used in many (most?) NumPy methods Are not flexible in size (have to use or write auxiliary methods to add fields) Not too many built-in data manipulation methods Selecting subsets is often O(n)! McKinney () Statistical Data Structures in Python SciPy 2010 7 / 31

Structured arrays Structured arrays are great for many applications, but not always great for general data analysis Pros Fast, memory-efficient, good for loading and saving big data Nested dtypes help manage hierarchical data Cons Can t be immediately used in many (most?) NumPy methods Are not flexible in size (have to use or write auxiliary methods to add fields) Not too many built-in data manipulation methods Selecting subsets is often O(n)! What can be learned from other statistical languages? McKinney () Statistical Data Structures in Python SciPy 2010 7 / 31

R s data.frame One of the core data structures of the R language. In many ways similar to a structured array. > df <- read.csv( data ) item date price volume 1 GOOG 2009-12-28 622.87 1697900 2 GOOG 2009-12-29 619.40 1424800 3 GOOG 2009-12-30 622.73 1465600 4 GOOG 2009-12-31 619.98 1219800 5 AAPL 2009-12-28 211.61 23003100 6 AAPL 2009-12-29 209.10 15868400 7 AAPL 2009-12-30 211.64 14696800 8 AAPL 2009-12-31 210.73 12571000 McKinney () Statistical Data Structures in Python SciPy 2010 8 / 31

R s data.frame Perhaps more like a mutable dictionary of vectors. Much of R s statistical estimators and 3rd-party libraries are designed to be used with data.frame objects. > df$isgoog <- df$item == "GOOG" > df item date price volume isgoog 1 GOOG 2009-12-28 622.87 1697900 TRUE 2 GOOG 2009-12-29 619.40 1424800 TRUE 3 GOOG 2009-12-30 622.73 1465600 TRUE 4 GOOG 2009-12-31 619.98 1219800 TRUE 5 AAPL 2009-12-28 211.61 23003100 FALSE 6 AAPL 2009-12-29 209.10 15868400 FALSE 7 AAPL 2009-12-30 211.64 14696800 FALSE 8 AAPL 2009-12-31 210.73 12571000 FALSE McKinney () Statistical Data Structures in Python SciPy 2010 9 / 31

pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or labeled data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31

pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or labeled data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! Takes some inspiration from R but aims also to improve in many areas (like data alignment) McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31

pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or labeled data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! Takes some inspiration from R but aims also to improve in many areas (like data alignment) Core idea: ndarrays with labeled axes and lots of methods McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31

pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or labeled data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! Takes some inspiration from R but aims also to improve in many areas (like data alignment) Core idea: ndarrays with labeled axes and lots of methods Etymology: panel data structures McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31

pandas DataFrame Basically a pythonic data.frame, but with automatic data alignment! Arithmetic operations align on row and column labels. >>> data = DataFrame.fromcsv( data, index_col=none) date item price volume 0 2009-12-28 GOOG 622.9 1.698e+06 1 2009-12-29 GOOG 619.4 1.425e+06 2 2009-12-30 GOOG 622.7 1.466e+06 3 2009-12-31 GOOG 620 1.22e+06 4 2009-12-28 AAPL 211.6 2.3e+07 5 2009-12-29 AAPL 209.1 1.587e+07 6 2009-12-30 AAPL 211.6 1.47e+07 7 2009-12-31 AAPL 210.7 1.257e+07 >>> df[ ind ] = df[ item ] == GOOG McKinney () Statistical Data Structures in Python SciPy 2010 11 / 31

How to organize the data? Especially for larger data sets, we d rather not pay O(# obs) to select a subset of the data. O(1)-ish would be preferable >>> data[data[ item ] == GOOG ] array([( GOOG, 2009-12-28, 622.87, 1697900.0), ( GOOG, 2009-12-29, 619.40, 1424800.0), ( GOOG, 2009-12-30, 622.73, 1465600.0), ( GOOG, 2009-12-31, 619.98, 1219800.0)], dtype=[( item, S4 ), ( date, S10 ), ( price, <f8 ), ( volume, <f8 )]) McKinney () Statistical Data Structures in Python SciPy 2010 12 / 31

How to organize the data? Really we have data on three dimensions: date, item, and data type. We can pay upfront cost to pivot the data and save time later: >>> df = data.pivot( date, item, price ) >>> df AAPL GOOG 2009-12-28 211.6 622.9 2009-12-29 209.1 619.4 2009-12-30 211.6 622.7 2009-12-31 210.7 620 McKinney () Statistical Data Structures in Python SciPy 2010 13 / 31

How to organize the data? In this format, grabbing labeled, lower-dimensional slices is easy: >>> df[ AAPL ] 2009-12-28 211.61 2009-12-29 209.1 2009-12-30 211.64 2009-12-31 210.73 >>> df.xs( 2009-12-28 ) AAPL 211.61 GOOG 622.87 McKinney () Statistical Data Structures in Python SciPy 2010 14 / 31

Data alignment Data sets originating from different files or different database tables may not always be homogenous: >>> s1 >>> s2 AAPL 0.044 AAPL 0.025 IBM 0.050 BAR 0.158 SAP 0.101 C 0.028 GOOG 0.113 DB 0.087 C 0.138 F 0.004 SCGLY 0.037 GOOG 0.154 BAR 0.200 IBM 0.034 DB 0.281 VW 0.040 McKinney () Statistical Data Structures in Python SciPy 2010 15 / 31

Data alignment Arithmetic operations, etc., match on axis labels. Done in Cython so significantly faster than pure Python. >>> s1 + s2 AAPL 0.0686791008184 BAR 0.358165479807 C 0.16586702944 DB 0.367679872693 F NaN GOOG 0.26666583847 IBM 0.0833057542385 SAP NaN SCGLY NaN VW NaN McKinney () Statistical Data Structures in Python SciPy 2010 16 / 31

Missing data handling Since data points may be deemed missing or masked, having tools for these makes sense. >>> (s1 + s2).fill(0) AAPL 0.0686791008184 BAR 0.358165479807 C 0.16586702944 DB 0.367679872693 F 0.0 GOOG 0.26666583847 IBM 0.0833057542385 SAP 0.0 SCGLY 0.0 VW 0.0 McKinney () Statistical Data Structures in Python SciPy 2010 17 / 31

Missing data handling >>> (s1 + s2).valid() AAPL 0.0686791008184 BAR 0.358165479807 C 0.16586702944 DB 0.367679872693 GOOG 0.26666583847 IBM 0.0833057542385 >>> (s1 + s2).sum() 1.3103630754662747 >>> (s1 + s2).count() 6 McKinney () Statistical Data Structures in Python SciPy 2010 18 / 31

Categorical data and Group by Often want to compute descriptive stats on data given group designations: >>> s >>> cats industry AAPL 0.044 AAPL TECH IBM 0.050 IBM TECH SAP 0.101 SAP TECH GOOG 0.113 GOOG TECH C 0.138 C FIN SCGLY 0.037 SCGLY FIN BAR 0.200 BAR FIN DB 0.281 DB FIN VW 0.040 VW AUTO RNO AUTO F AUTO TM AUTO McKinney () Statistical Data Structures in Python SciPy 2010 19 / 31

GroupBy in R R users are spoiled by having vector recognized as something you might want to group by : > labels [1] GOOG GOOG GOOG GOOG AAPL AAPL AAPL AAPL Levels: AAPL GOOG > data [1] 622.87 619.40 622.73 619.98 211.61 209.10 211.64 210.73 > tapply(data, labels, mean) AAPL GOOG 210.770 621.245 McKinney () Statistical Data Structures in Python SciPy 2010 20 / 31

GroupBy in pandas We try to do something similar in pandas; the input can be any function or dict-like object mapping labels to groups: >>> data.groupby(labels).aggregate(np.mean) AAPL 210.77 GOOG 621.245 McKinney () Statistical Data Structures in Python SciPy 2010 21 / 31

GroupBy in pandas More fancy things are possible, like transforming groups by arbitrary functions: demean = lambda x: x - x.mean() def group_demean(obj, keyfunc): grouped = obj.groupby(keyfunc) return grouped.transform(demean) >>> group_demean(s, ind) AAPL -0.0328370881632 BAR 0.0358663891836 C -0.0261271326111 DB 0.11719543981 GOOG 0.035936259143 IBM -0.0272802815728 SAP 0.024181110593 McKinney () Statistical Data Structures in Python SciPy 2010 22 / 31

Merging data sets One commonly encounters a group of data sets which are not quite identically-indexed: >>> df1 >>> df2 AAPL GOOG MSFT YHOO 2009-12-24 209 618.5 2009-12-24 31 16.72 2009-12-28 211.6 622.9 2009-12-28 31.17 16.88 2009-12-29 209.1 619.4 2009-12-29 31.39 16.92 2009-12-30 211.6 622.7 2009-12-30 30.96 16.98 2009-12-31 210.7 620 McKinney () Statistical Data Structures in Python SciPy 2010 23 / 31

Merging data sets By default gluing these together on the row labels seems reasonable: >>> df1.join(df2) AAPL GOOG MSFT YHOO 2009-12-24 209 618.5 31 16.72 2009-12-28 211.6 622.9 31.17 16.88 2009-12-29 209.1 619.4 31.39 16.92 2009-12-30 211.6 622.7 30.96 16.98 2009-12-31 210.7 620 NaN NaN McKinney () Statistical Data Structures in Python SciPy 2010 24 / 31

Merging data sets Returning to our first example, one might also wish to join on some other key: >>> df.join(cats, on= item ) date industry item value 0 2009-12-28 TECH GOOG 622.9 1 2009-12-29 TECH GOOG 619.4 2 2009-12-30 TECH GOOG 622.7 3 2009-12-31 TECH GOOG 620 4 2009-12-28 TECH AAPL 211.6 5 2009-12-29 TECH AAPL 209.1 6 2009-12-30 TECH AAPL 211.6 7 2009-12-31 TECH AAPL 210.7 McKinney () Statistical Data Structures in Python SciPy 2010 25 / 31

Manipulating panel (3D) data In finance, econometrics, etc. we frequently encounter panel data, i.e. multiple data series for a group of individuals over time: >>> grunfeld capita firm inv value year 0 2.8 1 317.6 3078 1935 20 53.8 2 209.9 1362 1935 40 97.8 3 33.1 1171 1935 60 10.5 4 40.29 417.5 1935 80 183.2 5 39.68 157.7 1935 100 6.5 6 20.36 197 1935 120 100.2 7 24.43 138 1935 140 1.8 8 12.93 191.5 1935 160 162 9 26.63 290.6 1935 180 4.5 10 2.54 70.91 1935 1 52.6 1 391.8 4662 1936... McKinney () Statistical Data Structures in Python SciPy 2010 26 / 31

Manipulating panel (3D) data What you saw was the stacked or tabular format, but the 3D form can be more useful at times: >>> lp = LongPanel.fromRecords(grunfeld, year, firm ) >>> wp = lp.towide() >>> wp <class pandas.core.panel.widepanel > Dimensions: 3 (items) x 20 (major) x 10 (minor) Items: capital to value Major axis: 1935 to 1954 Minor axis: 1 to 10 McKinney () Statistical Data Structures in Python SciPy 2010 27 / 31

Manipulating panel (3D) data What you saw was the stacked or tabular format, but the 3D form can be more useful at times: >>> wp[ capital ].head() 1935 1936 1937 1938 1939 1 2.8 265 53.8 213.8 97.8 2 52.6 402.2 50.5 132.6 104.4 3 156.9 761.5 118.1 264.8 118 4 209.2 922.4 260.2 306.9 156.2 5 203.4 1020 312.7 351.1 172.6 6 207.2 1099 254.2 357.8 186.6 7 255.2 1208 261.4 342.1 220.9 8 303.7 1430 298.7 444.2 287.8 9 264.1 1777 301.8 623.6 319.9 10 201.6 2226 279.1 669.7 321.3 McKinney () Statistical Data Structures in Python SciPy 2010 28 / 31

Manipulating panel (3D) data What you saw was the stacked or tabular format, but the 3D form can be more useful at times: # mean over time for each firm >>> wp.mean(axis= major ) capital inv value 1 140.8 98.45 923.8 2 153.9 131.5 1142 3 205.4 134.8 1140 4 244.2 115.8 872.1 5 269.9 109.9 998.9 6 281.7 132.2 1056 7 301.7 169.7 1148 8 344.8 173.3 1068 9 389.2 196.7 1236 10 428.5 197.4 1233 McKinney () Statistical Data Structures in Python SciPy 2010 29 / 31

Implementing statistical models Common issues Model specification (think R formulas) Data cleaning Attaching metadata (labels) to variables To the extent possible, should make the user s life easy Short demo McKinney () Statistical Data Structures in Python SciPy 2010 30 / 31

Conclusions Let s attract more (statistical) users to Python by providing superior tools! Related projects: larry (la), tabular, datarray, others... Come to the BoF today at 6 pm pandas Website: http://pandas.sourceforge.net Contact: wesmckinn@gmail.com McKinney () Statistical Data Structures in Python SciPy 2010 31 / 31