Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

Similar documents
Unlocking the True Value of Hadoop with Open Data Science

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Session 15 OF, Unpacking the Actuary's Technical Toolkit. Moderator: Albert Jeffrey Moore, ASA, MAAA

ANACONDA. Open Source Modern Analytics Platform Powered by Python ANACONDA DELIVERS OPEN ENTERPRISE PYTHON KEY FEATURES WHY YOU LL LOVE ANACONDA

Microsoft Research Windows Azure for Research Training

Microsoft Research Microsoft Azure for Research Training

Introduction to Python

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Getting more out of Matplotlib with GR

How To Create A Data Visualization With Apache Spark And Zeppelin

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

GR.jl Plotting for Julia based on GR

Big Data Paradigms in Python

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Data Science with Hadoop at Opower

Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

MEng, BSc Computer Science with Artificial Intelligence

Régression logistique : introduction

MEng, BSc Applied Computer Science

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

The most powerful open source data science technologies in your browser.!! Yves Hilpisch

Generalized Linear Models

Visualization of Semantic Windows with SciDB Integration

Databricks. A Primer

Welcome to the second half ofour orientation on Spotfire Administration.

Today's Topics. COMP 388/441: Human-Computer Interaction. simple 2D plotting. 1D techniques. Ancient plotting techniques. Data Visualization:

Sisense. Product Highlights.

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Databricks. A Primer

Ethar Ibrahim Elsaka

Multivariate Logistic Regression

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

Analytic Modeling in Python

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

R YOU READY FOR PYTHON? Sunday 19th April, 2015

Computer Information Systems (CIS)

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data and Data Science: Behind the Buzz Words

Data Science Certificate Program

ANALYTICS CENTER LEARNING PROGRAM

DB2 Web Query Interfaces

Hadoop & SAS Data Loader for Hadoop

Chapter 13: Program Development and Programming Languages

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Introducing open source statistical and data science tools to business analytics students and professionals

Operationalise Predictive Analytics

Interactive Applications for Modeling and Analysis with Shiny

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Interactive Visualization

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

An Introduction to Using Python with Microsoft Azure

Multiple Linear Regression

CS 40 Computing for the Web

Big Data Executive Survey

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Microsoft Access is an outstanding environment for both database users and professional. Introduction to Microsoft Access and Programming SESSION

Programming Languages & Tools

Python for Data Analysis and Visualiza4on. Fang (Cherry) Liu, Ph.D PACE Gatech July 2013

What s New in MATLAB and Simulink

Origins, Evolution, and Future Directions of MATLAB Loren Shure

USE OF PYTHON AS A SATELLITE OPERATIONS AND TESTING AUTOMATION LANGUAGE

Data Analysis with MATLAB The MathWorks, Inc. 1

Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations

The Julia Language Seminar Talk. Francisco Vidal Meca

McGraw-Hill The McGraw-Hill Companies, Inc.,

Deposit Identification Utility and Visualization Tool

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

2015, André Melancia (Andy.PT) 1

Session 190 PD, Model Risk Management and Controls Moderator: Chad R. Runchey, FSA, MAAA

How To Write A Web Server In Javascript

Lab 13: Logistic Regression

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Shark Installation Guide Week 3 Report. Ankush Arora

Wiley. Automated Data Collection with R. Text Mining. A Practical Guide to Web Scraping and

Building a BI Solution in the Cloud

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

1 Topic. 2 Scilab. 2.1 What is Scilab?

Logistic Regression (a type of Generalized Linear Model)

Top 10 Oracle SQL Developer Tips and Tricks

Big Data. Lyle Ungar, University of Pennsylvania

FROM RELATIONAL TO OBJECT DATABASE MANAGEMENT SYSTEMS

Interactive Data Mining and Visualization

Course Information Course Number: IWT 1229 Course Name: Web Development and Design Foundation

Session D15 Simple Visualization of your TimeSeries Data. Shawn Moe IBM

Scalable Developments for Big Data Analytics in Remote Sensing

Transcription:

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm! Moderator: David L. Snell, ASA, MAAA Presenters: Brian D. Holland, FSA, MAAA Dihui Lai, Ph.D. Sheamus Kee Parkes, FSA, MAAA

Python for Actuaries Brian Holland, FSA, MAAA 2015 SOA Annual Meeting Austin, TX

Disclaimer: Any views or opinions discussed or shown in this presentation are solely those of the author and do not represent those of AIG or any of its subsidiaries or employees. 2

Why learn Python? We hear a ton about machine learning, data science, big data. To actually do these things personally, you have to have the technical skills programming / hacking skills included. Python has a lot of traction in data science applications and is now quite popular. You don t have to look long before seeing it. Some data science companies are Python shops. Why not learn or learn about Python: You don t program or manage programmers or programming. You can get by in a spreadsheet or with VBA. You have no interest in doing or trying advanced analytics. Fair warning: this is a presentation about a programming language. 3

Purpose today: shake hands with Python See what you might want to dig into What is Python? an object-oriented language with extensive scientific, numeric libraries with many special-purpose libraries with an expanding user base that is designed for readability Forced tabbing; many places to comment work in accessible ways around since 1991 in two active versions: 2 and 3 For new work: not much case for sticking with 2 now, big libraries are ported to 3. named after Monte Python, not the snake 4

Applications for actuaries A general-purpose master tool, with libraries for special purposes Can manipulate R; MS Office, other Windows objects Data munging: Easily read spreadsheets, text files, databases, scrape web (with library BeautifulSoup) Process automation and documentation Data visualization Statistical modeling / machine learning / data science / predictive modeling Presentations 5

Ways to use Python System command: for scripts Command line environment 6

Ways to use Python: IPython notebooks Edit browser-based documents saved in JSON Mix formatted text and computation Typeset math Section headings, HTML, markdown Graphics inline with the flow of text, computed as you go Run remote servers thorough the web also grids Convert the notebooks easily to slides, HTML, plain Python files; on to MS Word Note: IPython notebooks recently folded into Jupyter project Front-end for many other back-end computations, including R, Julia 7

Ways to use Python: IPython notebooks Could you do that in a spreadsheet? I could not, not reasonably. 8

What is knowing Python? Language: syntax, and Python standard library The Python Standard Library by Example, Doug Hellmann, 2011 Libraries to do what you need BeautifulSoup: to read and manipulate HTML/XML, scraping web PyODBC to talk to databases NumPy, Pandas, Scikit-Learn: essential for machine learning and computation generally 9

Graphics libraries: Death by choice Bokeh for interactive plots in browser Seaborn GGPLOT port for R fans and experts; VisPy bleeding edge, GPU, interactive, 2d, 3d, wow Matplotlib the main one Tip: come to afternoon session to see what these LTC exhibits are. 10

Data I/O with Pandas The Pandas library can import many document types directly into a DataFrame object (similar to R s) Fixed-width text Delimited text Spreadsheets HTML, JSON SQL queries, using an open connection to the DB 11

Machine learning: scikit-learn the killer app? Many examples at http://scikit-learn.org/stable/auto_examples/index.html. A very small sample from the page: 12

Cooperation with other software: RPy2 in a Notebook R Magic : (are many magic functions in IPython or Jupyter notebooks) Allow commands to other tools directly in the notebook 13

More on RPy2: accessing R objects 14

PypeR: another way to talk to R PypeR uses pipes to communicate with R. 15

Good luck, have fun! Thanks for your interest. Brian Holland, FSA, MAAA 16

R for Actuarial Science Dihui Lai, PhD Data Scientist Reinsurance Group of America, Incorporated

Outline R, Whats and Whys? Use R for Actuarial Science R Demo Conquer Big Data with R

R, Whats and Whys? Powerful data manipulation, statistical modeling, and charting tools of modern data science Open source project since 1995 Active community (>2 million users and developers) Incorporates features of object-oriented and functional programming

R, Whats and Whys? Statistic toolkits Easy data manipulation STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT 2009-2010 33-37 10 1 1 2009-2010 63-67 10 1 0 2008-2009 28-32 10 2 2 2008-2009 53-57 10 2 1 2009-2010 38-42 10 1 1 2008-2009 23-27 10 1 0 Cutting edge analytics Database Integrate advanced data tech Visualization tools

Use R for Actuarial Science Example: Term Tail Lapse Study load("lapsedata.rdata") head(lapsedata) ## STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT FA_BAND ## 9 2009-2010 33-37 10 1 1 B. 100k-249k ## 71 2009-2010 63-67 10 1 0 B. 100k-249k ## 121 2008-2009 28-32 10 2 2 C. 250k-999k ## 210 2008-2009 53-57 10 2 1 B. 100k-249k ## 223 2009-2010 38-42 10 1 1 C. 250k-999k ## 237 2008-2009 23-27 10 1 0 B. 100k-249k summary(lapsedata) ## STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE ## 2010-2011:98630 33-37 :92930 Min. :10.00 Min. : 0.002732 ## 2011-2012:88353 38-42 :91723 1st Qu.:10.00 1st Qu.: 1.000000 ## 2009-2010:83321 43-47 :76142 Median :10.00 Median : 1.000000 ## 2008-2009:77505 28-32 :69777 Mean :10.87 Mean : 1.226270 ## 2007-2008:59968 48-52 :57920 3rd Qu.:11.00 3rd Qu.: 1.000000 ## 2006-2007:41000 53-57 :41278 Max. :19.00 Max. :26.000000 ## (Other) :64476 (Other):83483 ## LAPSE_CNT FA_BAND ## Min. : 0.000 A. < 100k : 39121 ## 1st Qu.: 0.000 B. 100k-249k :230897 ## Median : 1.000 C. 250k-999k :208131 ## Mean : 0.615 D. 1M - 1.99M: 26042 ## 3rd Qu.: 1.000 E. 2M+ : 7232 ## Max. :24.000 D. 1M-1.99M : 1830

Use R for Actuarial Science Example: Term Tail Lapse Study

Use R for Actuarial Science Example: Term Tail Lapse Study Model1 <- glm(lapse_cnt~offset(log(exposure))+fa_band, family=poisson(),data= LapseData) summary(model1) ## ## Call: ## glm(formula = LAPSE_CNT ~ offset(log(exposure)) + FA_BAND, family = poisso n(), ## data = LapseData) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6517-0.9669-0.2003 0.6752 2.8462 ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) -0.987363 0.007434-132.81 <2e-16 *** ## FA_BANDB. 100k-249k 0.226844 0.007926 28.62 <2e-16 *** ## FA_BANDC. 250k-999k 0.372967 0.007905 47.18 <2e-16 *** ## FA_BANDD. 1M - 1.99M 0.488017 0.010462 46.65 <2e-16 *** ## FA_BANDE. 2M+ 0.615627 0.015559 39.57 <2e-16 *** ## FA_BANDD. 1M-1.99M 0.857298 0.020445 41.93 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 413195 on 513252 degrees of freedom ## Residual deviance: 408135 on 513247 degrees of freedom ## AIC: 951877

Use R for Actuarial Science Example: Hierarchical Clustering

Use R for Actuarial Science Examples: Other Potentials SVM Text Mining Map Have Fun

R Demo Use R for Twitter Streaming

Conquer Big Data with R R packages for big data Memory allocation: ff, bigmemory Integrate R with clusters: RHadoop, SparkR Parallel computing package: snowfall, multicore Commercial distribution: Revolution R

Summary - Do You Want the Toolbox? Easy data manipulation STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT 2009-2010 33-37 10 1 1 2009-2010 63-67 10 1 0 2008-2009 28-32 10 2 2 2008-2009 53-57 10 2 1 2009-2010 38-42 10 1 1 2008-2009 23-27 10 1 0 Statistic toolkits Cutting edge analytics Database Integrate advanced data tech Visualization tools

Questions?

R vs Python SOA Annual Meeting October 2015 Presented by Shea Parkes, FSA, MAAA

Limitations The views expressed in this presentation are those of the presenter, and not those of Milliman. Nothing in this presentation is intended to represent a professional opinion or be an interpretation of actuarial standards of practice. 2

Data Science A Useful Perspective http://drewconway.com/zia/2013/3/26 /the-data-sciencevenn-diagram 3 June 27, 2011

Data Science A Useful Perspective http://drewconway.com/zia/2013/3/26 /the-data-sciencevenn-diagram =Actuarial Student/Analyst Self-Assessment 4 June 27, 2011

Data Science A Useful Perspective http://drewconway.com/zia/2013/3/26 /the-data-sciencevenn-diagram =Actuarial Student/Analyst Self-Assessment 5 June 27, 2011

Bending your brain The more you use Python, the better you are able to think about programming The more you use R, the better you are able to think about data analysis 6 June 27, 2011

Both are multi-paradigm but Functions are first class objects, but lambda s are constrained and an awkward nonlocal statement was only recently introduced 3+ ways to do Object Oriented Programming, but none of them are simple and easy to use 7 June 27, 2011

Both could use a little help 8 June 27, 2011

Recent growth coming together Data Science stack Pandas + scikit-learn + statsmodels + IPython Cutting edge modeling Theano and PyStan RStudio + devtools + more encouraging best software development practices Dplyr + magrittr = more readable code = faster development 9 June 27, 2011

But what should I use? Will you need to integrate with other systems at all? Is analyzing data 80%+ of what you will be doing? Whichever your colleagues have experience in! 10 June 27, 2011

But what should I use? Will you need to integrate with other systems at all? Is analyzing data 80%+ of what you will be doing? Whichever your colleagues have experience in! 11 June 27, 2011