Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm! Moderator: David L. Snell, ASA, MAAA Presenters: Brian D. Holland, FSA, MAAA Dihui Lai, Ph.D. Sheamus Kee Parkes, FSA, MAAA
Python for Actuaries Brian Holland, FSA, MAAA 2015 SOA Annual Meeting Austin, TX
Disclaimer: Any views or opinions discussed or shown in this presentation are solely those of the author and do not represent those of AIG or any of its subsidiaries or employees. 2
Why learn Python? We hear a ton about machine learning, data science, big data. To actually do these things personally, you have to have the technical skills programming / hacking skills included. Python has a lot of traction in data science applications and is now quite popular. You don t have to look long before seeing it. Some data science companies are Python shops. Why not learn or learn about Python: You don t program or manage programmers or programming. You can get by in a spreadsheet or with VBA. You have no interest in doing or trying advanced analytics. Fair warning: this is a presentation about a programming language. 3
Purpose today: shake hands with Python See what you might want to dig into What is Python? an object-oriented language with extensive scientific, numeric libraries with many special-purpose libraries with an expanding user base that is designed for readability Forced tabbing; many places to comment work in accessible ways around since 1991 in two active versions: 2 and 3 For new work: not much case for sticking with 2 now, big libraries are ported to 3. named after Monte Python, not the snake 4
Applications for actuaries A general-purpose master tool, with libraries for special purposes Can manipulate R; MS Office, other Windows objects Data munging: Easily read spreadsheets, text files, databases, scrape web (with library BeautifulSoup) Process automation and documentation Data visualization Statistical modeling / machine learning / data science / predictive modeling Presentations 5
Ways to use Python System command: for scripts Command line environment 6
Ways to use Python: IPython notebooks Edit browser-based documents saved in JSON Mix formatted text and computation Typeset math Section headings, HTML, markdown Graphics inline with the flow of text, computed as you go Run remote servers thorough the web also grids Convert the notebooks easily to slides, HTML, plain Python files; on to MS Word Note: IPython notebooks recently folded into Jupyter project Front-end for many other back-end computations, including R, Julia 7
Ways to use Python: IPython notebooks Could you do that in a spreadsheet? I could not, not reasonably. 8
What is knowing Python? Language: syntax, and Python standard library The Python Standard Library by Example, Doug Hellmann, 2011 Libraries to do what you need BeautifulSoup: to read and manipulate HTML/XML, scraping web PyODBC to talk to databases NumPy, Pandas, Scikit-Learn: essential for machine learning and computation generally 9
Graphics libraries: Death by choice Bokeh for interactive plots in browser Seaborn GGPLOT port for R fans and experts; VisPy bleeding edge, GPU, interactive, 2d, 3d, wow Matplotlib the main one Tip: come to afternoon session to see what these LTC exhibits are. 10
Data I/O with Pandas The Pandas library can import many document types directly into a DataFrame object (similar to R s) Fixed-width text Delimited text Spreadsheets HTML, JSON SQL queries, using an open connection to the DB 11
Machine learning: scikit-learn the killer app? Many examples at http://scikit-learn.org/stable/auto_examples/index.html. A very small sample from the page: 12
Cooperation with other software: RPy2 in a Notebook R Magic : (are many magic functions in IPython or Jupyter notebooks) Allow commands to other tools directly in the notebook 13
More on RPy2: accessing R objects 14
PypeR: another way to talk to R PypeR uses pipes to communicate with R. 15
Good luck, have fun! Thanks for your interest. Brian Holland, FSA, MAAA 16
R for Actuarial Science Dihui Lai, PhD Data Scientist Reinsurance Group of America, Incorporated
Outline R, Whats and Whys? Use R for Actuarial Science R Demo Conquer Big Data with R
R, Whats and Whys? Powerful data manipulation, statistical modeling, and charting tools of modern data science Open source project since 1995 Active community (>2 million users and developers) Incorporates features of object-oriented and functional programming
R, Whats and Whys? Statistic toolkits Easy data manipulation STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT 2009-2010 33-37 10 1 1 2009-2010 63-67 10 1 0 2008-2009 28-32 10 2 2 2008-2009 53-57 10 2 1 2009-2010 38-42 10 1 1 2008-2009 23-27 10 1 0 Cutting edge analytics Database Integrate advanced data tech Visualization tools
Use R for Actuarial Science Example: Term Tail Lapse Study load("lapsedata.rdata") head(lapsedata) ## STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT FA_BAND ## 9 2009-2010 33-37 10 1 1 B. 100k-249k ## 71 2009-2010 63-67 10 1 0 B. 100k-249k ## 121 2008-2009 28-32 10 2 2 C. 250k-999k ## 210 2008-2009 53-57 10 2 1 B. 100k-249k ## 223 2009-2010 38-42 10 1 1 C. 250k-999k ## 237 2008-2009 23-27 10 1 0 B. 100k-249k summary(lapsedata) ## STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE ## 2010-2011:98630 33-37 :92930 Min. :10.00 Min. : 0.002732 ## 2011-2012:88353 38-42 :91723 1st Qu.:10.00 1st Qu.: 1.000000 ## 2009-2010:83321 43-47 :76142 Median :10.00 Median : 1.000000 ## 2008-2009:77505 28-32 :69777 Mean :10.87 Mean : 1.226270 ## 2007-2008:59968 48-52 :57920 3rd Qu.:11.00 3rd Qu.: 1.000000 ## 2006-2007:41000 53-57 :41278 Max. :19.00 Max. :26.000000 ## (Other) :64476 (Other):83483 ## LAPSE_CNT FA_BAND ## Min. : 0.000 A. < 100k : 39121 ## 1st Qu.: 0.000 B. 100k-249k :230897 ## Median : 1.000 C. 250k-999k :208131 ## Mean : 0.615 D. 1M - 1.99M: 26042 ## 3rd Qu.: 1.000 E. 2M+ : 7232 ## Max. :24.000 D. 1M-1.99M : 1830
Use R for Actuarial Science Example: Term Tail Lapse Study
Use R for Actuarial Science Example: Term Tail Lapse Study Model1 <- glm(lapse_cnt~offset(log(exposure))+fa_band, family=poisson(),data= LapseData) summary(model1) ## ## Call: ## glm(formula = LAPSE_CNT ~ offset(log(exposure)) + FA_BAND, family = poisso n(), ## data = LapseData) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6517-0.9669-0.2003 0.6752 2.8462 ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) -0.987363 0.007434-132.81 <2e-16 *** ## FA_BANDB. 100k-249k 0.226844 0.007926 28.62 <2e-16 *** ## FA_BANDC. 250k-999k 0.372967 0.007905 47.18 <2e-16 *** ## FA_BANDD. 1M - 1.99M 0.488017 0.010462 46.65 <2e-16 *** ## FA_BANDE. 2M+ 0.615627 0.015559 39.57 <2e-16 *** ## FA_BANDD. 1M-1.99M 0.857298 0.020445 41.93 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 413195 on 513252 degrees of freedom ## Residual deviance: 408135 on 513247 degrees of freedom ## AIC: 951877
Use R for Actuarial Science Example: Hierarchical Clustering
Use R for Actuarial Science Examples: Other Potentials SVM Text Mining Map Have Fun
R Demo Use R for Twitter Streaming
Conquer Big Data with R R packages for big data Memory allocation: ff, bigmemory Integrate R with clusters: RHadoop, SparkR Parallel computing package: snowfall, multicore Commercial distribution: Revolution R
Summary - Do You Want the Toolbox? Easy data manipulation STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT 2009-2010 33-37 10 1 1 2009-2010 63-67 10 1 0 2008-2009 28-32 10 2 2 2008-2009 53-57 10 2 1 2009-2010 38-42 10 1 1 2008-2009 23-27 10 1 0 Statistic toolkits Cutting edge analytics Database Integrate advanced data tech Visualization tools
Questions?
R vs Python SOA Annual Meeting October 2015 Presented by Shea Parkes, FSA, MAAA
Limitations The views expressed in this presentation are those of the presenter, and not those of Milliman. Nothing in this presentation is intended to represent a professional opinion or be an interpretation of actuarial standards of practice. 2
Data Science A Useful Perspective http://drewconway.com/zia/2013/3/26 /the-data-sciencevenn-diagram 3 June 27, 2011
Data Science A Useful Perspective http://drewconway.com/zia/2013/3/26 /the-data-sciencevenn-diagram =Actuarial Student/Analyst Self-Assessment 4 June 27, 2011
Data Science A Useful Perspective http://drewconway.com/zia/2013/3/26 /the-data-sciencevenn-diagram =Actuarial Student/Analyst Self-Assessment 5 June 27, 2011
Bending your brain The more you use Python, the better you are able to think about programming The more you use R, the better you are able to think about data analysis 6 June 27, 2011
Both are multi-paradigm but Functions are first class objects, but lambda s are constrained and an awkward nonlocal statement was only recently introduced 3+ ways to do Object Oriented Programming, but none of them are simple and easy to use 7 June 27, 2011
Both could use a little help 8 June 27, 2011
Recent growth coming together Data Science stack Pandas + scikit-learn + statsmodels + IPython Cutting edge modeling Theano and PyStan RStudio + devtools + more encouraging best software development practices Dplyr + magrittr = more readable code = faster development 9 June 27, 2011
But what should I use? Will you need to integrate with other systems at all? Is analyzing data 80%+ of what you will be doing? Whichever your colleagues have experience in! 10 June 27, 2011
But what should I use? Will you need to integrate with other systems at all? Is analyzing data 80%+ of what you will be doing? Whichever your colleagues have experience in! 11 June 27, 2011