1 Introduction to Python for Text Analysis Jennifer Pan Institute for Quantitative Social Science Harvard University (Political Science Methods Workshop, February ) *Much credit to Andy Hall and Learning to Think Like a Computer Scientist
2 Outline 1 Comparing Python and R 2 Python Basics 3 Web-scraping with Python 4 Text Pre-processing and Analysis
3 Outline 1 Comparing Python and R 2 Python Basics 3 Web-scraping with Python 4 Text Pre-processing and Analysis
4 How do we compare languages? Depends on the task For the same task Speed of processing Amount of code (how much does each line accomplish) Readability of code There are many ways of doing the same thing in each language your preference
5 Comparing Python and R Similarities: Both are high-level languages Both are interpreted languages Both have shell and script mode (though most rarely use script mode for R) General advantages of each: R: more richness for data analysis, e.g., >5,000 packages for data analysis, lots of packages developed for social science Python: much more varied applications, e.g., internet protocols, OS interface, data analysis
6 Outline 1 Comparing Python and R 2 Python Basics 3 Web-scraping with Python 4 Text Pre-processing and Analysis
7 Comparing Python and R Refer to basics.py for all exercises in this section: 1. Data types 2. Variables 3. Functions 4. Conditionals 5. Iteration 6. String 7. Lists 8. Modules 9. Scripting
8 1. Data types Every thing you come across belongs to a data type: >>> type(4) <type int > >>> type(5.3226) <type float > >>> type("hello, World!") <type str > >>> type( Hello, MIT ) <type str > >>> type( 4 ) <type str > >>> type(true) <type bool >
9 1. Data types Data types can be converted >>> int(5.3326) 5 >>> float(4) 4.0 >>> int( 4 ) 4 >>> str(5.3326)
10 2. Variables A variables is a name that refers to a value. In python, a name is assigned to a value with = (equivalent of <- in R) >>> message = "How are you doing?" >>> n = 34 >>> pi = >>> print message How are you doing? >>> print n 34 >>> print pi >>> type(message) <type str > >>> type(n) <type int >
11 2. Variables Variables of the same type can be combined >>> n + pi >>> n + message Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: int and str Variables can be altered >>> count = 0 >>> count = count + 1 >>> count 1 >>> count += 1 >>> count 2
12 3. Functions A function is series of statements with names that perform a desired operation. The python syntax is always (indent crucial): def NAME( PARAMETER LIST ): STATEMENTS For example: >>> def funny_add(a, b):... return a + b >>> funny_add(3,4) 8 >>> weird = funny_add(3,4) >>> weird 8
13 3. Functions There are lots of useful built-in functions, some of which we have used >>> type(pi) <type float > >>> abs(-15) 15 >>> max(7,11) 11 >>> max(abs(-15),7,11) 15 >>> len(message) 18
14 4. Conditionals Conditionals allows us to change our program based on pre-set conditions (e.g., scrape only entries with X). They take the form: if BOOLEAN EXPRESSION: STATEMENTS For example: >>> x = 5 >>> if x > 0:... print "x is positive"... x is positive >>> if x % 2 == 0:... print x, "is even"... else:... print x, "is odd"... 5 is odd
15 4. Conditionals Conditionals can be chained: >>> y = 4 >>> if x < y:... print x, "is less than", y... elif x > y:... print x, "is greater than", y... else:... print x, "and", y, "are equal" 5 is greater than 4 They can also be nested: >>> if x == y:... print x, "and", y, "are equal"... else:... if x < y:... print x, "is less than", y... else:... print x, "is greater than", y 5 is greater than 4
16 4. Conditionals Conditionals can also be part of functions: >>> def is_divisible(a, b):... if a % b == 0:... return True... else:... return False... >>> is_divisible(x,y) False >>> type(is_divisible) <type function >
17 5. Iteration Iterators allow us to repeat execution of a set of statements (e.g., stem all of the text from each row of a csv file). Important are the while statement >>> def countdown(n):... while n > 0:... print n... n = n-1... print "Blastoff!"... >>> countdown(6) And the for statement: >>> for x in message:... print x
18 6. Strings Strings are a data type that s critical for working with text data Strings can be sliced: >>> message How are you doing? >>> message H >>> message[-1]? >>> message[0:5] How a >>> message[5:] re you doing? But they are immutable: >>> message = "L" Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: str object does not support item assignment >>> new message = "L" + message[1:] >>> new message Low are you doing?
19 6. Strings We often use iterators to process strings: >>> index = 0 >>> while index < len(message):... letter = message[index]... print letter... index += 1 >>> for thing in message:... print thing >>> prefixes = "JKLMNOPQ" >>> suffix = "ack" >>> >>> for letter in prefixes:... print letter + suffix
20 6. Strings Here s an example of a function that removes vowels, using an iterator and a conditional: >>> def remove_vowels(s):... vowels = "aeiouaeiou"... s_without_vowels = ""... for letter in s:... if letter not in vowels:... s_without_vowels += letter... return s_without_vowels >>> remove_vowels(message) Hw r y dng?
21 6. Strings There are a number of useful string methods (a method is a member of a class, here methods are functions that belong to the string class) >>> message.upper() >>> message.lower() >>> bad_message = " How are you doing? " >>> bad_message.strip() >>> bad_message.replace(" ","") >>> bad_message.split() >>> " ".join(bad_message.split()) >>> message.find( How ) >>> message.find( how ) >>> message.find(? )
22 6. Strings The string formatting operator % is very useful. >>> name = "John" >>> age = 24 >>> weight = >>> "I am %s and I am %d years old and weight %f pounds." % (name, age, weight) I am John and I am 24 years old and weight pounds. >>> "I am %s and I am %d years old and weight %.1f pounds." % (name, age, weight) I am John and I am 24 years old and weight pounds.
23 7. Lists Lists are a very important data type because they allow you to store data for later use >>> numlist = [10, 20, 30, 40] >>> numlist [10, 20, 30, 40] >>> numlist 10 >>> numlist 20 >>> numlist[1:3] [20, 30] >>> numlist[3:] 
24 7. Lists Iterators are crucial for going through each item of a list >>> for item in numlist:... print item You can create a numerical list with the range function: >>> range(1,5) [1, 2, 3, 4] >>> range(5) [0, 1, 2, 3, 4]
25 7. Lists You can create any list with the.append() method: >>> democracies =  >>> democracies  >>> democracies.append("united States") >>> democracies.append("germany") >>> democracies.insert(0,"canada") >>> democracies [ Canada, United States, Germany ] Lists are mutable: >>> democracies Canada >>> democracies = "India" >>> democracies [ India, United States, Germany ]
26 8. Modules A module is a neatly packaged set of functions. Some come with python, others need to be installed. You need to import modules to use them (like library() in R) Some useful modules: import os: allows you to interact with your operating system, e.g., set working directory, set the path to file import time: time functions (e.g., for benchmarking) import string: lots of string operations, some similar to the methods available to string data type, others as well, e.g., >>> string.find(message, are ) >>> string.split(message) >>> string.printable >>> string.digits
27 8. Modules Two modules for web-scraping we ll use are urllib2 and BeautifulSoup >>> import urllib2 >>> url = " >>> page = urllib2.urlopen(url) >>> type(page) >>> page = urllib2.urlopen(url).read() >>> type(page) >>> print page >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(page) >>> print soup.prettify()
28 9. Scripting We ve been working in the command interface, but if you re doing anything that requires more than a few lines of code (i.e., web-scraping, text analysis), you don t want to input each line of code manually, so you need a script. A script is just lines of code for python to execute, that you save as a file ending in.py. cleanup.py is a script that cleans up text Run the script by cd to the directory where the file is, and type: $ python cleanup.py I cleaned the bad string: How are you doing? made it into a good string: How are you doing? and it took seconds cleanup.py has about 10 lines of executable code. How many lines to do this in R? Can you do it in R without RegEx?
29 Outline 1 Comparing Python and R 2 Python Basics 3 Web-scraping with Python 4 Text Pre-processing and Analysis
30 Web-scraping example We re going to do the same thing Danny did in the last workshop Code found in scraping.py, does the following things Load needed packages Define the url we re scraping from Use BeautifulSoup to find the country name and rate Save country name and rate as items of a list Fix a problem with Northern Ireland (because of bad HTML) Save output of country names and rates to.csv
31 Web-scraping example stats Here are the stats for this code: 25 lines of code (not including comments) Does everything in seconds How does this compare to R? Amount of code Readability of code Processing speed
32 Outline 1 Comparing Python and R 2 Python Basics 3 Web-scraping with Python 4 Text Pre-processing and Analysis
33 Comparing Available Resources Python R Access URL urllib2 ReadLines Clean HTML BeautifulSoup Regular expressions grep gsub Pre-processing nltk tm Classification nltk or scikit-learn tm Topic models gensim topicmodels Topic models More info More info Pre-processing include: segmenting, tokenizing, stripping, stopword removal, stemming, creating a TDM It s very easy to write wrappers for other languages in python (e.g., Stanford NLP group has lots of great tools in Java, it s easy to access these tools with python), can be harder with R.
34 For more information bit.ly/jenpan_python Please send corrections to
CWCS Workshop May 2009 PYTHON Basics http://hetland.org/writing/instant-hacking.html Python is an easy to learn, modern, interpreted, object-oriented programming language. It was designed to be as simple
Caltech/LEAD Summer 2012 Computer Science Lecture 2: July 10, 2012 Introduction to Python The Python shell Outline Python as a calculator Arithmetic expressions Operator precedence Variables and assignment
CRASH COURSE PYTHON nr. Het begint met een idee This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab 2 Python Scripting language expensive
Last time we...  Learned how to set up our computer for scripting with python et al.  Thought about breaking down a scripting problem into its constituent steps.  Solved a simple data logistics
Chapter 2 Writing Simple Programs Charles Severance Textbook: Python Programming: An Introduction to Computer Science, John Zelle Software Development Process Figure out the problem - for simple problems
CS 1133, LAB 2: FUNCTIONS AND TESTING http://www.cs.cornell.edu/courses/cs1133/2015fa/labs/lab02.pdf First Name: Last Name: NetID: The purpose of this lab is to help you to better understand functions:
Exercise 4 Learning Python language fundamentals Work with numbers Python can be used as a powerful calculator. Practicing math calculations in Python will help you not only perform these tasks, but also
WEEK ONE Introduction to Python Python is such a simple language to learn that we can throw away the manual and start with an example. Traditionally, the first program to write in any programming language
CSCE 110 Programming Basics of Python: Variables, Expressions, and nput/output Dr. Tiffani L. Williams Department of Computer Science and Engineering Texas A&M University Fall 2011 Python Python was developed
CS177 MIDTERM 2 PRACTICE EXAM SOLUTION Name: Student ID: This practice exam is due the day of the midterm 2 exam. The solutions will be posted the day before the exam but we encourage you to look at the
Introduction to Python Sophia Bethany Coban Problem Solving By Computer March 26, 2014 Introduction to Python Python is a general-purpose, high-level programming language. It offers readable codes, and
NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
Reading Computer Science 1 CSci 1100 Lecture 3 Python Functions Most of this is covered late Chapter 2 in Practical Programming and Chapter 3 of Think Python. Chapter 6 of Think Python goes into more detail,
COMS 3101-3 Programming Languages Python: Lecture 1 Kangkook Jee firstname.lastname@example.org Agenda Course descripgon IntroducGon to Python Language aspects and usage cases GeJng started How to run Python Basic
ESCI 386 Scientific Programming, Analysis and Visualization with Python Lesson 5 Program Control 1 Interactive Input Input from the terminal is handled using the raw_input() function >>> a = raw_input('enter
LEARNING TO PROGRAM WITH PYTHON Richard L. Halterman Copyright 2011 Richard L. Halterman. All rights reserved. i Contents 1 The Context of Software Development 1 1.1 Software............................................
Example of a Java program class SomeNumbers static int square (int x) return x*x; public static void main (String args) int n=20; if (args.length > 0) // change default n = Integer.parseInt(args);
Python Programming: An Introduction to Computer Science Chapter 7 Decision Structures Python Programming, 1/e 1 Objectives To understand the programming pattern simple decision and its implementation using
Welcome to Introduction to programming in Python Suffolk One, Ipswich, 4:30 to 6:00 Tuesday Jan 14, Jan 21, Jan 28, Feb 11 Welcome Fire exits Toilets Refreshments 1 Learning objectives of the course An
CS101 Lecture 24: Thinking in Python: Input and Output Variables and Arithmetic Aaron Stevens 28 March 2011 1 Overview/Questions Review: Programmability Why learn programming? What is a programming language?
LING115 Lecture Note Session #4 Python (1) 1. Introduction As we have seen in previous sessions, we can use Linux shell commands to do simple text processing. We now know, for example, how to count words.
Introduction Welcome to our Python sessions. University of Hull Department of Computer Science Wrestling with Python Week 01 Playing with Python Vsn. 1.0 Rob Miles 2013 Please follow the instructions carefully.
WEEK THREE Python Lists and Loops You ve made it to Week 3, well done! Most programs need to keep track of a list (or collection) of things (e.g. names) at one time or another, and this week we ll show
Exercise 0 Deadline: None Computer Setup Windows Download Python(x,y) via http://code.google.com/p/pythonxy/wiki/downloads and install it. Make sure that before installation the installer does not complain
Introduction to Java The HelloWorld program Primitive data types Assignment and arithmetic operations User input Conditional statements Looping Arrays CSA0011 Matthew Xuereb 2008 1 Java Overview A high
CS-889 Spring 2011 Project 2: Term Clouds (HOF) Implementation Report Members: Nicole Sparks (project leader), Charlie Greenbacker Abstract: This report describes the methods used in our implementation
CME 193: Introduction to Scientific Python Lecture 8: Unit testing, more modules, wrap up Sven Schmit stanford.edu/~schmit/cme193 8: Unit testing, more modules, wrap up 8-1 Contents Unit testing More modules
IVR Studio 3.0 Guide May-2013 Knowlarity Product Team Contents IVR Studio... 4 Workstation... 4 Name & field of IVR... 4 Set CDR maintainence property... 4 Set IVR view... 4 Object properties view... 4
Python Programming: An Introduction To Computer Science Chapter 8 Booleans and Control Structures Python Programming, 2/e 1 Objectives æ To understand the concept of Boolean expressions and the bool data
Chapter 3 Writing Simple Programs Charles Severance Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/.
s CMPS 5P (Professor Theresa Migler-VonDollen ): Assignment #8 Problem 6 Problem 1 Programming Exercises Modify the recursive Fibonacci program given in the chapter so that it prints tracing information.
Compsci 101: Test 1 October 2, 2013 Name: NetID/Login: Community Standard Acknowledgment (signature) Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 TOTAL: value 16 pts. 12 pts. 20 pts. 12 pts. 20 pts.
Python Programming: An Introduction to Computer Science Sequences: Strings and Lists Python Programming, 2/e 1 Objectives To understand the string data type and how strings are represented in the computer.
21.1 Advanced Tornado Advanced Tornado One of the main reasons we might want to use a web framework like Tornado is that they hide a lot of the boilerplate stuff that we don t really care about, like escaping
CLC Server Command Line Tools USER MANUAL Manual for CLC Server Command Line Tools 2.5 Windows, Mac OS X and Linux September 4, 2015 This software is for research purposes only. QIAGEN Aarhus A/S Silkeborgvej
Visual Basic Programming An Introduction Why Visual Basic? Programming for the Windows User Interface is extremely complicated. Other Graphical User Interfaces (GUI) are no better. Visual Basic provides
CS106A, Stanford Handout #38 Fall, 2004-05 Nick Parlante Strings and Chars The char type (pronounced "car") represents a single character. A char literal value can be written in the code using single quotes
First Java Programs V. Paúl Pauca Department of Computer Science Wake Forest University CSC 111D Fall, 2015 Hello World revisited / 8/23/15 The f i r s t o b l i g a t o r y Java program @author Paul Pauca
http://htsql.org/ HTSQL A Database Query Language HTSQL is a comprehensive navigational query language for relational databases. HTSQL is designed for data analysts and other accidental programmers who
Lecture 5: Java Fundamentals III School of Science and Technology The University of New England Trimester 2 2015 Lecture 5: Java Fundamentals III - Operators Reading: Finish reading Chapter 2 of the 2nd
ECPE 170 University of the Pacific Crash Dive into Python 2 Lab Schedule Ac:vi:es Assignments Due Today Lab 8 Python Due by Oct 26 th 5:00am Endianness Lab 9 Tuesday Due by Nov 2 nd 5:00am Network programming
AWK: The Duct Tape of Computer Science Research Tim Sherwood UC Santa Barbara Duct Tape Systems Research Environment Lots of simulators, data, and analysis tools Since it is research, nothing works together
Computers An Introduction to Programming with Python CCHSG Visit June 2014 Dr.-Ing. Norbert Völker Many computing devices are embedded Can you think of computers/ computing devices you may have in your
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
JobScheduler - Job Execution and Scheduling System JobScheduler Web Services Executing JobScheduler commands Technical Reference March 2015 March 2015 JobScheduler Web Services page: 1 JobScheduler Web
Python Programming: An Introduction to Computer Science Chapter 1 Computers and Programs 1 Objectives To understand the respective roles of hardware and software in a computing system. To learn what computer
Object Oriented Software Design Introduction to Java - II Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa September 14, 2011 G. Lipari (Scuola Superiore Sant Anna) Introduction
Federated Desktop and File Server Search with libferris Ben Martin Abstract How to federate CLucene personal document indexes with PostgreSQL/TSearch2. The libferris project has two major goals: mounting
Test automation of Web applications can be done more effectively by accessing the plumbing within the user interface. Here is a detailed walk-through of Watir, a tool many are using to check the pipes.
Outline Chapter 7 Software Development 7.1 The Development Process 7.1.1 The Waterfall Model 7.1.2 The Iterative Methodology 7.1.3 Elements of UML 7.2 Software Testing 7.2.1 The Essence of Testing 7.2.2
Sample midterm Multiple Choice (circle one) (2 pts) Evaluate the following Boolean expressions and indicate whether short-circuiting happened during evaluation: Assume variables with the following names
Term Test #2 COSC 1020 3.0 Introduction to Computer Science I Section A, Summer 2005 Family Name: Given Name(s): Student Number: Question Out of Mark A Total 16 B-1 7 B-2 4 B-3 4 B-4 4 B Total 19 C-1 4
Basic C Shell email@example.com 11th August 2003 This is a very brief guide to how to use cshell to speed up your use of Unix commands. Googling C Shell Tutorial can lead you to more detailed information.
MINING DATA FROM TWITTER Abhishanga Upadhyay Luis Mao Malavika Goda Krishna 1 Abstract The purpose of this report is to illustrate how to data mine Twitter to anyone with a computer and Internet access.
PHP Tutorial From beginner to master PHP is a powerful tool for making dynamic and interactive Web pages. PHP is the widely-used, free, and efficient alternative to competitors such as Microsoft's ASP.
Unit 6 Loop statements Summary Repetition of statements The while statement Input loop Loop schemes The for statement The do statement Nested loops Flow control statements 6.1 Statements in Java Till now
A Test Suite for Basic CWE Effectiveness Paul E. Black firstname.lastname@example.org http://samate.nist.gov/ Static Analysis Tool Exposition (SATE V) News l We choose test cases by end of May l Tool output uploaded
AUTHORIZED DOCUMENTATION Manual Task Service Driver Implementation Guide Novell Identity Manager 4.0.1 April 15, 2011 www.novell.com Legal Notices Novell, Inc. makes no representations or warranties with
Who Am I Chris Smith (@chrismsnz) Previously: Polyglot Developer - Python, PHP, Go + more Linux Sysadmin Currently: Pentester, Consultant at Insomnia Security Little bit of research Insomnia Security Group
Scraping class Documentation Release 0.1 IRE/NICAR May 09, 2016 Contents 1 What you will make 3 2 Prelude: Prerequisites 5 2.1 Command-line interface......................................... 5 2.2 Text
USC VITERBI SCHOOL OF ENGINEERING INFORMATICS PROGRAM INF 510: Principles of Programming for Informatics Dr. Jeremy Abramson Abramson@isi.usc.edu Time: 5:00-7:20 PM Day: Tuesdays Room: KAP 164 Instructor
Python KS3 Programming Workbook Do you speak Parseltongue? Name ICT Teacher Form Welcome to Python The python software has two windows that we will use. The main window is called the Python Shell and allows
CSC 221: Computer Programming I Fall 2011 Python control statements operator precedence importing modules random, math conditional execution: if, if-else, if-elif-else counter-driven repetition: for conditional
Integrating Secure FTP into Data Services SAP Data Services includes decently-robust native support for FTP transport, as long as you don t mind it being non-secured. However, understandably, many applications
PHP Integration Kit Version 2.5.1 User Guide 2012 Ping Identity Corporation. All rights reserved. PingFederate PHP Integration Kit User Guide Version 2.5.1 December, 2012 Ping Identity Corporation 1001
Software Testing Theory and Practicalities Purpose To find bugs To enable and respond to change To understand and monitor performance To verify conformance with specifications To understand the functionality
Text Clustering Using LucidWorks and Apache Mahout (Nov. 17, 2012) 1. Module name Text Clustering Using Lucidworks and Apache Mahout 2. Scope This module introduces algorithms and evaluation metrics for
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
CIS 192: Lecture 10 Web Development with Flask Lili Dworkin University of Pennsylvania Last Week s Quiz req = requests.get("http://httpbin.org/get") 1. type(req.text) 2. type(req.json) 3. type(req.json())
App Building Guidelines App Building Guidelines Table of Contents Definition of Apps... 2 Most Recent Vintage Dataset... 2 Meta Info tab... 2 Extension yxwz not yxmd... 3 Map Input... 3 Report Output...