CRASH COURSE PYTHON. Het begint met een idee



Similar documents
Exercise 0. Although Python(x,y) comes already with a great variety of scientic Python packages, we might have to install additional dependencies:

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Introduction to Python

Introduction to Python for Text Analysis

Unlocking the True Value of Hadoop with Open Data Science

Python Programming: An Introduction to Computer Science

Scientific Programming in Python

ESCI 386 Scientific Programming, Analysis and Visualization with Python. Lesson 5 Program Control

Advanced analytics at your hands

Introduction to Python

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015

Computational Mathematics with Python

Computational Mathematics with Python

Modeling with Python

Programming Languages & Tools

Python for Scientific Computing.

A Comparison of C, MATLAB, and Python as Teaching Languages in Engineering

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014

The Clean programming language. Group 25, Jingui Li, Daren Tuzi

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

I PUC - Computer Science. Practical s Syllabus. Contents

Outline. hardware components programming environments. installing Python executing Python code. decimal and binary notations running Sage

Computational Mathematics with Python

Chemical and Biological Engineering Calculations using Python 3. Jeffrey J. Heys

6.170 Tutorial 3 - Ruby Basics

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Analysis Programs DPDAK and DAWN

Introduction to Python for Econometrics, Statistics and Data Analysis. Kevin Sheppard University of Oxford

Crash Dive into Python

CME 193: Introduction to Scientific Python Lecture 8: Unit testing, more modules, wrap up

Python Loops and String Manipulation

Introduction to Java

vmprof Documentation Release 0.1 Maciej Fijalkowski, Antonio Cuni, Sebastian Pawlus

Python Basics. S.R. Doty. August 27, Preliminaries What is Python? Installation and documentation... 4

PYTHON Basics

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

Introduction to the data.table package in R

An Incomplete C++ Primer. University of Wyoming MA 5310

Simulation Tools. Python for MATLAB Users I. Claus Führer. Automn Claus Führer Simulation Tools Automn / 65

CS 1133, LAB 2: FUNCTIONS AND TESTING

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

Exercise 4 Learning Python language fundamentals

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

CSCE 110 Programming I Basics of Python: Variables, Expressions, and Input/Output

Moving from CS 61A Scheme to CS 61B Java

2! Multimedia Programming with! Python and SDL

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Python Programming: An Introduction to Computer Science

Android Application Development Course Program

Python programming guide for Earth Scientists. Maarten J. Waterloo and Vincent E.A. Post

AIMMS 4.0. Portable component Linux Intel version. Release Notes for Build 4.9. Visit our web site for regular updates AIMMS

Programming and Software Development CTAG Alignments

Sources: On the Web: Slides will be available on:

Object Oriented Software Design

Fast Analytics on Big Data with H20

Crash Dive into Python

An Introduction to APGL

Computer Science 217

Computing Concepts with Java Essentials

Programming Exercise 3: Multi-class Classification and Neural Networks

Big Data Paradigms in Python

CUDAMat: a CUDA-based matrix class for Python

Ensembles and PMML in KNIME

INFORMATION BROCHURE Certificate Course in Web Design Using PHP/MySQL

Introduction to Apache Pig Indexing and Search

NetworkX: Network Analysis with Python

Scientific Programming with Python. Randy M. Wadkins, Ph.D. Asst. Prof. of Chemistry & Biochem.

Informatica e Sistemi in Tempo Reale

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Simply Accounting Intelligence Tips and Tricks Booklet Vol. 1

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

Instructional Design Framework CSE: Unit 1 Lesson 1

Python for Data Analysis and Visualiza4on. Fang (Cherry) Liu, Ph.D PACE Gatech July 2013

Relational Database: Additional Operations on Relations; SQL

Python Lists and Loops

Unix Shell Scripts. Contents. 1 Introduction. Norman Matloff. July 30, Introduction 1. 2 Invoking Shell Scripts 2

LEARNING TO PROGRAM WITH PYTHON. Richard L. Halterman

Car Insurance. Prvák, Tomi, Havri

Parallel and Large Scale Learning with scikit-learn

arrays C Programming Language - Arrays

dedupe Documentation Release Forest Gregg, Derek Eder, and contributors

Operation Count; Numerical Linear Algebra

Google Cloud Data Platform & Services. Gregor Hohpe

Analysis Tools and Libraries for BigData

Introduction to Python

An introduction to Python Programming for Research

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Example of a Java program

Solution of Linear Systems

PostgreSQL Functions By Example

CLC Server Command Line Tools USER MANUAL

Transcription:

CRASH COURSE PYTHON nr. Het begint met een idee

This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab 2

Python Scripting language expensive computations typically in compiled modules such as matrix multiplication, optimization, classification Faster Python code: Numba s @jit construct (or Cython) Support for functions and OOP (classes, abstract classes, polymorphism, inheritance; but no encapsulation) Direct competitors: R, Julia, Matlab 3

Zen of Python Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. 4

Python 2 or 3? 1994: Python 1 2000: Python 2 (backward compatible) 2008: Python 3 Most pronounced difference: Python 2: print hello world! Python 3: print( hello world! ) Strength of Python: broad availability of modules Many modules have been updated for Python 3 5 Some people still use Python 2

Installing Python Windows users: use winpython Has MKL for fast linear algebra, and many preinstalled modules Portable, so extract & go Ships with the Spyder editor for coding and debugging and a compiler for new modules winpython 3.4 is currently recommended (3.5 does not (yet) ship with a compiler) Mac users OS X ships with Python 2.7 (and depends on it, do not update to 3) Python 3 can be installed alongside Linux Ubuntu ships with both Python 2.7 and Python 3.4 Commands: python & python3 6

Installing modules Mac/Linux/POSIX-compatible systems: run pip from the terminal e.g.: pip install cylp WinPython: run WinPython Command Prompt.exe and use pip For dependencies that require a shell script (./configure ): add the folder winpython/share/mingwpy/bin to the path install msys from mingw.org start msys (C:\MinGW\msys\1.0\msys.bat) Configure&compile the dependency 7

Running Python The editor probably has a hotkey (F5 in Spyder) Shell command: python filename.py Alternative: python (runs commands as they are entered) 8

Crash course Data type Initialize empty Initialize with data List x = [] x = [1,2,5] Tuple - x = (1,2,5) Set x = set() x = {1,2,5} Dict x = {} x = {"one": 1, "two": 2, "five": 5} String x = "" x = "hello world" 9

Precision Integers have infinite precision Floats have finite precision use decimal/float/mpmath modules for arbitrary precision >> print(2**1000) 107150860718626732094842504906000181056140481170 553360744375038837035105112493612249319837881569 585812759467291755314682518714528569231404359845 775746985748039345677748242309854210746050623711 418779541821530464749835819412673987675591655439 460770629145711964776865421676604298316526243868 37205668069376 10

Creating a list Code x = [0,1,2,3,4,5,6,7,8,9,10] print(x) x = range(11) print(x) print(list(x)) Output [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] range(0, 11) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] range(n) is list-like internally it is an object that can be converted to a list range(int(1e10)) requires a few bytes instead of 74.5 GB 11

Loops Code for i in [1,2,3]: print(i) while i < 5: i += 1 print(i) Output 1 2 3 4 5 No curly braces or end for Structure is derived from level of indentation One statement per line No semicolons required 12

Functions Code def fun(name, greeting='hi', me='evil caterpillar'): print(greeting + ' ' + name + ', this is ' + me) return 0 Output Hi group, this is Python fun('group', me='python') All arguments are named: fun(name= group ) Naming useful for optional arguments Return is optional 13

Functions Code def trick_me(a,b,c): a.append('o') b.append('o') c += 1 Output ['m', 'n', 'o', 'o'] ['m', 'n', 'o', 'o'] 1 x = ['m','n'] y = x z = 1 trick_me(x,y,z) print(x,y,z) 14 Behavior depends on whether type is mutable variables are pointers memory gets overwritten for mutable types only String, int, double, tuple are immutable List, set, dict are mutable y=list(x) creates a shallow copy (y = copy.deepcopy(x) when x contains mutable data)

List comprehensions Creating a list with squares: 0, 1, 4,, 100 Naive code x = [] for i in range(11): x.append(i*i) Idiomatic Python x = [i*i for i in range(11)] Creating a list of even numbers 6, 8, 10, 12, 14 Naive code x = [] for i in range(6,15): if i % 2 == 0: x.append(i) Idiomatic Python x = [i for i in range(6,15) if i%2==0] # or x = [i for i in range(6,15,2)] 15

One-liner example Find the last ten digits of the series: 1 1 + 2 2 + 3 3 +... + 1000 1000 (projecteuler.net) >> print(str(sum([k**k for k in range(1,1001)]))[-10:]) 9110846700 [k**k for k in range(1,1001)] creates the terms sum(.) takes the sum str(.) converts the argument to a string [-10:] takes a substring 16

Modules Matlab replacements scipy (free, linear algebra) matplotlib (free, graphing) Optimization cylp (free, linear and mixed integer optimization) pyipopt (free, convex optimization) gurobi / cplex (academic license) 17 Data mining pandas (free, importing and slicing data) scikit-learn (free, machine learning) xgboost (free, gradient boosting) takes less than 20 lines to create a cross-validated ensemble of classifiers

Recap Example: function, for-loop, range, comment def take_sum(s): sum = 0 for i in S: sum += i return sum print(take_sum(range(7))) # outputs 21 Example: named arguments def fun(name, greeting='hi', me='evil caterpillar'): print(greeting + ' ' + name + ', this is ' + me) return 0 18 fun('group', me='python')

Data mining Reading data with pandas Visualization with matplotlib Machine learning with scikit-learn 19

Reading data Pandas offers read_csv, read_excel, read_sql, read_json, read_html, read_sas, etc read_* returns pandas data structure: DataFrame Having data in DataFrame is useful filtering, combining, grouping, sorting to_csv, to_excel, etc (for, e.g., converting csv to json) 20

Example: reading csv file CSV file id,feat_1,feat_2,feat_3,feat_4,feat_5,target 1,1,0,0,0,0,1 2,0,0,0,0,0,0 3,0,0,0,0,0,0 4,1,0,0,1,6,0 5,0,0,0,0,0,1 Code import pandas filename = 'train.csv' X = pandas.read_csv(filename, sep=",") y = X.target X.drop(['target', 'id'], axis=1, inplace=true) 21

Filtering data CSV file id,feat_1,feat_2,feat_3,feat_4,feat_5,target 1,1,0,0,0,0,1 2,0,0,0,0,0,0 3,0,0,0,0,0,0 4,1,0,0,1,6,0 5,0,0,0,0,0,1 Code filename = 'train.csv' data = pandas.read_csv(filename) print(data[0:2]) output: id feat_1 feat_2 feat_3 feat_4 feat_5 target 1 2 0 0 0 0 0 1 2 3 0 0 0 0 0 1 22

Filtering data CSV file id,feat_1,feat_2,feat_3,feat_4,feat_5,target 1,1,0,0,0,0,1 2,0,0,0,0,0,0 3,0,0,0,0,0,0 4,1,0,0,1,6,0 5,0,0,0,0,0,1 Code filename = 'train.csv' data = pandas.read_csv(filename) print(data[data.feat_1 == 1]) output: id feat_1 feat_2 feat_3 feat_4 feat_5 target 0 1 1 0 0 0 0 1 3 4 1 0 0 1 6 1 23

Visualization Code data[data.feat_2<=5].feat_2.plot(kind='hist') # since the data takes few distinct values: data[data.feat_2<=5].feat_2.value_counts().sort_index().plot(kind='bar') 24

Grouping Code import numpy as np pandas.set_option('display.precision',2) for feat_2_value,group in data.groupby('feat_2'): # group is the DataFrame data[feat_2 == feat_2_value] data.groupby('feat_2').aggregate(pandas.series.nunique) # other aggregation functions: np.min, np.max, np.sum, np.std id feat_1 feat_3 feat_4 feat_5 target feat_2 0 55018 37 39 48 15 9 1 4012 26 39 36 10 9 2 1215 14 31 39 7 9 3 549 9 24 27 7 7 4 310 13 21 27 4 5 5 170 5 10 13 3 6 25

Example: time series Code import pandas import numpy as np ts = pandas.series(np.random.randn(1000), \ index=pandas.date_range('1/1/2000', periods=1000)) ts = ts.cumsum() ts.plot() print(ts.mean()) # output: 28.642802230898678 26

Example: large data set Suppose csv file is 100 GB and has thousands of columns Subset of three columns is manageable Code infile = 'train.csv' outfile = output.xlsx df = pandas.dataframe() # chunksize is the number of rows to read per iteration for data in pandas.read_csv(infile, chunksize=100): data = data[['feat_1', 'feat_2', 'target']] df = pandas.concat([df,data]) writer = pandas.excelwriter(outfile) df.to_excel(writer, 'Sheet1') writer.save() 27

Logistic regression Code from sklearn import cross_validation,linear_model from sklearn.metrics import log_loss filename = 'train.csv' X = pandas.read_csv(filename, sep=",") y = X.target X.drop(['target', 'id'], axis=1, inplace=true) y[y==1] = 0 y[y>1] = 1 X,X_test,y,y_test = cross_validation.train_test_split(x, y, test_size=0.5) clf = linear_model.logisticregression() clf.fit(x,y) prediction = clf.predict_proba(x_test) print(log_loss(y_test,prediction)) # output: 0.00159227347414; log_loss is in in [0, 34.5] # 0 for perfect fit, 0.7 for constant p=0.5, 34.5 for all wrong 28