CRASH COURSE PYTHON nr. Het begint met een idee
This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab 2
Python Scripting language expensive computations typically in compiled modules such as matrix multiplication, optimization, classification Faster Python code: Numba s @jit construct (or Cython) Support for functions and OOP (classes, abstract classes, polymorphism, inheritance; but no encapsulation) Direct competitors: R, Julia, Matlab 3
Zen of Python Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. 4
Python 2 or 3? 1994: Python 1 2000: Python 2 (backward compatible) 2008: Python 3 Most pronounced difference: Python 2: print hello world! Python 3: print( hello world! ) Strength of Python: broad availability of modules Many modules have been updated for Python 3 5 Some people still use Python 2
Installing Python Windows users: use winpython Has MKL for fast linear algebra, and many preinstalled modules Portable, so extract & go Ships with the Spyder editor for coding and debugging and a compiler for new modules winpython 3.4 is currently recommended (3.5 does not (yet) ship with a compiler) Mac users OS X ships with Python 2.7 (and depends on it, do not update to 3) Python 3 can be installed alongside Linux Ubuntu ships with both Python 2.7 and Python 3.4 Commands: python & python3 6
Installing modules Mac/Linux/POSIX-compatible systems: run pip from the terminal e.g.: pip install cylp WinPython: run WinPython Command Prompt.exe and use pip For dependencies that require a shell script (./configure ): add the folder winpython/share/mingwpy/bin to the path install msys from mingw.org start msys (C:\MinGW\msys\1.0\msys.bat) Configure&compile the dependency 7
Running Python The editor probably has a hotkey (F5 in Spyder) Shell command: python filename.py Alternative: python (runs commands as they are entered) 8
Crash course Data type Initialize empty Initialize with data List x = [] x = [1,2,5] Tuple - x = (1,2,5) Set x = set() x = {1,2,5} Dict x = {} x = {"one": 1, "two": 2, "five": 5} String x = "" x = "hello world" 9
Precision Integers have infinite precision Floats have finite precision use decimal/float/mpmath modules for arbitrary precision >> print(2**1000) 107150860718626732094842504906000181056140481170 553360744375038837035105112493612249319837881569 585812759467291755314682518714528569231404359845 775746985748039345677748242309854210746050623711 418779541821530464749835819412673987675591655439 460770629145711964776865421676604298316526243868 37205668069376 10
Creating a list Code x = [0,1,2,3,4,5,6,7,8,9,10] print(x) x = range(11) print(x) print(list(x)) Output [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] range(0, 11) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] range(n) is list-like internally it is an object that can be converted to a list range(int(1e10)) requires a few bytes instead of 74.5 GB 11
Loops Code for i in [1,2,3]: print(i) while i < 5: i += 1 print(i) Output 1 2 3 4 5 No curly braces or end for Structure is derived from level of indentation One statement per line No semicolons required 12
Functions Code def fun(name, greeting='hi', me='evil caterpillar'): print(greeting + ' ' + name + ', this is ' + me) return 0 Output Hi group, this is Python fun('group', me='python') All arguments are named: fun(name= group ) Naming useful for optional arguments Return is optional 13
Functions Code def trick_me(a,b,c): a.append('o') b.append('o') c += 1 Output ['m', 'n', 'o', 'o'] ['m', 'n', 'o', 'o'] 1 x = ['m','n'] y = x z = 1 trick_me(x,y,z) print(x,y,z) 14 Behavior depends on whether type is mutable variables are pointers memory gets overwritten for mutable types only String, int, double, tuple are immutable List, set, dict are mutable y=list(x) creates a shallow copy (y = copy.deepcopy(x) when x contains mutable data)
List comprehensions Creating a list with squares: 0, 1, 4,, 100 Naive code x = [] for i in range(11): x.append(i*i) Idiomatic Python x = [i*i for i in range(11)] Creating a list of even numbers 6, 8, 10, 12, 14 Naive code x = [] for i in range(6,15): if i % 2 == 0: x.append(i) Idiomatic Python x = [i for i in range(6,15) if i%2==0] # or x = [i for i in range(6,15,2)] 15
One-liner example Find the last ten digits of the series: 1 1 + 2 2 + 3 3 +... + 1000 1000 (projecteuler.net) >> print(str(sum([k**k for k in range(1,1001)]))[-10:]) 9110846700 [k**k for k in range(1,1001)] creates the terms sum(.) takes the sum str(.) converts the argument to a string [-10:] takes a substring 16
Modules Matlab replacements scipy (free, linear algebra) matplotlib (free, graphing) Optimization cylp (free, linear and mixed integer optimization) pyipopt (free, convex optimization) gurobi / cplex (academic license) 17 Data mining pandas (free, importing and slicing data) scikit-learn (free, machine learning) xgboost (free, gradient boosting) takes less than 20 lines to create a cross-validated ensemble of classifiers
Recap Example: function, for-loop, range, comment def take_sum(s): sum = 0 for i in S: sum += i return sum print(take_sum(range(7))) # outputs 21 Example: named arguments def fun(name, greeting='hi', me='evil caterpillar'): print(greeting + ' ' + name + ', this is ' + me) return 0 18 fun('group', me='python')
Data mining Reading data with pandas Visualization with matplotlib Machine learning with scikit-learn 19
Reading data Pandas offers read_csv, read_excel, read_sql, read_json, read_html, read_sas, etc read_* returns pandas data structure: DataFrame Having data in DataFrame is useful filtering, combining, grouping, sorting to_csv, to_excel, etc (for, e.g., converting csv to json) 20
Example: reading csv file CSV file id,feat_1,feat_2,feat_3,feat_4,feat_5,target 1,1,0,0,0,0,1 2,0,0,0,0,0,0 3,0,0,0,0,0,0 4,1,0,0,1,6,0 5,0,0,0,0,0,1 Code import pandas filename = 'train.csv' X = pandas.read_csv(filename, sep=",") y = X.target X.drop(['target', 'id'], axis=1, inplace=true) 21
Filtering data CSV file id,feat_1,feat_2,feat_3,feat_4,feat_5,target 1,1,0,0,0,0,1 2,0,0,0,0,0,0 3,0,0,0,0,0,0 4,1,0,0,1,6,0 5,0,0,0,0,0,1 Code filename = 'train.csv' data = pandas.read_csv(filename) print(data[0:2]) output: id feat_1 feat_2 feat_3 feat_4 feat_5 target 1 2 0 0 0 0 0 1 2 3 0 0 0 0 0 1 22
Filtering data CSV file id,feat_1,feat_2,feat_3,feat_4,feat_5,target 1,1,0,0,0,0,1 2,0,0,0,0,0,0 3,0,0,0,0,0,0 4,1,0,0,1,6,0 5,0,0,0,0,0,1 Code filename = 'train.csv' data = pandas.read_csv(filename) print(data[data.feat_1 == 1]) output: id feat_1 feat_2 feat_3 feat_4 feat_5 target 0 1 1 0 0 0 0 1 3 4 1 0 0 1 6 1 23
Visualization Code data[data.feat_2<=5].feat_2.plot(kind='hist') # since the data takes few distinct values: data[data.feat_2<=5].feat_2.value_counts().sort_index().plot(kind='bar') 24
Grouping Code import numpy as np pandas.set_option('display.precision',2) for feat_2_value,group in data.groupby('feat_2'): # group is the DataFrame data[feat_2 == feat_2_value] data.groupby('feat_2').aggregate(pandas.series.nunique) # other aggregation functions: np.min, np.max, np.sum, np.std id feat_1 feat_3 feat_4 feat_5 target feat_2 0 55018 37 39 48 15 9 1 4012 26 39 36 10 9 2 1215 14 31 39 7 9 3 549 9 24 27 7 7 4 310 13 21 27 4 5 5 170 5 10 13 3 6 25
Example: time series Code import pandas import numpy as np ts = pandas.series(np.random.randn(1000), \ index=pandas.date_range('1/1/2000', periods=1000)) ts = ts.cumsum() ts.plot() print(ts.mean()) # output: 28.642802230898678 26
Example: large data set Suppose csv file is 100 GB and has thousands of columns Subset of three columns is manageable Code infile = 'train.csv' outfile = output.xlsx df = pandas.dataframe() # chunksize is the number of rows to read per iteration for data in pandas.read_csv(infile, chunksize=100): data = data[['feat_1', 'feat_2', 'target']] df = pandas.concat([df,data]) writer = pandas.excelwriter(outfile) df.to_excel(writer, 'Sheet1') writer.save() 27
Logistic regression Code from sklearn import cross_validation,linear_model from sklearn.metrics import log_loss filename = 'train.csv' X = pandas.read_csv(filename, sep=",") y = X.target X.drop(['target', 'id'], axis=1, inplace=true) y[y==1] = 0 y[y>1] = 1 X,X_test,y,y_test = cross_validation.train_test_split(x, y, test_size=0.5) clf = linear_model.logisticregression() clf.fit(x,y) prediction = clf.predict_proba(x_test) print(log_loss(y_test,prediction)) # output: 0.00159227347414; log_loss is in in [0, 34.5] # 0 for perfect fit, 0.7 for constant p=0.5, 34.5 for all wrong 28