No black magic: Text processing using the UNIX command line

? No black magic: Text processing using the UNIX command line Barbara Plank http://cst.dk/bplank! Nov 6, 2014

Motivation (1994)

What is UNIX? Operating system (OS), 1969 AT&T / Bell labs Used loosely to refer to any OS sharing the same basic design (Linux, Solaris, Mac OS) Unix philosophy: Build functionality out of small programs that do one thing and do it well Slide inspired by: https://software.rc.fas.harvard.edu/training/intro_unix/

What is the command line? $ command prompt this window is called the terminal which is the program that allows us to interact with the shell the shell is an environment that executes the commands we type in at the command prompt

What is the command line? INPUT OUTPUT REPL: read-eval-print loop very different from the well-known graphical user interface

Input Output process model Shell programs do I/O (input/output) with the terminal, using three streams: Terminal INPUT Keyboard stdin shell program stderr Display (print) stdout shell environment (e.g. Bash shell) OUTPUT Interactively, you rarely notice there's separate stdout and stderr (today we won t worry about stderr)

Unix philosophy combine many small programs for more advanced functionality Terminal Keyboard Display (print) stdin stderr shell program stdout shell program stdout shell program stdout shell environment (e.g. Bash shell)

Why (still) the command line? Advantages: allows you to be agile (REPL vs edit-compile-run-debug cycle) this window is called the terminal the command line is extensible and complementary which is the program that allows us automation and reproducibility to interact with the shell to run jobs on big clusters of computers (HPC computing) the shell executes the commands we type in Disadvantage: (e.g. Bash shell) takes some time to get acquainted

Start a terminal: on Mac OS Applications, Utilities, Terminal

On Linux

Note: Windows Windows Command Prompt cmd (or PowerShell) is fundamentally different and incompatible with the commands we will see today! for today: download PuTTY http://the.earth.li/~sgtatham/putty/latest/x86/putty.exe

Getting started: start terminal Connect to the server for today s workshop (see handout): ssh username@hostname Type yes when you see this: The authenticity of host.(10.1.) can't be established. ECDSA key fingerprint is c0:7b: 40:5f:c9:d4:97:6f:33:27:76:8f:5e:b9:25:92. Are you sure you want to continue connecting (yes/no)? yes Enter the password You now have a prompt: Windows users use PuTTY: hostname

now we are all connected to a shell where we can issue commands

First shell commands Type text (your command) after the prompt ($), followed by ENTER: pwd: print working directory (shows the current location in the filesystem)

Shell command: Structure A shell command (or shell program) usually takes parameters: arguments (required) and options (optional) Shell program with argument(s) cat text.txt cat text1.txt text2.txt text3.txt With argument and option: cat -n text.txt (prefix every line by line number)

Note shell commands are CaSE SeNsItVe pwd PWD Pwd pwd spaces have special meanings (do not use them for file names or folder names)

Where to find help To know what options and arguments a command takes consult the man (manual) pages: man whoami man cat Use q to exit

Tips m<tab> (use auto-completion) use the arrow up key to reload command from your command history (or more advanced to search history of commands: <CTRL>+r) <CTRL>+d or <CTRL>+c or just q to quit

Word frequency list

Prerequisite: Copy file Copy the text file from my home directory to yours: cp /home/bplank/text.txt. command name arg1: what? arg2: where to? (copy) Check if the file is in your directory with ls:

Inspect files head text.txt prints out the first ten lines of the file Try out the following commands - what do they do? tail text.txt cat text.txt less text.txt (continue with SPACE or arrow UP/DOWN; quit by typing q)

line-based processing head text.txt prints out the first (by default) ten lines of the file head -4 text.txt prints out the first 4 lines of the file

I/O redirection to files Shell commands can be redirected to write to files instead of to the screen, or read from files instead of the keyboard Append to any command: > myfile send stdout to file called myfile < myfile send content of myfile as input to some program < 2> > 2> myfile send stderr to file called myfile

line-based processing and I/O redirection head text.txt equivalent to head < text.txt head -1 text.txt > tmp prints out the first line of the file and stores it in file tmp Exercise: store the last 4 lines of the file text.txt in a file called footer.txt

Recipe for counting words An algorithm: a. split text into one word per line (tokenize) b. sort words c. count how often each word appears

a) split text: word per line translate A into B A=set of characters B=single character (\n newline) -s squeezes multiple blanks, -c complement tr -sc [a-za-z] \n < text.txt More examples: tr -sc [a-za-z0-9] \n < text.txt tr -sc [:alnum:] \n < text.txt tr -sc [:alnum:]@# \n < tweets.txt

b) sorting lines of text: sort FILE sort -r (reverse sort) sort -n (numeric) sort sort -nr (reverse numeric sort) Exercise: try out the sort command with the different options above on the the file: /home/bplank/numbers

c) count words = count duplicate lines in a sorted text file: uniq -c uniq assumes a SORTED file as input! uniq -c SORTEDFILE Exercise: frequency list of numbers in file sort the numbers file and save it (> redirect to file) in a new file called numsorted now use uniq -c to count how often each number appears Solution: sort -n /home/bplank/numbers > numsorted uniq -c numsorted

Now we have seen all necessary ingredients for our recipe on counting words An algorithm: a. split text into one word per line (tokenize) b. sort words c. count how often each word appears

The UNIX game commands ~ bricks building more powerful tools by combining bricks using the pipe:

The Pipe Unix philosophy: combine many small programs Terminal Keyboard stdin stderr tr -sc [:alnum:] \n shell program stdout Display (print) stdout use as glue uniq -q shell program sort shell program stdout shell environment (e.g. Bash shell)

Word frequency list combining the three single commands (tr,sort,uniq): tr -sc [:alnum:] \n < text.txt sort uniq -q Terminal specify input for first program combine commands using the pipe (the symbol), i.e., the stdout of the previous is the stdin for the next command

The Pipe: tr -sc [:alnum:] \n < text.txt sort uniq -q Terminal Keyboard Display (print) stdin stderr tr -sc [:alnum:] \n shell program stdout sort shell program stdout shell environment (e.g. Bash shell) uniq -q shell program stdout

Using pipe to avoid extra files without pipe (2 commandos = 2 REPLs): with pipe (no intermediate file necessary! 1 REPL):

alternative to split test: sed sed (replace) command: sed s/what/with/g FILE sed s/ /\n/g text.txt What happens if you leave out g? Try the following (with and without g): sed s/i/**you**/g /home/bplank/ short.txt

tr Another use of tr: tr '[:upper:]' '[:lower:]' < text.txt! Extra exercise: Merge upper and lower case by downcasing everything

Exercise Extract the 10 most frequent hashtags from the file /home/bplank/tweets.txt (hint: create a word frequency list first and then use sort and head) Also, use the command grep ^# (grep # ) in your pipeline (to extract words that start with a hashtag) we will see grep again later

File system and navigation

File system usual system with files, folders, paths to files root of the file system hierarchy is always: / paths can be absolute or relative, e.g. /home/bplank/data vs data/ Commonly used directories:. current working directory.. parent directory ~ home directory of user (for me: /home/bplank == ~bplank)

Navigating the file system cd change directory cd data/001/ mkdir project creates a directory called project ls list content of directory ls /home/bplank pwd

What we have seen so far What is UNIX, what is the command line, why Inspecting a file on the command line Creating a word frequency lists (sed, sort, uniq, tr, and the pipe), extract most frequent words File system and navigation

Overview Bigrams, working with tabular data Searching files with grep A final tiny story

Bigram = word pairs Algorithm: tokenize by word print word_i and word_i+1 next to each other count

Print words next to each other paste command paste FILE1 FILE2 if your two files contain lists of words, prints them next to each other

get next word create a file with one word per line create a second file from the first, but which starts at the second line: tail -n +2 file > next [start with the second file and output all until the end]

Bigrams Exercise: find the 5 most frequent bigrams of text.txt

Solution: Find the 5 most common bigrams Extra: Find the 5 most common trigrams

Tabular data paste FILES (in contrast to cat) cut -f1 FILE (cut out first column from FILE) Exercise: create a frequency list from column 4 in file parses.conll cut -f 4 parses.conll sed '/^$/d' sort uniq -c sort -nr

grep grep finds lines that match a given pattern grep star text.txt

grep grep finds patterns specified as regular expression globally search for regular expression and print grep is a filter - you only keep certain lines of the input e.g., words that end with -ing: grep -w "[a-z]*ing" text.txt Exercises: try the above command: without -w option with the -o and -w option (or -ow for shorthand) what does the -v and -i option do? use man grep to find out

grep grep gh keep lines containing gh grep -i gh keep lines containing gh independent of casing (gh GH..) grep ^ch keep lines beginning with ch grep ing$ keep lines ending with ing grep -v gh do NOT keep lines containing gh

More on regular expressions see Lindberg [1] or chapter 2 on regular expressions of [4] Jurafsky & Manning

Counting: wc Counting lines (-l), words and characters in a file: wc FILE Why is the number of words different?

Exercises with grep & wc How many uppercase words are in text.txt? How many 4-letter words? How many 1 syllable words are there (with exactly one vowel)?

stop words

Removing bigrams that contain stop words Exercise: Use grep to filter out stop words from the text.bigram file

Most frequent bigrams w/o stop words towards more useful bigrams pre-processing matters!

Shell scripts Basically, a shell script is a text file with shell commands in it. To automate and avoid repetition Example: backup.sh make executable: chmod +x backup.sh run:./backup.sh (or sh backup.sh)

Shell scripts: example Create text file called bigram.sh Execute it on a text file:./bigram.sh head -5 sh bigram.sh sort uniq -c sort -nr head

a tiny story (real-world example) in the end..

I never seem to remember when the New York Fashion Week takes place

New York Fashion week we ll consult the New York Times (web API) to find out. Step 1: get the data <your-key>

New York Fashion week Step 2: combine the results

Extract year-month Extract year and month and sort by frequency to get a first impression

References [1] Nikolaj Lindberg. egrep for Linguists. http:// stts.se/egrep_for_linguists/egrep_for_linguists.pdf [2] Ken W. Church (1994). Unix for Poets. http:// cst.dk/bplank/refs/unixforpoets.pdf [3] Jeroen Janssens (2014). Data Science at the Command Line. O Reilly. [4] Jursfky & Martin. Speech and Language Processing. 2nd edition (2009).