No black magic: Text processing using the UNIX command line

Similar documents
Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

Command Line - Part 1

INASP: Effective Network Management Workshops

Command Line Crash Course For Unix

Beginners Shell Scripting for Batch Jobs

Tutorial 0A Programming on the command line

A Crash Course on UNIX

Cisco Networking Academy Program Curriculum Scope & Sequence. Fundamentals of UNIX version 2.0 (July, 2002)

LSN 10 Linux Overview

A UNIX/Linux in a nutshell

CPSC2800: Linux Hands-on Lab #3 Explore Linux file system and file security. Project 3-1

Introduction to Operating Systems

ICS 351: Today's plan

New Lab Intro to KDE Terminal Konsole

HP-UX Essentials and Shell Programming Course Summary

Unix Sampler. PEOPLE whoami id who

Chapter 2 Text Processing with the Command Line Interface

Introduction to Shell Programming

Basic C Shell. helpdesk@stat.rice.edu. 11th August 2003

Command-Line Operations : The Shell. Don't fear the command line...

Introduction to Programming and Computing for Scientists

Unix the Bare Minimum

1 Basic commands. 2 Terminology. CS61B, Fall 2009 Simple UNIX Commands P. N. Hilfinger

Tutorial Guide to the IS Unix Service

Linux Overview. Local facilities. Linux commands. The vi (gvim) editor

Hands-On UNIX Exercise:

Unix Shell Scripts. Contents. 1 Introduction. Norman Matloff. July 30, Introduction 1. 2 Invoking Shell Scripts 2

An Introduction to the Linux Command Shell For Beginners

grep, awk and sed three VERY useful command-line utilities Matt Probert, Uni of York grep = global regular expression print

CS Unix Tools & Scripting Lecture 9 Shell Scripting

SEO - Access Logs After Excel Fails...

Lab 1: Introduction to C, ASCII ART and the Linux Command Line Environment

Lecture 4. Regular Expressions grep and sed intro

AN INTRODUCTION TO UNIX

Unix Guide. Logo Reproduction. School of Computing & Information Systems. Colours red and black on white backgroun

PHP Debugging. Draft: March 19, Christopher Vickery

University of Toronto

Shellshock Security Patch for X86

SSH and Basic Commands

SSH Connections MACs the MAC XTerm application can be used to create an ssh connection, no utility is needed.

Open Source Computational Fluid Dynamics

Lab 1 Beginning C Program

UNIX, Shell Scripting and Perl Introduction

sftp - secure file transfer program - how to transfer files to and from nrs-labs

Installing IBM Websphere Application Server 7 and 8 on OS4 Enterprise Linux

Thirty Useful Unix Commands

Fred Hantelmann LINUX. Start-up Guide. A self-contained introduction. With 57 Figures. Springer

CS 2112 Lab: Version Control

CS10110 Introduction to personal computer equipment

Editing Locally and Using SFTP: the FileZilla-Sublime-Terminal Flow

The Linux Operating System and Linux-Related Issues

Text Clustering Using LucidWorks and Apache Mahout

There s a variety of software that can be used, but the approach described here uses freely available Cygwin software: (1) Cygwin/X (2) Cygwin/openssh

Source Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

SparkLab May 2015 An Introduction to

Introduction to Linux operating system. module Basic Bioinformatics PBF

Linux System Administration on Red Hat

Cygwin command line windows. Get that Linux feeling - on Windows

How to use the UNIX commands for incident handling. June 12, 2013 Koichiro (Sparky) Komiyama Sam Sasaki JPCERT Coordination Center, Japan

Automated Offsite Backup with rdiff-backup

Extreme computing lab exercises Session one

File Transfer Examples. Running commands on other computers and transferring files between computers

Introduction to the UNIX Operating System and Open Windows Desktop Environment

Introduction to Mac OS X

LECTURE-7. Introduction to DOS. Introduction to UNIX/LINUX OS. Introduction to Windows. Topics:

Training Day : Linux

Instructions for Accessing the Advanced Computing Facility Supercomputing Cluster at the University of Kansas

CS 103 Lab Linux and Virtual Machines

Using a login script for deployment of Kaspersky Network Agent to Mac OS X clients

TS-800. Configuring SSH Client Software in UNIX and Windows Environments for Use with the SFTP Access Method in SAS 9.2, SAS 9.3, and SAS 9.

UNIX / Linux commands Basic level. Magali COTTEVIEILLE - September 2009

Introduction to UNIX and SFTP

CMSC 216 UNIX tutorial Fall 2010

SSH with private/public key authentication

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Shell Scripts (1) For example: #!/bin/sh If they do not, the user's current shell will be used. Any Unix command can go in a shell script

CPSC 226 Lab Nine Fall 2015

HDFS Installation and Shell

Remote Access to Unix Machines

Extending Remote Desktop for Large Installations. Distributed Package Installs

Hadoop Basics with InfoSphere BigInsights

USEFUL UNIX COMMANDS

Recommended File System Ownership and Privileges

Cloud Server powered by Mac OS X. Getting Started Guide. Cloud Server. powered by Mac OS X. AKJZNAzsqknsxxkjnsjx Getting Started Guide Page 1

INT322. By the end of this week you will: (1)understand the interaction between a browser, web server, web script, interpreter, and database server.

How To Use The Librepo Software On A Linux Computer (For Free)

TP1: Getting Started with Hadoop

CLC Server Command Line Tools USER MANUAL

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

TIBCO ActiveMatrix BusinessWorks Plug-in for TIBCO Managed File Transfer Software Installation

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Syntax: cd <Path> Or cd $<Custom/Standard Top Name>_TOP (In CAPS)

CSIL MiniCourses. Introduction To Unix (I) John Lekberg Sean Hogan Cannon Matthews Graham Smith. Updated on:

There are many different ways in which we can connect to a remote machine over the Internet. These include (but are not limited to):

Tour of the Terminal: Using Unix or Mac OS X Command-Line

How to Tunnel Remote Desktop using SSH (Cygwin) for Windows XP (SP2)

Hadoop Shell Commands

TELNET CLIENT 5.11 SSH SUPPORT

Hadoop Shell Commands

Transcription:

? No black magic: Text processing using the UNIX command line Barbara Plank http://cst.dk/bplank! Nov 6, 2014

Motivation (1994)

What is UNIX? Operating system (OS), 1969 AT&T / Bell labs Used loosely to refer to any OS sharing the same basic design (Linux, Solaris, Mac OS) Unix philosophy: Build functionality out of small programs that do one thing and do it well Slide inspired by: https://software.rc.fas.harvard.edu/training/intro_unix/

What is the command line? $ command prompt this window is called the terminal which is the program that allows us to interact with the shell the shell is an environment that executes the commands we type in at the command prompt

What is the command line? INPUT OUTPUT REPL: read-eval-print loop very different from the well-known graphical user interface

Input Output process model Shell programs do I/O (input/output) with the terminal, using three streams: Terminal INPUT Keyboard stdin shell program stderr Display (print) stdout shell environment (e.g. Bash shell) OUTPUT Interactively, you rarely notice there's separate stdout and stderr (today we won t worry about stderr)

Unix philosophy combine many small programs for more advanced functionality Terminal Keyboard Display (print) stdin stderr shell program stdout shell program stdout shell program stdout shell environment (e.g. Bash shell)

Why (still) the command line? Advantages: allows you to be agile (REPL vs edit-compile-run-debug cycle) this window is called the terminal the command line is extensible and complementary which is the program that allows us automation and reproducibility to interact with the shell to run jobs on big clusters of computers (HPC computing) the shell executes the commands we type in Disadvantage: (e.g. Bash shell) takes some time to get acquainted

Start a terminal: on Mac OS Applications, Utilities, Terminal

On Linux

Note: Windows Windows Command Prompt cmd (or PowerShell) is fundamentally different and incompatible with the commands we will see today! for today: download PuTTY http://the.earth.li/~sgtatham/putty/latest/x86/putty.exe

Getting started: start terminal Connect to the server for today s workshop (see handout): ssh username@hostname Type yes when you see this: The authenticity of host.(10.1.) can't be established. ECDSA key fingerprint is c0:7b: 40:5f:c9:d4:97:6f:33:27:76:8f:5e:b9:25:92. Are you sure you want to continue connecting (yes/no)? yes Enter the password You now have a prompt: Windows users use PuTTY: hostname

now we are all connected to a shell where we can issue commands

First shell commands Type text (your command) after the prompt ($), followed by ENTER: pwd: print working directory (shows the current location in the filesystem)

Shell command: Structure A shell command (or shell program) usually takes parameters: arguments (required) and options (optional) Shell program with argument(s) cat text.txt cat text1.txt text2.txt text3.txt With argument and option: cat -n text.txt (prefix every line by line number)

Note shell commands are CaSE SeNsItVe pwd PWD Pwd pwd spaces have special meanings (do not use them for file names or folder names)

Where to find help To know what options and arguments a command takes consult the man (manual) pages: man whoami man cat Use q to exit

Tips m<tab> (use auto-completion) use the arrow up key to reload command from your command history (or more advanced to search history of commands: <CTRL>+r) <CTRL>+d or <CTRL>+c or just q to quit

Word frequency list

Prerequisite: Copy file Copy the text file from my home directory to yours: cp /home/bplank/text.txt. command name arg1: what? arg2: where to? (copy) Check if the file is in your directory with ls:

Inspect files head text.txt prints out the first ten lines of the file Try out the following commands - what do they do? tail text.txt cat text.txt less text.txt (continue with SPACE or arrow UP/DOWN; quit by typing q)

line-based processing head text.txt prints out the first (by default) ten lines of the file head -4 text.txt prints out the first 4 lines of the file

I/O redirection to files Shell commands can be redirected to write to files instead of to the screen, or read from files instead of the keyboard Append to any command: > myfile send stdout to file called myfile < myfile send content of myfile as input to some program < 2> > 2> myfile send stderr to file called myfile

line-based processing and I/O redirection head text.txt equivalent to head < text.txt head -1 text.txt > tmp prints out the first line of the file and stores it in file tmp Exercise: store the last 4 lines of the file text.txt in a file called footer.txt

Recipe for counting words An algorithm: a. split text into one word per line (tokenize) b. sort words c. count how often each word appears

a) split text: word per line translate A into B A=set of characters B=single character (\n newline) -s squeezes multiple blanks, -c complement tr -sc [a-za-z] \n < text.txt More examples: tr -sc [a-za-z0-9] \n < text.txt tr -sc [:alnum:] \n < text.txt tr -sc [:alnum:]@# \n < tweets.txt

b) sorting lines of text: sort FILE sort -r (reverse sort) sort -n (numeric) sort sort -nr (reverse numeric sort) Exercise: try out the sort command with the different options above on the the file: /home/bplank/numbers

c) count words = count duplicate lines in a sorted text file: uniq -c uniq assumes a SORTED file as input! uniq -c SORTEDFILE Exercise: frequency list of numbers in file sort the numbers file and save it (> redirect to file) in a new file called numsorted now use uniq -c to count how often each number appears Solution: sort -n /home/bplank/numbers > numsorted uniq -c numsorted

Now we have seen all necessary ingredients for our recipe on counting words An algorithm: a. split text into one word per line (tokenize) b. sort words c. count how often each word appears

The UNIX game commands ~ bricks building more powerful tools by combining bricks using the pipe:

The Pipe Unix philosophy: combine many small programs Terminal Keyboard stdin stderr tr -sc [:alnum:] \n shell program stdout Display (print) stdout use as glue uniq -q shell program sort shell program stdout shell environment (e.g. Bash shell)

Word frequency list combining the three single commands (tr,sort,uniq): tr -sc [:alnum:] \n < text.txt sort uniq -q Terminal specify input for first program combine commands using the pipe (the symbol), i.e., the stdout of the previous is the stdin for the next command

The Pipe: tr -sc [:alnum:] \n < text.txt sort uniq -q Terminal Keyboard Display (print) stdin stderr tr -sc [:alnum:] \n shell program stdout sort shell program stdout shell environment (e.g. Bash shell) uniq -q shell program stdout

Using pipe to avoid extra files without pipe (2 commandos = 2 REPLs): with pipe (no intermediate file necessary! 1 REPL):

alternative to split test: sed sed (replace) command: sed s/what/with/g FILE sed s/ /\n/g text.txt What happens if you leave out g? Try the following (with and without g): sed s/i/**you**/g /home/bplank/ short.txt

tr Another use of tr: tr '[:upper:]' '[:lower:]' < text.txt! Extra exercise: Merge upper and lower case by downcasing everything

Exercise Extract the 10 most frequent hashtags from the file /home/bplank/tweets.txt (hint: create a word frequency list first and then use sort and head) Also, use the command grep ^# (grep # ) in your pipeline (to extract words that start with a hashtag) we will see grep again later

File system and navigation

File system usual system with files, folders, paths to files root of the file system hierarchy is always: / paths can be absolute or relative, e.g. /home/bplank/data vs data/ Commonly used directories:. current working directory.. parent directory ~ home directory of user (for me: /home/bplank == ~bplank)

Navigating the file system cd change directory cd data/001/ mkdir project creates a directory called project ls list content of directory ls /home/bplank pwd

What we have seen so far What is UNIX, what is the command line, why Inspecting a file on the command line Creating a word frequency lists (sed, sort, uniq, tr, and the pipe), extract most frequent words File system and navigation

Overview Bigrams, working with tabular data Searching files with grep A final tiny story

Bigram = word pairs Algorithm: tokenize by word print word_i and word_i+1 next to each other count

Print words next to each other paste command paste FILE1 FILE2 if your two files contain lists of words, prints them next to each other

get next word create a file with one word per line create a second file from the first, but which starts at the second line: tail -n +2 file > next [start with the second file and output all until the end]

Bigrams Exercise: find the 5 most frequent bigrams of text.txt

Solution: Find the 5 most common bigrams Extra: Find the 5 most common trigrams

Tabular data paste FILES (in contrast to cat) cut -f1 FILE (cut out first column from FILE) Exercise: create a frequency list from column 4 in file parses.conll cut -f 4 parses.conll sed '/^$/d' sort uniq -c sort -nr

grep grep finds lines that match a given pattern grep star text.txt

grep grep finds patterns specified as regular expression globally search for regular expression and print grep is a filter - you only keep certain lines of the input e.g., words that end with -ing: grep -w "[a-z]*ing" text.txt Exercises: try the above command: without -w option with the -o and -w option (or -ow for shorthand) what does the -v and -i option do? use man grep to find out

grep grep gh keep lines containing gh grep -i gh keep lines containing gh independent of casing (gh GH..) grep ^ch keep lines beginning with ch grep ing$ keep lines ending with ing grep -v gh do NOT keep lines containing gh

More on regular expressions see Lindberg [1] or chapter 2 on regular expressions of [4] Jurafsky & Manning

Counting: wc Counting lines (-l), words and characters in a file: wc FILE Why is the number of words different?

Exercises with grep & wc How many uppercase words are in text.txt? How many 4-letter words? How many 1 syllable words are there (with exactly one vowel)?

stop words

Removing bigrams that contain stop words Exercise: Use grep to filter out stop words from the text.bigram file

Most frequent bigrams w/o stop words towards more useful bigrams pre-processing matters!

Shell scripts Basically, a shell script is a text file with shell commands in it. To automate and avoid repetition Example: backup.sh make executable: chmod +x backup.sh run:./backup.sh (or sh backup.sh)

Shell scripts: example Create text file called bigram.sh Execute it on a text file:./bigram.sh head -5 sh bigram.sh sort uniq -c sort -nr head

a tiny story (real-world example) in the end..

I never seem to remember when the New York Fashion Week takes place

New York Fashion week we ll consult the New York Times (web API) to find out. Step 1: get the data <your-key>

New York Fashion week Step 2: combine the results

Extract year-month Extract year and month and sort by frequency to get a first impression

References [1] Nikolaj Lindberg. egrep for Linguists. http:// stts.se/egrep_for_linguists/egrep_for_linguists.pdf [2] Ken W. Church (1994). Unix for Poets. http:// cst.dk/bplank/refs/unixforpoets.pdf [3] Jeroen Janssens (2014). Data Science at the Command Line. O Reilly. [4] Jursfky & Martin. Speech and Language Processing. 2nd edition (2009).