Jerzy Nawrocki Faculty of Computing & Information Sci. Poznan University of Technology jerzy.nawrocki@put.poznan.pl File conversion problem FName:John SName:Great Salary 585 Text processing and AWK FName:Ann SName:Nice Salary 7 FName SName Salary John Great 585 Ann Nice 7 Text processing & AWK (2) File conversion problem File conversion problem #include <stdio.h> #include <stdlib.h> FILE *fin; char token[2]; char gettoken(void) {int i=; char c; do {c = getc(fin); if (c == EOF) return (EOF); } while (c <! ); Solution in C: 4 lines of code Text processing & AWK (3) Solution in AWK BEGIN {FS=": ";} NR == 1 {print $1, "\t", $3, "\t", $5;} {gsub(/,/, "., $6); print $2, "\t", $4, "\t", $6;} Text processing & AWK (4) Origins of AWK Authors of AWK Bell Labs, Murray Hill (New Jersey), Foto: http://en.wikipedia.org/wiki/bell_labs Bell Labs, New Jersey (USA), 1977 AWK: Aho, Weinberger, Kernighan Platforms: Unix, MS DOS/Windows Similarity to C Text processing & AWK (5) Alfred Aho http://www.underforty.us/geeks.html Peter Weinberger Brian Kernighan Text processing & AWK (6) Text processing and AWK 1
Aim of the lecture Agenda To present: Another programming paradigm (rule-based programming) Basics of AWK Fundamentals of AWK Simplest programs Patterns Variables Text processing & AWK (7) Text processing & AWK (8) Fundamental question Input file What is text? Field Jerzy Nawrocki 4389 I1 Jane Kowalski 4378 I2 Line Fields: $1, $2, $3,... Text processing & AWK (9) Text processing & AWK (1) Program structure Execution principle pattern1 {instruction1} pattern2 {instruction2}...... Processing rule Jerzy Nawrocki Jane Kowalski Adam Malinowski pattern1 {instruction1} pattern2 {instruction2}...... Text processing & AWK (11) Text processing & AWK (12) Text processing and AWK 2
Agenda Simplest programs Fundamentals of AWK Simplest programs Patterns Variables How many fields on the output? Jerzy Nawrocki 4389 I1 Jane Kowalski 4378 I2 $4== I1 { print $2, $1; } Nawrocki Jerzy Malinowski Adam Text processing & AWK (13) Text processing & AWK (14) Simplest programs Simplest programs How many fields on the output? Jerzy Nawrocki 4389 I1 Jane Kowalski 4378 I2 What field will be first? Jerzy Nawrocki 4389 I1 Jane Kowalski 4378 I2 $4== I1 { print $2, $1; } Jerzy Nawrocki 4389 I1 Nawrocki Jerzy Kowalski Jane Malinowski Adam Text processing & AWK (15) Text processing & AWK (16) Agenda Patterns Fundamentals of AWK Simplest programs Patterns Variables Begining and end of text Relations Compound patterns Range patterns Text processing & AWK (17) Text processing & AWK (18) Text processing and AWK 3
Begining and end of text Begining and end of text Jerzy Nawrocki Jane Kowalski BEGIN { print ----- ; } $4== I2 { print $2, $1; } END { print ***** ; } ----- Kowalski Jane ***** 4389 I1 4378 I2 Text processing & AWK (19) Jerzy Nawrocki Jane Kowalski END { print ***** ; } $4== I2 { print $2, $1; } BEGIN { print ----- ; } ----- Kowalski Jane ***** 4389 I1 4378 I2 Text processing & AWK (2) Relations Compound patterns 12 11 2 11 $1 > $2 12 11 or $1==1 $2==1 && and $1==1 && $2==1! not! $1==1 Text processing & AWK (21) Text processing & AWK (22) Compound patterns Agenda Jerzy Adam 4389 I1 Adam Kowalski 4378 I2 $4== I1 && $1== Adam { print $2, $1; } Fundamentals of AWK Simplest programs Patterns Variables Malinowski Adam Text processing & AWK (23) Text processing & AWK (24) Text processing and AWK 4
http://www.math.wisc.edu/~gpslogic/ Stephen Kleene 199-1-5, Connecticut, USA 1934: Dr, Princeton Univ., (Alonzo Church) 1935: Univ. of Wisconsin- Madison (USA) 1939-4: Inst. for Advanced Study, Princeton recursion theory 199: National Medal of Sci. 1994-1-25, Madison Text processing & AWK (25) Regular expressions Arithmetic expressions Value: Text Number Value(2 3 + 3) = 9 Regular expressions Value: Text SetOfCharacterStrings Value(/Ala Ola/) = {"Ala", "Ola"} Text processing & AWK (26) $, $1, $2,.. $, $1, $2,.. Begining String ~ /^ reg_exp / $1 ~ /^I$/ $1 ~ /^I/ Text processing & AWK (27) Text processing & AWK (28) $, $1, $2,.. $, $1, $2,.. Begining String ~ /^ reg_exp / End String ~ / reg_exp $/ $3 ~ /e$/ Begining String ~ /^ reg_exp / End String ~ / reg_exp $/ Substring String ~ / reg_exp / $3 ~ /e/ Text processing & AWK (29) Text processing & AWK (3) Text processing and AWK 5
$, $1, $2,.. $, $1, $2,.. Begining String ~ /^ reg_exp / End String ~ / reg_exp $/ Substring String ~ / reg_exp / $ ~ /ne/ Text processing & AWK (31) Begining String ~ /^ reg_exp / End String ~ / reg_exp $/ Substring String ~ / reg_exp / $ ~ / wyr_reg / = / wyr_reg / /ne/ Text processing & AWK (32) Special characters Special characters. Any character [ ] Set of characters \n New line \. Dot \ Quotation \ddd Character of octal code = ddd What does it mean? /^.$/ /[123456789]/ /[-9]/ Text processing & AWK (33) Text processing & AWK (34) Complement of a set of characters Disjunction (or) reg_exp reg_exp What s the difference? [^... ] /[^-9]/ /^[-9]/ /im in/ Text processing & AWK (35) Text processing & AWK (36) Text processing and AWK 6
Parentheses and disjunction Parentheses change precedence of operators: 3*(4 + 5) = 3* 9 = 27 Distributive law: 3*(4 + 5) = 3*4 + 3*5 = 12 + 15 = 27 (4 + 5)*3 = 4*3 + 5*3 = 12 +15 = 27 Select all lines for which the first field is a number. 1 2 3 AD 1984 $1 ~ /[-9]/ Distributivity of concatenation over disjunction: /i(m n)/ = /Lond(o y)n/ = /im in/ /London Londyn/ Text processing & AWK (37) Text processing & AWK (38) Select all lines for which the first field is a number. 1 2 3 4tune: AD 1984 2 cents Select all lines for which the first field is a 1-digit number. 1 2 3 4tune: 4tune: AD 1984 2 cents 32 cents $1 ~ /[-9]/ $1 ~ /^ [-9] $/ 1 2 3 4tune: 2 cents Text processing & AWK (39) Text processing & AWK (4) Select all lines for which the first field is a 1- or 2-digit number. 1 2 3 4tune: 4tune: AD1984 2 cents 32 cents Select all lines for which the first field is a 1- or 2-digit number. 1 2 3 4tune: 4tune: 4tune: AD 1984 2 cents 32 cents 32 cents 12 euro $1 ~ /^( [-9] [-9][-9] )$/ $1 ~ /^( [-9] [-9][-9] )$/ 1 digit 2 digits 1 2 3 2 cents 32 cents Text processing & AWK (41) Text processing & AWK (42) Text processing and AWK 7
$1 ~ /^( [-9] [-9][-9] [-9][-9][-9] )$/ Select all lines for which the first field is a number. 1 2 3 4tune: 4tune: 4tune: AD 1984 2 cents 32 cents 32 cents 12 euro 1 digit 2 digits 3 digits [-9] = [-9] 1 [-9][-9] = [-9] 2 [-9][-9][-9] = [-9] 3.................. [-9] 1 [-9] 2 [-9] 3... = [-9]+ Text processing & AWK (43) Text processing & AWK (44) $1 ~ /^ [-9]+ $/ Select all lines for which the first field is a number. 1 2 3 4tune: 4tune: 4tune: AD 1984 2 cents 32 cents 32 cents 12 euro 1 2 3 2 cents 32 cents 32 cents 12 euro Text processing & AWK (45) A non-empty sequence Can be a of w s. sequence of w s. w+ = w ww www... Empty string w* = w ww www... x = x x = x w+ = w w* = w* w x w* = x x w+ w*x = x w+x w( w ww www.. )= w ww www wwww.. Text processing & AWK (46) Riddle Riddle What s its meaning? What on the output? Payment 21/11 ================ Nawrocki 16 Antczak 359 $2 ~ /^[-9][-9]*$/ $2 ~ /^[-9]+$/ Text processing & AWK (47) Text processing & AWK (48) Text processing and AWK 8
Agenda Variables Fundamentals of AWK Simplest programs Patterns Variables Variables introduced by a programmer (type: string of characters; initial value: empty string / zero) Built-in variables (standard meaning) Field variables $1, $(i+j-1),.. Text processing & AWK (49) Text processing & AWK (5) Some built-in variables NF number of fields in a row NR row number FILENAME file name with input data NR NF total 1 2 2 1 2 3 Variables If you have a hammer, everything looks like a nail {total= total + NF;} END {print "Fields: ", total; print "Rows: ", NR;} Text processing & AWK (51) Text processing & AWK (52) Input file Summary Field Jerzy Nawrocki 4389 I1 Jane Kowalski 4378 I2 Line Fields: $1, $2, $3,... Text processing & AWK (53) Text processing & AWK (54) Text processing and AWK 9
Program structure Execution principle pattern1 {instruction1} pattern2 {instruction2}...... Processing rule Jerzy Nawrocki Jane Kowalski Adam Malinowski pattern1 {instruction1} pattern2 {instruction2}...... Text processing & AWK (55) Text processing & AWK (56) Regular expressions Summary (1)* = {ε, 1, 11, 111, } ε x = x ε L 1 L 2 = {xy: x L 1 y L 2 } L (r) = { ε } L n+1 (r) = L(r) L n (r) L n (r) = {xx x: x L(r) } n L(r*) = U L n (r) n= L(r*) = {ε,x, xx, xxx, : x L(r)} Text processing & AWK (57) At last! gawk -f prog.awk <in.txt >out.txt Other features of AWK: Compound instructions (if, while,..) Dynamic arrays Built-in functions (gsub,..) Text processing & AWK (58) Literature A. Aho, B. Kernighan, P. Weinberger, The AWK Programming Language, Addison-Wesley, Reading, 1988. J. Nawrocki, W. Complak, Wprowadzenie do przetwarzania tekstów w języku AWK, Pro Dialog 2 (1994), 23-46. J. Cybulka, B. Jankowska, J.R. Nawrocki, Automatyczne przetwarzanie tekstów. AWK, Lex i YACC, Nakom, Poznań, 22. Text processing & AWK (59) Thank you for your attention! Text processing & AWK (6) Text processing and AWK 1