(mini) Unix for Poets

from Unix for Poets
by Kenneth Ward Church
AT&T Bell Laboratories
kwc @research att com

Disclaimer: this text was OCR'd and more-or-less HTML-ized by Paai and Diwi. Now, corrected by G. Rigau. Please note that there still are many errors, especially where shell-syntax is used. It is left as an exercise for the students to correct them.
We will add comments and reflections to the original sheets of Church. Such comments will be easily recognizable.

Text is available like never hefore
Dictionaries, corpora, etc.
Data Collection Efforts: ACLIDCI, BNC, CLR, ECI, EDR, ICAME, LDC
Information Super Highway Roadkill: email, blogs, wikis, etc.
Billions and billions of words
What can we do with it all?
It is better to do something simple, than nothing at all.
You can do the simple things yourself ...

Exercises to be addressed

See a file

Count words in a text

Sort a list of words in various ways

ascii order
dictionary order
"rhyming" order

Extract useful info from a dictionary

Compute ngram statistics

Tools

grep: search for a pattern (regular expression)
sort
uniq -c (count duplicates)
tr (translate characters)
wc (word count)
sed (edit string)
awk (simple programming language)
cut
paste
comm
Join

Please check the man-pages of the commands you are using and try to recognize the options that are used in the examples!

Uncompress and see

Type the following commands:

file bible.txt.gz
gunzip -c bible.txt.gz | more
zmore bible.txt.gz
gunzip -c bible.txt.gz | less
gunzip -c bible.txt.gz | tail
gunzip -c bible.txt.gz | head
gunzip -c bible.txt.gz | wc
gunzip -c bible.txt.gz | wc
gunzip bible.txt

Exercise 1: Count words in a text

Input: text file (bible.txt)
Output: list of words in the file with frequency counts

Algorithm

Tokenize (tr)
Sort (sort)
Count duplicates (uniq -c)

Solution to Exercise 1

tr -sc 'A-Za-z' '\012' < bible.txt | sort | uniq -c | more

      1
   7973 a
    236 A
      1 aa
    350 Aaron
      2 Aaronites
      1 Abaddon
      1 Abagtha
      1 Abana
      4 Abarim
...

Glue

Note in the above example how the powerful syntax of a typical Unix-shell is used. If a program would expect input from the keyboard (stdin) it can also use input from an existing textfile (Bible.txt) by using the < sign. The > sign is used to direct output to another device than the device (stdout). The |-sign pipes the output of a program directly into the input of the next program. In this way you can create veritable assembly-lines of programs that progressively change the original input into the output you need.


read from input file    < 
write to output file    > 
pipe                    |

Step by Step

1) more bible.txt

...
1:1 In the beginning God created the heaven and
1:2 And the earth was without form, and void; an
1:3 And God said, Let there be light: and there
1:4 And God saw the light, that [it was] good: a
...

2) tr -sc 'A-Za-z' '\012' < bible.txt | more

DOC
Welcome
To
The
World
...

3) Filtering with a simple gawk program ...

gunzip -c bible.txt.gz | tr -sc 'A-Za-z' '\012' | gawk 'BEGIN{flag=0};$0~/\<TEXT\>/{flag=1;next};$0~/\<\/TEXT\>/{flag=0;next};{if(flag>0){print}}' > bible.clean

4) Ordering and counting ...

tr -sc 'A-Za-z' '\012' < bible.clean | sort | uniq -c | more

   7943 a
    234 A
    350 Aaron
      2 Aaronites
...

More Counting Exercises

Merge the counts for upper and lower case.

tr 'a-z' 'A-Z' < bible.clean |
tr -sc 'A-Z' '\012' |
sort | 
uniq -c

Count sequences of vowers

tr 'a-z' 'A-Z' < bible.clean |
tr -sc 'AEIOU' '\012' | 
sort |
uniq -c

Count sequences of consonants

tr 'a-z' 'A-Z' < bible.clean | 
tr -sc 'BCDFGHJKLMNPQRSTVWXYZ' '\012' |
sort |
uniq -c

sort lines of text

Example       Explanation

sort -d	      dictionary order
sort-f	      fold case
sort-e	      numeric order
sort-nr	      reverse numeric order
sort +1	      start with field 1 (starting from 0)
sort +0.50    start with 50th character
sort +1.5     start with 5th character of field 1

See man page:

man sort

Sort Exercises

Sort the words in Bible by freq

tr -sc 'A-Za-z' '\012' < bible.clean | sort | uniq -c | sort -nr > bible.hist

Sort them by dictionary order

Sort them by rhyming order (hint: rev)

. . .
 1 freely
 1 sorely
 5 Surely
15 surely
 1 falsely
 1 fly
. . .

echo hello world | rev 
dlrow olleh

echo hello world | rev | rev 
hello world

Important Points Thus Far

Tools: tr, sort, uniq, sea, rev
Glue: | < >
Example: count words in a text
Pipes - flexibility: simple yet powerful
Variations

tokenize by vowel, merge upper and lower case
sort by freq, dictionary order, rhyming order

Bigrams Algorithm

tokenize by word
print word i and word i + 1 on the same line
count

tr -sc 'A-Za-z' '\012' < bible.clean > bible.words

tail -n +2 bible.words > bible.nextwords

paste bible.words bible.nextwords | more

The     Old
Old     Testament
Testament       of
of      the
...

paste bible.words bible.nextwords | sort | uniq -c > bible.bigrams
sort -nr < bible.bigrams | more

  11445 of      the
   5964 the     LORD
   4880 in      the
   4044 and     the
   2461 shall   be
...

Exercise 2: count trigrams of Bible

grep & egrep: An Example of a Filter

Count "-ing" words

tr -sc 'A-Za-z' '\012' < bible.clean | grep 'ing$' | sort | uniq -c | more

Example	  Explanation

grep gh	        find lines containing "gh''
grep '^con'	find lines beginning with "con"
grep 'ing$'	find lines ending with "in"
grep -v gh	don't display lines containing "gh"
grep -v '^con'  don't display lines beginning with "con"
grep -v 'ing$'  don't display lines ending with "ing"

More examples

Example 		explanation

grep '[A-Z]		lines with an uppercase char
grep '^[A-Z]		lines starting with an uppercase
grep '[A-Z]$'	        lines ending with an uppercase 
grep '^[A-Z]|*$'	lines with all uppercase chars
grep '[aeiouAEIOU]'	lines with a vowel
grep '^[aeiouAEIOU]'	lines starting with a vowel
grep '[aeiouAEIOU]$'	lines ending with a vowel
grep -i '[aeiou]'	ditto
grep -i '^[aeiou]'
grep -i '[aeiou]$'
grep-i '^[^aeiou]'	lines starting with a non-vowel
grep -i ' [^aeiou]$'	lines ending with a non-vowel
grep -i ' [aeiou].*[aeiou]'  lines with two or more vowels
grep-i '^[^aeiou]*[aeiou][^aeiou]*$' lines with exactly one vowel

Regular Expressions

Example	Explanation

a	match the letter "a"
[a-z]	match any lowercase letter
[A-Z]	match any uppercase letter
[0-9]	match any digit
[0123456789]	match any digit
[aeiouAEIUO]	match any vowel
[^aeiouAEIOU]	match any letter but a vowel
.	match any character
^	beginning of line
$       end of line

x*	any number of x
x+	one or more of x (egrep only)
x | y	x or y (egrep only)
(x)	override precedence rules (egrep only)

Grep Exercises

How many uppercase words are there in the Bible? Lowercase? Hint: wc -1 or grep -c
How many 4-letter words?
Are there any words with no vowels?
Find " l-syllable" words (words with exactly ane vowell)
Find "2-syllable" words (words with exactly two vowels)
Some words with two orthographic vowels have only one phonological vowel. Delete words ending with a silent "e" from the 2-syllable list. Delete diphthongs.
Find verses in the Bible with the word "light." How many have two or more instances of "light" ? Three or more? Exactly two?
WARNING: grep, fgrep, egrep, ...

sed (string editor)

print the first 5 lines (quit after the 5th line)
```
sed 5q < bible.clean
```
print up to the first instance of a regular expression
```
sed '/light/q' bible.clean
```

substitution

	Example	                Explanation
	sed 's/light/dark/g'
	sed 's/ly$/-ly/g'	simple morph prog
	sed 's/[ \011].*//g'	select first field

sed exercises

Count morphs in bible.clean.
Hint: use spell -v to extract morphs, select first field and count
```
echo darkness | spell
+ness darkness
```
Count word initial consonant sequences: tokenize by word, delete the vowel and the rest of the word, and count
Count word final consonant sequences

awk

Etymology

Alfred Aho

Peter Weinberger

Brian Kernighan

It is a general purpose programming language, though generally intended for shorter programs (1 or 2 lines)
Especially good for manipulating lines and fields in simplee ways
WARNING: awk, nawk, gawk

Selecting Fields by Position

print the first field

awk '{print $1}'
cut -f1

```
print the second field
```

awk '{print $2}'
cut -f2

```
print the last field
```

awk '{print $NF}'
rev | cut -f1 | rev

print the penultimate field

awk '{print $(NF-1)}'
rev | cut -f2 | rev

```
print the number of fields
```

awk '{print NF}'

Exercise 3: sort the words in the Bible by the number of syllables (sequences of vowels). Which is the word with more syllables?

Filtering by Numerical Comparison

get lines with large frequencies. Recall bible.hist contains the words in the Bible and their frequencies

awk '$1 > 100 {print $0}' bible.hist

awk '$1 > 100 {print}' bible.hist

awk '$1 > 100' bible.hist

operators:
>, <, >=, <=, ==, !=, &&, | |

Exercice 4: How many bigrams appear more than 10 times.

Filtering by String Comparison

sort -u bible.words > bible.types

Find palindromes

rev < bible.types | paste - bible.types | awk '$1 == $2'

a       a
A       A
aha     aha
deed    deed
did     did
...

== works on strings
paste
-

Find words that can also be spelled backwards

rev < bible.types | cat - bible.types | sort | uniq -c | awk '$1 >= 2 {print $2}'
a
A
ah
aha
dam
deed
deeps
...

Filtering by Regular Expression Matching

lookup words ending in "ed"

awk '$2~/ed$/' bible.hist
grep 'ed$' bible.hist

count "ed" words (by token)

awk '$2~/ed$/ {x = x + $1} END{print x}' bible.hist

tr -sc 'A-Za-z' '\012' < bible.clean | grep 'ed$' | wc -l

count "ed" words (by type)

awk '$2~/ed$/ {x = x + 1} END{print x}' bible.hist

tr -sc 'A-Za-z' '\012' < bible.clean | grep 'ed$' | sort | uniq -c | wc -l

count "ed" words both ways

awk '/ed$/ {token = token + $1;
            type = type + 1}
     END   {print token, type}' bible.hist

awk '/ed$/ {token += $1; type++}
     END   {print token, type}' bible.hist

Exercice 5: It is said that English avoids sequences of -ing words. Find bigrams where both words end in -ing. Do these count as counter-examples to the -ing -ing rule? For comparison's sake, find bigrams where both words end in -ed. Should there also be a prohibition against -ed -ed? Are there any examples of -ed -ed in the Bible? If so, how many? Which verse(s)?

Arrays

Two programs for counting word frequencies:

tr -sc 'A-Za-z' '\012' < bible.clean | sort | uniq -c

tr -sc 'A-Za-z' '\012' < bible.clean | awk '{ freq[$0]++ }; END{for(w in freq) print freq[w], w }'

Arrays are really hashtables

They grow as needed.

They take strings (and numbers) as keys.

Mutual Info: An Example of Arrays

   I(x;y) = log2 Pr(x,y) / Pr(x) Pr(y)

   I(x;y) ~ log2 N f(x,y)/ f(x) f(y)

paste bible.words bible.nextwords | sort | uniq -c > bible.bigrams

cat bible.hist bible.bigrams |

awk 'NF == 2 { f[$2]=$1}

     NF == 3 { print log(N*$1/(f[$2]*f[$3]))/log(2), $2, $3}' 

where N='wc -l bible.words'

Exercice 6: Mutual information is unstable for small bigram counts. Modify the previous program so that it doesn't produce any output when the bigram count is less than 5.