Nuts and Bolts Script Utilities Lead image: © jirkaejc, 123RF.com

Script utilities in depth: grep, awk, and sed

Players

We take an in-depth look at three important scripting players: grep, awk, and sed. By AEleen Frisch

A variety of scripts rely on common Linux commands as well as Bash scripting features to accomplish their tasks. For example, the code example in Listing 1 displays the current date and time in a specified city.

Listing 1: Display Date and Time

01 if [ "$1" = "" ]; then
02    echo "City name required."
03    exit
04 fi
05 city=`echo $1 | sed -e 's/ /_/g'`
06 z=`find /usr/share/zoneinfo -type f | \
07    grep -v 'posix|right|Etc|SystemV | \
08    grep -i $city`
09 echo -n "Time in $1: "; TZ=`echo $z` date

The first section of this code ensures that the script's argument was not omitted using an if statement. The next command defines the variable city as the script's first argument, transformed so that spaces are replaced by underscores via sed. Next, the variable z is constructed from a sequence of three commands: the find command locates all time zone definition files, the first grep command removes some unwanted entries from that list, and the final grep extracts the entry corresponding to the specified city.

The final command displays the desired information in an attractive format. This small snippet illustrates the way Bash control structures and commands like grep can be combined to perform a specific job. Here, grep operates as a tool that extracts the desired portions from some larger block of information, providing utility and functionality to the Bash scripting environment (analogous to a utility infielder in baseball).

In this column, I'll take a closer, more detailed look at three of these important tools, which are useful both at the command line and within scripts: grep, awk, and sed.

Locate and Scrutinize with grep

The grep command's function is to locate lines matching a specified pattern within one or a set of files (or standard input). It has the following general syntax:

grep regular-expression  file(s)

When grep is included as the second or later segment within a pipe, only the first argument is used.

What grep should search for is a pattern specified via a regular expression. Regular expressions – a somewhat odd name, by the way – are a scheme for specifying complex search patterns. They are built from literal characters (letters, numbers, symbols, etc.) and a small set of characters given special meanings.

Regular expression components, summarized in Table 1, are used by several of the commands I will consider here.

Tabelle 1: Regular Expressions Summary

Pattern Building Block	What to Match
`c`	Match that character
`\c`	Match a literal instance of a special character
`^`	Match beginning of line
`$`	Match end of line
`.`	Match any character
`[chars]`	Match any listed character (no separators); can include ranges (e.g., `a-z`)
`[^chars]`	Match any character not in the character list
`[-chars^]`	Place a literal `-` or `]` first (or second after `^`). Place a literal `^` last
`pat1\|pat2`	Match either pattern 1 or pattern 2
`( )`	Used for grouping and back references
`\n`	Back reference: Expands to whatever matched the nth `( )` pattern
Suffixes	How Many to Match
`?`	0 or 1
`*`	0 or more
`+`	1 or more
`{n}`	Exactly n
`{n,}`	n or more
`{,n}`	n or fewer
`{n,m}`	At least `n` and not more than `m` (inclusive range)

The simplest regular expression is just a literal string of one or more characters that you want to look for. For example, these two grep commands search a file for lines containing the letter "e" and the string "cat":

grep e somefile
grep cat somefile
One regular expression construct that might be new to readers is the curly brace. It indicates a specific numeric range for the item to which it is attached. Here are some examples:

_{3,} – Three or more consecutive underscores
· {2,5} – Between two and five spaces in a row (where · is a space)
(the·){2} – Two consecutive instances of "the "

Regular expressions are case sensitive. You can perform case-insensitive searches with grep by including the -i option. However, if you want to search for lines containing the word "cat" with a capital or lowercase "c" (but the rest of the word lowercase), you would use this search expression:

grep "[cC]at" somefile

Note that you now must enclose the regular expression in quotation marks because the square brackets also have special meaning to the Bash shell, and you need to pass them on unaltered to grep.

The following are the most useful grep command-line options:

-v – Display only non-matching lines.

-i – Perform case-insensitive comparisons.

-c – Display only the number of matching lines.

-l, -L – Display only the names of files with/without matches.

-F – Treat the search pattern as a literal string.

-E – Use extended regular expressions.

On most current Linux systems, the distinction between ordinary and extended regular expressions is no longer relevant, but I still alias grep to grep -E out of habit.

The following small script illustrates the use of the -c option. It waits for a running instance of the gauss application to complete and then starts another one:

x=1
while [ $x -gt 0 ]; do
  sleep $waittime
  x=`ps aux | grep -c gauss`
done
/apps/bin/gauss $1 &

The script begins by setting the variable x to 1. Next, a while loop continues until x becomes 0. The body of the loop sleeps for a specified period then searches the output of the ps command for the application name (gauss).

The grep command does the search and returns the number of matching lines (processes); x is set to that value. When x is 0, it means that grep found no matching lines in the ps output, indicating that the application is no longer running. At this point, the loop ends, and a new process is started.

Although they might seem a bit artificial, the next examples will illustrate important concepts about ordering and exclusion within regular expressions. These examples all search the following lines:

1 all of the eggs inside our fridge are old.
2 all eggs in our fridge were white.
3 but an egg is foul.
4 the old egg is awful.

If you want to match lines containing all five vowels, appearing in alphabetical order but allowing intervening letters, you can use the search expression:

"a.*e.*i.*o.*u"

This would match lines 1, 2, and 3. If you want lines containing all five vowels in order with only consonants between them, you would use the regular expression:

"a[^iou]*e[^eou]*i[^aeu]*o[aei]*u"

Now, each vowel is followed by a bracket expression excluding all vowels other than the next one in alphabetic sequence. This would match lines 2 and 3.

If you want only one instance of each vowel, you would include the current vowel within the square brackets (i.e., a[^aiou] and so on). Such an expression would match only line 3. If you want no other vowels before/after the matching section, you would add to the beginning/end of the search string:

^[^aeiou]*vowel-search-expression[^aeiou]*$

The ^ beginning-of-line character is followed by the expression for zero or more (*) non-vowel characters (not a, e, i, o, or u), and the $ end-of-line character is similarly preceded by it. With this modified regular expression, grep would no longer find any matching lines.

Matching lines containing all five vowels in any order is more verbose:

grep a eggfile | grep e | grep i | grep o | grep u

This would match all four lines. However, grep is not the best tool for this job. I'll show a much more compact solution with sed a bit later.

The final set of grep examples is taken from a script that removes HTML tags from text and reformats it for importing into a page layout program. I will operate on the short sample HTML file in Listing 2. The script uses a regular expression to locate specific items, which are then transformed by other commands.

Listing 2: A Sample HTML File

01 <h1>Some Simple HTML</h1>
02 <p>Here is some simple HTML.</p>
03 <p class="listintro">These are some of
04 my favorite paint colors:</p>
05 <ul>
06 <li>Alizarin Crimson</li>
07 <li>Naples Yellow</li>
08 <li>Phthalo Blue
09 (Green Shade)</li></ul>
10 <p class="end">This is the final bit.</p>

The following command will locate the beginning of each ordinary paragraph within the file:

grep "<p" something.htm

This command will match lines 2, 3, and 10. Note that I do not specify the entire opening tag because it can take different forms depending on whether it includes options (e.g., compare the tags in lines 2 and 3).

The following command will return the number of bulleted lists within a series of HTML files,

grep -c "<ul" *.htm

and it returns 1 for the sample file. The following commands will return the number of bulleted items that span multiple lines. The first grep command locates the opening tag for a bullet item (<li>), and the second one returns the number of lines that do not contain the closing tag (</li>).

grep "<li>" something.htm | grep -c -v "</li>"

These commands return 1 for the sample file (corresponding to line 8).

The following command finds all lines that begin with an opening tag and end with the corresponding closing tag:

grep "^<([^/ >]+)[^>]*>.*</\1>$" something.htm

The command displays lines 1, 2, 6, 7, and 10. Here is a detailed look at the regular expression:

^< – Beginning of line and opening angle bracket.
([^/ >]+) – One or more characters other than a forward slash or a space. The expression is enclosed in parentheses for later back reference. It corresponds to the tag name, which is terminated by the closing angle bracket or a space, depending on whether the tag includes options.
[^>]* – Zero or more characters other than a closing angle bracket. This will match any options present within the tag.
>.* – Closing angle bracket plus any zero or more characters.
</\1>$ – Opening angle bracket and forward slash, followed by the tag name matched previously (\1) and a closing angle bracket, all appearing at the end of the line.

I'll conclude this look at grep with a bit of grep trivia: some official regular expression jargon. An individual item within an expression is known as an atom. The different patterns connected by a vertical bar (OR) are called branches. Suffixes are used to specify bounds for their components. Square bracket expressions define character classes.

Wield awk with Panache

Awk is another of the Linux utilities that finds frequent use within Bash scripts. My most frequent use of it is to extract certain items from the output of other commands. For example, the following command displays the current time zone by extracting the fifth field from the default output of the date command:

$ date
Fri Apr  1 10:50:29 EDT 2011
$ date | awk '{print $5}'
EDT

Fields are separated by whitespace by default, but you can specify the field separator character with awk's -F option (as I'll show).

The awk command takes a set of internal commands to execute – known as an awk script – as its first argument, followed by one or more file names when not part of a pipe. Alternatively, the script can be stored in an external file and specified to awk with the -f option. Awk has a rich built-in command set that is far too extensive to cover exhaustively here. Instead, I'll look at its most-used features when used in simple commands and scripts. See the awk documentation for information about all of its features and capabilities.

The general syntax of an awk script statement is

match-expression {action}

where the first item is an expression against which each input line is tested. This expression can be a regular expression enclosed in forward slashes, a conditional expression, a keyword, and several other formats. When a line matches, the operations enclosed in curly braces are executed. The expression to match is optional, and omitting it applies the operations to every input line encountered. Multiple statements can be included in an awk script, and multiple actions can be included in curly braces; the separator character in both cases in the semicolon.

Here is an example of awk that includes a match expression. This command displays the program and username for all processes whose command path includes the string "games":

$ ps -ef | awk '/games/ {print $8 "\t" $1}'
/usr/games/blackjack    wong
/usr/games/bridge       sharif
/apps/econ/games2       smith
/usr/games/same-gnome   chavez

If the intent of this command is to list all users running a game application, then the search pattern should include the forward slashes (to eliminate the false positive in the third output line). I can also add a second awk command to the pipe to make the output less verbose:

$ ps -ef |\
   awk '/\/games\// {print $8 "\t" $1}' |\
   awk -F/ '{print $2}'
blackjack       wong
...

Note that the forward slashes that are part of the match expression must be escaped with a backslash.

The print statement is the most used statement in awk. It takes a series of items to display. Literal strings are enclosed in double quotation marks (e.g., the tab character in the preceding example).

If the items to be displayed are separated by commas, then awk places the field separator character between them (by default, a space). For example, the following command generates a cp command that will create a new file of the same name plus .old for each file listed by ls. The resulting commands are sent on to the Bash shell for execution:

ls *.dat | awk '{print "cp",$1,$1".old"}' | bash

Alternatively, I could have included the spaces needed in the command explicitly in the list to print then omitted the commas between items:

{print "cp " $1 " " $1".old"}

Awk provides the special match keywords BEGIN and END, which are used to specify actions to be performed before processing the first line and after processing the final line, respectively:

cat somefile |\
awk '    { if (x < length()) x=length() };\
     END { print "max line length is " x }'

This command displays the maximum line length within the specified file. The awk script has two statements. The first does not include any match pattern, so it operates on every line and tests whether the internal awk script variable x is less than the length of the current line; if so, it resets x to the line length with the internal awk function length. After all lines have been examined, the END statement is executed; it uses print to display the final value of x.

Awk is also useful for adding up numerical values within output. For example, the following script displays the total space used on all local disks for a specified user:

find / -user $1 -xdev ! -name /dev/\* -ls |\
 awk '{sum+=$7}; \
 END {print $5 "\t" sum/(1024*1024*1024) "GB"}'

The find command locates all files owned by the user specified as the script's argument on local disks (-xdev) that are not device files (! -name) and produces a long directory listing for each one. The awk command adds the file size from each file, accumulating the total into the internal variable sum. Once all of the output from find has been processed, awk prints the username (found in the fifth field of every line) and the sum of all the file sizes (converted to and labeled GB), separating the two items of information with a tab. Invoking this script for user chavez yields:

chavez    12.12537GB

Such commands are often useful in loops over several usernames, as in the following example, which summarizes current CPU and memory usage by user:

echo "User      CPU%    Mem%"
for u in $*; do
  ps aux | grep ^$u | awk '{c+=$3}; {m+=$4}; \
       END {printf "%s%s%4.1f%s%4.1f%s",
            $1,"\t",c,"\t",m,"\n"}'
done

After printing an initial header line, the script loops over the list of users specified as the script's arguments. For each user, it extracts information about that user's processes from the output of ps aux using grep and then accumulates the total CPU and memory percentage usage in the internal variables c and m.

Once all lines are processed, the results are printed using awk's formatted printing statement printf, which takes a format string as its first argument, followed by the list of items to print. It is equivalent to the same command provided by Bash. Here is some sample output from the script:

User    CPU%    Mem%
chavez  14.1    18.5
keynes   5.7     8.3

My final awk example illustrates the use of the BEGIN keyword with a logical expression as a match pattern. It searches the password file for alternative root entries: any user account with UID 0 but a username other than root. Here is an example of the type of line that it locates:

oops::000:100::/:/bin/sh

This password file entry allows someone to log in as the root-equivalent user oops without a password:

grep '.*:.*:00*:' /etc/passwd | awk -F:
'BEGIN       {n=0};
 $1!="root"  {print $0; n=1}
 END         {if (n==0) print "None found."}'

The grep command searches the password file for any lines containing one or more zeros enclosed by colons. The awk command begins by setting the value of the internal variable n to 0. It then compares the first field of each line (where fields are separated by colons via the -F option) with the string root. If it matches, the entire line is printed ($0), and n is set to 1. When all lines have been processed, the message "None found" is printed if n is still 0.

I could use a more complex matching pattern in the second awk statement and eliminate the need for grep, but then the code would be less readable.

Dynamic Transformations with sed

The third command in this scripting infield triad is sed. The sed utility is a stream editor designed to transform input according to the specifications it receives. The set of sed directives can be given on the command line or be stored in a file specified to the command via its -f option.

The sed command will accept input file names on the command line; the transformed files are displayed on standard output. It can also edit files in place by specifying the -i option. This option also accepts a string to append to the current file name to which the original file will be saved (e.g., -i .orig). If the string is omitted, then no backup file will be created (a bit risky).

I've shown an example of sed already:

city1=`echo $1 | sed -e 's/ /_/g'`

This command replaces spaces with underscores in the script's first argument and displays the resulting string. The -e option says to append the command that follows to the current list of commands (i.e., sed script). When there is only a single command, this option is superfluous, but I often include it out of habit.

Like awk, sed has more features than can be covered in detail here, so I'll look at the most useful. Generally, sed statements consist of two parts: a range of lines to act on followed immediately by the action to perform. The syntax uses the same characters in multiple roles and can be less than intuitive at times. I'll consider these components in reverse order.

The sed actions are specified by a single character and possibly arguments. The following are the most important:

s/old/new/ – Substitute the first text portion in a line matching the old regular expression with the new text. The latter may contain back references. The conventional delimiters for the two components are forward slashes, but any character may be used. The character immediately following s is used as the delimiter for that operation. You can also include a g following the final delimiter to perform the substitution globally: on all matches with each line. Including i in this location requests a case-insensitive comparison.
d – Delete the selected range of lines. The !d says not to delete the selected lines, and using it implicitly causes other lines to be deleted.
p – Print the specified range of lines. Typically used in combination with the -n option, which reverses sed's default behavior of printing every line. !p may be used in the absence of -n to suppress specific lines.
N – Join the next line to the search pattern space. Useful for locating patterns that span multiple lines.

Line ranges are specified using the following components:

/regex/ – Line matching the specified regular expression.
n – Line n.
^ – First line.
$ – Last line.
line1,line2 – All lines between line1 and line2, inclusive.

The following example contains two sed statements (separated by a semicolon), which both include line ranges:

sed '1,/<body>/d; /<\/body>/,$d' somepage.htm

This command extracts and displays only the lines in the specified file following the <body> tag and before the corresponding closing tag. The first statement deletes lines starting at line 1 and ending with the first line containing <body>, and the second statement deletes all lines beginning with the first one containing </body> through the last line. Note that the forward slash in the second search string is preceded by a backslash.

The following two equivalent sed commands remove all blank lines from the specified file:

sed -n '/^$/!p' sometext
sed '/^$/d' sometext

The following command will display lines in the specified file that match any of the specified regular expressions in any order:

sed '/apple/!d;/orange/!d;/pear/!d' < fruits.txt

The first sed statement excludes lines not containing apple, the second those not containing orange, and the final one those not containing pear. This single statement is equivalent to three piped grep commands.

The following commands list the date and time when a certain network error message is received. The relevant lines are located by the grep command (although once again this could be handled by sed as well, with a loss of readability). The sed command uses a regular expression including parentheses to locate and extract the desired information via its substitution command:

$ grep "Invalid query" /var/log/messages |\
  sed -r 's/^[^:]*:(.*:[0-9][0-9]).*/\1/'
Apr  5 10:29
Mar 19 13:07
Mar 17 19:18
Mar 10 16:36

The initial portion of the search expression (^[^:]*:) matches all the characters up to and including the first colon. The next, parenthesized portion matches all characters between the first two colons on the line, the second colon itself and two digits following it, marking it for later reference. The remainder of the search expression (.*) matches all remaining characters on the line.

The replacement expression is simply a back reference to whatever matched the first (and only) parenthesized portion of the search expression, replacing the entire line with the marked substring and effectively stripping out all unwanted information in the line.

The sed command in the preceding example used the -r option, which ensures that all regular expressions are interpreted as extended regular expressions.

The following sed command edits a group of files in place (without creating backup files). In lines beginning with a hash mark (#), and possibly some leading spaces, sed replaces that hash mark and a following t (either uppercase or lowercase), if present, with #p and adds an additional keyword following it:

sed -i '/^ *#/s/#[ tT] */#p iop1=timestamp /' *.job

My final sed examples come from the HTML-to-text conversion script mentioned previously. The first is a command that extracts all of the text enclosed in <p>...</p> tags:

... | sed -n '/<p[^>]*>/,/<\/p>/p'

The first line addressing pattern matches lines with the opening tag, and the second one matches lines containing the closing tag. Note that this command will not work when the opening and closing tags are on the same line (which does not occur in the actual HTML files processed with the script) because of the way sed processes line range expressions. After a line matching the start-of-range pattern is located, searching for the end-of-range pattern begins with the next input line.

This last example illustrates some of sed's most advanced features:

... | sed ':a;N;$!ba;
      s%\n\n+%=newpara=%g;
      s%\n% %g;
      s%  % %g;
      s%=newpara=%\n%g'

The sed command includes many statements. The group immediately following the sed command verb uses an internal loop and the N directive to combine distinct input lines into a combined line so that searches and substitutions can span line boundaries. Here are the meanings of these various segments:

:a – Define the label "a" for later reference.
N – Combine the next line with the current one.
$ – Match the final line of input.
!ba – Don't branch to label "a" when the address expression is matched.

When all of the layers of negation are deconvoluted, the effect of the final two components is to jump back to the beginning of the sed script until the final line of input is encountered. The remaining four sed components are substitution operations.

The purpose of the substitutions is to combine all of the separate lines within a paragraph into a single line, where separate paragraphs are defined by blank lines. The operators do this through the following steps:

Convert all sequences of one or more blank lines (recognized as two or more consecutive newlines) to a special string indicating a paragraph boundary (=newpara=).
Change all remaining newlines to spaces.
Squeeze all runs of two or more spaces to a single space.
Change the paragraph boundary string back to a newline.

I hope you've enjoyed this look at using grep, awk, and sed in scripts. If you have some fine examples of using these commands, drop me a line. I'd love to see them.