It is the regular expressions that make SED powerful and efficient. A number of complex tasks can be solved with regular expressions. Any command-line expert knows the power of regular expressions.
Like many other GNU/Linux utilities, SED too supports regular expressions, which are often referred to as as regex. This chapter describes regular expressions in detail. The chapter is divided into three sections: Standard regular expressions, POSIX classes of regular expressions, and Meta characters.
Standard Regular Expressions
Start of line (^)
In regular expressions terminology, the caret(^) symbol matches the start of a line. The following example prints all the lines that start with the pattern “The”.
[jerry]$ sed -n '/^The/ p' books.txt
On executing the above code, you get the following result:
The Two Towers, J. R. R. Tolkien The Alchemist, Paulo Coelho The Fellowship of the Ring, J. R. R. Tolkien The Pilgrimage, Paulo Coelho
End of Line ($)
End of line is represented by the dollar($) symbol. The following example prints the lines that end with “Coelho”.
[jerry]$ sed -n '/Coelho$/ p' books.txt
On executing the above code, you get the following result:
The Alchemist, Paulo Coelho The Pilgrimage, Paulo Coelho
Single Character (.)
The Dot(.) matches any single character except the end of line character. The following example prints all three letter words that end with the character “t”.
[jerry]$ echo -e "cat\nbat\nrat\nmat\nbatting\nrats\nmats" | sed -n '/^..t$/p'
On executing the above code, you get the following result:
cat bat rat mat
Match Character Set ([])
In regular expression terminology, a character set is represented by square brackets ([]). It is used to match only one out of several characters. The following example matches the patterns “Call” and “Tall” but not “Ball”.
[jerry]$ echo -e "Call\nTall\nBall" | sed -n '/[CT]all/ p'
On executing the above code, you get the following result:
Call Tall
Exclusive Set ([^])
In exclusive set, the caret negates the set of characters in the square brackets. The following example prints only “Ball”.
[jerry]$ echo -e "Call\nTall\nBall" | sed -n '/[^CT]all/ p'
On executing the above code, you get the following result:
Ball
Character Range ([-])
When a character range is provided, the regular expression matches any character within the range specified in square brackets. The following example matches “Call” and “Tall” but not “Ball”.
[jerry]$ echo -e "Call\nTall\nBall" | sed -n '/[C-Z]all/ p'
On executing the above code, you get the following result:
Call Tall
Now let us modify the range to “A-P” and observe the result.
[jerry]$ echo -e "Call\nTall\nBall" | sed -n '/[A-P]all/ p'
On executing the above code, you get the following result:
Call Ball
Zero on One Occurrence (\?)
In SED, the question mark (\?) matches zero or one occurrence of the preceding character. The following example matches “Behaviour” as well as “Behavior”. Here, we made “u” as an optional character by using “\?”.
[jerry]$ echo -e "Behaviour\nBehavior" | sed -n '/Behaviou\?r/ p'
On executing the above code, you get the following result:
Behaviour Behavior
One or More Occurrence (\+)
In SED, the plus symbol(\+) matches one or more occurrences of the preceding character. The following example matches one or more occurrences of “2”.
[jerry]$ echo -e "111\n22\n123\n234\n456\n222" | sed -n '/2\+/ p'
On executing the above code, you get the following result:
22 123 234 222
Zero or More Occurrence (*)
Asterisks (*) matches the zero or more occurrence of the preceding character. The following example matches “ca”, “cat”, “catt”, and so on.
[jerry]$ echo -e "ca\ncat" | sed -n '/cat*/ p'
On executing the above code, you get the following result:
ca cat
Exactly N Occurrences {n}
{n} matches exactly “n” occurrences of the preceding character. The following example prints only three digit numbers. But before that, you need to create the following file which contains only numbers.
[jerry]$ cat numbers.txt
On executing the above code, you get the following result:
1 10 100 1000 10000 100000 1000000 10000000 100000000 1000000000
Let us write the SED expression.
[jerry]$ sed -n '/^[0-9]\{3\}$/ p' numbers.txt
On executing the above code, you get the following result:
100
Note that the pair of curly braces is escaped by the “\” character.
At least n Occurrences {n,}
{n,} matches at least “n” occurrences of the preceding character. The following example prints all the numbers greater than or equal to five digits.
[jerry]$ sed -n '/^[0-9]\{5,\}$/ p' numbers.txt
On executing the above code, you get the following result:
10000 100000 1000000 10000000 100000000 1000000000
M to N Occurrence {m, n}
{m, n} matches at least “m” and at most “n” occurrences of the preceding character. The following example prints all the numbers having at least five digits but not more than eight digits.
[jerry]$ sed -n '/^[0-9]\{5,8\}$/ p' numbers.txt
On executing the above code, you get the following result:
10000 100000 1000000 10000000
Pipe (|)
In SED, the pipe character behaves like logical OR operation. It matches items from either side of the pipe. The following example either matches “str1” or “str3”.
[jerry]$ echo -e "str1\nstr2\nstr3\nstr4" | sed -n '/str\(1\|3\)/ p'
On executing the above code, you get the following result:
str1 str3
Note that the pair of the parenthesis and pipe (|) is escaped by the “\” character.
Escaping Characters
There are certain special characters. For example, newline is represented by “\n”, carriage return is represented by “\r”, and so on. To use these characters into regular ASCII context, we have to escape them using the backward slash(\) character. This chapter illustrates escaping of special characters.
Escaping “\”
The following example matches the pattern “\”.
[jerry]$ echo 'str1\str2' | sed -n '/\\/ p'
On executing the above code, you get the following result:
str1\str2
Escaping “\n”
The following example matches the new line character.
[jerry]$ echo 'str1\nstr2' | sed -n '/\\n/ p'
On executing the above code, you get the following result:
str1\nstr2
Escaping “\r”
The following example matches the carriage return.
[jerry]$ echo 'str1\rstr2' | sed -n '/\\r/ p'
On executing the above code, you get the following result:
str1\rstr2
Escaping “\dnnn”
This matches a character whose decimal ASCII value is “nnn”. The following example matches only the character “a”.
[jerry]$ echo -e "a\nb\nc" | sed -n '/\d97/ p'
On executing the above code, you get the following result:
a
Escaping “\onnn”
This matches a character whose octal ASCII value is “nnn”. The following example matches only the character “b”.
[jerry]$ echo -e "a\nb\nc" | sed -n '/\o142/ p'
On executing the above code, you get the following result:
b
This matches a character whose hexadecimal ASCII value is “nnn”. The following example matches only the character “c”.
[jerry]$ echo -e "a\nb\nc" | sed -n '/\x63/ p'
On executing the above code, you get the following result:
c
POSIX Classes of Regular Expressions
There are certain reserved words which have special meaning. These reserved words are referred to as POSIX classes of regular expression. This section describes the POSIX classes supported by SED.
[:alnum:]
It implies alphabetical and numeric characters. The following example matches only “One” and “123”, but does not match the tab character.
[jerry]$ echo -e "One\n123\n\t" | sed -n '/[[:alnum:]]/ p'
On executing the above code, you get the following result:
One 123
[:alpha:]
It implies alphabetical characters only. The following example matches only the word “One”.
[jerry]$ echo -e "One\n123\n\t" | sed -n '/[[:alpha:]]/ p'
On executing the above code, you get the following result:
One
[:blank:]
It implies blank character which can be either space or tab. The following example matches only the tab character.
[jerry]$ echo -e "One\n123\n\t" | sed -n '/[[:space:]]/ p' | cat -vte
On executing the above code, you get the following result:
^I$
Note that the command “cat -vte” is used to show tab characters (^I).
[:digit:]
It implies decimal numbers only. The following example matches only digit “123”.
[jerry]$ echo -e "abc\n123\n\t" | sed -n '/[[:digit:]]/ p'
On executing the above code, you get the following result:
123
[:lower:]
It implies lowercase letters only. The following example matches only “one”.
[jerry]$ echo -e "one\nTWO\n\t" | sed -n '/[[:lower:]]/ p'
On executing the above code, you get the following result:
one
[:upper:]
It implies uppercase letters only. The following example matches only “TWO”.
[jerry]$ echo -e "one\nTWO\n\t" | sed -n '/[[:upper:]]/ p'
On executing the above code, you get the following result:
TWO
[:punct:]
It implies punctuation marks which include non-space or alphanumeric characters
[jerry]$ echo -e "One,Two\nThree\nFour" | sed -n '/[[:punct:]]/ p'
On executing the above code, you get the following result:
One,Two
[:space:]
It implies whitespace characters. The following example illustrates this.
[jerry]$ echo -e "One\n123\f\t" | sed -n '/[[:space:]]/ p' | cat -vte
On executing the above code, you get the following result:
123^L^I$
Metacharacters
Like traditional regular expressions, SED also supports metacharacters. These are Perl style regular expressions. Note that metacharacter support is GNU SED specific and may not work with other variants of SED. Let us discuss metacharacters in detail.
Word Boundary (\b)
In regular expression terminology, “\b” matches the word boundary. For example, “\bthe\b” matches “the” but not “these”, “there”, “they”, “then”, and so on. The following example illustrates this.
[jerry]$ echo -e "these\nthe\nthey\nthen" | sed -n '/\bthe\b/ p'
On executing the above code, you get the following result:
the
Non-Word Boundary (\B)
In regular expression terminology, “\B” matches non-word boundary. For example, “the\B” matches “these” and “they” but not “the”. The following example illustrates this.
[jerry]$ echo -e "these\nthe\nthey" | sed -n '/the\B/ p'
On executing the above code, you get the following result:
these they
Single Whitespace (\s)
In SED, “\s” implies single whitespace character. The following example matches “Line\t1” but does not match “Line1”.
[jerry]$ echo -e "Line\t1\nLine2" | sed -n '/Line\s/ p'
On executing the above code, you get the following result:
Line 1
Single Non-Whitespace (\S)
In SED, “\S” implies single whitespace character. The following example matches “Line2” but does not match “Line\t1”.
[jerry]$ echo -e "Line\t1\nLine2" | sed -n '/Line\S/ p'
On executing the above code, you get the following result:
Line2
Single Word Character (\w)
In SED, “\w” implies single word character, i.e., alphabetical characters, digits, and underscore (_). The following example illustrates this.
[jerry]$ echo -e "One\n123\n1_2\n&;#" | sed -n '/\w/ p'
On executing the above code, you get the following result:
One 123 1_2
Single Non-Word Character (\W)
In SED, “\W” implies single non-word character which is exactly opposite to “\w”. The following example illustrates this.
[jerry]$ echo -e "One\n123\n1_2\n&;#" | sed -n '/\W/ p'
On executing the above code, you get the following result:
&;#
Beginning of Pattern Space (\`)
In SED, “\`” implies the beginning of the pattern space. The following example matches only the word “One”.
[jerry]$ echo -e "One\nTwo One" | sed -n '/\`One/ p'
On executing the above code, you get the following result:
One