Infolinks

Monday 16 July 2012

What Are Regular Expressions?

What Are Regular Expressions?
A regular expression is a pattern template you define that a Linux utility Uses to filter text. A Linux utility (such as the sed editor or the gawk program)matches the regular expression pattern against data as that data flows Into the utility. If the data matches the pattern, it’s accepted for processing.
              If the data doesn’t match the pattern, it’s rejected. The regular expression pattern makes use of wildcard characters to represent one or more characters in the data stream.
Types of regular expressions:
There are two popular regular expression engines:
  • The POSIX Basic Regular Expression (BRE) engine
  • The POSIX Extended Regular Expression (ERE) engine
Defining BRE Patterns:
The most basic BRE pattern is matching text characters in a data stream.
Eg 1: Plain text
$ echo "This is a test" | sed -n ’/test/p’
This is a test.
$ echo "This is a test" | sed -n ’/trial/p’
$
$ echo "This is a test" | gawk ’/test/{print $0}’
This is a test.
$ echo "This is a test" | gawk ’/trial/{print $0}’
$
Eg 2: Special characters
The special characters recognized by regular expressions are:
.*[]^${}\+?|()
For example, if you want to search for a dollar sign in your text, just precede it with a backslash character:
$ cat data2
The cost is $4.00
$ sed -n ’/\$/p’ data2
The cost is $4.00
$
Eg 3: Looking for the ending
The dollar sign ($) special character defines the end anchor.
$ echo "This is a good book" | sed -n ’/book$/p’
This is a good book
$ echo "This book is good" | sed -n ’/book$/p’
$
Eg 4: Using ranges
You can use a range of characters within a character class by using the dash symbol.
Now you can simplify the zip code example by specifying a range of digits:
$ sed -n ’/^[0-9][0-9][0-9][0-9][0-9]$/p’ data8
60633
46201
45902
$
Extended Regular Expressions:
The POSIX ERE patterns include a few additional symbols that are used by some Linux applications and utilities. The gawk program recognizes the ERE patterns, but the sed editor doesn’t.
Eg 1: The question mark
The question mark indicates that the preceding character can appear zero or one time, but that’s all. It doesn’t match repeating occurrences of the character:
$ echo "bt" | gawk ’/be?t/{print $0}’
bt
$ echo "bet" | gawk ’/be?t/{print $0}’
Bet
$ echo "beet" | gawk ’/be?t/{print $0}’
$
$ echo "beeet" | gawk ’/be?t/{print $0}’
$
Eg 2: The plus sign
The plus sign indicates that the preceding character can appear one ormore times, but must be present at least once. The pattern doesn’t match if the character is not present:
$ echo "beeet" | gawk ’/be+t/{print $0}’
beeet
$ echo "beet" | gawk ’/be+t/{print $0}’
beet
$ echo "bet" | gawk ’/be+t/{print $0}’
bet
$ echo "bt" | gawk ’/be+t/{print $0}’
$
Eg 3: The pipe symbol
The pipe symbol allows to you to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream. If any of the patterns match the data stream text, the text passes. If none of the patterns match, the data stream text fails.
The format for using the pipe symbol is:
expr1|expr2|...
Here’s an example of this:
$ echo "The cat is asleep" | gawk ’/cat|dog/{print $0}’
The cat is asleep
$ echo "The dog is asleep" | gawk ’/cat|dog/{print $0}’
The dog is asleep
$ echo "The sheep is asleep" | gawk ’/cat|dog/{print $0}’
$
Eg 4: Grouping expressions
When you group a regular expression pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character.
For example:
$ echo "Sat" | gawk ’/Sat(urday)?/{print $0}’
Sat
$ echo "Saturday" | gawk ’/Sat(urday)?/{print $0}’
Saturday
$

No comments:

Post a Comment