Regular expressions
Regular Expressions are a pattern matching language that is part of many modern programming languages. Regular Expressions allow you to apply a pattern to an input string and return a list of the matches within the text. Regular expressions also allow text to be replaced using replacement patterns. It is a very powerful version of find and replace.
This article introduces you to the Regular Expression syntax. After learning the syntax for Regular Expressions you can use it many different languages as the syntax is fairly similar between languages.
The Regular Expression Designer
When learning Regular Expressions, it helps to have a tool that you can use to test Regex patterns. There is a Free Regular Expression Tool available that will help as you go through the article.
The basics - Finding text
Regular Expressions are similar to find and replace in that ordinary characters match themselves. If I want to match the word "went" the Regular Expression pattern would be "went".
Text: Anna Jones and a friend went to lunch Regex: went Matches: Anna Jones and a friend went to lunch went
The following are special characters when working with Regular Expressions. They will be discussed throughout the article.
. $ ^ { [ ( | ) * + ? \
Matching any character with dot
The full stop or period character (.
) is known as dot. It is a
wildcard that will match any character except a new line (\n
). For
example if I wanted to match the 'a' character followed by any two characters.
Text: abc def ant cow Regex: a.. Matches: abc def ant cow abc ant
If the Singleline option is enabled, a dot matches any character including the new line character.
Matching word characters
Backslash and a lowercase 'w' (\w
) is a character class that will
match any word character. The following Regular Expression matches 'a' followed
by two word characters.
Text: abc anaconda ant cow apple Regex: a\w\w Matches: abc anaconda ant cow apple abc ana ant app
Backslash and an uppercase 'W' (\W
) will match any non-word
character.
Matching white-space
White-space can be matched using \s
(backslash and 's'). The
following Regular Expression matches the letter 'a' followed by two word
characters then a white space character.
Text: "abc anaconda ant" Regex: a\w\w\s Matches: "abc "
Note that ant was not matched as it is not followed by a white space character.
White-space is defined as the space character, new line (\n
), form
feed (\f
), carriage return (\r
), tab (\t
)
and vertical tab (\v
). Be careful using \s as it can lead to
unexpected behaviour by matching line breaks (\n
and \r
).
Sometimes it is better to explicitly specify the characters to match instead of
using \s. e.g. to match Tab and Space use [\t\0x0020]
Matching digits
The digits zero to nine can be matched using \d
(backslash and
lowercase 'd'). For example, the following Regular Expression matches any three
digits in a row.
Text: 123 12 843 8472 Regex: \d\d\d Matches: 123 12 843 8472 123 843 847
Matching sets of single characters
The square brackets are used to specify a set of single characters to match. Any single character within the set will match. For example, the following Regular Expression matches any three characters where the first character is either 'd' or 'a'.
Text: abc def ant cow Regex: [da].. Matches: abc def ant cow abc def ant
The caret (^
) can be added to the
start of the set of characters to specify that none of the characters in the
character set should be matched. The following Regular Expression matches any
three character where the first character is not 'd' and not 'a'.
Text: abc def ant cow Regex: [^da].. Matches: "bc " "ef " "nt " "cow"
Matching ranges of characters
Ranges of characters can be matched using the hyphen (-
). the
following Regular Expression matches any three characters where the second
character is either 'a', 'b', 'c' or 'd'.
Text: abc pen nda uml Regex: .[a-d]. Matches: abc pen nda uml abc nda
Ranges of characters can also be combined together. the following Regular Expression matches any of the characters from 'a' to 'z' or any digit from '0' to '9' followed by two word characters.
Text: abc no 0aa i8i Regex: [a-z0-9]\w\w Matches: abc no 0aa i8i abc 0aa i8i
The pattern could be written more simply as [a-z\d]
Specifying the number of times to match with Quantifiers
Quantifiers let you specify the number of times that an expression must match.
The most frequently used quantifiers are the asterisk character (*
)
and the plus sign (+
). Note that the asterisk (*
) is
usually called the star when talking about Regular Expressions.
Matching zero or more times with star (*)
The star tells the Regular Expression to match the character, group, or character class that immediately precedes it zero or more times. This means that the character, group, or character class is optional, it can be matched but it does not have to match. The following Regular Expression matches the character 'a' followed by zero or more word characters.
Text: Anna Jones and a friend owned an anaconda Regex: a\w* Options: IgnoreCase Matches: Anna Jones and a friend owned an anaconda Anna and a an anaconda
Matching one or more times with plus (+)
The plus sign tells the Regular Expression to match the character, group, or character class that immediately precedes it one or more times. This means that the character, group, or character class must be found at least once. After it is found once it will be matched again if it follows the first match. The following Regular Expression matches the character 'a' followed by at least one word character.
Text: Anna Jones and a friend owned an anaconda Regex: a\w+ Options: IgnoreCase Matches: Anna Jones and a friend owned an anaconda Anna and an anaconda
Note that "a" was not matched as it is not followed by any word characters.
Matching zero or one times with question mark (?)
To specify an optional match use the question mark (?
). The
question mark matches zero or one times. The following Regular
Expression matches the character 'a' followed by 'n' then optionally followed
by another 'n'.
Text: Anna Jones and a friend owned an anaconda Regex: an? Options: IgnoreCase Matches: Anna Jones and a friend owned an anaconda An a an a an an a a
Specifying the number of matches
The minimum number of matches required for a character, group, or character
class can be specified with the curly brackets ({n}
). The
following Regular Expression matches the character 'a' followed by a minimum of
two 'n' characters. There must be two 'n' characters for a match to occur.
Text: Anna Jones and Anne owned an anaconda Regex: an{2} Options: IgnoreCase Matches: Anna Jones and Anne owned an anaconda Ann Ann
A range of matches can be specified by curly brackets with two numbers inside
({n,m}
).
The first number (n) is the minimum number of matches required, the second (m)
is the maximum number of matches permitted. This Regular Expression matches the
character 'a' followed by a minimum of two 'n' characters and a maximum of
three 'n' characters.
Text: Anna and Anne lunched with an anaconda annnnnex Regex: an{2,3} Options: IgnoreCase Matches: Anna and Anne lunched with an anaconda annnnnex Ann Ann annn
The Regex stops matching after the maximum number of matches has been found.
Matching the start and end of a string
To specify that a match must occur at the beginning of a string use the caret
character (^
). For example, I want a Regular Expression pattern to
match the beginning of the string followed by the character 'a'.
Text: an anaconda ate Anna Jones Regex: ^a Matches: an anaconda ate Anna Jones "a" at position 1
The pattern above only matches the a in "an".
Note that the caret (^
) has different behaviour when used inside
the square brackets.
If the Multiline option is on, the caret (^
) will match the
beginning of each line in a multiline string rather than only the start of the
string.
To specify that a match must occur at the end of a string use the dollar
character ($
). If the Multiline option is on then the pattern will
match at the end of each line in a multiline string. This Regular Expression
pattern matches the word at the end of the line in a multiline string.
Text: "an anaconda ate Anna Jones" Regex: \w+$ Options: Multiline, IgnoreCase Matches: Jones
Microsoft have an online reference for Regex in .NET: Regular Expression Syntax on MSDN
To learn more about Regular Expression syntax see the next article: C# Regular Expression (Regex) Examples in .NET
source: http://www.radsoftware.com.au/articles/regexlearnsyntax.aspxMore information