Introduction
Regular Expressions provide a standard and powerful way of
pattern matching for text data. The .NET Framework
exposes its regular expression engine via System.Text.RegularExpressions
Namespace. The Regex class is the primary way for developers to perform
pattern matching, search and replace, and splitting operations on a string.
Many beginners avoid using regular expressions because of the apparently
difficult syntax. However, if your application calls for heavy pattern matching
then learning and using regular expressions over ordinary string manipulation
functions is strongly recommended. This article is intended to give beginners a
quick overview of .NET Framework’s offerings for pattern matching using regular
expressions.
Note:
This article will not teach you how to write regular expressions. It focuses
primarily on using classes from System.Text.RegularExpressions namespace. It is
assumed that you are already familiar with regular expression syntax and are
able to write basic regular expressions.
Basic Terminology
Before you go any further let’s quickly glance over the
basic terminology used in the context of regular expressions.
- Capture : When you perform pattern matching using a
regular expression result of a single sub-expression match is called as a
Capture. The Capture and CaptureCollection classes represent a single
capture and a collection of captures respectively. - Group : A regular expression often consists of one
or more Groups. A group is represented by rounded brackets within a
regular expression (the whole regular expression itself is considered as a
group). There can be zero or more captures for a single group. The Group
and GroupCollection classes represent a single group and a collection of
groups respectively. - Match : A result obtained after a single match of a
regular expression is termed as a Match. A match contains one or more
groups. The Match and MatchCollection classes represent a single match and
a collection of matches respectively.
Thus the relation between the regular expression related
objects is:
Regex class–> MatchCollection–> Match
objects–> GroupCollection–> Group objects–> CaptureCollection–>
Capture objects
The Regex Class
The Regex
class along with few more support classes represents the regular
expression engine of .NET Framework. The Regex class allows you to perform
pattern matching, search and replace, and splitting on the source strings. You
can use the Regex class in two ways, viz. calling static methods of Regex class
or by instantiating Regex class and then calling instance methods. The
difference between these two approaches will be clear in the section related to
performance. The following table lists some of the important methods of the
Regex class along with the purpose of each:
Method |
Description |
IsMatch |
IsMatch() method is used to determine whether a string |
Match |
Match() method searches a string for a specified pattern |
Matches |
Matches() method searches a string for all the occurrences |
Replace |
Replaces all the occurrences of a pattern with a specified |
Split |
Splits a string based on a specified pattern as a |
In the following sections you are going to use many of the
methods mentioned above.
Pattern Matching Using Regex
Class
In this section you will use the pattern matching abilities
of the Regex class. Begin by creating a new Console Application and import
System.Text.RegularExpressions namespace at the top.
using System.Text.RegularExpressions;
Using IsMatch() Method
In this example you will check whether a string is a valid
URL. Key-in the following code in the Main() method.
static void Main(string[] args) { string source = args[0]; string pattern = @"http(s)?://([w-]+.)+[w-]+(/[w- ./?%&=]*)?"; bool success = Regex.IsMatch(source, pattern); if (success) { Console.WriteLine("Entered string is a valid URL!"); } else { Console.WriteLine("Entered string is not a valid URL!"); } Console.ReadLine(); }
The Main() method receives the string to be tested as a
command line argument. The pattern string variable holds the regular expression
for verifying URLs. The code then calls the IsMatch() static method on the
Regex class and passes the source and pattern strings to it. Depending on the
returned boolean value a message is displayed to the user.
You could have achieved the same result by creating an
instance of Regex class and then calling IsMatch() method on it, as shown
below:
Regex ex = new Regex(pattern); success = ex.IsMatch(source);
Using Match() Method
In order to see how Match() method can be used, modify the
Main() method as shown below:
static void Main(string[] args) { string source = args[0]; string pattern = @"http(s)?://([w-]+.)+[w-]+(/[w- ./?%&=]*)?"; Match match = Regex.Match(source, pattern); if(match.Success) { Console.WriteLine("Entered string is a valid URL!"); Console.WriteLine("{0} Groups", match.Groups.Count); for(int i=0;i<match.Groups.Count;i++) { Console.WriteLine("Group {0} Value = {1} Status = {2}", i, match.Groups[i].Value, match.Groups[i].Success); Console.WriteLine("t{0} Captures", match.Groups[i].Captures.Count); for (int j = 0; j < match.Groups[i].Captures.Count; j++) { Console.WriteLine("tt Capture {0} Value = {1} Found at = {2}", j, match.Groups[i].Captures[j].Value, match.Groups[i].Captures[j].Index); } } } else { Console.WriteLine("Entered string is not a valid URL!"); } Console.ReadLine(); }
The code shown above makes use of the Match() method to perform pattern matching. As mentioned earlier the Match() method returns an instance of Match class that represents the first occurrence of the pattern. The Success property of the Match object tells you whether the pattern matching was successful or not. A for loop then iterates through the Groups collection (GroupCollection object). With each iteration, the group searched for and its status is outputted. Further, the Captures collection of each group is also iterated and with each iteration the captured value and its index in the string is outputted. The following figure shows a sample run of the above application.
Figure 1:
A sample run of the application
Observe the above figure carefully. Our pattern contains 4
groups (three in rounded brackets of the regular expression and the whole
expression) so Count property of the Groups collection returns 4. The first
group (the whole expression) has value https://www.codeguru.com/. The second
group has value of s (from https). The third group has two captures – www. and
codeguru. Finally, the last group has value of / (the / at the end of the URL).
Using Matches() Method
Matches() method is similar to Match() method but returns a
collection of Match objects (MatchCollection). You can then iterate through all
of the Match instances and see various group and capture values. The following
code illustrates how this is done:
MatchCollection matches = Regex.Matches(source, pattern); foreach (Match match in matches) { Console.WriteLine("Match Value = {0}",match.Value); Console.WriteLine("============"); if (match.Success) { Console.WriteLine("Entered string is a valid URL!"); Console.WriteLine("{0} Groups", match.Groups.Count); for (int i = 0; i < match.Groups.Count; i++) { Console.WriteLine("Group {0} Value = {1} Status = {2}", i, match.Groups[i].Value, match.Groups[i].Success); Console.WriteLine("t{0} Captures", match.Groups[i].Captures.Count); for (int j = 0; j < match.Groups[i].Captures.Count; j++) { Console.WriteLine("tt Capture {0} Value = {1} Found at = {2}", j, match.Groups[i].Captures[j].Value, match.Groups[i].Captures[j].Index); } } } else { Console.WriteLine("Entered string is not a valid URL!"); } }
The following figure shows a sample run of the above code:
Figure 2: Matches() method returns two Match objects
Notice how Matches() method has returned two Match objects
(one for http://site1.com and other for http://site2.com).
Search and Replace Using Regex Class
The Regex class not only allows you to perform pattern
matching but also allows you to search and replace strings. Consider, for
example, that you are developing a discussion forum in ASP.NET. For the sake of reducing SPAM and
promotional content you want to scan forum posts made by new members for URLs
and then replace the URLs with ****. Something like this can easily be done
with the search and replace abilities of the Regex class. Let’s see how.
static void Main(string[] args) { string source = args[0]; string pattern = @"http(s)?://([w-]+.)+[w-]+(/[w- ./?%&=]*)?"; string result = Regex.Replace(source,pattern,"[*** URLs not allowed ***]"); Console.WriteLine(result); Console.ReadLine(); }
In the code fragment shown above the regular expression is
intended to scan URLs from the input string. You then call the Replace() method
of the Regex class. The first parameter of the Replace() method is the string
in which you wish to perform the replacement. The second parameter indicates
the replacement string. The Replace() method returns the resultant string after
performing the replacement. If you run the above code you should see something
like this in the console window:
Figure 3: The Replace() method of the Regex class
Notice how the URL has been replaced with the text you
specify.
Splitting Strings Using Regex
Regex class also allows you to split an input string based
on a regular expression. Say, for example, you wish to split a date in
DD/MM/YYYY format at / so as to retrieve individual day, month and year values.
The Split() method of the Regex class allows you to do just that. The following
example shows how:
string strFruits = "Apple,Mango,Banana"; string[] fruits = Regex.Split(strFruits, ","); foreach(string s in fruits) { Console.WriteLine(s); }
In the above code the Split() method takes the source string
and a regular expression for searching the delimiter (, in the above example).
It then splits the string and returns an array of strings consisting of
individual elements. A sample run of the above code is shown below:
Figure 4: Splitting Strings Using Regex
Regex Options
Most of the methods discussed above are overloaded to take a
parameter of type RegexOptions enumeration. As the name suggests, the
RegexOptions enumeration is used to indicate certain configuration options to
the regular expression engine during the pattern matching process. The
following table lists some of the important options of RegexOptions
enumeration:
Option |
Description |
IgnoreCase |
Indicates that the pattern matching operation should |
Multiline |
Indicates that ^ and $ characters are to be applied to the |
Singleline |
Indicates that dot (.) should match every character, |
RightToLeft |
Indicates that the pattern matching will be performed from |
Compiled |
Indicates that the regular expression is to be converted |
Just to illustrate how RegexOptions enumeration can be used
write the following code in the Main() method and observe the difference due to
RegexOptions value.
bool success1 = Regex.IsMatch(source, "hello"); Console.WriteLine("String found? {0}",success1); bool success2 = Regex.IsMatch(source, "hello", RegexOptions.IgnoreCase); Console.WriteLine("String found? {0}", success2);
As you can see, the second call to the IsMatch() method makes use of RegexOptions enumeration and specifies the case should be ignored during pattern matching. If you observe the output of the above code (see below) you will find that IsMatch() method without any RegexOptions returns false whereas with RexexOptions.IgnoreCase returns true.
Figure 5: IsMatch() method without RegexOptions returns false; with
RegexOptions.IgnoreCase returns true
Note:
You can combine multiple RegexOptions values like this :
bool success2 = Regex.IsMatch(source, "hello", RegexOptions.IgnoreCase | RegexOptions.Compiled);
Performance Considerations
As mentioned earlier, the Regex class provides static as
well as instance methods for pattern matching. The static methods accept the
source string and the pattern as the parameters whereas the instance methods
accept source string (since pattern is specified while creating the instance
itself). The following code fragment makes it clear:
//Using static method bool success = Regex.IsMatch(source, pattern); //Using instance method Regex ex = new Regex(pattern); success = ex.IsMatch(source);
When you use static methods, the regular expression engine caches the regular expressions so that if the same regular expression is used multiple times the performance will be faster. On the other hand, if you use instance methods, the regular expression engine cannot cache the patterns because Regex instances are immutable (i.e. you cannot change them later). Naturally, even if you use the same pattern multiple times there is no way to boost the performance as in the previous case.
You should also be aware of the impact of
RegexOptions.Compiled on the performance. While calling any of the Regex
methods, if you use the RegexOptions.Compiled option then the regular
expression is converted to MSIL code and not to regular expression internal
instructions. Though this improves performance it also means that the regular
expressions are also loaded as a part of the assembly making it heavy and may
increase the startup time. So, you should carefully evaluate the use of
RegexOptions.Compiled option.
Summary
Regular expressions provide a standard and powerful way of
pattern matching. The Regex class represents .NET Framework’s regular
expression engine. The methods of Regex class are exposed as static as well as
instance methods. These methods allow you to perform search, replace and
splitting operations on input strings. Behavior of the regular expression
engine can be configured with the help of RegExOptions enumeration.