Regular Expression in Python
Regular expression is a sequence of characters that forms a pattern which is mainly used to find or replace patterns in a string.
These are supported by many languages such as python, java, R etc.Most common uses of regular expressions are:
- Finding patterns in a string or file.(Ex: find all the numbers present in a string)
- Replace a part of the string with another string.
- Search substring in string or file.
- Split string into substrings.
- Validate email format.
We will see examples of above mentioned uses in detail.
RegEx Module
In python we have a built-in package called re to work with regular expressions. We can use it as shown in the example below:
This will give “We have a match!” as output as the given string starts with Turing and ends with London. We will see how this works later in the article.
Python RegEx Expressions Functions
The ‘re’ package in python provides various functions to work with regular expressions. We will discuss some commonly used ones.
S No | Function | Description |
---|---|---|
1 | findall(pattern,string) | This matches all the occurrences of the pattern present in the string. |
2 | search(pattern,string) | This matches the pattern which is present at any position in the string. This will match the first occurrence of the pattern. |
3 | split(pattern,string) | This splits the string on the given pattern. |
4 | sub(pattern,rep_substring,string) | This replaces one or more matching pattern in the string with the given substring. |
Meta Characters
These are the characters which have special meaning. The following are some of the meta characters with their uses.
S No | Meta character | Description | |
---|---|---|---|
1 | [ ](Square brackets) | This matches any single character in this bracket with the given string. | |
2 | . (Period) | This matches all the characters except the newline. If we pass this as a pattern in the findall() function it will match with all the characters present in the string except newline characters. | |
3 | ^ (Carret) | This matches the given pattern at the start of the string.This is used to check if the string starts with a particular pattern or not. | |
4 | $ (Dollar) | This matches the given pattern at the end of string. This is used to check if the string ends with a pattern or not. | |
5 | * (Star) | This matches 0 or more occurrences of the pattern to its left. | |
6 | + (Plus) | This matches 1 or more occurrences of the pattern to its left. | |
7 | ? (Question mark) | This matches 0 or 1 occurrence of the pattern to its left. | |
8 | { } (Braces) | This matches the specified number of occurrences of pattern present in the braces. | |
9 | (Alternation) | This works like ‘or’ condition. In this we can give two or more patterns. If the string contains at least one of the given patterns this will give a match. | |
10 | ( ) (Group) | This is used to group various regular expressions together and then find a match in the string. | |
11 | \ (Backslash) | This is used to match special sequences or can be used as escape characters also. |
Special Sequences in Python RegEx
S No | Sequence | Description |
---|---|---|
1 | \A | This gives a match if the characters to the right of this are at the beginning of the string. |
2 | \b | This gives a match if the characters to the right are at the beginning of a word or the characters to the left are at the end of a word in the given string. |
3 | \B | This gives a match if the characters to the right or left of \B are not present at the beginning or end of a word in the given string. |
4 | \d | This gives a match if the string contains a digit. |
5 | \D | This gives a match if the string contains only non digit characters. |
6 | \s | This gives a match if the string contains a white space character. |
7 | \S | This gives a match if the string contains only characters other than white space character. |
8 | \w | This gives a match if the string contains any character in a-z, A-Z, 0-9 and underscore(_). |
9 | \W | This gives a match if the string contains characters other than a-z, A-Z, 0-9 and underscore(_). |
10 | \Z | This gives a match if the characters to the left of \Z are present at the end of the string. |
Examples for each sequence are given below
Sets
A set is a set of characters inside the square bracket which is treated as a pattern. Given below are some examples of set:
No | Set | Description |
---|---|---|
1 | [abcd] | Gives a match if the string contains a,b,c or d. |
2 | [a-z] | Gives a match if the string contains any character from a to z. |
3 | [A-Z] | Gives a match if the string contains any character from A to Z. |
4 | [0-9] | Gives a match if string contains digits from 0 to 9 |
5 | [a-zA-Z0-9] | Gives a match if any of the above conditions holds true. |
6 | [^a-zA-Z] | Gives a match if the string doesn’t contain any alphabet. |
7 | [%&$#@*] | Gives a match if the string contains any of these characters. When these characters are in square brackets they are treated as normal characters. |
findall(pattern, string)
This function is the same as search but it matches all the occurrences of the pattern in the given string and returns a list. The list contains the number of times it is present in the string.
Ex: The following example will make it clear.
In the output you can clearly see that the function finds a match for the pattern ‘Turing’’. It is advisable to use findall while searching for a pattern in a string as it covers both match and search functions.
search(pattern, string)
This is the same as match function but this function can search patterns irrespective of the position at which the pattern is present. The pattern can be present anywhere in the string. This function matches the first occurrence of the pattern.
Ex: The following example shows how to use the function
The function returns re.Match object if pattern if present in the string else returns None.
We can also get the start and end positions of matching pattern by calling span method on the re.Match object.
split(pattern, string)
This function splits a string on the given pattern. This returns the result as a list after splitting. The example given below will make it clear.
sub(pattern, repl, string)
This function replaces a pattern with the given substring in a given string. In the example below we will replace the word ‘theoretical’ with ‘practical’.
In the output we can see that the function replaced the pattern ‘theoretical’ with the given substring ‘practical’’. This function will replace all the patterns present in the string with the given substring.
Match Object
Whenever we call any regex method/function it searches the pattern in the string. If it finds a match then it returns a match object else return None. We will see how the match object looks like and how to access methods and properties of that object. Let’s search a pattern in a string and print the match object.
In the example above we can see that if a match happens then the re.Match object is returned. If there is no match then None will be returned.
Now we will see the attributes and properties of re.Match objects one by one. They are as follows:
- match.group(): This returns the part of the string where the match was there.
- match.start(): This returns the start position of the matching pattern in the string.
- match.end(): This returns the end position of the matching pattern in the string.
- match.span(): This returns a tuple which has start and end positions of matching pattern.
- match.re: This returns the pattern object used for matching.
- match.string: This returns the string given for matching.
- Using r prefix before regex: This is used to convert the pattern to raw string.This means any special character will be treated as normal character. Ex: \ character will not be treated as an escape character if we use r before the pattern.
Now we will see examples for each of the functions mentioned above.
If there is no match then instead of re.Match object, it will return ‘none’.
The concepts that you have learned above can be used to verify email address, phone number, address or names of colleges, institutes etc.