Regular Expression Operations in Python

Regular Expression Operations in Python

In this blog, we will know about regular expression operations in python.

What is Regular Expression?

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this language, we can specify the rules for the set of possible strings that we want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like.

Regular Expression Syntax

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

The special characters are:

.

(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

^

(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.

*

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match 'a', 'ab', or 'a' followed by any number of 'b's.

+

Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.

?

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either 'a' or 'ab'.


*?, +?, ??

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<a> b <c>', it will match the entire string, and not just '<a>'. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only '<a>'.

{m}

Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.

{m,n}

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match 'aaaab' or a thousand 'a' characters followed by a 'b', but not 'aaab'. The comma may not be omitted or the modifier would be confused with the previously described form.

{m,n}?

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.

\

Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

[]

Used to indicate a set of characters. In a set:

    1. Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
    2. Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal '-'.
    3. Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
    4. Character classes such as \w or \S are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
    5. Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

|

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way.

\d

Matches a digit[0-9]


\D

Matches any character which is (a non-digit) not a decimal digit. This is the opposite of \d. If the ASCII flag is used this becomes the equivalent of [^0-9].


\s

Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters,whitespace (tab, space, newline, etc.)

\S

    Matches any character which is not a whitespace character. This is the opposite of \s. 
    If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v].

\w

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore [a-zA-Z0-9_].

\W

Matches any character which is not a word character. This is the opposite of \w.  If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]

^

Matches start of string, or line

\A

Matches only at the start of the string.

\Z

Matches only at the end of the string.

\b

Matches the empty string, but only at the beginning or end of a word.

\B
Matches the empty string, but only when it is not at the beginning or end of a word. \B is just the opposite of \b.

Import re module

In [1]:
import re

Search simple text

re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
In [3]:
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'

match = re.search(pattern, text)

Check match found or not

In [4]:
if match:
    print('found a match!')
else:
    print('no match')
found a match!
In [5]:
match.group(0)
Out[5]:
'Machine Learning'

The start() and end() methods give the indexes into the string showing where the text matched by the pattern occurs.

In [6]:
start = match.start()
end = match.end()
print('Found "{}" \n in "{}"\nfrom {} to {} ("{}")'.format(match.re.pattern, match.string, start, end, text[start:end]))
Found "Machine Learning" 
 in "Data Science is forcing every business to act differently. The decision making today is far more complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning."
from 126 to 142 ("Machine Learning")

Match the pattern

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

re.match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, the Python RegEx Match function returns null.

In [7]:
m = re.match(pattern, text)
In [9]:
print(m)
None

The literal text "Machine Learning" does not appear at the start of the input text, it is not found using match().

Let us change the pattern and match

In [10]:
pattern = 'Data Science'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
In [11]:
m = re.match(pattern, text)
print(m)
<re.Match object; span=(0, 12), match='Data Science'>

Now this time matched the pattern because it is at the beginning of the string.

Match the full text

re.fullmatch(pattern, string, flags=0)

If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

The fullmatch() method requires that the entire input string match the pattern.

In [12]:
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
In [13]:
s = re.fullmatch(pattern, text)
print('Full match :', s)
None

It didnot match, so returned 'None'.

Change the pattern, text and then fullmatch

In [14]:
pattern = "This"
text = "This"
In [15]:
s = re.fullmatch(pattern, text)
print('Full match :', s)
Full match : <re.Match object; span=(0, 4), match='This'>
In [16]:
pattern = "This"
text = "This is a flower."
In [17]:
s = re.fullmatch(pattern, text)
print('Full match :', s)
Full match : None

Find all the matches

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
In [18]:
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
In [19]:
matches = re.findall(pattern, text)
In [20]:
for match in matches:
    print('Found {!r}'.format(match))
    pass
    
Found 'Machine Learning'
Found 'Machine Learning'

Find all pettern with regular expresstion syntax

In [21]:
#\b - Matches the empty string, but only at the beginning or end of a word.
#M - It should match 'M'
#[a-z] - set of character from a to z
#* - It should be zero or more 
pattern = r'\bM[a-z]*'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
In [22]:
re.findall(pattern, text)
Out[22]:
['Machine', 'Models', 'Machine']

re.finditer()

re.finditer(pattern, string, flags=0)

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

The finditer() function returns an iterator that produces Match instances instead of the strings returned by findall().

In [23]:
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
In [24]:
for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print('Found "{}" \n from {} to {} ("{}")'.format(pattern, s, e, text[s:e]))
Found "Machine Learning" 
 from 126 to 142 ("Machine Learning")
Found "Machine Learning" 
 from 248 to 264 ("Machine Learning")

Compile a regular expression pattern

re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.
In [25]:
regexes = [
    re.compile(p)
    for p in ['Data', 'Machine', 'test']
]

text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '

print('Text: {}\n'.format(text))
Text: Data Science is forcing every business to act differently. The decision making today is far more complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. 

In [26]:
for regex in regexes:
    print('Searching "{}" ->'.format(regex.pattern),
          end=' ')

    if regex.search(text):
        print('match!')
    else:
        print('no match')
Searching "Data" -> match!
Searching "Machine" -> match!
Searching "test" -> no match

Replace the string

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
In [27]:
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
In [28]:
# replacing Machine Learning with ML
result = re.sub(pattern, 'ML', text)  
In [29]:
print(result)
Data Science is forcing every business to act differently. The decision making today is far more complex and driven by AI and ML Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and ML. 

another example

In [30]:
text = "john xxx john yyy"

#check either xxx or yyy
pattern = "xxx|yyy"
replacing_string = "john"
In [31]:
result = re.sub(pattern, replacing_string, text)
print(result)
john john john john

one more example

In [32]:
pattern = r'\bM[a-z]*\sL[a-z]*'

text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '

replacing_string = "ML"
In [33]:
result = re.sub(pattern, replacing_string, text)
print(result)
Data Science is forcing every business to act differently. The decision making today is far more complex and driven by AI and ML Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and ML. 

re.subn(pattern, repl, string, count=0, flags=0)

Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made). Means returns the new string along with the no. of replacements in a tuple.
In [34]:
pattern = r'\bM[a-z]*\sL[a-z]*'

text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '

replacing_string = "ML"
In [35]:
result = re.subn(pattern, replacing_string, text)
print(result)
('Data Science is forcing every business to act differently. The decision making today is far more complex and driven by AI and ML Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and ML. ', 2)

We can see result we got in tuple. First item in tuple is new text which is replaced by pattern and second item is count of replacement.

In [36]:
type(result)
Out[36]:
tuple
In [37]:
print(len(result))
print(result[0])
print(result[1])
2
Data Science is forcing every business to act differently. The decision making today is far more complex and driven by AI and ML Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and ML. 
2

Some more patterns

'ab*', - a followed by zero or more b 'ab+', - a followed by one or more b 'ab?', - a followed by zero or one b 'ab{3}' - a followed by three b 'ab{2,3}' - a followed by two to three b 'ab*?' - a followed by zero or more b 'ab+?' - a followed by one or more b 'ab??' - a followed by zero or one b 'ab{3}?' - a followed by three b 'ab{2,3}?' - a followed by two to three b '[ab]' - either a or b 'a[ab]+' - a followed by one or more a or b 'a[ab]+?' - a followed by one or more a or b, not greedy '[a-z]+' - sequences of lower case letters '[A-Z]+' - sequences of upper case letters '[a-zA-Z]+' - sequences of lower or upper case letters '[A-Z][a-z]+' - one upper case letter followed by lower case letters 'a.' - a followed by any one character 'b.' - b followed by any one character 'a.*b' - a followed by anything, ending in b 'a.*?b' - a followed by anything, ending in b r'\d+' - sequence of digits r'\D+' - sequence of non-digits r'\s+' - sequence of whitespace r'\S+' - sequence of non-whitespace r'\w+' - alphanumeric characters r'\W+' - non-alphanumeric '[^-. ]+' - sequences without -, ., or space

That's it in this blog. Thanks for reading.

In [ ]:
 
In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing