Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Python regular expression


May 10, 2021 Python2


Table of contents


Python regular expression

Regular expressions are a special sequence of characters that can help you easily check if a string matches a pattern. Python has added the re module since version 1.5, which provides a Perl-style regular expression pattern.

The re module gives the Python language all the regular expression functionality.

The compile function generates a regular expression object based on a pattern string and optional flag parameters. The object has a series of methods for regular expression matching and substitution.

The re module also provides functions that are exactly consistent with the functionality of these methods, using a pattern string as their first argument.

This section focuses on regular expression handling functions that are commonly used in Python.


the re.match function

re.match attempts to match a pattern from the beginning of the string, and if the starting position match is not successful, match() returns None.

Function syntax:

re.match(pattern, string, flags=0)

Description of function parameters:

Parameters Describe
pattern The matching regular expression
string The string to match.
flags Flag bits that control how regular expressions match, such as case sensitive, multi-line matches, and so on.

The match successful re.match method returns a matching object, otherwise None is returned.

We can use group (num) or groups() to match object functions to get matching expressions.

Match object method Describe
group(num=0) Matches the string of the entire expression, group() can enter more than one group number at a time, in which case it returns a metagroup that contains the values corresponding to those groups.
groups() Returns a metagroup that contains all the group strings, from 1 to the group number contained in .

Instance:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

The above examples perform as follows:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

the re.search method

re.search looks for pattern matches within the string until the first match is found.

Function syntax:

re.search(pattern, string, flags=0)

Description of function parameters:

Parameters Describe
pattern The matching regular expression
string The string to match.
flags Flag bits that control how regular expressions match, such as case sensitive, multi-line matches, and so on.

The matching success re.search method returns a matching object, otherwise None is returned.

We can use group (num) or groups() to match object functions to get matching expressions.

Match object method Describe
group(num=0) Matches the string of the entire expression, group() can enter more than one group number at a time, in which case it returns a metagroup that contains the values corresponding to those groups.
groups() Returns a metagroup that contains all the group strings, from 1 to the group number contained in .

Instance:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print "searchObj.group() : ", searchObj.group()
   print "searchObj.group(1) : ", searchObj.group(1)
   print "searchObj.group(2) : ", searchObj.group(2)
else:
   print "Nothing found!!"

The above examples perform as follows:

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter

The difference between re.match and re.search

Re.match matches only the beginning of the string, fails to match if the string starts to fail to conform to the regular expression, the function returns None, and re.search matches the entire string until a match is found.

Instance:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print "match --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print "search --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

The above examples run as follows:

No match!!
search --> matchObj.group() :  dogs

Retrieval and replacement

Python's re module provides re.sub to replace matches in strings.

Grammar:

re.sub(pattern, repl, string, max=0)

The returned string is replaced in the string with a non-repeating match on the far left of the RE. If the pattern is not found, the characters are returned without change.

The optional parameter count is the maximum number of times a pattern is replaced after a pattern match; The default value is 0 to replace all matches.

Instance:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
import re
 
phone = "2004-959-559 # 这是一个国外电话号码"
 
# 删除字符串中的 Python注释 
num = re.sub(r'#.*$', "", phone)
print "电话号码是: ", num
 
# 删除非数字(-)的字符串 
num = re.sub(r'\D', "", phone)
print "电话号码是 : ", num

The above examples perform as follows:

电话号码 :  2004-959-559
电话号码 :  2004959559

The repl argument is a function

In the following example, multiply the matching number in the string by 2:

Instance:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
import re
 
# 将匹配的数字乘以 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)
 
s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))

The output is:

A46G8HFD1134

re.compile function

The compile function is used to compile regular expressions and generate a regular expression (Pattern) object for use by both match() and search().

The syntax format is:

re.compile(pattern[, flags])

Parameters:

  • Pattern : A regular expression in the form of a string
  • flags : optional, representing matching patterns, such as ignoring case, multi-line patterns, etc., with the following parameters:
    1. re. I Ignore case
    2. re. L means that the special character set is dependent on the current environment
    3. re. M multi-line mode
    4. re. S is . A nd any character, including line breaks (. Line breaks are not included)
    5. re. U represents a special character set of
    6. re. X To increase readability, ignore the spaces and the comments that follow

Instance

>>>import re
>>> pattern = re.compile(r'\d+')                    # 用于匹配至少一个数字
>>> m = pattern.match('one12twothree34four')        # 查找头部,没有匹配
>>> print m
None
>>> m = pattern.match('one12twothree34four', 2, 10) # 从'e'的位置开始匹配,没有匹配
>>> print m
None
>>> m = pattern.match('one12twothree34four', 3, 10) # 从'1'的位置开始匹配,正好匹配
>>> print m                                         # 返回一个 Match 对象
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 可省略 0
'12'
>>> m.start(0)   # 可省略 0
3
>>> m.end(0)     # 可省略 0
5
>>> m.span(0)    # 可省略 0
(3, 5)

Above, when the match is successful, a match object is returned, where:

  • The group method is used to obtain one or more strings that match the group, and can be used directly using group() or group(0) when you want to get the entire matching substruce;
  • The start method is used to get the starting position of the subchain in the entire string (index of the first character of the substring), with the default value of 0;
  • The end method is used to get the end position of the subchain of the group match in the entire string (index of the last character of the subchain), with the default value of 0;
  • The span method returns (start), end (group).

Look at another example:

>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I 表示忽略大小写
>>> m = pattern.match('Hello World Wide Web')
>>> print m                               # 匹配成功,返回一个 Match 对象
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # 返回匹配成功的整个子串
'Hello World'
>>> m.span(0)                             # 返回匹配成功的整个子串的索引
(0, 11)
>>> m.group(1)                            # 返回第一个分组匹配成功的子串
'Hello'
>>> m.span(1)                             # 返回第一个分组匹配成功的子串的索引
(0, 5)
>>> m.group(2)                            # 返回第二个分组匹配成功的子串
'World'
>>> m.span(2)                             # 返回第二个分组匹配成功的子串
(6, 11)
>>> m.groups()                            # 等价于 (m.group(1), m.group(2), ...)
('Hello', 'World')
>>> m.group(3)                            # 不存在第三个分组
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

findall

Find all subseth strings matched by the regular expression in the string and return a list, or an empty list if no match is found.

Note: match and search are matched once findall matches all.

The syntax format is:

findall(string[, pos[, endpos]])

Parameters:

  • String: The string to be matched.
  • pos : Optional parameter that specifies the starting position of the string, which defaults to 0.
  • endpos : Optional parameter that specifies the end position of the string, which defaults to the length of the string.

Find all the numbers in the string:

# -*- coding:UTF8 -*-
 
import re
 
pattern = re.compile(r'\d+')   # 查找数字
result1 = pattern.findall('school 123 google 456')
result2 = pattern.findall('sch88ool123google456', 0, 10) 
print(result1)
print(result2)

Output:

['123', '456']
['88', '12']

re.finditer

Similar to findall, find all subsections matched by regular expressions in the string and return them as an iterator.

re.finditer(pattern, string, flags=0)

Parameters:

Parameters Describe
pattern The matching regular expression
string The string to match.
flags Flag bits that control how regular expressions match, such as case sensitive, multi-line matches, and so on. See also: Regular Expression Modifier - Optional Flag

Instance:

# -*- coding: UTF-8 -*-
 
import re
 
it = re.finditer(r"\d+","12a32bc43jf3") 
for match in it: 
    print (match.group() )

Output:

12 
32 
43 
3

re.split

The split method splits strings into matching substrings and returns the list in the following form:

re.split(pattern, string[, maxsplit=0, flags=0])

Parameters:

Parameters Describe
pattern The matching regular expression
string The string to match.
maxsplit Number of separations, maxsplit=1 separation once, default to 0, no limit number of times.
flags Flag bits that control how regular expressions match, such as case sensitive, multi-line matches, and so on. See also: Regular Expression Modifier - Optional Flag

Instance:

>>>import re
>>> re.split('\W+', 'w3cschool, w3cschool, w3cschool.')
['w3cschool', 'w3cschool', 'w3cschool', '']
>>> re.split('(\W+)', ' w3cschool, w3cschool, w3cschool.') 
['', ' ', 'w3cschool', ', ', 'w3cschool', ', ', 'w3cschool', '.', '']
>>> re.split('\W+', ' w3cschool, w3cschool, w3cschool.', 1) 
['', 'w3cschool, w3cschool, w3cschool.']
 
>>> re.split('a*', 'hello world')   # 对于一个找不到匹配的字符串而言,split 不会对其作出分割
['hello world']



Regular expression modifier - optional flag

Regular expressions can contain optional flag modifiers to control matching patterns. T he modifier is designated as an optional flag. M ultiple flags can be specified by | OR (or) them. S uch as re. I | r e. M is set to I and M flags:

Modifier Describe
re. Make the match case insensitive
re. L Make localization recognition (locale-aware) matching
re. M Multi-line matches, affecting s and $
re. S Make. Matches all characters, including line new lines
re. U Resolve characters based on the Unicode character set. This flag affects .
re. The flag is easier to understand by giving you a more flexible format so that you can write regular expressions.

Regular expression pattern

Pattern strings use a special syntax to represent a regular expression:

Letters and numbers represent themselves. Letters and numbers in a regular expression pattern match the same string.

Most letters and numbers have different meanings when you add a backslash in front of them.

Punctuation marks match themselves only when they are escaped, otherwise they represent a special meaning.

The backslash itself needs to be escaped using the backslash.

Because regular expressions usually contain backslashes, you'd better use the original strings to represent them. Pattern elements, such as r'/t' equivalent to '//t', match the corresponding special characters.

The following table lists special elements in the regular expression pattern syntax. If you use patterns while providing optional flag parameters, the meaning of some pattern elements changes.

Mode Describe
^ Matches the beginning of the string
$ Match the end of the string.
. Matches any character, except line breaks, when re. When the DOTALL tag is specified, you can match any character that includes line breaks.
[...] Used to represent a set of characters, listed separately: 'amk' matches 'a', 'm' or 'k'
[^...] Characters that are not in the .
re* Matches 0 or more expressions.
re+ Matches one or more expressions.
re? Match 0 or 1 fragment defined by the previous regular expression, not greedy
re{ n}
re{ n,} Exactly match n previous expressions.
re{ n, m} Match n to m times by the previous regular expression defined fragments, greedy way
a| B Match a or b
(re) G matches the expression in parentheses and also means a group
(?imx) Regular expressions contain three optional flags: i, m, or x. Only the areas in parentheses are affected.
(?-imx) Regular expressions turn off i, m, or x optional flags. Only the areas in parentheses are affected.
(?: re) Similar (...), but does not represent a group
(?imx: re) Use i, m, or x optional flags in parentheses
(?-imx: re) I, m, or x optional flags are not used in parentheses
(?#...) Comments.
(?= re) Forward positive definer. I f a regular expression is included, ... i ndicates that the successful match at the current location succeeded, otherwise it failed. But once the included expression has been attempted, the matching engine does not improve at all;
(?! re) Forward negative definer. Contrary to a positive definer, the included expression succeeds when it cannot match the current position of the string
(?> re) Match the stand-alone mode, omitting backtracking.
\w Matches alphanumeric numbers
\W Matches non-alphanumeric numbers
\s Match any blank character, equivalent to the equivalent of .
\S Matches any non-empty character
\d Match any number, equivalent to .
\D Match any non-number
\A The matching string begins
\Z The matching string ends, if a liner exists, and only matches the end string before the line-up.
\z The matching string ends
\G Match where the final match is complete.
\b Match a word boundary, which refers to the position between a word and a space. For example, 'er'b' can match 'er' in 'never', but not 'er' in 'verb'.
\B Match non-word boundaries. 'er', but not 'er' in 'never'.
sn, st, etc. Matches a line break. M atches a tab. and so on
\1...\9 A subexpression that matches the nth grouping.
\10 Matches the subexpression of the nth group if it matches. Otherwise, it refers to an expression of an octal character code.

An instance of a regular expression

Character matching

Instance Describe
Python Match "python".

The character class

Instance Describe
[Pp]ython Match "Python" or "python"
rub[ye] Match "ruby" or "rube"
[aeiou] Matches any letter in parentheses
[0-9] Matches any number. Similar to the s0123456789
[a-z] Match any lowercase letters
[A-Z] Matches any capital letters
[a-zA-Z0-9] Match any letter and number
[^aeiou] All characters except the aeiou letter
[^0-9] Matches characters other than numbers

Special character class

Instance Describe
. Matches any single character other than "" To match any character, including ''n', use a pattern like '
\d Matches a numeric character. Equivalent to .
\D Matches a non-numeric character. Equivalent to the equivalent of .
\s Matches any blank characters, including spaces, tabs, page breaks, and so on. Equivalent to the equivalent of the value
\S Matches any non-blank characters. Equivalent to the value of
\w Matches any word characters that include underscores. Equivalent to 'A-Za-z0-9'.
\W Matches any non-word characters. Equivalent to ''A-Za-z0-9'.