Python3 regular expression

Regular expressions are a special sequence of characters that can help you easily check if a string matches a pattern.

Python has added the re module since version 1.5, which provides Perl-style regular expression patterns.

The re module gives the Python language all the regular expression functionality.

The compile function generates a regular expression object based on a pattern string and optional flag parameters. T he object has a series of methods for regular expression matching and substitution.

The re module also provides functions that are exactly consistent with the functionality of these methods, using a pattern string as their first argument.

This section focuses on regular expression handling functions that are commonly used in Python.


the re.match function

re.match attempts to match a pattern from the beginning of the string, and if the starting position match is not successful, match() returns none.

Function syntax:

re.match(pattern, string, flags=0)

Description of function parameters:

Parameters Describe
pattern The matching regular expression
string The string to match.
flags Flag bits that control how regular expressions match, such as case sensitive, multi-line matches, and so on.

The match successful re.match method returns a matching object, otherwise None is returned.

We can use group (num) or groups() to match object functions to get matching expressions.

Match object method Describe
group(num=0) Matches the string of the entire expression, group() can enter more than one group number at a time, in which case it returns a metagroup that contains the values corresponding to those groups.
groups() Returns a metagroup that contains all the group strings, from 1 to the group number contained in .

Example 1:

#!/usr/bin/python

import re
print(re.match('www', 'www.w3cschool.cn').span())  # 在起始位置匹配
print(re.match('cn', 'www.w3cschool.cn'))         # 不在起始位置匹配

The above instance runs the output as:

(0, 3)
None

Example 2:

#!/usr/bin/python3
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

The above examples perform as follows:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

the re.search method

re.search scans the entire string and returns the first successful match.

Function syntax:

re.search(pattern, string, flags=0)

Description of function parameters:

Parameters Describe
pattern The matching regular expression
string The string to match.
flags Flag bits that control how regular expressions match, such as case sensitive, multi-line matches, and so on.

The matching success re.search method returns a matching object, otherwise None is returned.

We can use group (num) or groups() to match object functions to get matching expressions.

Match object method Describe
group(num=0) Matches the string of the entire expression, group() can enter more than one group number at a time, in which case it returns a metagroup that contains the values corresponding to those groups.
groups() Returns a metagroup that contains all the group strings, from 1 to the group number contained in .

Example 1:

#!/usr/bin/python3

import re

print(re.search('www', 'www.w3cschool.cn').span())  # 在起始位置匹配
print(re.search('cn', 'www.w3cschool.cn').span())         # 不在起始位置匹配

The above instance runs the output as:

(0, 3)
(14, 16)

Example 2:

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
   print ("searchObj.group() : ", searchObj.group())
   print ("searchObj.group(1) : ", searchObj.group(1))
   print ("searchObj.group(2) : ", searchObj.group(2))
else:
   print ("Nothing found!!")
The above examples perform as follows:
searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter

The difference between re.match and re.search

Re.match matches only the beginning of the string, fails to match if the string starts to fail to conform to the regular expression, the function returns None, and re.search matches the entire string until a match is found.

Instance:

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print ("search --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")
The above examples run as follows:
No match!!
search --> matchObj.group() :  dogs

Retrieval and replacement

Python's re module provides re.sub to replace matches in strings.

Grammar:

re.sub(pattern, repl, string, max=0)

The returned string is replaced in the string with a non-repeating match on the far left of the re. If the pattern is not found, the characters are returned without change.

The optional parameter count is the maximum number of times a pattern is replaced after a pattern match; The default value is 0 to replace all matches.

Instance:

#!/usr/bin/python3

import re

phone = "2004-959-559 # 这是一个电话号码"

# 删除注释
num = re.sub(r'#.*$', "", phone)
print ("电话号码 : ", num)

# 移除非数字的内容
num = re.sub(r'\D', "", phone)
print ("电话号码 : ", num)
The above examples perform as follows:
电话号码 :  2004-959-559 
电话号码 :  2004959559

Regular expression modifier - optional flag

Regular expressions can contain optional flag modifiers to control matching patterns. T he modifier is designated as an optional flag. M ultiple flags can be specified by | or (or) them. S uch as re. L | r e. M is set to L and M flags:

Modifier Describe
re. Make the match case insensitive
re. L Make a localization recognition (locale-aware) match
re. M Multi-line matches, affecting s and $
re. S Make. Matches all characters, including line new lines
re. U Resolve characters based on the Unicode character set. This flag affects .
re. The flag is easier to understand by giving you a more flexible format so that you can write regular expressions.

Regular expression pattern

Pattern strings use a special syntax to represent a regular expression:

Letters and numbers represent themselves. Letters and numbers in a regular expression pattern match the same string.

Most letters and numbers have different meanings when you add a backslash in front of them.

Punctuation marks match themselves only when they are escaped, otherwise they represent a special meaning.

The backslash itself needs to be escaped using the backslash.

Because regular expressions usually contain backslashes, you'd better use the original strings to represent them. Pattern elements, such as r'/t' equivalent to '//t', match the corresponding special characters.

The following table lists special elements in the regular expression pattern syntax. If you use patterns while providing optional flag parameters, the meaning of some pattern elements changes.

Mode Describe
^ Matches the beginning of the string
$ Match the end of the string.
. Match any character, except line breaks, when re. When the DOTALL tag is specified, you can match any character that includes line breaks.
[...] Used to represent a set of characters, listed separately: 'amk' matches 'a', 'm' or 'k'
[^...] Characters that are not in the .
re* Match 0 or more expressions.
re+ Matches one or more expressions.
re? Match 0 or 1 fragment defined by the previous regular expression, not greedy way
re{ n} Match n previous expressions. For example, "o{2}" does not match the "o" in "Bob", but it does match the two os in "food".
re{ n,} Exactly match n previous expressions. F or example, "o'2, " cannot match "o" in "Bob", but it can match all os in "foooood". " o'1,"" is equivalent to "o-plus". "o'0,"" is equivalent to "o".
re{ n, m} Match n to m times by the previous regular expression defined fragments, greedy way
a| B Match a or b
(re) G matches the expression in parentheses and also means a group
(?imx) Regular expressions contain three optional flags: i, m, or x. Only the areas in parentheses are affected.
(?-imx) Regular expressions turn off i, m, or x optional flags. Only the areas in parentheses are affected.
(?: re) Similar (...), but does not represent a group
(?imx: re) Use i, m, or x optional flags in parentheses
(?-imx: re) I, m, or x optional flags are not used in parentheses
(?#...) Comments.
(?= re) Forward positive definer. I f a regular expression is included, ... i ndicates that the successful match at the current location succeeded, otherwise it failed. But once the included expression has been attempted, the matching engine does not improve at all;
(?! re) Forward negative definer. Contrary to a positive definer, the included expression succeeds when it cannot match the current position of the string
(?> re) Match the stand-alone mode, omitting backtracking.
\w Matches alphanumeric numbers
\W Matches non-alphanumeric numbers
\s Match any blank character, equivalent to the equivalent of .
\S Matches any non-empty character
\d Match any number, equivalent to .
\D Match any non-number
\A The matching string begins
\Z The matching string ends, if a liner exists, and only matches the end string before the line-up. C
\z The matching string ends
\G Match where the final match is complete.
\b Match a word boundary, which refers to the position between a word and a space. For example, 'er'b' can match 'er' in 'never', but not 'er' in 'verb'.
\B Match non-word boundaries. 'er', but not 'er' in 'never'.
sn, st, etc. Matches a line break. M atches a tab. and so on
\1...\9 A subexpression that matches the nth grouping.
\10 Matches the subexpression of the nth group if it matches. Otherwise, it refers to an expression of an octal character code.

An instance of a regular expression

Character matching

Instance Describe
Python Match "python".

The character class

Instance Describe
[Pp]ython Match "Python" or "python"
rub[ye] Match "ruby" or "rube"
[aeiou] Matches any letter in parentheses
[0-9] Matches any number. Similar to the s0123456789
[a-z] Match any lowercase letters
[A-Z] Matches any capital letters
[a-zA-Z0-9] Match any letter and number
[^aeiou] All characters except the aeiou letter
[^0-9] Matches characters other than numbers

Special character class

Instance Describe
. Matches any single character other than "" To match any character, including ''n', use a pattern like '
\d Matches a numeric character. Equivalent to .
\D Matches a non-numeric character. Equivalent to the equivalent of .
\s Matches any blank characters, including spaces, tabs, page breaks, and so on. Equivalent to the equivalent of the value
\S Matches any non-blank characters. Equivalent to the value of
\w Matches any word characters that include underscores. Equivalent to 'A-Za-z0-9'.
\W Matches any non-word characters. Equivalent to ''A-Za-z0-9'.