Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Regular Expressions - Syntax


May 28, 2021 Regular expression


Table of contents


Regular expression - syntax

Regular expressions describe a pattern of string matching that can be used to check whether a string contains some kind of subchain, replace a matching subchain, or remove a string from a string that meets a certain criteria.

  • When column directories are listed, the .txt in .txt or ls .txt is not a regular expression, because the meaning here is different from that of regular.
  • Regular expressions are constructed in the same way as mathematical expressions are created. T hat is, small expressions can be combined with multiple meta-characters and operators to create larger expressions. The components of a regular expression can be a single character, a collection of characters, a range of characters, a choice between characters, or any combination of all of these components.

Regular expressions are text patterns that consist of ordinary characters, such as characters a through z, and special characters, called meta-characters. P atterns describe one or more strings to match when searching for text. Regular expressions act as a template to match a character pattern to the string being searched.

Normal characters

Normal characters include all printable and non-printable characters that are not explicitly specified as meta-characters. This includes all capital and lowercase letters, all numbers, all punctuation marks, and some other symbols.

Non-print characters

Non-printed characters can also be part of regular expressions. The following table lists escape sequences that represent non-printed characters:

字符 描述
\cx 匹配由x指明的控制字符。例如, \cM 匹配一个 Control-M 或回车符。x 的值必须为 A-Z 或 a-z 之一。否则,将 c 视为一个原义的 'c' 字符。
\f 匹配一个换页符。等价于 \x0c 和 \cL。
\n 匹配一个换行符。等价于 \x0a 和 \cJ。
\r 匹配一个回车符。等价于 \x0d 和 \cM。
\s 匹配任何空白字符,包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S 匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
\t 匹配一个制表符。等价于 \x09 和 \cI。
\v 匹配一个垂直制表符。等价于 \x0b 和 \cK。

Special characters

The so-called special characters, is some characters with special meaning, such as the above said in the ".txt", simply said to mean any string. I f you're looking for a file with a file name in it, you'll need to escape it, i.e. add one before it. ls \*.txt。

Many meta-characters require special treatment when trying to match them. T o match these special characters, you must first "escape" the characters, that is, place the backslash characters in front of them. The following table lists special characters in regular expressions:

特别字符 描述
$ 匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性,则 $ 也匹配 '\n' 或 '\r'。要匹配 $ 字符本身,请使用 \$。
( ) 标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符,请使用 \( 和 \)。
* 匹配前面的子表达式零次或多次。要匹配 * 字符,请使用 \*。
+ 匹配前面的子表达式一次或多次。要匹配 + 字符,请使用 \+。
. 匹配除换行符 \n之外的任何单字符。要匹配 .,请使用 \。
[ 标记一个中括号表达式的开始。要匹配 [,请使用 \[。
? 匹配前面的子表达式零次或一次,或指明一个非贪婪限定符。要匹配 ? 字符,请使用 \?。
\ 将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符。例如, 'n' 匹配字符 'n'。'\n' 匹配换行符。序列 '\\' 匹配 "\",而 '\(' 则匹配 "("。
^ 匹配输入字符串的开始位置,除非在方括号表达式中使用,此时它表示不接受该字符集合。要匹配 ^ 字符本身,请使用 \^。
{ 标记限定符表达式的开始。要匹配 {,请使用 \{。
| 指明两项之间的一个选择。要匹配 |,请使用 \|。

Qualifier

Qualifiers are used to specify how many times a given component of a regular expression must appear to satisfy a match. D o you have a s/or s/or? There are 6 kinds of 6 kinds of or . or . . . or .

The qualifiers for regular expressions are:

字符 描述
* 匹配前面的子表达式零次或多次。例如,zo* 能匹配 "z" 以及 "zoo"。* 等价于{0,}。
+ 匹配前面的子表达式一次或多次。例如,'zo+' 能匹配 "zo" 以及 "zoo",但不能匹配 "z"。+ 等价于 {1,}。
? 匹配前面的子表达式零次或一次。例如,"do(es)?" 可以匹配 "do" 、 "does" 中的 "does" 、 "doxy" 中的 "do" 。? 等价于 {0,1}。
{n} n 是一个非负整数。匹配确定的 n 次。例如,'o{2}' 不能匹配 "Bob" 中的 'o',但是能匹配 "food" 中的两个 o。
{n,} n 是一个非负整数。至少匹配n 次。例如,'o{2,}' 不能匹配 "Bob" 中的 'o',但能匹配 "foooood" 中的所有 o。'o{1,}' 等价于 'o+'。'o{0,}' 则等价于 'o*'。
{n,m} m 和 n 均为非负整数,其中n <= m。最少匹配 n 次且最多匹配 m 次。例如,"o{1,3}" 将匹配 "fooooood" 中的前三个 o。'o{0,1}' 等价于 'o?'。请注意在逗号和两个数之间不能有空格。

Because chapter numbers are likely to exceed nine in large input documents, you need a way to handle two or three chapter numbers. Q ualifiers give you this ability. The following regular expression matches the chapter title numbered with any number of digits:

/Chapter [1-9][0-9]*/

Note that qualifiers appear after range expressions. Therefore, it applies to the entire range expression, in this case specifying only numbers from 0 to 9, including 0 and 9.

The plus qualifier is not used here because a number is not necessarily required in the second or later position. Y ou don't use it? c haracter because it limits the chapter number to just two digits. You need to match at least one number after the Chapter and space characters.

If you know that the chapter number is limited to only 99 chapters, you can use the following expression to specify at least one but up to two digits.

/Chapter [0-9]{1,2}/

The disadvantage of the above expression is that chapter numbers larger than 99 still match only the first two digits. A nother drawback is that Chapter 0 will also match. A better expression that matches only two digits is as follows:

/Chapter [1-9][0-9]?/

Or

/Chapter [1-9][0-9]{0,1}/

What's the matter? Q ualifiers are greedy because they match as many words as possible, only one after them? Non-greedy or minimal matching can be achieved.

For example, you might search an HTML document for chapter titles that are enclosed in the H1 tag. The text is as follows in your document:

<H1>Chapter 1 – Introduction to Regular Expressions</H1>

The following expression matches everything from the beginning less than the symbol (<) to the closing of the H1 marker greater than the symbol (>).

/<.*>/

If you only need to match the start H1 tag, the following "non-greedy" expression only matches the .lt;H1.gt;.

/<.*?>/

By using the Placed after the qualifier?, the expression is converted from a greedy expression to a "non-greedy" expression or a minimum match.

The locator

Locators enable you to pin regular expressions to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.

Locators are used to describe the boundaries of a string or word, with s and $referring to the beginning and end of a string, respectively, and the front or back boundaries of a word, and a non-word boundary, respectively.

The qualifiers for regular expressions are:

字符 描述
^ 匹配输入字符串开始的位置。如果设置了 RegExp 对象的 Multiline 属性,^ 还会与 \n 或 \r 之后的位置匹配。
$ 匹配输入字符串结尾的位置。如果设置了 RegExp 对象的 Multiline 属性,$ 还会与 \n 或 \r 之前的位置匹配。
\b 匹配一个字边界,即字与空格间的位置。
\B 非字边界匹配。

Note: Qualifiers cannot be used with anchors. Expressions such as . . . are not allowed because there cannot be more than one position in front of or after line changes or word boundaries.

To match the text at the beginning of a line of text, start with the character of . Don't confuse this usage of s with the usage within the parenthesis expression.

To match the text at the end of a line of text, use the $ character at the end of the regular expression.

To use anchors when searching for chapter titles, the following regular expression matches a chapter title that contains only two trailing numbers and appears at the beginning of the line:

/^Chapter [1-9][0-9]{0,1}/

The real chapter title not only appears at the beginning of the line, but it is also the only text in the line. I t appears at the beginning of the line and again at the end of the same line. T he following expression ensures that the specified match matches only chapters and not cross-references. You can do this by creating regular expressions that match only the beginning and end of a line of text.

/^Chapter [1-9][0-9]{0,1}$/

Match word boundaries are slightly different, but add important capabilities to regular expressions. T he word boundary is the position between the word and the space. T he non-word boundary is any other location. The following expression matches the first three characters of the word Chapter because the three characters appear after the word boundary:

/\bCha/

The position of the character is very important. I f it is at the beginning of the string to match, it looks for a match at the beginning of the word. I f it is at the end of the string, it looks for a match at the end of the word. For example, the following expression matches the string ter in the word Chapter because it appears in front of the word boundary:

/ter\b/

The following expression matches the string apt in Chapter, but not the string apt in aptitude:

/\Bapt/

The string apt appears at the non-word boundary in the word Chapter, but at the word boundary in the word aptitude. The position is not important for the non-word boundary operator, because matching does not care whether it is the beginning or end of a word.

Choose

All selections are enclosed in parentheses, and adjacent selections are separated by | selections. However, with parentheses there is a side effect that the relevant match will be cached and is available at this time?: Put the first option to eliminate this side effect.

Where?: is one of the non-capturing objects, and there are two non-capture metas are? and ?!,, the two have more meaning, the former is a positive pre-check, in any place that starts to match the regular expression pattern in parentheses to match the search string, the latter is a negative pre-check, in any place where the start does not match the regular expression pattern to match the search string.

Back-reference

Adding parentheses to both sides of a regular expression pattern or partial pattern causes related matches to be stored in a temporary buffer, and each submatch captured is stored in order that appears from left to right in regular expression patterns. T he buffer number starts at 1 and can store up to 99 captured subexpressions. Each buffer can be accessed using ''n', where n is a one- or two-bit hedding number that identifies a particular buffer.

You can override the capture using the non-captured meta-characters '??,'???

One of the simplest and most useful applications for backtesting is the ability to find matches for two identical adjacent words in text. Take the following sentence as an example:

Is is the cost of of gasoline going up up?

The above sentence obviously has multiple repetitive words. H ow nice it would be if you could design a way to position the sentence without having to look for repetitions of each word. The following regular expressions use a single subexpression to do this:

/\b([a-z]+) \1\b/gi

The captured expression, as specified by the word "a-z", includes one or more letters. T he second part of the regular expression is a reference to a previously captured child match, that is, the second match of a word is matched by a parenthesis expression. S pecify the first child match. W ord boundary meta-characters ensure that only the entire word is detected. Otherwise, phrases such as "is issued" or "this is" will not be correctly recognized by this expression.

A global tag (g) after a regular expression indicates that as many matches as you can find by applying the expression to the input string. T he case insensosensity (i) tag at the end of the expression specifies case insense. A multi-line marker specifies that a potential match may occur on both sides of the line break.

Reverse references can also break common resource indicators (URIs) into components. Suppose you want to break down the following URIs into protocols (ftp, http, etc.), domain addresses, and pages/paths:

http://www.w3cschool.cn:80/html/html-tutorial.html

The following regular expressions provide this functionality:

/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/

The first parenthesis subexpression captures the protocol portion of the Web address. T he subexpression matches any word in front of the colon and the two forward slashes. T he second parenthesis subexpression captures the domain address portion of the address. S ubexpression matches one or more characters other than /or : . T he third parenthesis subexpression captures the port number, if specified. T he subexpression matches zero or more numbers after the colon. T he subexpression can only be repeated once. F inally, the fourth parenthesis subexpression captures the path and/or page information specified by the Web address. This subexpression can match any sequence of characters that does not include the characters of the s or spaces.

Apply regular expressions to the URI above, and each child match contains the following:

  • The first parenthesis subexpression contains "http"
  • The second parenthesis subexpression contains www.w3cschool.cn"
  • The third parenthesis subexpression contains ":80"
  • The fourth parenthesis subexpression contains "/html/html-tutorial.html"