Extension of ES6 rules

1. RegExp constructor

In ES5, there are two cases in which the parameters of the RegExp 两种 used.

In the first case, the argument is 字符串 at which point 正则表达式 modifier of the regular expression (flag).

var regex = new RegExp('xyz', 'i');
// 等价于
var regex = /xyz/i;

In the second case, the argument is a positive notation, and a copy of the original regular expression is returned.

var regex = new RegExp(/xyz/i);
// 等价于
var regex = /xyz/i;

However, ES5 does not allow the modifier to be added with the second parameter at this time, otherwise an error will be reported.

var regex = new RegExp(/xyz/, 'i');
// Uncaught TypeError: Cannot supply flags when constructing one RegExp from another

ES6 changes this behavior. I f the first argument of the RegExp constructor is a positive object, you can use the second argument to specify the modifier. Also, the returned regular expression ignores the modifier of the original regular expression and uses only the newly specified modifier.

new RegExp(/abc/ig, 'i').flags
// "i"

In the code above, the modifier of the original positive object is ig, which is overwritten by the second parameter, i.

2. The positive method of the string

字符串 The object 4 that can use regular expressions: match(), replace(), search(), and split().

ES6 defines all four RegExp within the language, so that all methods related to the positive are defined on the RegExp object.

String.prototype.match calls RegExp.prototype (Symbol.match)
String.prototype.replace calls RegExp.prototype (Symbol.replace)
String.prototype.search calls RegExp.prototype (Symbol.search)
String.prototype.split calls RegExp.prototype (Symbol.split)

3. u modifier

ES6 adds a u modifier to the regular “Unicode 模式” to correctly handle Unicode characters larger than . That is, four bytes of UTF-16 encoding are handled correctly.

/^\uD83D/u.test('\uD83D\uDC2A') // false
/^\uD83D/.test('\uD83D\uDC2A') // true

In the code above, the s uD83D-uDC2A is a four-byte UTF-16 encoding that represents a character. H owever, ES5 does not support four-byte UTF-16 encoding, which is recognized as two characters, resulting in a second line of code with a true result. With the u modifier added, ES6 recognizes it as a character, so the first line of code results in false.

Once the u-decorated symbol is added, the behavior of the following regular expressions is modified.

(1) dot character

Point ( . C haracters are in regular expressions and mean any single character except line breaks. For Unicode characters with 0xFFFF or more than the code point, the dot character is not recognized and must be decorated with the u modifier.

var s = '????';
/^.$/.test(s) // false
/^.$/u.test(s) // true

The code above indicates that if you do not add a u modifier, the regular expression will consider the string to be two characters, thus failing to match.

(2) Unicode character notation

ES6 has added the 大括号 to represent Unicode characters, which must be represented in regular expressions with u modifiers to recognize the braces in them, or they will be interpreted as a quantum word.

/\u{61}/.test('a') // false
/\u{61}/u.test('a') // true
/\u{20BB7}/u.test('????') // true

The above code indicates that the regular expression does not recognize the notation of the {61} without the u modifier, only that it matches 61 consecutive u.

(3) Measure words

When you use the u modifier, all words correctly identify Unicode characters 0xFFFF than the value.

/a{2}/.test('aa') // true
/a{2}/u.test('aa') // true
/????{2}/.test('????????') // false
/????{2}/u.test('????????') // true

(4) Predefined mode

The u modifier also affects the predefined pattern, correctly identifying Unicode characters 0xFFFF than the code point.

/^\S$/.test('????') // false
/^\S$/u.test('????') // true

The code above is a predefined pattern that matches all non-blank characters. Only with the u modifier does it correctly match the Unicode character with a 0xFFFF than the code point.

With this, you can write a function that correctly returns the length of the string.

function codePointLength(text) {
  var result = text.match(/[\s\S]/gu);
  return result ? result.length : 0;
}
var s = '????????';
s.length // 4
codePointLength(s) // 2

(5) i modifier

Some Unicode are coded differently, but the font sizes are similar, for example, both the s/he and the s/u212A are capital Ks.

/[a-z]/i.test('\u212A') // false
/[a-z]/iu.test('\u212A') // true

In the code above, non-standard K characters cannot be recognized without the u modifier.

(6) Escape

Without the u modifier, there is no defined escape (such as the escape of a comma), and the u pattern reports an error.

/\,/ // /\,/
/\,/u // 报错

In the code above, when there is no u modifier, the backslash in front of the comma is invalid, and the u modifier is not corrected.

4. RegExp.prototype.unicode property

正则实例对象 unicode property to indicate whether the u modifier is set.

const r1 = /hello/;
const r2 = /hello/u;
r1.unicode // false
r2.unicode // true

In the code above, you can see from the unicode property whether the regular expression has a u modifier set.

5. y modifier

In addition to the u ES6 adds y 修饰符 modifier to the regular “粘连” called the "sticky" modifier.

The y modifier functions like the g modifier and is a global match, starting with the next location where the last match was successful. The difference is that the g modifier is possible as long as there is a match in the remaining position, and the y modifier ensures that the match must start at the remaining first position, which is what "glue" means.

var s = 'aaa_aa_a';
var r1 = /a+/g;
var r2 = /a+/y;
r1.exec(s) // ["aaa"]
r2.exec(s) // ["aaa"]
r1.exec(s) // ["aa"]
r2.exec(s) // null

The code above 两个 regular expressions, one with the g 修饰符 and the other with y 修饰符 T he two regular expressions are executed twice each, and the first time they are executed, they behave the same, and the remaining strings are _aa_a. Because the g-decoration does not have a position requirement, the second execution returns the result, and the y modifier requires that the match must start at the head, so the null is returned.

If you change the regular expression to ensure that the head matches each time, the y modifier returns the result.

var s = 'aaa_aa_a';
var r = /a+_/y;
r.exec(s) // ["aaa_"]
r.exec(s) // ["aa_"]

Each time the above code matches, it starts at the head of the remaining string.

The y modifier is better explained with the lastIndex property.

const REGEX = /a/g;
// 指定从2号位置（y）开始匹配
REGEX.lastIndex = 2;
// 匹配成功
const match = REGEX.exec('xaya');
// 在3号位置匹配成功
match.index // 3
// 下一次匹配从4号位开始
REGEX.lastIndex // 4
// 4号位开始匹配失败
REGEX.exec('xaya') // null

In the code above, the lastIndex property specifies the start of each search, from which the g modifier searches back until a match is found.

The y modifier also adheres to the lastIndex property, but requires that a match be found at the location specified by lastIndex.

const REGEX = /a/y;
// 指定从2号位置开始匹配
REGEX.lastIndex = 2;
// 不是粘连，匹配失败
REGEX.exec('xaya') // null
// 指定从3号位置开始匹配
REGEX.lastIndex = 3;
// 3号位置是粘连，匹配成功
const match = REGEX.exec('xaya');
match.index // 3
REGEX.lastIndex // 4

In fact, the y-decorated symbol implies a head-matching flag.

/b/y.exec('aba')
// null

The above code returns null because there is no guarantee that the head will match. The y modifier is designed to make the head match the flag that is valid in the global match.

The following is an example of the replace method for string objects.

const REGEX = /a/gy;
'aaxa'.replace(REGEX, '-') // '--xa'

In the code above, the last a is not replaced because it does not appear on the next match header.

A single y modifier can only return the first match to the match method, and must be used with the g modifier to return all matches.

'a1a2a3'.match(/a\d/y) // ["a1"]
'a1a2a3'.match(/a\d/gy) // ["a1", "a2", "a3"]

An app for y modifiers is to extract tokens from strings, and y modifiers ensure that there are no missing characters between matches.

const TOKEN_Y = /\s*(\+|[0-9]+)\s*/y;
const TOKEN_G  = /\s*(\+|[0-9]+)\s*/g;
tokenize(TOKEN_Y, '3 + 4')
// [ '3', '+', '4' ]
tokenize(TOKEN_G, '3 + 4')
// [ '3', '+', '4' ]
function tokenize(TOKEN_REGEX, str) {
  let result = [];
  let match;
  while (match = TOKEN_REGEX.exec(str)) {
    result.push(match[1]);
  }
  return result;
}

In the above code, if there are no illegal characters in the string, the y modifier is extracted the same as the g modifier. However, once illegal characters appear, the two behave differently.

tokenize(TOKEN_Y, '3x + 4')
// [ '3' ]
tokenize(TOKEN_G, '3x + 4')
// [ '3', '+', '4' ]

In the above code, the g modifier ignores illegal characters, while the y modifier does not, which makes it easy to find errors.

6. RegExp.prototype.sticky property

Matching the y modifier, 正则实例对象 the positive instance object of ES6 has sticky to indicate whether the y modifier is set.

var r = /hello\d/y;
r.sticky // true

7. RegExp.prototype.flags property

ES6 the flags expression and returns a modifier for the regular expression.

// ES5 的 source 属性
// 返回正则表达式的正文
/abc/ig.source
// "abc"
// ES6 的 flags 属性
// 返回正则表达式的修饰符
/abc/ig.flags
// 'gi'

8. s modifier: dotAll mode

In regular expressions, a （ . ） is a special character that represents any single character, with two exceptions. One is a four-byte UTF-16 character, which can be solved with an u modifier, and the other is a line terminator character.

The 行终止符 is that the character represents the end of a line. The following four characters belong to the Line Terminator.

U-000A Line Breaks ( . . .
U-000D carriage return ('r)
Line separator (line separator)
U-2029 Segment Separator (paragraph separator)

/foo.bar/.test('foo\nbar')
// false

In the code above, because . Does not match , so the regular expression returns false .

However, many times we want to match any single character, and there is a workable way to write.

/foo[^]bar/.test('foo\nbar')
// true

After all, this solution is not intuitive, ES2018 introduced s modifier, so that . You can match any single character.

/foo.bar/s.test('foo\nbar') // true

This is dotAll 模式 mode, where dots represent all characters. Therefore, the regular expression also introduces a dotAll property that returns a Boolean value to indicate whether the regular expression is in dotAll mode.

const re = /foo.bar/s;
// 另一种写法
// const re = new RegExp('foo.bar', 's');
re.test('foo\nbar') // true
re.dotAll // true
re.flags // 's'

/s modifier and multi-line modifier /m do not conflict, both used together in the case of . . Matches all characters, while s and $ match the beginning and end of each line.

9. Post-assertion

Regular expressions 正则表达式 the JavaScript language support 先行断言 first assertions (lookahead) and 先行否定断言 assertions, and later line assertions (lookbehind) and back-line negative assertions (negative lookbehind). ES2018 introduces a back row assertion that V8 Engine Version 4.9 (Chrome 62) is already supported.

"Advance assertion" means that x matches only before y and must be written as /x (?y)/. F or example, to match only the numbers before the percent sign, write as /?d plus (?%)/ . " First negative assertion" means that x matches only if it is not preceded by y and must be written as /x (?! y )/ 。 F or example, match only numbers that are not before the percent sign, to be written as /\d-plus (?!%) / 。

/\d+(?=%)/.exec('100% of US presidents have been male')  // ["100"]
/\d+(?!%)/.exec('that’s all 44 of them')                 // ["44"]

The above two strings, if you swap regular expressions, will not get the same result. I n addition, you can see that the part of the brackets of the "advance assertion" (?%) ), is not counted as the returned result.

"Back-row assertion" is exactly the opposite of "advance assertion", x only matches after y and must be written as /?lt;y)x/. F or example, to match only the numbers after the dollar sign, write it as /?lt;?$?d?. " Back row negative assertion" is the opposite of "first negative assertion", x only does not match after y, must be written as /?. For example, match only numbers that are not behind the dollar sign, to be written as /?.

/(?<=\$)\d+/.exec('Benjamin Franklin is on the $100 bill')  // ["100"]
/(?<!\$)\d+/.exec('it’s is worth about €90')                // ["90"]

In the example above, the part of the parenthesis of the "back-row assertion" (?lt;\$)) is also not counted in the return result.

The following example is a string replacement using a back-line assertion.

const RE_DOLLAR_PREFIX = /(?<=\$)foo/g;
'$foo %foo foo'.replace(RE_DOLLAR_PREFIX, 'bar');
// '$bar %foo foo'

In the code above, only foo after the dollar sign is replaced.

The implementation of a "back-row assertion" requires matching the x of /?lt;y)x/, and then going back to the left to match the y part. This "right-to-left" order of execution, contrary to all other positive actions, results in some behavior that does not meet expectations.

First, the group match for the back row assertion is different from normal results.

/(?<=(\d+)(\d+))$/.exec('1053') // ["", "1", "053"]
/^(\d+)(\d+)$/.exec('1053') // ["1053", "105", "3"]

In the code above, you need to capture two group matches. W ithout Back Row Assertion, the first parenthesis is greedy mode, and the second bracket captures only one character, so the results are 105 and 3. When Back Row Asserts, the result is 1 and 053 because the execution order is right-to-left, the second parenthesis is greedy mode, and the first parenthesis captures only one character.

Second, the backslash reference to back row assertions, as opposed to the usual order, must be placed before the corresponding parenthesis.

/(?<=(o)d\1)r/.exec('hodor')  // null
/(?<=\1d(o))r/.exec('hodor')  // ["r", "o"]

In the above code, if the backslash reference to the back-line assertion ( s1 ) is placed behind the parenthesis, the match is not obtained and must be placed in front of it. Because the back row assertion is to scan from left to right, find the match and then go back, complete the backslash reference from right to left.

10. Unicode property class

ES2018 a new class of writing, which allows regular expressions to match all characters that conform to a Unicode property.

const regexGreekSymbol = /\p{Script=Greek}/u;
regexGreekSymbol.test('π') // true

In the code above, the match is specified to match a Greek letter, so matching the π success.

The Unicode property class specifies the property name and property value.

\p{UnicodePropertyName=UnicodePropertyValue}

For some properties, you can write only the property name, or only the property value.

\p{UnicodePropertyName}
\p{UnicodePropertyValue}

The reverse match of the character is the character that does not meet the criteria.

Note that these two categories are only valid for Unicode, so be sure to add the u modifier when using them. Without the u modifier, regular expressions report errors using sp and sp, and ECMAScript reserves both classes.

Because Unicode has so many properties, this new class is very expressed.

const regex = /^\p{Decimal_Number}+$/u;
regex.test('1234567890123456') // true

In the above code, the property class specifies that all hem characters are matched, and you can see that the hedding characters of various fonts will match successfully.

Even the Roman numerals can be matched.

// 匹配所有数字
const regex = /^\p{Number}+$/u;
regex.test('²³¹¼½¾') // true
regex.test('㉛㉜㉝') // true
regex.test('ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫ') // true

Here are some other examples.

// 匹配所有空格
\p{White_Space}
// 匹配各种文字的所有字母，等同于 Unicode 版的 \w
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
// 匹配各种文字的所有非字母的字符，等同于 Unicode 版的 \W
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
// 匹配 Emoji
/\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu
// 匹配所有的箭头字符
const regexArrows = /^\p{Block=Arrows}+$/u;
regexArrows.test('←↑→↓↔↕↖↗↘↙⇏⇐⇑⇒⇓⇔⇕⇖⇗⇘⇙⇧⇩') // true

11. Named group matches

Brief introduction

Regular expressions use 圆括号 to match groups.

const RE_DATE = /(\d{4})-(\d{2})-(\d{2})/;

In the code above, there are three sets of parentheses in the regular expression. Using the exec method, you can extract the three sets of matching results.

const RE_DATE = /(\d{4})-(\d{2})-(\d{2})/;
const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj[1]; // 1999
const month = matchObj[2]; // 12
const day = matchObj[3]; // 31

One problem with group matching is that each group's match meaning is not easy to see and can only be referenced with a numeric serial number (such as matchObj1), which must be modified if the order of the group changes.

ES2018 introduces Named Capture Groups, which allows you to specify a name for each group match, making it easy to read code and reference it.

const RE_DATE = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj.groups.year; // 1999
const month = matchObj.groups.month; // 12
const day = matchObj.groups.day; // 31

In the above code, "named group match" is inside the parenthesis, and the head of the pattern is added "question marks and angle brackets and group names" ( ? ) , you can then reference the group name on the groups property of the exec method that returns the results. At the same time, the number sequence number (matchObj1) is still valid.

A named group match is equal to adding an ID to each set of matches, making it easy to describe the purpose of the match. If the order of the groups changes, there is no need to change the matching processing code.

If the named group does not match, the corresponding groups object property will be undefined.

const RE_OPT_A = /^(?<as>a+)?$/;
const matchObj = RE_OPT_A.exec('');
matchObj.groups.as // undefined
'as' in matchObj.groups // true

In the above code, the name group as does not find a match, then the matchObj.groups.as property value is undefined, and the key name as is always present in the groups.

Deconstruct assignments and substitutions

Once you have a named group match, you can use deconstruction assignments to assign values to variables directly from the matching results.

let {groups: {one, two}} = /^(?<one>.*):(?<two>.*)$/u.exec('foo:bar');
one  // foo
two  // bar

When a string is replaced, a named group is referenced with a named group.

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
'2015-01-02'.replace(re, '$<day>/$<month>/$<year>')
// '02/01/2015'

In the code above, the second argument of the replace method is a string, not a regular expression.

The second argument of the replace method can also be a function, and the sequence of arguments for that function is as follows.

'2015-01-02'.replace(re, (
   matched, // 整个匹配结果 2015-01-02
   capture1, // 第一个组匹配 2015
   capture2, // 第二个组匹配 01
   capture3, // 第三个组匹配 02
   position, // 匹配开始的位置 0
   S, // 原字符串 2015-01-02
   groups // 具名组构成的一个对象 {year, month, day}
 ) => {
 let {day, month, year} = groups;
 return `${day}/${month}/${year}`;
});

Named group matching on the original basis, added the last function parameter: an object made up of named groups. The object can be deconstructed directly within the function.

Reference

If you want to refer to a "named group match" within a regular expression, you can use the writing of the group name.

const RE_TWICE = /^(?<word>[a-z]+)!\k<word>$/;
RE_TWICE.test('abc!abc') // true
RE_TWICE.test('abc!ab') // false

The digital reference ( . .

const RE_TWICE = /^(?<word>[a-z]+)!\1$/;
RE_TWICE.test('abc!abc') // true
RE_TWICE.test('abc!ab') // false

Both reference syntaxes can also be used at the same time.

const RE_TWICE = /^(?<word>[a-z]+)!\k<word>!\1$/;
RE_TWICE.test('abc!abc!abc') // true
RE_TWICE.test('abc!abc!ab') // false

12. The positives match the index

The exact matching of the start and end positions of the results is not very convenient at this time. The exec() method of the positive instance returns a result with an index property that gets the start of the entire match, but if you include a group match, the starting position of each group match is difficult to get.

Now there is a third-stage proposal to add the indices property to the return result of the exec() method, where you can get the matching start and end positions.

const text = 'zabbcdef';
const re = /ab/;
const result = re.exec(text);
result.index // 1
result.indices // [ [1, 3] ]

In the example above, the exec() method returns the result of result, whose index property is the beginning of the entire match result (ab), and its indices property is an array whose members are an array of the start and end positions of each match. Because the regular expression in this example does not have a group match, the indices array has only one member, indicating that the entire match starts at 1 and ends at 3.

Note that the start position is included in the match result, but the end position is not included in the match result. For example, if the match results in ab, which is 1st and 2nd of the original string, the end position is 3rd.

If the regular expression contains a group match, the array corresponding to the indices property contains multiple members, providing the start and end positions of each group match.

const text = 'zabbcdef';
const re = /ab+(cd)/;
const result = re.exec(text);
result.indices // [ [ 1, 6 ], [ 4, 6 ] ]

In the example above, if the regular expression contains a group match, the indices property array has two members, the first member is the start and end of the entire match result (abbcd), and the second member is the start and end of the group match (cd).

The following are examples of multiple group matches.

const text = 'zabbcdef';
const re = /ab+(cd(ef))/;
const result = re.exec(text);
result.indices // [ [1, 8], [4, 8], [6, 8] ]

In the example above, the regular expression contains two groups matching, so the indices property array has three members.

If the regular expression contains a named group match, the indices property array also has a groups property. The property is an object from which you can get the start and end positions of the named group match.

const text = 'zabbcdef';
const re = /ab+(?<Z>cd)/;
const result = re.exec(text);
result.indices.groups // { Z: [ 4, 6 ] }

In the example above, the indices.groups property of the exec() method that returns the result is an object that provides the start and end positions of the named group match Z.

If the get group match is not successful, the corresponding member of the indices property array is undefined, and the corresponding member of the indices.groups property object is also undefined.

const text = 'zabbcdef';
const re = /ab+(?<Z>ce)?/;
const result = re.exec(text);
result.indices[1] // undefined
result.indices.groups['Z'] // undefined

In the example above, because the group match was unsuccessful, the group matching members for the indices property array and the indices.groups property object are undefined.

13. String.prototype.matchAll()

If a regular expression has more than one match in a string, the g modifier or y modifier is now generally used to take it out one by one in the loop.

var regex = /t(e)(st(\d?))/g;
var string = 'test1test2test3';
var matches = [];
var match;
while (match = regex.exec(string)) {
  matches.push(match);
}
matches
// [
//   ["test1", "e", "st1", "1", index: 0, input: "test1test2test3"],
//   ["test2", "e", "st2", "2", index: 5, input: "test1test2test3"],
//   ["test3", "e", "st3", "3", index: 10, input: "test1test2test3"]
// ]

In the code above, the while loop takes out the positive match for each round, for a total of three rounds.

ES2020 adds the String.prototype.matchAll() method to remove all matches at once. However, it returns a traverser (Iterator) instead of an array.

const string = 'test1test2test3';
// g 修饰符加不加都可以
const regex = /t(e)(st(\d?))/g;
for (const match of string.matchAll(regex)) {
  console.log(match);
}
// ["test1", "e", "st1", "1", index: 0, input: "test1test2test3"]
// ["test2", "e", "st2", "2", index: 5, input: "test1test2test3"]
// ["test3", "e", "st3", "3", index: 10, input: "test1test2test3"]

In the above code, since string.matchAll (regex) returns a traverser, you can use for... O f cycle out. The advantage of returning a traverser over returning an array is that if the match result is a large array, the traverser is more resource-saving.

The traverser into an array is very simple to use ... The operator and the Array.from() method are all right.

// 转为数组方法一
[...string.matchAll(regex)]
// 转为数组方法二
Array.from(string.matchAll(regex))

Table of contents