Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Java regular expression


May 10, 2021 Java


Table of contents


Java regular expression

Regular expressions define the pattern of strings.

Regular expressions can be used to search, edit, or work with text.

Regular expressions are not limited to one language, but are slightly different in each language.

Java regular expressions are most similar to Perl.

The java.util.regex package consists of three main categories:

  • Pattern class:

    The pattern object is a compiled expression of a regular expression. T he Pattern class does not have a common construction method. T o create a Papern object, you must first call its public static compilation method, which returns a Papern object. The method accepts a regular expression as its first argument.

  • Matcher class:

    Matcher objects are engines that interpret and match input strings. L ike the Pattern class, Matcher does not have a common construction method. You need to call the Matter method of the Pottern object to get a Matcher object.

  • PatternSyntaxException:

    PatternSyntaxException is a non-forced exception class that represents a syntax error in a regular expression pattern.


The capture group

A capture group is a method of processing multiple characters as a separate unit, created by grouping characters in parentheses.

For example, a regular expression (dog) creates a single group that contains "d", "o", and "g".

The capture group is numbered by calculating its opening brackets from left to right. F or example, in an expression (A) (B(C),), there are four such groups:

  • ((A)(B(C)))
  • (A)
  • (B(C))
  • (C)

You can see how many groups an expression has by calling the groupCount method of the matcher object. T he groupCount method returns an int value that indicates that the matcher object currently has multiple capture groups.

There is also a special group (group 0), which always represents the entire expression. T he group is not included in the groupCount's return value.

The following example shows how to find a string of numbers from a given string:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    public static void main( String args[] ){

      // 按指定模式在字符串查找
      String line = "This order was placed for QT3000! OK?";
      String pattern = "(.*)(\\d+)(.*)";

      // 创建 Pattern 对象
      Pattern r = Pattern.compile(pattern);

      // 现在创建 matcher 对象
      Matcher m = r.matcher(line);
      if (m.find( )) {
         System.out.println("Found value: " + m.group(0) );
         System.out.println("Found value: " + m.group(1) );
         System.out.println("Found value: " + m.group(2) );
      } else {
         System.out.println("NO MATCH");
      }
   }
}

The above examples compile and run as follows:

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0

Regular expression syntax

Character

Description

\

Mark the next character as a special character, text, back reference, or octal escape character. F or example, "n" matches the character "n". " """ The sequence is ""match"","""

^

Matches where the input string starts. If you set the Multiline property of the RegExp object, the location after the "

$

Matches the position at the end of the input string. If you set the Multiline property of the RegExp object, also matches the position before the "

*

Match the preceding character or subexpression zero or more times. F or example, zo s matches "z" and "zoo". :: Equivalent to .

+

Matches the preceding character or subexpression one or more times. F or example, "zo" matches "zo" and "zoo" but does not match "z". It's equivalent to . . .

?

Match the previous character or subexpression zero or once. F or example, "do(es)?" M atch "do" or "do" in "do". Equivalent to .

{ N }

n is a non-negative integer. M atches exactly n times. For example, "o{2}" does not match the "o" in "Bob" but does match the two "o" in "food".

{ N ,}

n is a non-negative integer. M atch at least n times. F or example, "o'2," does not match "o" in "Bob" and matches all os in "foooood". " o'1,"" is equivalent to "o". "o'0,"" is equivalent to "o".

{ n , m }

M and n are non-negative integers, where n slt;m. M atch at least n times, up to m times. F or example, "o'1,3"" matches the first three os in "fooooood". ' o'0,1'' is equivalent to 'o?'. Note: You cannot insert spaces between commas and numbers.

?

When this character follows any other qualifiers, T he matching pattern is "non-greedy" after , T he "non-greedy" pattern matches the shortest string possible, while the default "greedy" pattern matches the searched string as long as possible. F or example, in the string "oooo", "o plus?" Only a single "o" is matched, while an "o" matches all "o".

.

Matches any single character other than " To match any character, including the word "

( pattern )

Match pattern and capture the matching subexpression. Y ou can use the $0...$9 property to retrieve captured matches from the resulting Match collection. To match the parenthesis character ( ) , use . or .

(?: pattern )

Match pattern does not capture the match's subexpression, i.e. it is a non-capture match and does not store matches for later use. T his is useful for combining pattern parts | "or" characters (or" characters). For example, 'industry(?| is a more | expression than 'industry|industries'.

(?= pattern )

A subexpression that performs a forward prediction-first search that matches a string at the starting point of a string that matches the pattern. I t is a non-capture match, i.e. it cannot be captured for later use. F or example, 'Windows (?95|98| N T|2000) matches Windows 2000, but not Windows 3.1. Predictions do not occupy characters first, i.e. after a match occurs, the search for the next match follows the last match, not after the characters that make up the prediction first.

(?! pattern )

A subexpression that performs a reverse prediction-first search that matches a search string that is not at the starting point of a string that matches the pattern. I t is a non-capture match, i.e. it cannot be captured for later use. F or example, 'Windows (?! 9 5|98| N T|2000)' matches Windows 3.1, but does not match Windows in Windows 2000. Predictions do not occupy characters first, i.e. after a match occurs, the search for the next match follows the last match, not after the characters that make up the prediction first.

x | y

Match x or y. F or example, '|food' matches 'z' or 'food'. '|f) ood' matches 'zood' or 'food'.

[ Xyz ]

Character. M atches any character that is included. For example, ""abc"" matches "a" in "plain".

[^ Xyz ]

Reverse character set. M atches any characters that are not included. For example, ""

[ a-z ]

The range of characters. M atches any character in the specified range. For example, "a-z" matches any lowercase letter in the range of "a" to "z".

[^ a-z ]

Reverse range characters. M atches any character that is not in the specified range. For example, """""

\b

Matches a word boundary, which is the position between the word and the space. For example, "er" matches "er" in "never", but does not match "er" in "verb".

\B

Non-word boundaries match. "er" matches "er" in "verb" but does not match "er" in "never".

\c x

Match the control character indicated by x. F or example, the scM matches the Control-M or carriage return character. The value of x must be between A-Z and a-z. If this is not the case, assume that c is the "c" character itself.

\d

Numeric character matching. Equivalent to .0-9.

\D

Non-numeric character matching. Equivalent to .

\f

Page breaks match. Equivalent to sx0c and scL.

\n

Line breaks match. Equivalent to sx0a and scJ.

\r

Match a carriage return. Equivalent to sx0d and scM.

\s

Matches any blank characters, including spaces, tabs, page breaks, and so on. It's equivalent to the equivalent of .

\S

Matches any non-blank characters. It's equivalent to the equivalent of .

\t

Tabs match. Equivalent to sx09 and scI.

\v

Vertical tab matching. Equivalent to sx0b and scK.

\w

Matches any word class character, including underscores. It's equivalent to "A-Za-z0-9".

\W

Matches any non-word character. It's equivalent to the equivalent of .A-Za-z0-9.

\x n

Match n, where n is a he heteche escape code. T he heteer escape code must be exactly two digits long. F or example, the """" " """"""" AsCII code is allowed to be used in regular expressions.

\ Num

Matches num, where num is a positive integer. r everse reference to the capture match. F or example, "(.) "1" matches two consecutive identical characters.

\ N

Identify an octal escape code or reverse reference. N is a reverse reference if there are at least n capture subexpressions in front of the . Otherwise, if n is an octal number (0-7), then n is an octal escape code.

\ nm

Identify an octal escape code or reverse reference. N m is a reverse reference if there are at least nm capture subexpressions before . I f there are at least n captures in front of the snm, n is a reverse reference, followed by the character m. If neither of the previous cases exists, then the octal values of nm are matched, where n and m are octal numbers (0-7).

\ nml

When n is octal (0-3), m and l are octal (0-7), match octal escape code nml.

\u n

Matches n, where n is a Unicode character represented by a four-bit heteche number. For example, the copyright symbol (the copyright symbol) is matched by the ©.

The method of the Matcher class

The index method

The index method provides useful index values that pinpoint where a match can be found in the input string:

Serial number Methods and instructions
1 public int start()
Returns the initial index that was previously matched.
2 public int start(int group)
Returns the initial index of the subsethy captured by a given group during a previous match operation
3 public int end()
Returns the offset after the last matching character.
4 public int end(int group)
Returns the offset after the last character of the subsethic captured by a given group during a previous match operation.

Research methods

The study method is used to check the input string and return a Boolean value to indicate whether the pattern is found:

Serial number Methods and instructions
1 public boolean lookingAt()
An attempt was made to match the input sequence from the beginning of the region to the pattern.
2 public boolean find()
Try to find the next subsethyse of the input sequence that matches the pattern.
3 public boolean find(int start)
Reset this matcher, and then try to find the next subsethle of the input sequence that matches the pattern, starting with the specified index.
4 public boolean matches()
Try matching the entire area to the pattern.

The replacement method

The replacement method is to replace the text in the input string:

Serial number Methods and instructions
1 public Matcher appendReplacement(StringBuffer sb, String replacement)
Implement non-terminal addition and replacement steps.
2 public StringBuffer appendTail(StringBuffer sb)
Implement terminal addition and replacement steps.
3 public String replaceAll(String replacement)
Each subsethle of the input sequence that the replacement pattern matches the given replacement string.
4 public String replaceFirst(String replacement)
The first subseche of the input sequence in which the replacement pattern matches the given replacement string.
5 public static String quoteReplacement(String s)
Returns a literal replacement string for the specified string. This method returns a string that works as if it were a word string passed to the AppendReplacement method of the Matcher class.

Start and end methods

Here's an example of counting the number of times the word "cat" appears in the input string:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static final String REGEX = "\\bcat\\b";
    private static final String INPUT =
                                    "cat cat cat cattie cat";

    public static void main( String args[] ){
       Pattern p = Pattern.compile(REGEX);
       Matcher m = p.matcher(INPUT); // 获取 matcher 对象
       int count = 0;

       while(m.find()) {
         count++;
         System.out.println("Match number "+count);
         System.out.println("start(): "+m.start());
         System.out.println("end(): "+m.end());
      }
   }
}

The above examples compile and run as follows:

Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
Match number 4
start(): 19
end(): 22

You can see that this example uses word boundaries to ensure that the letter "c" "a" "t" is not just a subsethic of a longer word. I t also provides some useful information about where matching occurs in input strings.

The Start method returns the initial index of the subsethic captured by a given group during a previous match operation, and the index of the last matching character of the end method plus 1.

matches and look at the At method

Both the matches and the lookAt methods are used to try to match an input sequence pattern. T he difference is that matches require the entire sequence to match, while lookat does not.

Both methods are often used at the beginning of the input string.

Let's explain this feature with the following example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static final String REGEX = "foo";
    private static final String INPUT = "fooooooooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;

    public static void main( String args[] ){
       pattern = Pattern.compile(REGEX);
       matcher = pattern.matcher(INPUT);

       System.out.println("Current REGEX is: "+REGEX);
       System.out.println("Current INPUT is: "+INPUT);

       System.out.println("lookingAt(): "+matcher.lookingAt());
       System.out.println("matches(): "+matcher.matches());
   }
}

The above examples compile and run as follows:

Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
lookingAt(): true
matches(): false

ReplaceFirst and ReplaceAll methods

The replaceFirst and replaceAll methods are used to replace text that matches regular expressions. T he difference is that replaceFirst replaces the first match and replaceAll replaces all matches.

Here's an example to explain this feature:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static String REGEX = "dog";
    private static String INPUT = "The dog says meow. " +
                                    "All dogs say meow.";
    private static String REPLACE = "cat";

    public static void main(String[] args) {
       Pattern p = Pattern.compile(REGEX);
       // get a matcher object
       Matcher m = p.matcher(INPUT); 
       INPUT = m.replaceAll(REPLACE);
       System.out.println(INPUT);
   }
}

The above examples compile and run as follows:

The cat says meow. All cats say meow.

The appendReplacement and appendTail methods

The Matcher class also provides the appendReplacement and appendTail methods for text replacement:

Take a look at the following example to explain this feature:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
   private static String REGEX = "a*b";
   private static String INPUT = "aabfooaabfooabfoob";
   private static String REPLACE = "-";
   public static void main(String[] args) {
      Pattern p = Pattern.compile(REGEX);
      // 获取 matcher 对象
      Matcher m = p.matcher(INPUT);
      StringBuffer sb = new StringBuffer();
      while(m.find()){
         m.appendReplacement(sb,REPLACE);
      }
      m.appendTail(sb);
      System.out.println(sb.toString());
   }
}

The above examples compile and run as follows:

-foo-foo-foo-

PatternSyntax Exchange class method

PatternSyntaxException is a non-forced exception class that indicates a syntax error in a regular expression pattern.

The PatternSyntaxException class provides the following methods to help us see what went wrong.

Serial number Methods and instructions
1 public String getDescription()
Get a description of the error.
2 public int getIndex()
Get the wrong index.
3 public String getPattern()
Gets the wrong regular expression pattern.
4 public String getMessage()
Returns a multi-line string that contains a description of the syntax error and its index, a visual indication of the wrong regular expression pattern, and the error index in the pattern.