Julia string

May 14, 2021 Julia

1. String

2. Character

3. String basis

4. Unicode and UTF-8

5. Interpolation

6. General operation

7. Non-standard string text

8. Byte array text

9. Version number text

String

The ASCII text in Julia is simple and efficient, and you can also work with Unicode. U sing C-style string code to handle ASCII strings, performance and semantics are no problem. I f this code encounters non-ASCII text, it prompts for an error instead of displaying garbled code. At this point, it is easy to modify the code to be compatible with non-ASCII data.

There are some notable advanced features about Julia strings:

String is an abstract type, not a concrete type
Char represents a single character and is a Unicode code bit represented by a 32-bit integer
As in Java, strings cannot be changed: the value String object cannot be changed. To get a different string, you need to construct a new string
Conceptually, strings are part of a function that maps from index values to characters, and for some index values, if not characters, an exception is thrown
Julia supports all Unicode characters: Text characters are usually ASCII or UTF-8, but other encodings are also supported

Character

Char a single character: it is a 32-bit integer, see the Unicode code. Char must use single quotes:

julia> 'x'
'x'

julia> typeof(ans)
Char

You can convert Char the corresponding integer value:

julia> int('x')
120

julia> typeof(ans)
Int64

On a 32-bit architecture, typeof(ans) is type Int32 You can also convert an integer Char

julia> char(120)
'x'

Not all integer values are valid Unicode code bits, but for performance, char generally does not check whether it is valid. If you want to make sure it works, is_valid_char function:

julia> char(0x110000)
'\U110000'

julia> is_valid_char(0x110000)
false

Currently, valid Unicode code points are U+00 U+d7ff and U+e000 U+10ffff

You can enter the Unicode character \u the \U maximum number of heteens in single quotes, or by the maximum number of heteens (valid characters, up to six bits) that follow:

julia> '\u0'
'\0'

julia> '\u78'
'x'

julia> '\u2200'
'∀'

julia> '\U10ffff'
'\U10ffff'

Julia uses the system's default zone and language settings to determine which characters can be displayed correctly and which need to be displayed with the escape of \u or \U With the exception of the Unicode escape format, all C language escape input formats enable:

julia> int('\0')
0

julia> int('\t')
9

julia> int('\n')
10

julia> int('\e')
27

julia> int('\x7f')
127

julia> int('\177')
127

julia> int('\xff')
255

You can compare Char values, or you can do a little arithmetic:

julia> 'A' < 'a'
true

julia> 'A' <= 'a' <= 'Z'
false

julia> 'A' <= 'X' <= 'Z'
true

julia> 'x' - 'a'
23

julia> 'A' + 1
'B'

String basis

String text should be placed in the middle of a double quote "..." or three double """..."""

julia> str = "Hello, world.\n"
"Hello, world.\n"

julia> """Contains "quote" characters"""
"Contains \"quote\" characters"

Use an index to extract characters from a string:

julia> str[1]
'H'

julia> str[6]
','

julia> str[end]
'\n'

The indexes in Julia all start at 1, and the index of the last element is the same length as the string, both n

In any index expression, the end is an abbreviation for the last index value endof(str) You can do end end or other calculations on strings:

julia> str[end-1]
'.'

julia> str[end/2]
' '

julia> str[end/3]
ERROR: InexactError()
 in getindex at string.jl:59

julia> str[end/4]
ERROR: InexactError()
 in getindex at string.jl:59

If the index is less than 1 or end an error is prompted:

julia> str[0]
ERROR: BoundsError()

julia> str[end+1]
ERROR: BoundsError()

Use a range index to extract substrings:

julia> str[4:9]
"lo, wo"
str[k] 和 str[k:k] 的结果不同：

julia> str[6]
','

julia> str[6:6]
","

The former is a single Char and the latter is a string with only one character. The two are completely different in Julia.

Unicode and UTF-8

Julia fully supports Unicode characters and strings. A s discussed above, in character text, the Unicode code point can be escaped by s \u and s \U or you can use the escape sequence of standard C. They can all be used to write string text:

julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"

Non-ASCII string text is encoded with UTF-8. U TF-8 is a longer encoding, meaning that not all characters are encoded at the same length. I n UTF-8, 0x80 (128) ASCII characters, are encoded using single bytes, as in ASCII, and characters with the remaining codes use multi-bytes, up to a maximum of four bytes per character. T his means that not all byte index values in the UTF-8 string are valid character index values. If the index is indexed to an invalid byte index value, an error is thrown:

julia> s[1]
'∀'

julia> s[2]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

julia> s[3]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

julia> s[4]
' '

In the example above, ∀ 3 byte characters, so the index values 2 and 3 are invalid, while the index value for the next character is 4.

Because of the longer encoding, the number of characters in the length(s) is not necessarily equal to the last index value of the string. T he string s and traversed from 1 endof(s) if no exception is thrown, the returned sequence of characters will include the sequence of s T hus, length(s) <= endof(s) Here's an example of an inefficient s characters:

julia> for i = 1:endof(s)
         try
           println(s[i])
         catch
           # ignore the index error
         end
       end
∀

x

∃

y

Fortunately, we can use strings as traversal objects without having to deal with exceptions:

julia> for c in s
         println(c)
       end
∀

x

∃

y

Julia doesn't just support UTF-8, it's easy to add support for other codings. I n particular, Julia also utf16string utf32string utf16(s) utf32(s) supporting UTF-16 and UTF-32 encoding, respectively. I t also provides the alias WString for the UTF-16 or UTF-32 wstring(s) Cwchar_t size. For a discussion of UTF-8, see the byte array text below.

Interpolation

String connections are the most common operations:

julia> greet = "Hello"
"Hello"

julia> whom = "world"
"world"

julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"

Like Perl, Julia allows the use of $ to interpolate string text:

julia> "$greet, $whom.\n"
"Hello, world.\n"

It is override as a string text connection.

$ Insert the shortest full expression after it into the string. You can use parentheses to interpole any expression:

julia> "1 + 2 = $(1 + 2)"
"1 + 2 = 3"

String connections and interpolation call string string to convert objects to String As in an interactive session, most String objects are converted to strings:

julia> v = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> "v: $v"
"v: [1,2,3]"

Char can also be interpoled into strings:

julia> c = 'x'
'x'

julia> "hi, $c"
"hi, x"

To include $ text in $ text, you should escape it with a backslash:

julia> print("I have \$100 in my account.\n")
I have $100 in my account.

General operation

Using standard comparison operators, compare strings in dictionary order:

julia> "abracadabra" < "xylophone"
true

julia> "abracadabra" == "xylophone"
false

julia> "Hello, world." != "Goodbye, world."
true

julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
true

Use search function to find the index value of a character:

julia> search("xylophone", 'x')
1

julia> search("xylophone", 'p')
5

julia> search("xylophone", 'z')
0

You can start by providing a third argument and start looking for this offset value:

julia> search("xylophone", 'o')
4

julia> search("xylophone", 'o', 5)
7

julia> search("xylophone", 'o', 8)
0

Another useful function for handling strings repeat

julia> repeat(".:Z:.", 10)
".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."

Some other useful functions:

endof(str) str index value of str
length(str) the str for str
i = start(str) a valid index value for the first character that can be found in str (usually 1)
c, j = next(str,i) the next character at or after the index i and the index value of the next valid character after that. Start start endof traverse characters in str
ind2chr(str,i) the character in which the ith index value in the string is located, corresponding to the first few characters
chr2ind(str,j) the character indexed i in the string, corresponding to the index value of the (first) byte

Non-standard string text

Julia provides non-standard string text. I t adds a prefix identifier to the string text enclosed in normal double quotes. T he regular expressions, byte array text, and version number text that will be described below are examples of non-standard string text. There are other examples in the Metaprogramming section.

Regular expression

Julia's regular expression (regexp) is perl compatible and is provided by the PCRE library. I t is a non-standard string text with a r and can be followed by some identifiers at the end. The most basic regular expression is only r"..."

julia> r"^\s*(?:#|$)"
r"^\s*(?:#|$)"

julia> typeof(ans)
Regex (constructor with 3 methods)

Check that the regular expression matches the string, ismatch function:

julia> ismatch(r"^\s*(?:#|$)", "not a comment")
false

julia> ismatch(r"^\s*(?:#|$)", "# a comment")
true

ismatch true or false depending on whether the regular expression matches the string. The match function can return the specifics of the match:

julia> match(r"^\s*(?:#|$)", "not a comment")

julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")

If there is no match nothing and this value is not printed in an interactive session. In addition to not being printed, this value is fully functional in programming:

m = match(r"^\s*(?:#|$)", line)
if m == nothing
  println("not a comment")
else
  println("blank or comment")
end

If the match is match return value is RegexMatch object. T his object records how regular expressions match, including substrings that match types, and other captured substrings. In this example, only a portion of the matching string is captured, which can be written if we want text that has a non-blank beginning after the comment character:

julia> m = match(r"^\s*(?:#\s*(.*?)\s*$|$)", "# a comment ")
RegexMatch("# a comment ", 1="a comment")

When match you can choose to specify an index that indicates where to start the search. Like what:

julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",1)
RegexMatch("1")

julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",6)
RegexMatch("2")

julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",11)
RegexMatch("3")

The following information can be extracted from the RegexMatch object:

Full match substring: m.match
The captured substring consists of a string plural m.captures
Full match start offset value: m.offset
Offset value vector of captured substrings: m.offsets

For unmatched captures, m.captures but nothing m.offsets are 0 offsets (index values in Julia start at 1 and therefore 0 offset values are invalid):

julia> m = match(r"(a|b)(c)?(d)", "acd")
RegexMatch("acd", 1="a", 2="c", 3="d")

julia> m.match
"acd"

julia> m.captures
3-element Array{Union(SubString{UTF8String},Nothing),1}:
 "a"
 "c"
 "d"

julia> m.offset
1

julia> m.offsets
3-element Array{Int64,1}:
 1
 2
 3

julia> m = match(r"(a|b)(c)?(d)", "ad")
RegexMatch("ad", 1="a", 2=nothing, 3="d")

julia> m.match
"ad"

julia> m.captures
3-element Array{Union(SubString{UTF8String},Nothing),1}:
 "a"
 nothing
 "d"

julia> m.offset
1

julia> m.offsets
3-element Array{Int64,1}:
 1
 0
 2

You can bind the resulting multiple groups to local variables:

julia> first, second, third = m.captures; first
"a"

After the quotation marks on the right, you can modify the behavior of regular expressions x of identifiers i m s and x. The use of these identifiers is the same as in Perl, see perlre manpage for details:

i   不区分大小写

m   多行匹配。 "^" 和 "$" 匹配多行的起始和结尾

s   单行匹配。 "." 匹配所有字符，包括换行符

    一起使用时，例如 r""ms 中， "." 匹配任意字符，而 "^" 与 "$" 匹配字符串中新行之前和之后的字符

x   忽略大多数空白，除非是反斜杠。可以使用这个标识符，把正则表达式分为可读的小段。 '#' 字符被认为是引入注释的元字符

For example, the following regular expression uses all options:

julia> r"a+.*b+.*?d$"ism
r"a+.*b+.*?d$"ims

julia> match(r"a+.*b+.*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")

Julia supports the regular expression string caused by three double quotes, r"""...""" This form is useful when regular expressions contain quotation marks or line breaks.

... A formal string of triple quotes, in the form r"""...""" S upported (may be for possible with ... Regular expressions with equal marks or line changes are convenient).

Byte array text

Another type of non-standard string text is b"..." which can represent an array of text-based bytes, Uint8 array. C ustomaryly, non-standard text is prefixed in capitals to produce the actual string object, while a prefix to the uppercase produces non-string objects, such as an array of bytes or compiled regular expressions. The rules for byte expressions are as follows:

The ASCII character generates a single byte with the ASCII escape character
The sequence of octal escapes generates bytes corresponding to the escape value
The Unicode escape sequence generates a byte sequence of UTF-8 yards

Examples are available in all three cases:

julia> b"DATA\xff\u2200"
8-element Array{Uint8,1}:
 0x44
 0x41
 0x54
 0x41
 0xff
 0xe2
 0x88
 0x80

The ASCII string "DATA" corresponds to bytes 68, 65, 84, 65. \xff generated by the xff is 255. U nicode Escape s \u2200 is coded as three bytes 226, 136, 128 by UTF-8. Note that the result of the byte array does not correspond to a valid UTF-8 string, and if you treat it as normal string text, you get a syntax error:

julia> "DATA\xff\u2200"
ERROR: syntax: invalid UTF-8 sequence

\xff \uff escape sequence of the byte 255 and the escape sequence of the code bit 255, encoded by UTF-8 into two bytes:

julia> b"\xff"
1-element Array{Uint8,1}:
 0xff

julia> b"\uff"
2-element Array{Uint8,1}:
 0xc3
 0xbf

In character text, the two are the same. \xff 255 can also be represented by the character, because the character always represents the code bit. In a string, however, \x represents bytes rather than code bits, while the escape of s \u and \U always represents code bits, encoded as 1 or more bytes.

Version number text

Version numbers can be easily represented as v"..." strings. T he version number creates the VersionNumber object according to the specification of the semantic version, so the version number is primarily determined by the values of the major version number, sub-version number, and patch, followed by pre-released and created numeric comments. F or example, v"0.2.1-rc1+win64" 0 secondary 2 1 RC1, and created as Win64. When you enter a version number, all fields except the major version number are optional, so v"0.2" equivalent v"0.2.0" v"2" equivalent to v"2.0.0" and so on.

VersionNumber objects are able to easily and accurately compare two (or more) versions. For example, a constant VERSION manages the Julia version number as VersionNumber object, so you can use simple statements to define the behavior of a particular version, such as:

if v"0.2" <= VERSION < v"0.3-"
    # do something specific to 0.2 release series
end

Since the non-standard version number v"0.3-" this symbol is a standard symbol for a Julia extension, which is used to represent a version lower than any 0.3 release, including all of its pre-release versions. S o the code in the example above will only run 0.2 and will not v"0.3.0-rc1" In order to allow it to also run on an unstable (i.e. pre-release) version 0.2, the lower check should be modified v"0.2-" <= VERSION

Another non-standard version specification extension allows for the use of tail plus to express an upper-limit build, such as VERSION > "v"0.2-rc1+" used to represent any version above 0.2-rc1 and any version created in any form: v"0.2-rc1+win64" return false, and for v"0.2-rc2" return true. +

Using this particular version is a good attempt - used in the upper limit specification unless there is a good reason not to), but such a form should not be used as any actual version number because they are illegal in semantic version control schemes.

In addition to VERSION used for VERSION constants, VersionNumber are Pkg modules to specify the version of the package and their dependencies.