May 14, 2021 Julia
The ASCII text in Julia is simple and efficient, and you can also work with Unicode. U sing C-style string code to handle ASCII strings, performance and semantics are no problem. I f this code encounters non-ASCII text, it prompts for an error instead of displaying garbled code. At this point, it is easy to modify the code to be compatible with non-ASCII data.
There are some notable advanced features about Julia strings:
String
is an abstract type, not a concrete type
Char
represents a single character and is a Unicode code bit represented by a 32-bit integer
String
object cannot be changed.
To get a different string, you need to construct a new string
Char
a single character: it is a 32-bit integer, see
the Unicode code.
Char
must use single quotes:
julia> 'x'
'x'
julia> typeof(ans)
Char
You can convert
Char
the corresponding integer value:
julia> int('x')
120
julia> typeof(ans)
Int64
On a 32-bit architecture,
typeof(ans)
is type
Int32
You can also convert an integer
Char
julia> char(120)
'x'
Not all integer values are valid Unicode code bits, but for performance,
char
generally does not check whether it is valid.
If you want to make sure it works,
is_valid_char
function:
julia> char(0x110000)
'\U110000'
julia> is_valid_char(0x110000)
false
Currently, valid Unicode code points are
U+00
U+d7ff
and
U+e000
U+10ffff
You can enter the Unicode character
\u
the
\U
maximum number of heteens in single quotes, or by the maximum number of heteens (valid characters, up to six bits) that follow:
julia> '\u0'
'\0'
julia> '\u78'
'x'
julia> '\u2200'
'∀'
julia> '\U10ffff'
'\U10ffff'
Julia uses the system's default zone and language settings to determine which characters can be displayed correctly and which need to be displayed with the escape of
\u
or
\U
With the exception of the Unicode escape format, all
C language escape input formats
enable:
julia> int('\0')
0
julia> int('\t')
9
julia> int('\n')
10
julia> int('\e')
27
julia> int('\x7f')
127
julia> int('\177')
127
julia> int('\xff')
255
You can compare
Char
values, or you can do a little arithmetic:
julia> 'A' < 'a'
true
julia> 'A' <= 'a' <= 'Z'
false
julia> 'A' <= 'X' <= 'Z'
true
julia> 'x' - 'a'
23
julia> 'A' + 1
'B'
String text should be placed in the middle of a double quote
"..."
or three double
"""..."""
julia> str = "Hello, world.\n"
"Hello, world.\n"
julia> """Contains "quote" characters"""
"Contains \"quote\" characters"
Use an index to extract characters from a string:
julia> str[1]
'H'
julia> str[6]
','
julia> str[end]
'\n'
The indexes in Julia all start at 1, and the index of the last element is the same length as the string, both
n
In any index expression, the
end
is an abbreviation for the last index value
endof(str)
You can do end
end
or other calculations on strings:
julia> str[end-1]
'.'
julia> str[end/2]
' '
julia> str[end/3]
ERROR: InexactError()
in getindex at string.jl:59
julia> str[end/4]
ERROR: InexactError()
in getindex at string.jl:59
If the index is less than 1 or
end
an error is prompted:
julia> str[0]
ERROR: BoundsError()
julia> str[end+1]
ERROR: BoundsError()
Use a range index to extract substrings:
julia> str[4:9]
"lo, wo"
str[k] 和 str[k:k] 的结果不同:
julia> str[6]
','
julia> str[6:6]
","
The former is a single
Char
and the latter is a string with only one character.
The two are completely different in Julia.
Julia fully supports Unicode characters and strings. A
s
discussed above, in character
text, the Unicode code point can be escaped by s
\u
and s
\U
or you can use the escape sequence of standard C.
They can all be used to write string text:
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
Non-ASCII string text is encoded with UTF-8. U
TF-8 is a longer encoding, meaning that not all characters are encoded at the same length. I
n UTF-8,
0x80 (128)
ASCII characters, are encoded using single bytes, as in ASCII, and characters with the remaining codes use multi-bytes, up to a maximum of four bytes per character. T
his means that not all byte index values in the UTF-8 string are valid character index values.
If the index is indexed to an invalid byte index value, an error is thrown:
julia> s[1]
'∀'
julia> s[2]
ERROR: invalid UTF-8 character index
in next at ./utf8.jl:68
in getindex at string.jl:57
julia> s[3]
ERROR: invalid UTF-8 character index
in next at ./utf8.jl:68
in getindex at string.jl:57
julia> s[4]
' '
In the example above,
∀
3 byte characters, so the index values 2 and 3 are invalid, while the index value for the next character is 4.
Because of the longer encoding, the number of characters in the
length(s)
is not necessarily equal to the last index value of the string. T
he string
s
and traversed from 1
endof(s)
if no exception is thrown, the returned sequence of characters will include the sequence of
s
T
hus,
length(s) <= endof(s)
Here's an example of an inefficient
s
characters:
julia> for i = 1:endof(s)
try
println(s[i])
catch
# ignore the index error
end
end
∀
x
∃
y
Fortunately, we can use strings as traversal objects without having to deal with exceptions:
julia> for c in s
println(c)
end
∀
x
∃
y
Julia doesn't just support UTF-8, it's easy to add support for other codings. I
n particular, Julia also
utf16string
utf32string
utf16(s)
utf32(s)
supporting UTF-16 and UTF-32 encoding, respectively. I
t also provides the alias
WString
for the UTF-16 or UTF-32
wstring(s)
Cwchar_t
size.
For a discussion of UTF-8, see the
byte array text below.
String connections are the most common operations:
julia> greet = "Hello"
"Hello"
julia> whom = "world"
"world"
julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
Like Perl, Julia allows the use of
$
to interpolate string text:
julia> "$greet, $whom.\n"
"Hello, world.\n"
It is override as a string text connection.
$
Insert the shortest full expression after it into the string.
You can use parentheses to interpole any expression:
julia> "1 + 2 = $(1 + 2)"
"1 + 2 = 3"
String connections and interpolation call string
string
to convert objects to
String
As in an interactive session, most
String
objects are converted to strings:
julia> v = [1,2,3]
3-element Array{Int64,1}:
1
2
3
julia> "v: $v"
"v: [1,2,3]"
Char
can also be interpoled into strings:
julia> c = 'x'
'x'
julia> "hi, $c"
"hi, x"
To include $ text in
$
text, you should escape it with a backslash:
julia> print("I have \$100 in my account.\n")
I have $100 in my account.
Using standard comparison operators, compare strings in dictionary order:
julia> "abracadabra" < "xylophone"
true
julia> "abracadabra" == "xylophone"
false
julia> "Hello, world." != "Goodbye, world."
true
julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
true
Use
search
function to find the index value of a character:
julia> search("xylophone", 'x')
1
julia> search("xylophone", 'p')
5
julia> search("xylophone", 'z')
0
You can start by providing a third argument and start looking for this offset value:
julia> search("xylophone", 'o')
4
julia> search("xylophone", 'o', 5)
7
julia> search("xylophone", 'o', 8)
0
Another useful function for handling strings
repeat
julia> repeat(".:Z:.", 10)
".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."
Some other useful functions:
endof(str)
str
index value of str
length(str)
the
str
for str
i = start(str)
a valid index value for the first character that can be found in
str
(usually 1)
c, j = next(str,i)
the next character at or after the index
i
and the index value of the next valid character after that.
Start
start
endof
traverse characters in
str
ind2chr(str,i)
the character in which the ith index value in the string is located, corresponding to the first few characters
chr2ind(str,j)
the character indexed i in the string, corresponding to the index value of the (first) byte
Julia provides non-standard string text. I t adds a prefix identifier to the string text enclosed in normal double quotes. T he regular expressions, byte array text, and version number text that will be described below are examples of non-standard string text. There are other examples in the Metaprogramming section.
Julia's regular expression (regexp) is perl compatible and is provided by
the PCRE
library. I
t is a non-standard string text with a
r
and can be followed by some identifiers at the end.
The most basic regular expression is only
r"..."
julia> r"^\s*(?:#|$)"
r"^\s*(?:#|$)"
julia> typeof(ans)
Regex (constructor with 3 methods)
Check that the regular expression matches the string,
ismatch
function:
julia> ismatch(r"^\s*(?:#|$)", "not a comment")
false
julia> ismatch(r"^\s*(?:#|$)", "# a comment")
true
ismatch
true or false depending on whether the regular expression matches the string.
The match function can return the specifics of the match:
julia> match(r"^\s*(?:#|$)", "not a comment")
julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")
If there is no
match
nothing
and this value is not printed in an interactive session.
In addition to not being printed, this value is fully functional in programming:
m = match(r"^\s*(?:#|$)", line)
if m == nothing
println("not a comment")
else
println("blank or comment")
end
If the match is
match
return value is
RegexMatch
object. T
his object records how regular expressions match, including substrings that match types, and other captured substrings.
In this example, only a portion of the matching string is captured, which can be written if we want text that has a non-blank beginning after the comment character:
julia> m = match(r"^\s*(?:#\s*(.*?)\s*$|$)", "# a comment ")
RegexMatch("# a comment ", 1="a comment")
When
match
you can choose to specify an index that indicates where to start the search.
Like what:
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",1)
RegexMatch("1")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",6)
RegexMatch("2")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",11)
RegexMatch("3")
The following information can be extracted from the
RegexMatch
object:
m.match
m.captures
m.offset
m.offsets
For unmatched captures,
m.captures
but
nothing
m.offsets
are 0 offsets (index values in Julia start at 1 and therefore 0 offset values are invalid):
julia> m = match(r"(a|b)(c)?(d)", "acd")
RegexMatch("acd", 1="a", 2="c", 3="d")
julia> m.match
"acd"
julia> m.captures
3-element Array{Union(SubString{UTF8String},Nothing),1}:
"a"
"c"
"d"
julia> m.offset
1
julia> m.offsets
3-element Array{Int64,1}:
1
2
3
julia> m = match(r"(a|b)(c)?(d)", "ad")
RegexMatch("ad", 1="a", 2=nothing, 3="d")
julia> m.match
"ad"
julia> m.captures
3-element Array{Union(SubString{UTF8String},Nothing),1}:
"a"
nothing
"d"
julia> m.offset
1
julia> m.offsets
3-element Array{Int64,1}:
1
0
2
You can bind the resulting multiple groups to local variables:
julia> first, second, third = m.captures; first
"a"
After the quotation marks on the right, you can modify the behavior of regular expressions
x
of identifiers
i
m
s
and x.
The use of these identifiers is the same as in Perl, see
perlre manpage for details:
i 不区分大小写
m 多行匹配。 "^" 和 "$" 匹配多行的起始和结尾
s 单行匹配。 "." 匹配所有字符,包括换行符
一起使用时,例如 r""ms 中, "." 匹配任意字符,而 "^" 与 "$" 匹配字符串中新行之前和之后的字符
x 忽略大多数空白,除非是反斜杠。可以使用这个标识符,把正则表达式分为可读的小段。 '#' 字符被认为是引入注释的元字符
For example, the following regular expression uses all options:
julia> r"a+.*b+.*?d$"ism
r"a+.*b+.*?d$"ims
julia> match(r"a+.*b+.*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")
Julia supports the regular expression string caused by three double quotes,
r"""..."""
This form is useful when regular expressions contain quotation marks or line breaks.
... A
formal string of triple quotes, in the form
r"""..."""
S
upported (may be for possible with ...
Regular expressions with equal marks or line changes are convenient).
Another type of non-standard string text is
b"..."
which can represent an array of text-based bytes,
Uint8
array. C
ustomaryly, non-standard text is prefixed in capitals to produce the actual string object, while a prefix to the uppercase produces non-string objects, such as an array of bytes or compiled regular expressions.
The rules for byte expressions are as follows:
Examples are available in all three cases:
julia> b"DATA\xff\u2200"
8-element Array{Uint8,1}:
0x44
0x41
0x54
0x41
0xff
0xe2
0x88
0x80
The ASCII string "DATA" corresponds to bytes 68, 65, 84, 65.
\xff
generated by the xff is 255. U
nicode Escape s
\u2200
is coded as three bytes 226, 136, 128 by UTF-8.
Note that the result of the byte array does not correspond to a valid UTF-8 string, and if you treat it as normal string text, you get a syntax error:
julia> "DATA\xff\u2200"
ERROR: syntax: invalid UTF-8 sequence
\xff
\uff
escape sequence of
the byte 255
and the
escape sequence of the code bit 255,
encoded by UTF-8 into two bytes:
julia> b"\xff"
1-element Array{Uint8,1}:
0xff
julia> b"\uff"
2-element Array{Uint8,1}:
0xc3
0xbf
In character text, the two are the same.
\xff
255 can also be represented by the character, because the character
always represents
the code bit.
In a string, however,
\x
represents bytes rather than code bits, while the escape of s
\u
and
\U
always represents code bits, encoded as 1 or more bytes.
Version numbers can be easily represented as
v"..."
strings. T
he version number creates
the
VersionNumber
object according to the specification of the semantic version, so the version number is primarily determined by the values of the major version number, sub-version number, and patch, followed by pre-released and created numeric comments. F
or example,
v"0.2.1-rc1+win64"
0
secondary
2
1
RC1, and created as Win64.
When you enter a version number, all fields except the major version number are optional, so
v"0.2"
equivalent
v"0.2.0"
v"2"
equivalent to
v"2.0.0"
and so on.
VersionNumber
objects are able to easily and accurately compare two (or more) versions.
For example, a constant
VERSION
manages the Julia version number as
VersionNumber
object, so you can use simple statements to define the behavior of a particular version, such as:
if v"0.2" <= VERSION < v"0.3-"
# do something specific to 0.2 release series
end
Since the non-standard version number
v"0.3-"
this symbol is a standard symbol for a Julia extension, which is used to represent a version lower than any 0.3 release, including all of its pre-release versions. S
o the code in the example above will only run
0.2
and will not
v"0.3.0-rc1"
In order to allow it to also run on an unstable (i.e. pre-release) version 0.2, the lower check should be modified
v"0.2-" <= VERSION
Another non-standard version specification extension allows for the use of tail plus to express an upper-limit build, such as
VERSION > "v"0.2-rc1+"
used to represent any version above
0.2-rc1
and any version created in any form:
v"0.2-rc1+win64"
return false, and for
v"0.2-rc2"
return true.
+
Using this particular version is a good attempt
-
used in the upper limit specification unless there is a good reason not to), but such a form should not be used as any actual version number because they are illegal in semantic version control schemes.
In addition to
VERSION
used for VERSION constants,
VersionNumber
are
Pkg
modules to specify the version of the package and their dependencies.