Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Julia code performance optimization


May 14, 2021 Julia


Table of contents


Code performance optimization

The following sections describe some of the techniques for speeding up Julia's code.

Avoid global variables

The value and type of the global variable are all likely to change. T his makes it difficult for the compiler to optimize code that uses global variables. Local variables should be used as much as possible, or passed as arguments to functions.

Performance-critical code should be put into functions.

Declaring a global variable as a constant can significantly improve performance:

const DEFAULT_VAL = 0

When using a very large number of global variables, it is best to indicate their type when using them, which can also help the compiler optimize:

global x
y = f(x::Int + 1)

Writing functions is a better style, which produces more repeatable and clear code, including clear inputs and outputs.

Use @time measure performance and pay attention to memory allocation

The most useful tool for measuring computing @time The following example shows how to use it well:

  julia> function f(n)
             s = 0
             for i = 1:n
                 s += i/2
             end
             s
          end
  f (generic function with 1 method)

  julia> @time f(1)
  elapsed time: 0.008217942 seconds (93784 bytes allocated)
  0.5

  julia> @time f(10^6)
  elapsed time: 0.063418472 seconds (32002136 bytes allocated)
  2.5000025e11

On the first call @time @time f(1) f compiled. ( If you haven't used the @time the timing function will also be compiled.) T he result at this time is not so important. O n the second call, the function prints the time it takes to execute, and note that a large chunk of memory is allocated during this execution. This is a big advantage of the tic toc @time @time form, which is the library macro.

Unexpected large memory allocations often mean that there is a problem with a part of the program, usually about type stability. T herefore, in addition to focusing on memory allocation itself, it is likely that Julia has a significant performance problem with the code generated for your function. At this time to take these issues seriously and follow some of the following recommendations.

In addition, as a lead, the above problem can be optimized for memory-free allocation (except for returning results to rePL), which increases the computational speed by 30 times ::

  julia> @time f_improved(10^6)
  elapsed time: 0.00253829 seconds (112 bytes allocated)
  2.5000025e11

You can learn how to identify f problems f the following sections.

In some cases, your function may need to allocate memory for its own operations, which can complicate the problem. In this case, consider using the following :ref:'tool

One of them to identify the problem, or split the function, partially handle memory allocation, another partial processing algorithm (see: ref: `pre-assigned memory). ## Tool Julia provides some toolkit to identify performance issues: - [PROFILING] (http://julia-cn.readthedocs.org/zh_cn/late/stdlib/profile/#stdlib-profiling) can be used to measure the performance of the code, and identify the bottleneck. For complex items, you can use `Profileview` Extended bags to visually show the results of analysis. - Unexpected bulk memory allocation, - `` @ Time``, `` @ allocated``, or -profiler - Means your code may have problems If you can't see the memory allocation, the type system may have problems. You can also use `` --track-allocation = user`` to start Julia, then view `` * .mem`` file to find Where the memory allocation appears. - `TypeCheck` _TypeCheck `_Triendle can point out some problems. ## Avoid including some abstract type parameters When running parameterization types, such as arrays, if it is best to avoid using abstract types parameter. Think of the following code: `` `a = real [] # typeof (a) = array {real, 1} if (f = rand ()) x = [1 2; 3 4] 2x2 array {INT64, 2}: 1 2 3 4 Julia> x [:] 4-Element Array {INT64, 1}: 1 3 2 4 `` `This gives an agreement to array is common in many languages, such as Fortran, Matlab, and R Language (a few examples). Another option to be listed is to be in the C language and Python language in other languages ​​(`` `` `` `` `` `` `` `` `` `` `` `` `numpy``). Remember that the order of array has a crucial impact on the quotation of arrays. A lookup rule to be remember is that the first pointer is the fastest in the array based on the order as the order. This basically means that if in a piece of code, the loop pointer is the first, then the lookup speed will be faster. Let's take a look at the example of this person. Suppose we want to implement a feature, receive a `` vector``, return a square `` Matrix``, and a copy of the input vector. We assume that it is still a copy of the data is not important (perhaps the remaining code can be easier to adapt accordingly). We can think of this with at least four ways to achieve this (except for the recommended return visit to Zhengjued`` function): `` `function copy_cols {t} (x :: vector {t}) n = size (x 1) OUT = Array (ELTYPE (X), N, N) for i = 1: N OUT [:, i] = x end end end = size {t} (x :: vector {t}) n = size (x, 1) OUT = Array (ELTYPE (X), N, N) for i = 1: N OUT [i,:] = x * {t} (x :: vector {t}) n = Size (x, 1) OUT = Array (t, n, n) for col = 1: n, row = 1: n out [Row, col] = x [ROW] end end end function COPY_ROW_COL {T} (x :: Vector {t}) n = size (x, 1) OUT = Array (t, n, n) for row = 1: n, col = 1: n out [Row, col] = x [col] end out OUT End `` `Now we use the same input vector````` `` `` `` `` `julia> x = randn (10000); julia> fmt (f) = println (rpad (string (f) * ":", 14, ''), @elapsed f (x)) julia> map (fmt, {copy_cols, copy_rows, copy_col_row, copy_row_col}); copy_cols: 0.331706323 copy_rows: 1.799009911 copy_col_row : 0.415630047 Copy_Row_col: 1.721531501 `` `Note` `Copy_cols`` More`` Copy_Rows`` This is expected, because `` Copy_cols`` More images of the `` matrix`` interface, is filled with a column once. In addition to this, `` copy_col_row`` is a lot faster than `` Copy_Row_col``, because it meets our lookup rules, that is, the first elements in a piece of code should be associated with the most in-room loop. ## Output Pre-Assignment If your function returns an Array or other complex type, it may have to allocate memory. Unfortunately, it is often distributed and its opposite events, and the garbage area is collected, which is a substantive bottleneck. Sometimes, you can avoid the need for allocation memory by pre-allocating output when accessing each feature. As a small example, compare `` `function xinc (x) return [x, x + 1, x + 2] end function loopinc () y = 0 for i = 1: 10 ^ 7 RET = xinc (i ) y + = RET [2] end y end `` `and` `` function xinc! {t} (return: AbstractVector {t}, x :: t) Ret [1] = x Ret [2] = x +1 RET [3] = x + 2 Nothing end function loopinc_prealloc () Ret = array (int, 3) y = 0 for i = 1: 10 ^ 7 xinc! (RET, I) Y + = RET [2] END y end `` `timing results:` `` julia> @time loopinc () elapsed time: 1.955026528 seconds (1279975584 bytes allocated) 50000015000000 julia> @time loopinc_prealloc () elapsed time: 0.078639163 seconds (144 bytes allocated) 50000015000000 `` ` Other benefits are pre-assigned, for example, allowing visitors to control the "Output" type by algorithm. In the above example, we can follow the hope of yourself through a `` subarray`. Instead of `` array``. According to the most extreme, pre-allocate can make your code look ugly point, so you need some expression and judgment. ## When you avoid the input / output, the formation of the intermediate string is an additional overhead when the data is written to the file (or other input / output device). Not: `` `Println (File," $ A $ B ")` `Use:` `` println (file, a, "", b) `` `first code forms a string, then Write it into the file, and the second code writes the value to the file. It is also noted that in some cases, it is difficult to read it. Consider: `` `Println (File," $ (f (a)) $ (f (b)) ")` `` comparison: `` `println (file, f (a), f (b) )` `` ## Processing the abandoned warnings that are abandoned, you can check the table and display a warning, which will affect performance. It is recommended to make a corresponding modification according to the warning prompt. ## Tips pay attention to some small things to make internal cycles more tight. - Avoid unnecessary arrays. For example, do not use `` sum ([x, y, z]) ``, and should use `` x + y + z`` - For smaller integers, use `` * `` better. Such as `` x * x``` `x ^ 3```````, use` `Abs2 (z) ``````` Under normal circumstances, for complex parameters, try to use `` Abs2``ril```` - for integer division, use `` div (x, y) `instead of` `Trunc (x / y)`, Using `` fld (x, y) `instead of` `floor (x / y)`, use `` CLD (x, y) `instead of` `ceil (x / y)` `` ## Performance Note Sometimes you can set some project properties to get better optimization. - When checking the formula, use `` @ inbounds`` to eliminate the array boundary. Before you have done it before. If the subscript is bound, you may encounter a problem that crashes or does not execute. - Write the `` @ Simd``, this can help you check it before `` for```` ** This feature is experimental ** and may change in the Julia version. Here is an example containing two forms: `` `function inner (x, y) S = ZERO (ELTYPE (X)) for i = 1: length (x) @INBOUNDS S + = X [i] * Y [I] End s end function innersimd (x, y) s = zero (ELTYPE (X)) @SIMD for i = 1: length (x) @inbounds s + = x [i] * y [i] end s end Function Timeit (N, REPS) x = rand (float32, n) y = rand (float32, n) s = zero (float64) Time = @ELAPSED for J in 1: REPS S + = Inner (x, y) endprintln "Gflop =", 2.0 * n * REPS / TIME * 1E-9) Time = @ELAPSED for J in 1: REPS S + = Innersimd (x, y) end println ("GFLOP (SIMD) =", 2.0 * n * REPS / TIME * 1E-9) End Timeit (1000, 1000) `` On the computer with 2.4GHz Intel Core i5 processor, produce the following results: `` `gflop = 1.9467069505224963 gflop (simd) = 17.578554163920018` `` `` @Simd for`` loop should be a one-dimensional range. * Reduced variables * is used to accumulate variables, such as `` s``. By using `` @ simd``, you can maintain several performances of the loop: - After the specific consideration of reduced variables, iterations are executed safely in any or overlapping order. - Reduce the floating point operation of the variable can be repeated, but may result in different results than the `` @ Simd``. - There will be no iteration waiting for another iteration to achieve advance. Use `` @ simd`` just give the compiler vectorized pass. It doesn't really do this depends on the compiler. To actually benefit from the current implementation, your loop should have additional performance: - The loop must be an internal loop. - The loop theme must be a cyclic program. That's why all the array of access to all the array is required to `` @ inbounds```. - Access must have a cross-mode and cannot "aggregate" (random pointer read) or "dispersion" (random pointer write). - The span should be a unit span. - In some simple examples, for example, in a loop of 2-3 array access, LLVM automatic vectorization may automatically take effect, resulting in further acceleration without `` @ simd``.