performance - Julia 性能改进建议

我第一次尝试从 Matlab 迁移到 Julia，发现我的代码改进了 ~3 倍，但仍然认为还有更多，我没有在函数中使用任何全局变量，并且已经预先分配了所有使用的数组(我认为？)。如果对如何进一步加快速度有任何想法，将不胜感激，即使在我认为目前的改进下，我也会完全转换!

function word_sim(tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega

words = zeros(Int32, 1, omega) # to store the words
tests = rand(1,omega) # will compare mu to these
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
innovates = mu .> tests; # when we'll make a new word
for i = 2:tau # simulate the process
    if innovates[i] == 1 # innovate 
        words[i] = next_word
        next_word = next_word + 1
    else # copy
        words[i] = words[rand(1:(i-1))]
    end
end
# force the word we're interested in
for i = (tau+2):omega
    if innovates[i] == 1 # innovate 
        words[i] = next_word
        next_word = next_word + 1
    else # copy
        words[i] = words[rand(1:(i-1))]
    end
end
result = sum(words .== (omega + 1)); # count how many times our word occurred
return result
end

当我使用这些值运行它时，在我的 PC 上需要 ~.26 秒

using Statistics
@time begin
nsim = 10^3;
omega = 100;
seed = [0:1:(omega-1);]; 
mu = 0.01; 

results = zeros(Float64, 1, length(seed));
pops = zeros(Int64, 1, nsim);
for tau in seed
    for jj = 1:nsim
        pops[jj] = word_sim(tau, omega, mu);
    end
    results[tau+1] = mean(pops);
end
end

或者也许我最好用 C++ 编写代码？ Julia 是我的第一 react ，因为我听说过对其语法的好评如潮，老实说，这太棒了!

非常感谢任何评论。

最佳答案

3 倍加速是一个不错的开始，但事实证明，您还可以采取更多措施来显着提高性能!
作为起点，使用上面在 Julia 1.6.1 中发布的示例，我得到

  0.301665 seconds (798.10 k allocations: 164.778 MiB, 12.70% gc time)

这是大量的分配，以及相当多的垃圾收集器(“gc”)时间，所以看起来我们在这里产生了相当多的垃圾。一些罪魁祸首是像

tests = rand(1,omega) # will compare mu to these

或者

innovates = mu .> tests; # when we'll make a new word

在像 Matlab 或 Python 这样的语言中，一次预计算这些东西整体向量可能对性能有好处，但在 Julia 中这通常不是必需的，甚至可能会造成伤害，因为这些行中的每一行都会导致一个全新的要分配的数组。如果我们删除这些并即时生成我们的测试，我们可以避免这些分配。在这里分配的另一行是

result = sum(words .== (omega + 1))

在求和之前首先构建一个全新的数组。您可以通过将其写为 for 来避免这种情况。循环(尽管这在 Matlab 中可能感觉不对，但在 Julia 中却相当快)。或者，要将其保留为单行，请使用 count或 sum使用将比较作为第一个参数的函数

result = count(x->(x == omega+1), words)

(在本例中，仅使用匿名函数 x->(x == omega+1) )。
到目前为止，添加这些更改

function word_sim(tau::Int, omega::Int, mu::Float64)
    # inserts a word in position (tau+1), at each point creates a new word with prob mu
    # otherwise randomly chooses a previously used. Runs the program until time omega
    words = zeros(Int32, 1, omega) # to store the words
    words[1] = 1; # initialize the words
    next_word = 2 # will be the next word used
    words[tau+1] = omega + 1; # max possible word so insert that at time tau
    for i = 2:tau # simulate the process
        if mu > rand()  # innovate
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            words[i] = words[rand(1:(i-1))]
        end
    end
    # force the word we're interested in
    for i = (tau+2):omega
        if mu > rand() # innovate
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            words[i] = words[rand(1:(i-1))]
        end
    end
    result = count(x->(x == omega+1), words) # count how many times our word occurred
    return result
end

使用相同的时间代码，这现在让我们回到

  0.177766 seconds (298.10 k allocations: 51.863 MiB, 13.01% gc time)

所以大约一半的时间和一半的分配。还有更多!
首先，让我们移动words的分配word_sim 函数之外的数组，而是创建该函数的就地版本。我们还可以加快处理速度，在紧的 for 中添加 @inbounds循环。

function word_sim!(words::AbstractArray, tau::Int, omega::Int, mu::Float64)
    # inserts a word in position (tau+1), at each point creates a new word with prob mu
    # otherwise randomly chooses a previously used. Runs the program until time omega
    fill!(words, 0) # Probably not necessary actually, but I haven't spent enough time looking at the code to be sure
    words[1] = 1; # initialize the words
    next_word = 2 # will be the next word used
    words[tau+1] = omega + 1; # max possible word so insert that at time tau
    @inbounds for i = 2:tau # simulate the process
        if mu > rand()  # innovate
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            words[i] = words[rand(1:(i-1))]
        end
    end
    # force the word we're interested in
    @inbounds for i = (tau+2):omega
        if mu > rand() # innovate
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            words[i] = words[rand(1:(i-1))]
        end
    end
    result = count(x->(x == omega+1), words) # count how many times our word occurred
    return result
end

修改其输入参数之一的就地函数通常用 ! 表示按照 Julia 的约定，在它们名称的末尾，因此是新的函数名称。
由于我们必须稍微修改时序代码以预先分配words现在，让我们也借此机会将该计时代码放入一个函数中，以避免计时中出现任何全局变量。

function run_word_sim()
    nsim = 10^3
    omega = 100
    seed = [0:1:(omega-1);]
    mu = 0.01

    results = zeros(Float64, 1, length(seed))
    pops = zeros(Int64, 1, nsim)
    words = zeros(Int32, 1, omega) # to store the words
    for tau in seed
        for jj = 1:nsim
            pops[jj] = word_sim!(words, tau, omega, mu)
        end
        results[tau+1] = mean(pops)
    end
    return results
end

然后我们可以使用 BenchmarkTools 包及其 @btime 获得最准确的计时结果(以及一些有用的图表和统计数据)。或 @benchmark宏

julia> using BenchmarkTools

julia> @btime run_word_sim()
  124.178 ms (4 allocations: 10.17 KiB)

或者

Benchmark result with histogram and statistics

因此，几乎又提高了 3 倍，并将分配和内存使用量(减少了四到五个数量级)减少到仅在时序代码中使用的四个数组( seed 、 results 、 pops 和 words ) .
为了获得绝对的最大性能，您可以使用 LoopVectorization.jl 走得更远。和它的 @turbo宏，尽管它可能需要更改算法，因为这些循环取决于以前的状态，因此似乎与循环重新排序不兼容。你可以转 count进入 for 循环和 @turbo不过，这是一个稍微额外的加速。
还有其他可能更快的随机数生成选项，例如 VectorizedRNG.jl正如评论中链接的话语线索中所讨论的那样。在每次调用 word_sim 时分配一个新的随机数向量可能不是最优的，当您可以一次生成大量随机数时，RNG 通常更快，因此将预分配的随机数缓冲区传递给 word_sim!并用 rand! 就地填充由 Random 提供stdlib 或 VectorizedRNG可能会产生显着的额外加速。
在 https://github.com/brenhinkeller/JuliaAdviceForMatlabProgrammers 中更广泛地讨论了此答案中使用的一些技巧和经验法则。，以及其他一些通用的 Matlab -> Julia 技巧。

关于performance - Julia 性能改进建议，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61349348/

performance - Julia 性能改进建议

上一篇：python - 从 DenseVariational 层中提取学习到的 NN 后验权重分布参数

下一篇：java - Tomcat 9.0.34 的 Gradle 插件