arrays - 在 Julia 中将数组拆分为训练集和测试集的有效方法是什么?

标签 arrays optimization machine-learning julia

因此,我在 Julia 中运行机器学习算法,机器上的备用内存有限。不管怎样,我注意到我在存储库中使用的代码中有一个相当大的瓶颈。似乎(随机)分割数组比从磁盘读取文件花费的时间更长,这似乎凸显了代码的低效率。正如我之前所说,任何加速此功能的技巧都将不胜感激。原函数可参见here 。由于它是一个简短的函数,我也会将其发布在下面。

# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
                       target_percentage=0.10)
    seen_users = Set()
    seen_items = Set()
    training_set = (Rating)[]
    test_set = (Rating)[]
    shuffled = shuffle(ratings)
    for rating in shuffled
        if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
            push!(test_set, rating)
        else
            push!(training_set, rating)
        end
        push!(seen_users, rating.user)
        push!(seen_items, rating.item)
    end
    return training_set, test_set
end

如前所述,无论如何我可以推送数据,我们将不胜感激。我还要指出的是,我实际上并不需要保留删除重复项的能力,但这将是一个很好的功能。另外,如果这已经在 J​​ulia 库中实现,我将很高兴了解它。任何利用 Julia 并行能力的解决方案都会加分!

最佳答案

就内存而言,这是我能想到的最高效的代码。

function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
  N = length(ratings) 
  splitindex = round(Integer, target_percentage * N)
  shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
  return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end

然而,Julia 极其缓慢的文件 IO 现在成为了瓶颈。该算法在包含 1.7 亿个元素的数组上运行大约需要 20 秒,因此我认为它的性能相当不错。

关于arrays - 在 Julia 中将数组拆分为训练集和测试集的有效方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37036757/

相关文章:

c++ - c++中分配器和内置数组的区别?

JavaScript - 比较两个具有有限值/元素的数组

c - 如何从C中的用户输入中读取字符串,放入数组并打印

mysql - 优化 MySQL 全文搜索查询?

C++函数优化

machine-learning - Keras LSTM 多维输入

Javascript:未捕获类型错误:无法读取 null 的属性 'indexOf'

c++ - C/C++编译器反馈优化

python - 机器学习用于查找偶数/奇数,为两个不同的分类器获得错误/正确的输出

python - 使用 Python 的 numpy 实现随机梯度下降