dask - 使用 Dask 访问大型已发布数组中的单个元素

是否有一种更快的方法可以使用 Dask 只检索大型已发布数组中的单个元素，而不检索整个数组？

在下面的示例中，client.get_dataset('array1')[0] 与 client.get_dataset('array1') 花费的时间大致相同。

import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)

one_element = client.get_dataset('array1')[0]

最佳答案

请注意，您发布的任何内容都会发送给调度程序，而不是发送给工作人员，因此这有点低效。 Publish 旨在与 Dask 集合(例如 dask.array)一起使用。

客户端 1

import dask.array as da
x = da.ones(10000000, chunks=(100000,))  # 1e7 size array cut into 1e5 size chunks
x = x.persist()  # persist array on the workers of the cluster

client.publish(x=x)  # store the metadata of x on the scheduler

客户端2

x = client.get_dataset('x')  # get the lazy collection x
x[0].compute()  # this selection happens on the worker, only the result comes down

关于dask - 使用 Dask 访问大型已发布数组中的单个元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45225485/

上一篇：html - 如何使用innerHTML渲染 Angular 变量 - Angular2

下一篇：asp.net - 有没有办法将 Gridview 页脚添加为单个单元格？

python - 从 Future 创建惰性 xarray 对象

python - Dask 数据框没有属性分类

python - dask.delayed 如何处理可变输入？

dask - 如何在 Databricks 上使用 Dask

python - 如何使用dask有效地计算许多简单统计数据

python - 如何使用 dask 高效地并行化时间序列预测？

python-3.x - Dask Dataframe 查看整行

python - 在 Dask 中使用尚未实现的 Pandas 函数

python - 使用 Dask 从谷歌云存储读取 Parquet 文件