python - 在 python 中使用多重处理读取多个大型 csv 文件的最佳策略？

我正在编写一些代码，并希望通过多处理来改进它。

最初，我有以下代码:

with Pool() as p:
        lst = p.map(self._path_to_df, paths)
...
df = pd.concat(lst, ignore_index=True)

其中 self._path_to_df() 基本上只是调用 pandas.read_csv(...) 并返回 pandas DataFrame。

这会导致以下错误:

.
.
.
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[                    ts                  id.orig  ...  successful history_category
0         1.331901e+09               ...        True            other
1         1.331901e+09               ...        True                ^
2         1.331901e+09               ...        True               Sh
3         1.331901e+09               ...        True               Sh
4         1.331901e+09               ...        True               Sh
...                ...               ...         ...              ...
23192090  1.332018e+09               ...       False            other
23192091  1.332017e+09               ...        True            other
23192092  1.332018e+09               ...        True            other
23192093  1.332018e+09               ...        True            other
23192094  1.332018e+09               ...        True            other

[23192095 rows x 24 columns]]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647")'

该错误是由于它正在读取的文件之一太大，导致 self._path_to_df() 在使用多处理时无法返回 DataFrame。

可能涉及多个不同大小的文件(从小到非常大的 3GB+)，因此我试图找出使用多处理来完成此任务的最佳方法。

我应该以某种方式对所有数据进行分块，以便 p.map() 可以工作，还是开销太大？如果是这样，我该怎么做？我应该在读取每个文件时使用多重处理并按顺序查看每个文件吗？

编辑:此外，当仅涉及较小的文件时，它似乎不会出错

最佳答案

如果最终结果太大而无法放入内存，请尝试 dask，

import dask.dataframe as dd
df = dd.read_csv('*.csv')

一旦读取，您就可以进行聚合等，最后计算以获得您想要的答案。

关于python - 在 python 中使用多重处理读取多个大型 csv 文件的最佳策略？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59256876/

python - 在 python 中使用多重处理读取多个大型 csv 文件的最佳策略？

上一篇：python - 将颜色条添加到集群热图

下一篇：python - 模块未找到错误: No module named 'conda' after resetting base environment