python - 有没有一种快速方法将 Pandas 列数据框转换为字符串列表？

这在某种程度上与大多数人在列表和数据帧之间转换时想要做的事情相反。

我希望将大型数据帧(10M+ 行，20+ 列)转换为字符串列表，其中每个条目都是数据帧中每行的字符串表示形式。我可以使用 pandas 的 to_csv() 方法来完成此操作，但我想知道是否有更快的方法，因为这被证明是我的代码中的瓶颈。

最小工作示例:

import numpy as np
import pandas as pd

# Create the initial dataframe.
size = 10000000
cols = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame()
for col in cols:
    df[col] = np.arange(size)
    df[col] = "%s_" % col + df[col].astype(str)

# Convert to the required list structure
ret_val = _df_.to_csv(index=False, header=False).split("\n")[:-1]

对于我的 Core i9 的单个线程上的 10,000,000 行数据帧，上述代码的转换大约需要 90 秒，并且高度依赖于 CPU。如果可能的话，我希望将其减少一个数量级。

编辑:我不想将数据保存到 .csv 或文件中。我只是想将数据帧转换为字符串数组。

编辑:只有 5 列的输入/输出示例:

In  [1]: df.head(10)
Out [1]:    a       b       c       d       e
         0  a_0     b_0     c_0     d_0     e_0
         1  a_1     b_1     c_1     d_1     e_1
         2  a_2     b_2     c_2     d_2     e_2
         3  a_3     b_3     c_3     d_3     e_3
         4  a_4     b_4     c_4     d_4     e_4
         5  a_5     b_5     c_5     d_5     e_5
         6  a_6     b_6     c_6     d_6     e_6
         7  a_7     b_7     c_7     d_7     e_7
         8  a_8     b_8     c_8     d_8     e_8
         9  a_9     b_9     c_9     d_9     e_9

In  [2]: ret_val[:10]
Out [2]: ['a_0,b_0,c_0,d_0,e_0',
          'a_1,b_1,c_1,d_1,e_1',
          'a_2,b_2,c_2,d_2,e_2',
          'a_3,b_3,c_3,d_3,e_3',
          'a_4,b_4,c_4,d_4,e_4',
          'a_5,b_5,c_5,d_5,e_5',
          'a_6,b_6,c_6,d_6,e_6',
          'a_7,b_7,c_7,d_7,e_7',
          'a_8,b_8,c_8,d_8,e_8',
          'a_9,b_9,c_9,d_9,e_9']

最佳答案

通过多处理，我获得了约 2.5 倍的加速...

import multiprocessing

# df from OPs above code available in global scope

def fn(i):
    return df[i:i+1000].to_csv(index=False, header=False).split('\n')[:-1]

with multiprocessing.Pool() as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), 1000)):
        result.extend(a)

在我的笔记本电脑上，处理 100 万行的总时间从 6.8 秒减少到 2.8 秒，因此有望扩展到 i9 CPU 中的更多内核。

这取决于 Unix fork 语义来与子进程共享数据帧，显然会做更多的工作，但可能会有所帮助......

使用 Massifox 的 numpy.savetxt 建议和 multiprocessing 可以将时间缩短至 2.0 秒，只需映射以下函数即可:

def fn2(i):
    with StringIO() as fd:
        np.savetxt(fd, df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

结果基本相同

您的评论“数据帧是类中的变量”可以通过多种不同的方式修复。一种简单的方法是将数据帧传递到 Pool initializer此时它不会被选择(无论如何在 Unix 下)并将对它的引用存储在某个全局变量中。然后每个工作进程都可以使用该引用，例如:

def stash_df(df):
    global the_df
    the_df = df

def fn(i):
    with StringIO() as fd:
        np.savetxt(fd, the_df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

with multiprocessing.Pool(initializer=stash_df, initargs=(df,)) as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), N)):
        result.extend(a)

只要每个池由单个数据帧使用就可以了

关于python - 有没有一种快速方法将 Pandas 列数据框转换为字符串列表？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57858514/

python - 有没有一种快速方法将 Pandas 列数据框转换为字符串列表？

上一篇：python - 识别数据框中除了多索引中的日期索引值之外相同的行？

下一篇：python - 为什么我的程序没有显示我告诉它的碰撞框？