python - 寻找一种避免合并两个数据帧的优雅解决方案

我有一个 dask dataframe df，如下所示:

Main_Author PaperID
A           X
B           Y
C           Z

我还有另一个 dask dataframe pa，如下所示:

PaperID  Co_Author
X        D
X        E
X        F
Y        A
Z        B
Z        D

我想要一个如下所示的结果数据框:

Main_Author  Co_Authors   Num_Co_Authors
A            (D,E,F)      3
B            (A)          1
C            (B,D)        2

这就是我所做的:

df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))

这适用于小型数据帧。然而，由于我正在处理非常大的东西，它不断被杀死。我相信这是因为我正在融合。有没有更优雅的方式来获得所需的结果？

最佳答案

如果您正在寻找与两个大的合作 DataFrame s，那么你可以尝试包装这个merge在 dask.delayed

有一个很好的例子 dask.delayed here in the Dask docs或here on SO
请参阅 Dask 用例 here

导入

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

生成虚拟数据，以便在每个 DataFrame 中获取大量行

指定在每个 DataFrame 中生成的虚拟数据行数

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

使用 faker 生成一些大数据集库(根据 this SO post )

def create_rows(auth_colname, num=1):
    output = [{auth_colname:fake.name(),
               "PaperID":random.randint(1000,2000)} for x in range(num)]
    return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

打印数据帧的前 5 行

print(df.head())
       Main_Author  PaperID
0   Kyle Morton MD     1522
1    April Edwards     1992
2  Rachel Sullivan     1874
3    Kevin Johnson     1909
4     Julie Morton     1635

print(pa.head())
        Co_Author  PaperID
0  Deborah Cuevas     1911
1     Melissa Fox     1095
2    Sean Mcguire     1620
3     Cory Clarke     1424
4     David White     1569

包裹merge辅助函数中的操作

def merge_operations(df1, df2):
    df = df1.merge(df2, on="PaperID")
    df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
    df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
    return df

Dask 方法 - 生成最终的 DataFrame使用dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
    df_dask = dask.compute(ddf)

Dask 方法的输出

[                                        ] | 0% Completed |  0.0s
[                                        ] | 0% Completed |  0.1s
[                                        ] | 0% Completed |  0.2s
[                                        ] | 0% Completed |  0.3s
[                                        ] | 0% Completed |  0.4s
[                                        ] | 0% Completed |  0.5s
[########################################] | 100% Completed |  0.6s

print(df_dask[0].head())
      Main_Author                                          Co_Author  Num_Co_Authors
0  Aaron Anderson  (Elizabeth Peterson, Harry Gregory, Catherine ...              15
1    Aaron Barron  (Martha Neal, James Walton, Amanda Wright, Sus...              11
2      Aaron Bond  (Theresa Lawson, John Riley, Daniel Moore, Mrs...               6
3  Aaron Campbell  (Jim Martin, Nicholas Stanley, Douglas Berry, ...              11
4  Aaron Castillo  (Kevin Young, Patricia Gallegos, Tricia May, M...               6

Pandas 方法 - 生成最终的 DataFrame使用 Pandas 创建

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
      Main_Author                                          Co_Author  Num_Co_Authors
0  Aaron Anderson  (Elizabeth Peterson, Harry Gregory, Catherine ...              15
1    Aaron Barron  (Martha Neal, James Walton, Amanda Wright, Sus...              11
2      Aaron Bond  (Theresa Lawson, John Riley, Daniel Moore, Mrs...               6
3  Aaron Campbell  (Jim Martin, Nicholas Stanley, Douglas Berry, ...              11
4  Aaron Castillo  (Kevin Young, Patricia Gallegos, Tricia May, M...               6

比较 DataFrame使用 Pandas 和 Dask 方法获得的 s

from pandas.util.testing import assert_frame_equal
try:
    assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
    message = "\n"+str(e)
else:
    message = 'DataFrames created using Dask and Pandas are equivalent.'

比较两种方法的结果

print(message)
DataFrames created using Dask and Pandas are equivalent.

关于python - 寻找一种避免合并两个数据帧的优雅解决方案，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55068338/

python - 寻找一种避免合并两个数据帧的优雅解决方案

上一篇：python - 自动将动态值放入表单

下一篇：python - Pytest:在 setup_method 中使用固定装置