我有一个 dask dataframe df
,如下所示:
Main_Author PaperID
A X
B Y
C Z
我还有另一个 dask dataframe pa
,如下所示:
PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D
我想要一个如下所示的结果数据框:
Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2
这就是我所做的:
df = df.merge(pa, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
这适用于小型数据帧。然而,由于我正在处理非常大的东西,它不断被杀死。我相信这是因为我正在融合。有没有更优雅的方式来获得所需的结果?
最佳答案
如果您正在寻找与两个大的合作 DataFrame
s,那么你可以尝试包装这个merge
在 dask.delayed
有一个很好的例子
dask.delayed
here in the Dask docs或here on SO请参阅 Dask 用例 here
导入
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
生成虚拟数据,以便在每个 DataFrame
中获取大量行
- 指定在每个
DataFrame
中生成的虚拟数据行数
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
使用 faker
生成一些大数据集库(根据 this SO post )
def create_rows(auth_colname, num=1):
output = [{auth_colname:fake.name(),
"PaperID":random.randint(1000,2000)} for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
打印数据帧的前 5 行
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
包裹merge
辅助函数中的操作
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask 方法 - 生成最终的 DataFrame
使用dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Dask 方法的输出
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas 方法 - 生成最终的 DataFrame
使用 Pandas 创建
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
比较 DataFrame
使用 Pandas 和 Dask 方法获得的 s
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "\n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
比较两种方法的结果
print(message)
DataFrames created using Dask and Pandas are equivalent.
关于python - 寻找一种避免合并两个数据帧的优雅解决方案,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55068338/