我想使用列 i
中的信息扩展数据框的列表条目:
i s_1 s_1 s_3
2 [1, 2, 3] [3, 4, 5] NaN
1 NaN [0, 0, 0] [2]
i 值仅表示应复制每个列表的最后一个值的频率:i s_1 s_1 s_3
2 [1, 2, 3, 3, 3] [3, 4, 5, 5, 5] NaN
1 NaN [0, 0, 0, 0] [2, 2]
我目前正在使用嵌套的应用循环:test.apply(lambda x: x.apply(
lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y)==list else 0), axis=1)
但是,这非常慢,如果我有很多行(> 10.000),代码就会中断。这个解决方案似乎有点困惑,我想知道最好的方法是什么?
最佳答案
您可以尝试就地扩展列表:
for col in df.loc[:, "s_1":]:
m = df[col].notna()
for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
v.extend([v[-1]] * i)
df.loc[~m, col] = 0
基准:
from timeit import timeit
from ast import literal_eval
def get_df():
dfs = []
# create some big dataframe
for i in range(5000):
txt = """
i s_1 s_1 s_3
2 [1, 2, 3] [3, 4, 5] NaN
1 NaN [0, 0, 0] [2] """
df = pd.read_csv(StringIO(txt), sep=r"\s{2,}", engine="python")
df.loc[:, "s_1":] = df.loc[:, "s_1":].apply(
lambda x: [v if pd.isna(v) else literal_eval(v) for v in x]
)
dfs.append(df)
return pd.concat(dfs)
def f1(df):
for col in df.loc[:, "s_1":]:
m = df[col].notna()
for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
v.extend([v[-1]] * i)
df.loc[~m, col] = 0
return df
def f2(df):
df = df.apply(
lambda x: x.apply(
lambda y: np.pad(y, (0, x.i), "constant", constant_values=y[-1])
if type(y) == list
else 0
),
axis=1,
)
return df
df1 = get_df()
df2 = get_df()
t1 = timeit(lambda: f1(df1), number=1)
t2 = timeit(lambda: f2(df2), number=1)
print(t1)
print(t2)
打印:0.01171580795198679
2.3192087680799887
所以改进~200x
关于python - 使用来自特定列的信息将函数应用于 Pandas 数据帧的每个单元格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67226612/