使用正确的示例输出编辑/重新发布。
我有一个如下所示的数据框:
data = {
"ID": [1, 1, 1, 2, 2, 2],
"Year": [2021, 2021, 2023, 2015, 2017, 2018],
"Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
"bool": [True, False, False, True, False, False],
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']
}
df = pd.DataFrame(data)
print(df)
ID Year Combined bool
1 2021 started True
1 2021 finished False
1 2023 started False
2 2015 started True
2 2017 finished False
2 2018 started False
此数据框按 ID
分为几组。
我想基于 if df['bool'] == True
创建更新的combined
列,但前提是 df['bool' ] == True
并且同一组中还有另一个“已完成”行,其年份较晚(不相同)。
示例输出:
ID Year Combined bool Update
1 2021 started True started
1 2021 finished False finished
1 2023 started False started
2 2015 started True finished
2 2017 finished False finished
2 2018 started False started
我们不会更新第一组,因为在较晚的一年中没有 finished
值,而我们正在更新第二组,因为在稍后的一年。谢谢!
最佳答案
我能想到的一个解决方案是使用 apply
和 groupby
方法。每个组的值通过 update
函数传递给 update_group
函数。这允许执行测试并在满足条件时返回更新的“Update”
列。然后,返回的 DataFrame 就是预期的 DataFrame。
如果我接管您的示例,则省略将在第二部分中创建的“Update”
列:
import pandas as pd
data = {
"ID": [1, 1, 1, 2, 2, 2],
"Year": [2021, 2021, 2023, 2015, 2017, 2018],
"Combined": ["started", 'finished', 'started', 'started', 'finished', 'started'],
"bool": [True, False, False, True, False, False],
}
df = pd.DataFrame(data)
print(df)
我获得以下输入数据帧:
ID Year Combined bool
0 1 2021 started True
1 1 2021 finished False
2 1 2023 started False
3 2 2015 started True
4 2 2017 finished False
5 2 2018 started False
这是我用来更新 DataFrame 的两个函数:
def update_group(row, group):
"""Update each row of a group"""
if row["bool"] is True:
# Extract the later years entries
group_later = group[group.Year > row.Year]
# If finished in found, then turn the Update column to finished
if any(group_later.Combined == "finished"):
row["Update"] = "finished"
else:
row["Update"] = row["Combined"]
else:
row["Update"] = row["Combined"]
return row
def update(group):
"""Apply the update to each group"""
return group.apply(update_group, group=group, axis=1)
因此,如果您将这些函数应用到 DataFrame 中:
df = df.groupby("ID").apply(update)
print(df)
返回的DataFrame是:
ID Year Combined bool Update
0 1 2021 started True started
1 1 2021 finished False finished
2 1 2023 started False started
3 2 2015 started True finished
4 2 2017 finished False finished
5 2 2018 started False started
关于python - 根据分组日期值更新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73030393/